Never use a "for" loop or "apply". 1000x speed bump in Pandas

Having learned Python years ago, I thought it would help to mention this idea I came across that basically says “if you reach for a loop, check to see if there is a functional way instead” and I would now say “Instead of looping, check to see if you can vectorize it”.

There’s an early pandas lesson where it asks me to create a for loop and modify some column names. not exactly something that “needs” optimization or vectorization, BUT we’re dealing with pandas and so if something can be vectorized, I think it should be. Pandas has all sorts of built in methods that allow for vectorization which can be 100’s of times or even 1000x faster than “looping”. These methods are super easy to implement as well. Check out some of these string methods https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#string-methods

also this article was neat https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6

this video shows some of it https://www.youtube.com/watch?v=nxWginnBklU

Update: literally 2 screens later in the course there’s mention of the .str accessor for the pandas string methods. wondering if we’ll cover .str.contains or in NumPy I saw that we can use np.where to replace simple “if” conditions and np.select to vectorize more complex nested if elif situations. Good job DQ teachers! well done :slight_smile:

One of the shortcomings I think there is with the early python at dq is that there’s no mention of comprehensions, lambda functions, map, filter, or reduce, zip, etc. but when looking at code out in the wild, you see it frequently. Even something like enumerate() is a little more performant if you need to loop. I’ve been helped by google, stack overflow, and YouTube a lot, and when it comes to data science I think optimization and speed should be considered since we’re dealing with millions of data points sometimes. I understand “you should learn it the hard way, first” but I disagree. I think you should learn it the “common” way first. I think there should also be some mindfulness of optimization and speed. It’s actually harder to comprehend the more “functional ways” after you learn to “loop” everything, but it’s way easier to learn the functional way first and then the easy “looping” way. My two cents.

4 Likes