Using series.apply vs calculating directly

Screen Link:

What are the functional differences, advantages, and disadvantages between the following two lines of code which accomplish the same thing?

distance = abs(3 - dc_listings['accommodates'])
distance2 = dc_listings['accommodates'].apply(lambda x: abs(3-x))

I’m trying to get a better understanding of what’s going on under the hood so I can make more efficient choices when there are multiple ways of doing something.

1 Like

Functionally, no difference. It’s essentially simple arithmetic. For something more complicated you might not be able to use the first approach and then apply() might be better.

apply() is a bit slower in general and sometimes people don’t tend to recommend using it too much. In your example, the first approach is a bit faster compared to the one with the apply().

1 Like

The big question here is how long does your code take to execute in each case. If I run some tests

import numpy as np
import pandas as pd

# Create a dataframe with 100000 random values
x = pd.DataFrame(np.random.randn(100000), columns=['values'])

# Your scenario 1
%timeit abs(3 - x['values'])

# Your scenario 2
%timeit x['values'].apply(lambda x: abs(3-x))

# My prefered code using vectorized Series.abs() 
%timeit (3 - x['values']).abs()

I come up with the following times on my machine:

Scenario 1: 1.54 ms ± 50.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Scenario 2 (apply): 45.9 ms ± 6.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 
Scenario 3: 1.49 ms ± 29.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

As you can see the differences are huge. The code using apply() performs slower by a factor of 40 when dealing with 100000 observations in comparison to the other 2 implementations. This being sad, if you are looking to write code, which performs well, than you want to use vectorized methods whenever possible. In this sense, there is no good reason in my eyes to use abs() as a function within Series.apply() as in your example 2. Just use vectorized Series.abs() coming with pandas instead. It performs better and it is actually easier to read.

Note: Your scenario 1 and my code should do the same thing behind the curtains if I am correct. You treat abs() as a function, I prefer to treat it as a Dataframe or Series method. This comes down to a question of personal preference in this case.

Good read for this topic: A Beginner’s Guide to Optimizing Pandas Code for Speed

Best
htw

3 Likes

Sorry to be a month late seeing these replies but thank you these are both very informative answers!