What are the functional differences, advantages, and disadvantages between the following two lines of code which accomplish the same thing?
distance = abs(3 - dc_listings['accommodates'])
distance2 = dc_listings['accommodates'].apply(lambda x: abs(3-x))
I’m trying to get a better understanding of what’s going on under the hood so I can make more efficient choices when there are multiple ways of doing something.
Functionally, no difference. It’s essentially simple arithmetic. For something more complicated you might not be able to use the first approach and then
apply() might be better.
apply() is a bit slower in general and sometimes people don’t tend to recommend using it too much. In your example, the first approach is a bit faster compared to the one with the
The big question here is how long does your code take to execute in each case. If I run some tests
import numpy as np
import pandas as pd
# Create a dataframe with 100000 random values
x = pd.DataFrame(np.random.randn(100000), columns=['values'])
# Your scenario 1
%timeit abs(3 - x['values'])
# Your scenario 2
%timeit x['values'].apply(lambda x: abs(3-x))
# My prefered code using vectorized Series.abs()
%timeit (3 - x['values']).abs()
I come up with the following times on my machine:
Scenario 1: 1.54 ms ± 50.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Scenario 2 (apply): 45.9 ms ± 6.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Scenario 3: 1.49 ms ± 29.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
As you can see the differences are huge. The code using
apply() performs slower by a factor of 40 when dealing with 100000 observations in comparison to the other 2 implementations. This being sad, if you are looking to write code, which performs well, than you want to use vectorized methods whenever possible. In this sense, there is no good reason in my eyes to use
abs() as a function within
Series.apply() as in your example 2. Just use vectorized
Series.abs() coming with pandas instead. It performs better and it is actually easier to read.
Note: Your scenario 1 and my code should do the same thing behind the curtains if I am correct. You treat
abs() as a function, I prefer to treat it as a Dataframe or Series method. This comes down to a question of personal preference in this case.
Good read for this topic: A Beginner’s Guide to Optimizing Pandas Code for Speed
Sorry to be a month late seeing these replies but thank you these are both very informative answers!