Looking for generalized input on code speed improvements

I’m starting to think about code optimization for speed in general, and am curious if anyone has any resources for conceptual guidelines, and/or welcoming feedback on the two approaches to the same code below. I ran times for both, and mine was negligibly faster, but I am wondering what tradeoffs are being made with each approach. Which would be faster if the dataset was 1000X bigger, and why? Thanks!

Screen Link:
https://app.dataquest.io/m/89/introduction-to-decision-trees/9/overview-of-dataset-entropy

My Code:

possibilities = income['high_income'].unique()
outcomes = len(income['high_income'])
entropies = []
for e in possibilities:
    occurances = len(income[income['high_income']==e])
    ratio = occurances/outcomes
    P = ratio * math.log(ratio, len(possibilities))
    entropies.append(P)                  
                     
income_entropy = -(sum(entropies))

DQ code:

prob_0 = income[income["high_income"] == 0].shape[0] / income.shape[0]
prob_1 = income[income["high_income"] == 1].shape[0] / income.shape[0]
income_entropy = -(prob_0 * math.log(prob_0, 2) + prob_1 * math.log(prob_1, 2))

In this scenario, there’s not much difference between the two.

Since your possibilities is of length 2, your for loop runs two times. If that was higher, your for loop would run more times.

In the solution code, there would be the need for prob_2 and prob_3 and so on.

The operation would still be the same, except yours is likely better for more than 2 possibilities than writing out each statement separately.

1 Like

Just simulate it. There are numerous tools under the package of pd.util.testing.makeDataFrame().

occurances = len(income[income['high_income']==e])
Why filter the entire dataframe when you only need information from 1 column?

P = ratio * math.log(ratio, len(possibilities)).
Why is your log base changing dynamically, shouldn’t it be always base 2?

Whenever you want to get group counts, can try collections.Counter() instead of looping to get a bar chart with unique keys, then turn it into numpy array and use vectorized operations from then on.

Dividing outcomes can be vectorized. Same for math.log, change it to np.log. That’s why sklearn and pandas source code is full of numpy functions. (c# - What happens in numpy's log function? Are there ways to improve the performance? - Stack Overflow

2 Likes

Thanks for the feedback! I will try out some of the testing stuff in the future, just trying to get an understanding of the way different approaches function performance-wise now.

good question! I guess I thought I had to, ha! Also I noticed the DQ approch used .shape to return the length instead of the python len() function. Is there a general preference here, pandas/numpy methods over python functions?

true. I did it dynamically because I am practicing building things that way for easier adjustments, and also I feel like it makes the coding easier for me to follow as I write it. Definitely takes longer to write, especially if there is a lot of dynamic variables, but does it slow down the actuall processing much, generally speaking?

so, if I use collections.Counter() I can get rid of the for loop, and then I could use vectorized functions for the calculations, right? I’ll need to noodle around with that a bit to figure it out. Thank you!

My guess is df.shape may be faster since there’s no filtering step. But len may communicate slightly better.

Yes it is good to think in terms of parameterizing inputs. My point here is you are not setting the base as a free parameter (which i think it shouldn’t be), but tying it to the number of unique possibilities. This is actually going against information theory? (From my understanding Info theory always uses base 2, or at least for entropy calculations in decision trees).

Variable access should not slow the code much, unless that attribute has to go through a whole chain of instance->class->parent class->base object to find an attribute.(LEGB? Meet ICPO, Python's search strategy for attributes — Reuven Lerner)

Counter is also great because it will not fail when you access non-existent key but give you a default value of 0. It also lets you do math on entire Counter objects. Besides official docs, I like this site for quick exposure (Counter - Python Module of the Week)

1 Like