 # Looking for generalized input on code speed improvements

I’m starting to think about code optimization for speed in general, and am curious if anyone has any resources for conceptual guidelines, and/or welcoming feedback on the two approaches to the same code below. I ran times for both, and mine was negligibly faster, but I am wondering what tradeoffs are being made with each approach. Which would be faster if the dataset was 1000X bigger, and why? Thanks!

My Code:

``````possibilities = income['high_income'].unique()
outcomes = len(income['high_income'])
entropies = []
for e in possibilities:
occurances = len(income[income['high_income']==e])
ratio = occurances/outcomes
P = ratio * math.log(ratio, len(possibilities))
entropies.append(P)

income_entropy = -(sum(entropies))

``````

DQ code:

``````prob_0 = income[income["high_income"] == 0].shape / income.shape
prob_1 = income[income["high_income"] == 1].shape / income.shape
income_entropy = -(prob_0 * math.log(prob_0, 2) + prob_1 * math.log(prob_1, 2))
``````

In this scenario, there’s not much difference between the two.

Since your `possibilities` is of length `2`, your for loop runs two times. If that was higher, your for loop would run more times.

In the solution code, there would be the need for `prob_2` and `prob_3` and so on.

The operation would still be the same, except yours is likely better for more than 2 `possibilities` than writing out each statement separately.

1 Like

Just simulate it. There are numerous tools under the package of ` pd.util.testing.makeDataFrame()`.

`occurances = len(income[income['high_income']==e])`
Why filter the entire dataframe when you only need information from 1 column?

`P = ratio * math.log(ratio, len(possibilities))`.
Why is your log base changing dynamically, shouldn’t it be always base 2?

Whenever you want to get group counts, can try `collections.Counter()` instead of looping to get a bar chart with unique keys, then turn it into numpy array and use vectorized operations from then on.

Dividing `outcomes` can be vectorized. Same for `math.log`, change it to `np.log`. That’s why sklearn and pandas source code is full of numpy functions. (c# - What happens in numpy's log function? Are there ways to improve the performance? - Stack Overflow

2 Likes

Thanks for the feedback! I will try out some of the testing stuff in the future, just trying to get an understanding of the way different approaches function performance-wise now.

good question! I guess I thought I had to, ha! Also I noticed the DQ approch used `.shape` to return the length instead of the python `len()` function. Is there a general preference here, pandas/numpy methods over python functions?

true. I did it dynamically because I am practicing building things that way for easier adjustments, and also I feel like it makes the coding easier for me to follow as I write it. Definitely takes longer to write, especially if there is a lot of dynamic variables, but does it slow down the actuall processing much, generally speaking?

so, if I use `collections.Counter()` I can get rid of the for loop, and then I could use vectorized functions for the calculations, right? I’ll need to noodle around with that a bit to figure it out. Thank you!

My guess is df.shape may be faster since there’s no filtering step. But len may communicate slightly better.

Yes it is good to think in terms of parameterizing inputs. My point here is you are not setting the base as a free parameter (which i think it shouldn’t be), but tying it to the number of unique possibilities. This is actually going against information theory? (From my understanding Info theory always uses base 2, or at least for entropy calculations in decision trees).

Variable access should not slow the code much, unless that attribute has to go through a whole chain of instance->class->parent class->base object to find an attribute.(LEGB? Meet ICPO, Python's search strategy for attributes — Reuven Lerner)

`Counter` is also great because it will not fail when you access non-existent key but give you a default value of 0. It also lets you do math on entire Counter objects. Besides official docs, I like this site for quick exposure (Counter - Python Module of the Week)

1 Like