Dataquest frecuency distribution table generation is slower than the method used in one of the first missions?

Hello everyone!

I was doing the following part of the mission Significance Testing and found the solution provided by Dataquest to be a bit slower than the other method for generating frecuency distribution dictionaries, we learned back at the beginning of the Data Science Path. Let me explain:

Dataquest approach

sampling_distribution = {}
for df in mean_differences:
    if sampling_distribution.get(df, False):
        sampling_distribution[df] = sampling_distribution[df] + 1
    else:
        sampling_distribution[df] = 1

Tracking the execution time :

import timeit

start_time_a = timeit.default_timer()

sampling_distribution = {}
for df in mean_differences:
    if sampling_distribution.get(df, False):
        sampling_distribution[df] = sampling_distribution[df] + 1
    else:
        sampling_distribution[df] = 1
elapsed_a = timeit.default_timer() - start_time_a

elapsed_a : 0.000935843214392662

Different Approach

Here i am not using the dictionary method get()

start_time_b = timeit.default_timer()
frecuency = dict()

for i in mean_differences:
    if i in frecuency:
        frecuency[i] +=1
    else:
        frecuency[i] = 1
elapsed_b = timeit.default_timer() - start_time_b
  
frecuency == sampling_distribution
elapsed_a > elapsed_b

elapsed_b : 0.0006739329546689987

Both of the last statements evaluate True

Is this by any change relevant if we consider sampling even more permutations?

Thanks in advance!

6 Likes

It might be relevant, but likely only when you have really large datasets to go through. But for those there are alternative implementations using different libraries/tools that are likely better.

For get(), Python has to do “more work” in order to execute it. The underlying discussion to be able to understand why it’s “more work” requires more of an understanding of some CS fundamentals so I won’t get into that. But I found this which does explain it if anyone is interested - https://stackoverflow.com/questions/36566331/why-does-dict-getkey-run-slower-than-dictkey

Using in doesn’t have quite the same overhead as get(), by the looks of it. Of course, as per me, the validity of this (that one is slower than the other) is probably only acceptable if you time it all over a series of executions with different sized data and average the values.

4 Likes

Well, first of all thank you for taking the time on this question.

This was just a small thing I found when doing the excercise and wanted just by curiosity to see what others think about it. I just remember doing this at the beginning and liking the simplicity of using in when building the logic in a loop, since it works for multiple data structures as well. Also the second approach is more readable in my opinion. The StackOverflow answer is very good an detailed so thank you for that ressource.

I guess the people writing the dataquests missions have their own ways of achieving the same thing, in this case this simple loop. Or they want you to see more possibilities when writing code. Anyways Thanks for the Response!!

1 Like

Nice work @eliasalvarez96 ! I was curious about how these two methods of accomplishing the same thing compare.

1 Like