Why do we sum the frequencies list instead of count them?

Screen link here

I just have a basic question here - why are we summing all values and not using len() to count the frequency?

Since there are no values in sampling_distribution greater than or equal 2.52 the p_value will be 0 regardless, but it seems to me that using np.sum() is the incorrect approach in theory.

l = [1,1,2,2,2,3]

l_dict = {1:2, 2:3, 3:1}
l_dict_morethanequal2 = {2:3, 3:1}
sum(l_dict_morethanequal2.values()) = 4
len(l_dict_morethanequal2) = 2 # counts number of keys, not how many times each key appears

len counts the number of elements in an iterable. What that iterable is in an unknown object takes some study. In a dictionary, the number of elements are the number of key-value pairs, got nothing to do with the values of the keys or values.

1 Like

Here is the solution provided - note that frequencies is a list:

frequencies = []
for sp in sampling_distribution.keys():
if sp >= 2.52:
p_value = np.sum(frequencies) / 1000

Given that frequencies is a list, isn’t this the incorrect approach? If it was a dictionary summing values certainly makes sense.

frequencies is a list of filtered (using dict keys) values from sampling_distribution[sp]. Each value says for each sp>=2.52, how many times it occured. The datatype of the container does not matter, what matters is what values we are trying to filter for.

If you don’t want to use a container, an accumulator variable current_sum initialized to 0 works too. If you don’t want to intialize anything, even operating directly on sampling_distribution works too.
sum(value for sp,value in sampling_distribution.items() if sp >= 2.52).

This exercise is looking for how many sp are greater than or equal to 2.52. So if 2.52 had 2 copies of it, and 2.53 had 3 copies, in total there are 5 instances of sp >= 2.52, its a sum of 2 from 2.52 + 3 from 2.53. If there was no frequency dict and data is given like 2.51…2.52,2.52,2.53,2.53,2.53, then len works.

len shows the number items in the list. It is usually used for checking if a sequence is empty or non-empty, to be used in algorithms for control flow. For descriptive analytics, len is pretty useless, thats why frequency dictionary exist, and why people look at bar charts (the visual representation of the frequency dictionary), rather than a list of numbers which when unsorted is impossible to work with and not much better when sorted if its millions of elements long.

Usually a more space + time saving method (if a list of values have many duplicates) is to apply collections.Counter(iterable) to get a dictionary of unique values and how many of each there are, then to look for keys we want to analyze and get their values. This method is useful in tons of data structures and algorithms problems.

To use len to answer this exercise, the duplicate copies of numbers must be left alone in a list without being summarized into a frequency dictionary, so no sampling_distribution dictionary will be used. Then we can filter(lambda x:x>=2.52,list_of_statistics) for those list elements >= 2.52 and len to count how many there are.

Try to see my 1st reply with example of l and l_dict and see how they represent the same data differently. Analogously, frequencies would contain the [3,1] from key 2 and key 3 respectively in l_dict_morethanequal2 = {2:3, 3:1} filtered from l_dict


Thank you for the incredibly detailed response hanqi, that makes a lot more sense. I’ll need to review this section more and experiment around with it.

This is an interesting question. On the previous screen we have a list of all the mean differences. Why dont we simply filter that list and either sum() it or len() it? Instead we create a fequency table, filter the frequency table, and then find the sum. The frequency table might be helpful to have if you were going to build a graph of the permutation results, but otherwise just seems like an extra step.

1 Like