 # Why do we sum the frequencies list instead of count them?

I just have a basic question here - why are we summing all values and not using len() to count the frequency?

Since there are no values in sampling_distribution greater than or equal 2.52 the p_value will be 0 regardless, but it seems to me that using np.sum() is the incorrect approach in theory.

1 Like
``````l = [1,1,2,2,2,3]

l_dict = {1:2, 2:3, 3:1}
l_dict_morethanequal2 = {2:3, 3:1}
sum(l_dict_morethanequal2.values()) = 4
len(l_dict_morethanequal2) = 2 # counts number of keys, not how many times each key appears
``````

len counts the number of elements in an iterable. What that iterable is in an unknown object takes some study. In a dictionary, the number of elements are the number of key-value pairs, got nothing to do with the values of the keys or values.

1 Like

Here is the solution provided - note that frequencies is a list:

``````frequencies = []
for sp in sampling_distribution.keys():
if sp >= 2.52:
frequencies.append(sampling_distribution[sp])
p_value = np.sum(frequencies) / 1000
``````

Given that frequencies is a list, isn’t this the incorrect approach? If it was a dictionary summing values certainly makes sense.

`frequencies` is a list of filtered (using dict keys) values from `sampling_distribution[sp]`. Each value says for each `sp>=2.52`, how many times it occured. The datatype of the container does not matter, what matters is what values we are trying to filter for.

If you don’t want to use a container, an accumulator variable `current_sum` initialized to 0 works too. If you don’t want to intialize anything, even operating directly on `sampling_distribution` works too.
`sum(value for sp,value in sampling_distribution.items() if sp >= 2.52)`.

This exercise is looking for how many `sp` are greater than or equal to 2.52. So if 2.52 had 2 copies of it, and 2.53 had 3 copies, in total there are 5 instances of sp >= 2.52, its a sum of 2 from 2.52 + 3 from 2.53. If there was no frequency dict and data is given like 2.51…2.52,2.52,2.53,2.53,2.53, then len works.

`len` shows the number items in the list. It is usually used for checking if a sequence is empty or non-empty, to be used in algorithms for control flow. For descriptive analytics, len is pretty useless, thats why frequency dictionary exist, and why people look at bar charts (the visual representation of the frequency dictionary), rather than a list of numbers which when unsorted is impossible to work with and not much better when sorted if its millions of elements long.

Usually a more space + time saving method (if a list of values have many duplicates) is to apply `collections.Counter(iterable)` to get a dictionary of unique values and how many of each there are, then to look for keys we want to analyze and get their values. This method is useful in tons of data structures and algorithms problems.

To use len to answer this exercise, the duplicate copies of numbers must be left alone in a list without being summarized into a frequency dictionary, so no `sampling_distribution` dictionary will be used. Then we can `filter(lambda x:x>=2.52,list_of_statistics)` for those list elements >= 2.52 and `len` to count how many there are.

Try to see my 1st reply with example of `l` and `l_dict` and see how they represent the same data differently. Analogously, `frequencies` would contain the [3,1] from key 2 and key 3 respectively in `l_dict_morethanequal2 = {2:3, 3:1}` filtered from `l_dict`

3 Likes

Thank you for the incredibly detailed response hanqi, that makes a lot more sense. I’ll need to review this section more and experiment around with it.