Significance Testing 106-8 questions

Screen Link:
https://app.dataquest.io/m/106/significance-testing/8/p-value

My Code:

frequencies = []
for key in sampling_distribution:
    if key >= 2.52:
        frequencies.append(key)
        
p_value = np.sum(frequencies) / 1000

Is there any difference with the solution??:

frequencies = []
for sp in sampling_distribution.keys():
    if sp >= 2.52:
        frequencies.append(sampling_distribution[sp])
p_value = np.sum(frequencies) / 1000

Why use sampling.distribution.keys() when we know we iterate over keys on dictionaries, is just to be extra clear or am I missing something??

Also, we are trying to find the number of times a value of 2.52 or higher appeared in our simulations, which is the difference of weight between the two groups in our test. So, when we are trying to find if the key is greater than 2.52, should we also find keys that are <= -2.52???, something like this:

frequencies = []
for sp in sampling_distribution.keys():
    if sp >= 2.52 or sp <= -2.52:
        frequencies.append(sampling_distribution[sp])
p_value = np.sum(frequencies) / 1000

Thanks.

Hello @probot

I don’t know if I’m missing something, but your code will rather return the frequency of dict.keys used, not the values.

To calculate the p_value we need all frequencies that match the hypothesis. In this case it’s sp >= 2.52 the 2.52 being the mean, we found in the previous mission.

Think of p_value as a fraction of the distribution that matches the hypothesis.

This is dependent on the size of the dictionary you end up working with. But, for key in dict_name.keys() is much faster than for key in dict_name .

The output is still the same, it’s just how they are implemented in Python’s source code that makes the difference. So, mostly when the dictionary sizes are quite large, for key in dict_name.keys() is much faster.

And, I think, @kakoori helps answer the other part of your question.

1 Like