Blue Week Special Offer | Brighten your week!
days
hours
minutes
seconds

Chi squared test lesson 4

Screen Link:
https://app.dataquest.io/m/99/chi-squared-tests/4/generating-a-distribution

My Code:

chi_squared_values = []
from numpy.random import random
import matplotlib.pyplot as plt
count_male=0
count_female=0
expected=16280.5
for i in range (1000):
    vector=random((32561,))
    for num in vector:
        if num<0.5:
            count_male+=1
        else:
            count_female+=1
    
    male_diff=(count_male-expected)**2/expected
    female_diff=(count_female-expected)**2/expected
    chi_squared=male_diff+female_diff
    chi_squared_values.append(chi_squared)
    
plt.hist(chi_squared_values)

What I expected to happen:
dq

What actually happened:
real

Solution code:

chi_squared_values = []
from numpy.random import random
import matplotlib.pyplot as plt

for i in range(1000):
    sequence = random((32561,))
    sequence[sequence < .5] = 0
    sequence[sequence >= .5] = 1
    male_count = len(sequence[sequence == 0])
    female_count = len(sequence[sequence == 1])
    male_diff = (male_count - 16280.5) ** 2 / 16280.5
    female_diff = (female_count - 16280.5) ** 2 / 16280.5
    chi_squared = male_diff + female_diff
    chi_squared_values.append(chi_squared)

plt.hist(chi_squared_values)

Looking at the solution code I’m quite aware that I didn’t take the best approach to generate the distribution, but I can’t see where the logic is wrong. Can anyone show me please?
Thx in advance =)

Your logic is almost entirely correct, except one small detail -

count_male=0
count_female=0
expected=16280.5
for i in range (1000):

Because you have your count_male and count_female initialized outside the for loop, their values accumulate over all 1000 iterations.

Instead, you want them to be initialized for every iteration. So, just a small change -

expected=16280.5
for i in range (1000):
    count_male=0
    count_female=0
2 Likes

Thanks so much, this makes a lot of sense. =)))))

1 Like

what’s wrong with my distribution technique? If applying the technique you showed, getting the correct distribution. I applied the same kind of technique to check the random chance which we learned in significance testing.


The random number generator you used produces only integers, and the arguments are

  1. low, including this number
  2. high, not including this number
  3. length, length of the array to be generated.

Since the only integer between 0 and .999999 is 0, you are creating a giant list of zeros. However, if you put in (0,2,32561) you would get a list of 1’s and 0’s randomly assigned and can skip the step of rounding up or down. Good job trying out other methods of randomization!