Distribution Algorithm

Chi-squared tests: Lecture 4

what’s wrong with my distribution technique? If applying the technique Dataquest showed in the answer, getting the correct distribution. I applied the same kind of algorithm to plot the distribution which we learned in significance testing. But in this case, the chi-squared value is coming so high for each simulation. I know the code is right but my question is why we cannot take this distribution algorithm?

chi_squared_values = []
for i in range(1000):
    gender = np.random.randint(0,1,32561)   # taking a list of random number of total number of male and female(32561)
    for index, item in enumerate(gender):
        if np.random.randn() >= 0.5:                # separting random list with only male and female 
            gender[index] = 1    # this is for female
        else:
            gender[index] = 0    # this is for male
            
    # counting total number of male and female        
    gender = list(gender)
    female = gender.count(1)
    male = gender.count(0)

    # chi squared difference
    female_difference = (female - 16280.5)**2 / 16280.5
    male_difference = (male - 16280.5)**2 / 16280.5
    chi_squared_difference = female_difference + male_difference
    
    chi_squared_values.append(chi_squared_difference)

plt.style.use('fivethirtyeight')
plt.figure(figsize=(6,3), dpi= 95)
plt.hist(chi_squared_values)
plt.show()

Output

1

2 Likes

Hi @rakibulislammm:

please format your code (not just the screenshot) and provide a question link according to these guidelines.

Got it. Please, check now. It should be fine this time.

@masterryan.prof

Could you please answer my question?

Hi @rakibulislammm:

You used np.random.randint when you were supposed to use np.random.random as per the instructions. The output for the 2 are very different.

In this line you are suggesting that gender is a dictionary, which is not correct. Try printing the output of gender.

You are reassigning values back to gender and I dont know why you would want to do this when it is not a dictionary, but a list.

Since they did not require styling in the mission, you may choose to add it in in your notebook but do not submit it in the mission because it would give error (expects a definite output as the answer).

Here is a modified solution that you could try out.

import numpy as np


chi_squared_values = []
excepted_val = 16280.5
for i in range(1000):
''' 
we initialise counter as a dictionary so that we can 
easily retrieve the count of the particular gender based on the key 
(i.e. the gender itself)
'''
    counter = {'male' : 0, 'female' : 0}
    for val in np.random.random(32561):
        if val < 0.5 :
            counter['male'] += 1
        else:
            counter['female'] += 1
    male_diff = (counter['male'] - excepted_val)**2 / excepted_val
    female_diff = (counter['female'] - excepted_val)**2 / excepted_val
    chi_squared_difference = male_diff + female_diff
    chi_squared_values.append(chi_squared_difference)
    
plt.hist(chi_squared_values)

Hope this clarifies.

1 Like

Thanks for explaining this. Especially the the part in the initializing counter as a dictionary so we can count male and female easily. :+1: :clap:

1 Like

@jinyushan1990: Good to hear. @rakibulislammm: If you found my answer useful in clarifying your doubts, could you mark it as the solution so others with similar doubts can quickly find the answer? Thanks

My problem is not with the code I posted. It’s about the algorithm. Probably I was not able to explain it properly and I am sorry for it. So, in lecture 5 of significant testing we have followed the below code.

mean_difference = 2.52
print(all_values)
mean_differences = []
for i in range(1000):
    group_a = []
    group_b = []
    for value in all_values:
        assignment_chance = np.random.rand()
        if assignment_chance >= 0.5:
            group_a.append(value)
        else:
            group_b.append(value)
    iteration_mean_difference = np.mean(group_b) - np.mean(group_a)
    mean_differences.append(iteration_mean_difference)
    
plt.hist(mean_differences)
plt.show()

If you see the below part of the code from above, Dataquest has used different procedures to run the simulation than what they used in lecture 4 of chi-squared.

simulation code applied on the significant testing lecture

for i in range(1000):
    group_a = []
    group_b = []
    for value in all_values:
        assignment_chance = np.random.rand()
        if assignment_chance >= 0.5:
            group_a.append(value)
        else:
            group_b.append(value)

simulation code applied on the chi-squared lecture

for i in range(1000):
    sequence = random((32561,))
    sequence[sequence < .5] = 0
    sequence[sequence >= .5] = 1

My question is if I am applying the simulation procedure used in significant testing when doing the simulation in the chi-squared test (what I already showed in the first posted question), not getting the correct distribution. Even though the simulation procedures should be the same on both because we are just splitting the data into two in both cases.

Thank you so much for all your effort. I really appreciate it.

We can use a similar version of the code you mentioned above to achieve the results. The issue isn’t with the algorithm, but how you did certain things (you will need to adapt the algorithm to solve your current problem–not just leave it as is and expect the solution to be “magically” successful for a totally different problem).

male_diff = (len(group_b) - excepted_val)**2 / excepted_val
female_diff = (len(group_a) - excepted_val)**2 / excepted_val

If you use append() and 2 lists, you will need to find the length of the two lists, similar to what the dictionary above is doing for you by taking the count.

You need to use the correct package as specified earlier.

Then again, you should iterate over np.random.random to generate 32561 0s and 1s in a vector/list as specified by one of the prompts.

Pass (32561,) into the numpy.random.random function to get a vector with 32561 elements.

  • For each of the numbers, if it is less than .5 , replace it with 0 , otherwise replace it with 1 .

Why do you need to find the mean here? Remember to modify your code accordingly to fit the question requirements.

Here is how I adapted the code.

import numpy as np

chi_squared_values = []
excepted_val = 16280.5
for i in range(1000):
    group_a = []
    group_b = []
    for value in np.random.random(32561):
        if value >= 0.5:
            group_a.append(value)
        else:
            group_b.append(value)
    male_diff = (len(group_b) - excepted_val)**2 / excepted_val
    female_diff = (len(group_a) - excepted_val)**2 / excepted_val
    chi_squared_difference = male_diff + female_diff
    chi_squared_values.append(chi_squared_difference)
    
plt.hist(chi_squared_values)
1 Like