Blue Week Special Offer | Brighten your week!
days
hours
minutes
seconds

Chi-squared logic

I dont understand the logic of the solution. The final return is a list of chi-squared_values, but in the for loop, we came up with only one number instead of a list.
Screen Link: https://app.dataquest.io/m/99/chi-squared-tests/4/generating-a-distribution

My understanding of the below code is to randomly take 32561 numbers between 0-1, if value >0.5 then we assign them as male, others are assigned to female. Then we count male and female. We came up 2 numbers, based on these 2 number we calculated male_diff and female_diffadded these 2 numbers together, then we have one number chi_squared.After appending it to the chi_squared_values, why it becomes a list eventually…
My Code:

for i in range(1000):
    sequence = random((32561,))
    sequence[sequence < .5] = 0
    sequence[sequence >= .5] = 1
    male_count = len(sequence[sequence == 0])
    female_count = len(sequence[sequence == 1])
    male_diff = (male_count - 16280.5) ** 2 / 16280.5
    female_diff = (female_count - 16280.5) ** 2 / 16280.5
    chi_squared = male_diff + female_diff
    chi_squared_values.append(chi_squared)

This is the same pattern as

l = []
for i in range(3):
    l.append(i)

A list was created and single values are appended to it.

What does it refer to?

If you mean chi_squared, it was and always is a single value throughout the looping, and
chi_squared_values was and always is a list throughout the looping.

If you are refering to the output of plt.hist.

(array([785., 128.,  46.,  21.,  11.,   6.,   1.,   1.,   0.,   1.]),
 array([3.07115875e-05, 1.37455852e+00, 2.74908633e+00, 4.12361414e+00,
        5.49814195e+00, 6.87266976e+00, 8.24719757e+00, 9.62172538e+00,
        1.09962532e+01, 1.23707810e+01, 1.37453088e+01]),
 <a list of 10 Patch objects>)

You can ignore this until you want to manually edit the matplotlib charts. Docs explain the return values: https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.hist.html

Thank you for your reply! I was confused about the connection between for i in range(1000)and the rest part of the code… but now I figured it out. We just repeat calculatingchi_squared 1000 times. That is why we have a list at the end… I did not see i included in the rest of code…that is why I was confused.

This sounds like you only knew there was a list after the whole looping block. chi_squared_values was initialized as an empty list even before the loop began so you can expect that list to be used for something. Your confusion may arise when list comprehensions are used because there would then be no list initialization before loop.

Depends on how clear the coder wants to express ideas. In this case, it can be written for _ in range(1000). Usually, 1000 will be substituted to len(x) so you iterate the same number of times as length of another object, making code dynamic. Other common patterns are for object in collection where you don’t care about the position or order, but just want to process every object in a collection.
Another pattern is for index, item in enumerate(collection), where you want the index for something in addition to getting every item in the collection. If you want to use the index for working on another collection in parallel, for item1,item2 in zip(collection1,collection2) auto-aligns and save you the need to produce an index. If you still need the index for pointing to something other than the two collections, for index, (item1, item2) in enumerate(zip(collection1,collection2)) is the syntax. You can learn about tuple unpacking to understand why it looks like this. You can even zip 3 collections.

Hi candiceliu93,

Thank you for raising the question here.
I was wondering if you can help to explain why we need to assign value>0.5 as Male and value<0.5 for female. I am still a bit lost about this part.

Thanks,

I’ll just run down the actions performed in each part of the code:

for i in range(1000):
initiates a for loop, we are going to perform the following actions 1000X
sequence = random((32561,))
This part creates a list-like array with 32,561 values, and each value is going to be a random decimal between 1 and 0. Each value in this list represents a person. Since there is a .5 probability overall of a member of a population being male or female, we assign half of the possibly generated numbers to be men, and half to be women.
sequence[sequence < .5] = 0
sequence[sequence >= .5] = 1
this code keeps the sequence we just generated in tact, but changes the values inplace to be either a 1 or 0 to represent a man or woman. It could also be changed to ‘man’ or ‘woman’, the new values just need to be the same so they become countable. Basically we just created a new population of equal size and determined their gender with a simulated coin flip
male_count = len(sequence[sequence == 0])
female_count = len(sequence[sequence == 1])
counting the men and women totals
male_diff = (male_count - 16280.5) ** 2 / 16280.5
female_diff = (female_count - 16280.5) ** 2 / 16280.5
chi_squared = male_diff + female_diff
calculating chi squared
chi_squared_values.append(chi_squared)
adding that chi squared value to a list that will have 1000 values when this loop is done

I think the confusing part here is how the random number generation works. We haven’t gone over random number generation specifically, but we have seen it a few times now and each time it seems to be used a little differently.