Hi! I am getting confused about the chi-squared test. I believe I did something wrong on my code and that is why my final answers ( chi_squared and p-value) are opposite to the solution answer. 10 Chi-squared I got are really close to each other, P-value I got is all 0.
I am not sure that I understand the logic of the code and the process of getting the chi-squared value. In the previous course, we learned that the chi-squared test for categorical data. This test enables us to determine the statistical significance of observing a set of categorical values. In this project, we try to determine if there is a significant relationship between terms_used and question value, but how does it help to answer ‘’ How often new questions are repeats of older questions?’’
What guesses do you have?
When you see something getting repeated, and you have a loop, one guess is the loop is seeing the same things again and again.
You are using list which is a colored python keyword. Keywords should not be used as variables.
Why would you expect this section of the exercise to answer that?
I looked through the mission and that question was asked on page 5. Recycled Questions.
On page 6, right before we do the chi-sq test of independence between value of question and whether a word appears in a question, a new question was posed Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.
This is a new question that has got nothing to do with repeats?
The results of the chi-sq have to be interpreted with care. Having high p-value just means it deviates from expected. It could deviate in 2 directions, a term appearing in high value more often than expected, or appearing in low value more often than expected. Although both are useful to decision making, I believe the author wants the students to search the “appear in high value more often than expected” case, since that is the goal of the study. To distinguish these two, further propcessing is needed.
I expect the p-values are higher than 0.05, just like the solution answer.
I changed the list to i in the for loop, the result has not changed. p-value still 0.0.
What is the chi-squared value mean in this project? We are trying to use chi-squared value to answer if those 10 terms have a significant relationship with high-value questions and low-value questions. did I understand it right?
Is the problem because you did the exact same thing as the solution and got the same intermediate output all the way until the last chi-sq step? Did you check if everything you did was the same as solution up til then? What if you copy paste the solution and ran, that will show if it’s your particular software version that’s causing differences.
Now I looked again and see the statistic values in your 1st post are not all the same, so my previous point about “something getting repeated” is wrong.
Looking through the course again, I feel that I was a code running robot and there was a lot of
missing stats knowledge to fill in from other sources, like https://philschatz.com/statistics-book/contents/m47082.html. I also wish the author had explained more clearly why we’re following the instructions with diagrams and linking back to concepts from previous lessons.
I suggest for such important fundamental statistics knowledge you consult more sources to make sure you understand the theory than just be satisfied with successfully running scipy.stats.chisquare or scipy.stats.chi2_contingency.
Yes it’s something like that. For this part of the project, the key is to finding the expected values before doing chi-square goodness-of-fit test using scipy.stats.chisquare. Note that this is not a test of independence, but I’m using a 2x2 matrix to help us find the expected values and explain the project’s instructions and solution, even though this exercise has got nothing to do with test of independence or scipy.stats.chi2_contingency.
Imagine a 2x2 table FOR EACH TERM and what you see here are observed values and job is to calculate expected values for cells of 6662 and 15128.
In the columns (or rows, however you want to arrange it transposed or not doesn’t matter), substitute male with this term is used in a question and female with this term is not used in a question. In the rows, substitute >50k with is a high value question and substitute <= 50k with is a low value question.
From preprocessing, the information you would already know are high_value_count analogous to 7841 marginal cell, low_value_count, analogous to 24720 marginal cell, both of which let you calculate the grand total in bottom right cell (Done through jeopardy.shape in the solution).
In the solution, the high_count,lowcount returned by def count_usage(term) corresponds to the Male > 50k cell of 6662 (meaning the intersection of 2 events: 1. term appears in question 2. question is high value) and Male <=50k (meaning 1. term appears in question 2. question is low value). With high_count + low_count we have total analogous to 21790, which the instruction asks you to divide by bottom right 32561 to get total_prop.
Later, this total_prop is used to multiply high_value_count (7841) and low_value_count (24720) to get the expected values for the observed values of 6662 and 15128 respectively. Finally, the chi-square goodness of fit is run for each pair of expected value that you calculated through the procedure above, and the pair of observed values.
To understand why this total_prop multiply marginals to get expected value, think about probability theory. If Event A and B are independent, P(A and B) = P(A) x P(B). Translate this to expected number of high value question the term appears in/total number of questions = (observed number of high value question/total number of questions) x (observed number of question term appears in/total number of questions) for the top left cell. Other 3 cells will have other physical meanings. Calculating the Right Hand Side, and multiplying the denominator in Left Hand Side will give you the expected value (not observed!) of one of the 4 cells in the 2x2 table. For the top left cell, that means number of high value question the term appears in/total number of questions (x/32561). Since we want to find the numerator (just the counts to feed into scipy.stats.chisquare, not the probability), we essentially want values for 3 terms: P(A) x P(B) x total number of questions. A and B correspond to row and col respectively and P(A) or P(B) correspond to probability of a particular event.
If we treat event A as high value question (not A will mean low value question) and event B as term appears in question (not B means term does not appear in question), P(B) = number of questions the term appears in/ total number of questions which is the total_prop in this exercise (21790/32561). P(A) will be the number of high value question/total number of questions. Note this denominator is exactly the 3rd term in P(A) x P(B) x total number of questions, so we can just multiple them together which gives number of high value question (7841/32561 * 32561 = 7841). This is exactly high_value_count. Finally high_value_count (7841) x total_prop (21790/32561) gives the expected number of high value questions the term appears in (top left cell with 6662 observed). You can repeat the above analysis to find the expected value of other 3 cells of 1179 , 15128, 9592 if you wanted to do a test of independence (actually scipy.stats.chi2_contingency will return you the expected given the observed, so you don’t even need to follow instructions in this mission and do all the tedious loops), but this question asked for a goodness of fit test that uses scipy.stats.chisquare with only the expected values of the left 2 cells (6662,15128) needed (meaning term appears in high value question and term appears in low value question).
With these 2 expected values, the chi-square goodness of fit will tell you how much the observed deviates from expected to give you a single chi-sq value, which you can then find a chi-sq distribution with the correct degrees of freedom to compare this value with to find the p-value and make conclusions I mentioned in 1st reply.
Notice in this guided project, P(B) total_prop was calculated first before multiplying high/low_value_count (which are P(A) with the denominator canceled out by multiplying the total number of questions). You could have similarly calculated P(A) and use total number of questions to cancel the denominator in P(B). This is analogous to (7841/32561 * 21790), see how it’s exactly the same as (21790/32561 * 7841) which is what this guided project solution is doing. Maybe due to experiment design (chi-sq test of homogeneity is almost the same as test of independence, just set-up
and interpretation difference), 1 way is easier to calculate or makes more sense as a metric than another. Some people will teach the expected value calculation as (Row Total x Column Total) / Grand Total, you can think about how this is the same thing as the 2 ways mentioned above, as long as you keep in mind the fundamentals of P(A and B) = P(A) x P(B).
For learning, besides reading theory, reading source code of scipy.stats.chi2_contingency and scipy.stats.chisquare is a good way to see how they are both calling the power_divergence function, so they’re not that much different.
@hanqi Thank you for a super detailed explanation.
Chi-sq values on my first post are not exactly the same but the number is so closed. I copy and paste the code from the solution answer, the result I got is the answer I expect ( big difference of all chi-sq values and high p-value).
I have to digest your answer about the chi-square part. Thank you!