GP Winning Jeopardy - Confusion with calculation of proportions and expected values - Chi-Squared test

Screen Link:
https://app.dataquest.io/m/210/guided-project%3A-winning-jeopardy/7/applying-the-chi-squared-test

My Code:

for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

Hi I hope someone can offer some guidance on this!

In this loop I am of the mind that the following is what we are doing:

  1. Finding the sum of the the observed_expected, which is a list of lists made up of the times a word is repeated in a high-value question, and the amount of times the word is repeated in a low_value question.

  2. If we sum these two together and divide by the total amount of questions in the data-set, we get the proportion of times the word was used in other questions in the dataset.

  3. This is where I get confused, if we are looking at the amount of times a word is used across the total amount of questions (total_prop in the code) it confuses me that we would use this total proportion to calculate the expected times a word is used in either high-value r low_value questions individually.

To put it more simply, how can we use the total proportion of a word being used in other questions , to calculate the expected amount of times this word would be used in specifically high or low_value questions?

Would it not make more sense to calculate a total proportion for words being reused for each category (high and low value questions) individually, and then applying this proportion to known high and low value counts to get the expected values?

If am completely off the mark here please let me know! Not too sure if there is a concept I haven’t fully grasped or something like that.

Thanks for reading and for your time!

John

hi @johnedwardferreira5

Thank you for this detailed question! :+1:

Before I attempt to discuss your doubts, please clarify this for me.

Kindly explain the calculation for individual proportions using this below example. Two tables - with observed and expected values.

Observed Values
Flavor ChocoChips DryFruits Row-Totals
Chocolate 25 15 40
Vanilla 13 20 33
Column-Totals 38 35 73
Expected Values
Flavor ChocoChips DryFruits Row-Totals
Chocolate 21 19 40
Vanilla 17 16 33
Column-Totals 38 35 73

Hi Rucha,

Thanks for the response!

For the expected values the individual proportions are as follows:

Chocolate: 40/73
Vanilla: 33/73

Chocolate (choco-chips) = 25/73
Chocolate (Dry Fruits) = 15/73

Vanilla (Choco-chips) = 13/73
Vanilla (Dry Fruits) = 20/73

We could obviously go into more depth regarding then narrower categories such as the expected proportion of chocochip chocolate against only chocochip cookies (I am assuming they are cookies lol :slight_smile: ).

I would apply the same methodology as above to the observed values and in doing so would be able to get the chi squared value as well as the p_value.

Am I doing this correctly?

In terms of the original question, I felt we grouped the probability of the repetition of words into one figure, which we then applied to both the high and low value questions, when in fact we should have calculated their proportions individually and then applied each proportion to its relative category of question?

I am slightly nervous there is something I have not grasped, if that is the case please do not hold back. Rather a fool for a second!

Regards,
John

hi @johnedwardferreira5

Nope, we are not talking about cookies, we are talking about ice-creams and toppings!
Let’s just ignore the GP question for now, and only focus on this example.

Frank answer nope - this is incorrect. You just gave the observed values as expected values.

Firstly the row-totals and the column-total are what we call marginal distribution. If we break our observed values for marginal distribution, we can simplify like this:

  • Total no. of ice-creams = 73
  • Total no. of chocolate ice-creams (out of 73) = 40 (regardless of toppings)
  • Total no. of vanilla ice-creams (out of 73) = 33 (regardless of toppings)
  • Total no. of ice-creams with choco-chips as toppings = 38 (regardless of flavor)
  • Total no. of ice-creams with dry-fruits as toppings = 35 (regardless of flavor)

Second, the cross-section of flavor with toppings, is what we call as joint distribution. Again breaking the observed values, we have:

  • Total no. of ice-creams = 73
  • Total no. of Chocolate Flavored ice-creams with Choco-chips as toppings = 25
  • Total no. of Chocolate Flavored ice-creams with dry-fruits as toppings = 15
  • and so on

When we talk about the Chi-squared test, what we want to know is, given an observed marginal distribution, what is the expected joint-distribution? (right now we are only talking about using Chi-Squared test for homogeneity. If this word is foreign to you then please ignore, we can take this up in the later post!)

Allow me to shorten the names before I try to simplify this.

Chocolate = CH
Vanilla = VA
Choco-Chip = CC
Dry-Fruits = DF

If you try to understand the observed values, we are calculating the overall probability of having CH out all ice-creams. So let’s take it this way:

  • P(CH) = 40/73 = 54.8%
  • P(CC) = 38/73 = 52.1%

If we expect our distribution is homogenous, what we mean is 52.1% of 54.8% of Total ice-creams should be CH + CC (Chocolate flavor and Choco-chip toppings).

This translates to, we expect the no. of CH and CC ice-creams out of the total
= 52.1\% * 54.8\% * 73

OR = ( \frac {38}{73}) * ({\frac {40}{73}}) * 73

Which then get’s reduced to Expected Value for Chocolate ice-cream and Choco-Chip = \frac {38\ *\ 40}{73} = 20.821 (I rounded it to 21)

Similarly, Expected value for Vanilla and Choco-chip = 52.1\% * (100 - 54.8)\% * 73

OR = \frac {38}{73} * \frac {33}{73} * 73
= \frac {38 * 33}{73}
= 17.178 (I rounded it to 17)

Note that, total ice-creams with Choco-chips still remain 38 (21 + 17)! Our marginal distribution for both observed and expected values is always the same.

If you have understood the essence of expected values and can proceed on your own Great! You may ignore the below section.

Following this calculation, we calculate the expected values for all variables(flavor) & attributes(toppings) to get the Expected values table.

Once we complete that, we can then use a X^2 test to understand how our observed values have fared as compared to expected values.

X_c^2 = \sum \frac {(O_i - E_i)^2}{E_i}

c = degrees of freedom
O_i = observed values
E_i = expected values

Degrees of freedom for a Chi-Squared test is calculated as:

Total no of columns without margins(Column-Total) = 2 (Col)
Total no of rows without margins(Row-Total) = 2 (Row)
c = (Col - 1) * (Row - 1) = (2-1) * (2 -1) = 1

This 1 degree of freedom implies that, if we have Marginal distribution and only one joint observation, we can still derive the entire observed and expected values table. To elaborate:

Observed Table with Margins and One observation:

Flavor\Toppings Choco-chip Dry-fruits Row-Total
Chocolate 25 ? 40
Vanilla ? ? 33
Col-Total 38 35 73

For a 3 column and 2 row table, c = (3-1) * (2-1) = 2

Observed Table with Margins and Two observation:

Flavor\Toppings Choco-chip Dry-fruits Oreo-cookie Row-Total
Chocolate 25 15 ? 60
Vanilla ? ? ? 45
Col-Total 38 35 32 105

I hope this helps you somewhat. Do let me know in case you want me to complete the X^2 test calc as well. And in case of newer or further doubts - 404 error :stuck_out_tongue_closed_eyes: Just Kidding! do let me know your thoughts.

2 Likes

Hi @Rucha,

Awesome! Thanks for your response!

I think I’ve got an idea of what is going on here! So if we apply what you have just said to the original case that I was struggling with we could say the following:

  1. By summing the amount of repetitions in high_value questions with the repetitions in low_value questions we get the expected proportion of the specific word we’re dealing with being repeated in other questions in the dataset.

  2. By looking at high and low value count (see code below), we are actually looking at the ‘proportion’/instances of low and high value questions in the data-set:

high_value_count = jeopardy[jeopardy[‘high_value’] == 1].shape[0]
low_value_count = jeopardy[jeopardy[‘high_value’] == 0].shape[0]

  1. To then calculate the expected figure for high and low value questions, we multiply the proportion of a specific word being repeated in the data-set by the count of a question being high or low value to determine the expected proportions, both, of a word being repeated in low and high value questions respectively?

Please let me know if this makes sense!

Thank you very much for your help!

John

1 Like

Hi @johnedwardferreira5

I didn’t quiet get this part.

If you mean this, then Yup you got it!

exp_high_val = total_prop_of_word * high_value_count
exp_low_val = total_prop_of_word * low_value_count

1 Like

There is nothing about repeating in this part of the exercise.
The recycling analysis on the previous page was done solely to create a single column question_overlap which has got nothing to do with this chi-sq analysis.

The previous analysis does however create a side product of terms_used which is used in this analysis. Look at the data type of terms_used it is a set. Means it contains no repeats, and no information of how many times each word appears, but just a vocabulary of what has appeared.

Here is my explanation of what’s going on in this question: Chi_squared and p-value result are different from solution answer

@Rucha

Are you refering to the flavour and toppings example? That looks more like a test of independence since test of homogeneity is for comparing 2 groups on a single variable according to http://inspire.stat.ucla.edu/unit_13/#:~:text=In%20the%20test%20of%20independence,from%20each%20sub-group%20separately.

Hey @hanqi

Thanks for highlighting this out.

At that time I may not have accounted for Ice-cream as a sample, and flavors being two groups from the same sample. I kind of made the two flavors as two distinct groups.
Perhaps Men & Women - Smoker and Non-Smokers would have been a better example?

So, in that sense, yes this test is more for association/independence testing rather than homogeneity.

Will try to work and improve on this for future reference.