Clarification required - Multi category chi-squared tests : 2. Calculating expected values

Screen Link:

In this session, it is explained how to calculate expected value. We are refering to a table with the values updated with corresponding proportions. As per the explanation, in order to calculate the proportion of female with sal more than 50k, we need to multiply total proportion of people who earn more than 50k with total proportion to females.

That is, the proportion of people who are female and earn >50k = 0.241 * 0.33

How is this correct? By doing the above calculation we are assuming entire population of females earn more than 50k. How can make such an assumption. But in reality, the total proportion of female that is 0.33 includes both females who earn more than and less and 50k. How can we use this method to calculate the expected? Please provide a detail explanation.

1 Like

Hi @sreekanthac,

To make it simple, Suppose there are 200 observations.

50% are males (100)
50% are females (100)
50% earn more than 50k (100)
50% earn less than 50k (100)

If we are looking for the percentage of females who are earning more than 50k, then we only need this info:

50% are females (100)
50% earn more than 50k (100)

And we can calculate it by multiplying them 0.5 * 0.5 = 0.25 (25% of 200 = 50). Why multiply? We are basically using this concept:

A and B are two events. If A and B are independent, then the probability that events A and B both occur is:

p(A and B) = p(A) x p(B).

In other words, the probability of A and B both occurring is the product of the probability of A and the probability of B.

Now, how do we verify that?

50% earn more than 50k (100)
And out of those 100, 50% are males, that is 50.

So if we subtract the number of males from it, we have 50 females who earn more than 50k.

Hope this helps :slightly_smiling_face:



Hi @Sahil,
I have not understood the verification part.

It is said that out of 200, 50% are male and 50% (100) earn more than 50K (100).
But you have mentioned 50% earn more than 50K (100) and out of those 100, 50% are males, that is 50. Why out of 100, isn’t it out of 200?

Total: 200
50% are males
Total Males: 100
50% earn more than 50k
Total >50k earners: 100
What we don’t know is that among males, what percentage are >50k earners and what percentage of them are not. It could be 50/50, 40/60, 30/70… and so on. Since the data is not available, we are assuming it to be 50% (the same as the percentage of total individuals) to generate the expected value.
Please note that at this point we are aware that the expected value can be wrong. That is why we call it the expected value instead of actual value. That is the best we can predict with the available data.
So now that we have settled it on 50%. 50% of the total males (100) would be 50.
So the total number of males who earn more than 50k is 50.


The way this calculation was derived was tough for me to follow as well. Here is how I found it helpful to visualize the process. Imagine that instead of 9 boxes, we are just given a chart with 3 boxes:


Then we are told that a great plague struck this population at random and the Total number has dropped to 7841. To estimate the numbers of males and females left, we first figure the percentages of the original population,


and then calculate our new M/F distribution.


The same math applies if we use proportions. If we were told that only .241 of the population survived the plague, we could multiply our origional ratios by the ration of those that survived, and guess what proportion of the original population is alive and either Male or Female.


Hi Sahil,
Thanks a million for your comment. I’d have been at a loss without it! With your comment, I can see where the whole idea is heading to.