Possible wrong answer in mission 254 screen 5: Average of ratios is not the same as ratios of sums

In the mission 254 screen 5 (https://app.dataquest.io/m/254/group-summary-statistics/5/group-by-visual-breakdown )the task is:

Write a SQL query that, for each major category, displays:

  • The major category
  • The average share of women with the alias Average_women .

This means the average share of Woemn in each major group
which is (total number of Women)/(Total number of Men and Women)
the suggested Solution is :

SELECT Major_category,
       AVG(ShareWomen) AS Average_women
  FROM recent_grads
 GROUP BY Major_category;

This code computes the average of ratio of women in all Majors of Major_category
This gives the same weight to every Major neglecting the number of participants.

A possible solution is :

SELECT Major_category, 
    ((SUM(Women)+.0)/(SUM(Women)+SUM(Men))) AS Average_women
    FROM recent_grads
    GROUP BY Major_category

hi @mehranmo

Could you please attach the mission screen-link as well in your original post. It will help the community members to help figure out, what could be the difference/ mismatch here.

1 Like

No, that’s not what it computes.

It computes the average of ShareWomen in each Major Category because you are grouping by the Major Category.

If you consider one of the Major Categories -

What’s calculated is the average of that ShareWomen column which corresponds to the Law & Public Policy Major_category.

Maybe I was not clear:
ShareWomen is defined as (Women/Women+Men) in each Major.
You can check this by this code:

SELECT Men,Women,ShareWomen,(Women+.0)/(Women+Men)
    FROM recent_grads
    Where Major_category = 'Law & Public Policy'

Unbenannt
When we group by category we cannot take the average of ShareWomen of all member of category as the average of Women share in that category.
ShareWomen of Major Category = (All women in that Category / All Men and Women in that category)

One can check the difference with the following code:

SELECT SUM(Men),SUM(Women),AVG(ShareWomen),(SUM(Women)+.0)/(SUM(Women+Men))
    FROM recent_grads
    Where Major_category = 'Law & Public Policy'
    Group by Major_category

Here the AVG(ShareWomen) is not equal to SUM(W)/SUM(W+M)
Unbenannt

The quantity AVG(ShareWomen) has no statistical meaning. and is not an answer to the question that was asked in the task:

  • The average share of women with the alias Average_women .

I think it does.

Because almost each Major Category has more than 1 Majors. When you calculated the average percentage of women in a particular category, you are estimating the average percentage of women for each of those Majors.

Law & Public Policy as a Category, has 5 Majors. We are essentially saying that the Major Public Policy , on average, has ~33.5% of female students considering the entire Category. We can also say the same of any of the other Majors within that category.

When you calculate the following -

(SUM(Women)+.0)/(SUM(Women+Men))

You are calculating the share of women for that category. You can say that the share of women for that Category is ~0.34 or that Category has ~34% female students. Of course, this is ignoring that there can be an overlap of some women taking multiple majors from that category. So, the actual percentage might be lower. If you consider the average of the category, it acts as a potentially reasonably estimate on the average percentage of women for each major within the category. I think it makes sense given that context.

With your code above, you are not calculating the average share of women for that category. This is required to answer that question and as per me holds meaning given the context of the Majors within a Category. When the number of Majors in a Category will be 1, the average shareWomen and shareWomen will be the same, which is the case with the Interdisciplinary category.

1 Like

Lets see if AVG(ShareWoman) has any statistical meaning with an example:
We have a category with 2 majors: Major A and B.
Major A has 1000 Woman and 0 Men and its ShareWoman is 1 = (1000/(1000+0))
Major B has 0 Woman and 1 Man and its ShareWoman is 0 = (0/(0+1))

Then if we ask the same question as in the task :

average share of women with the alias Average_women in that category .

When we group by category it evidently implies that we are interested in the statistics of that category.
Per suggested solution the AVG(ShareWoman) = (SUM(ShareWomen)/Count(ShareWomen)) = (1+0)/2 = .5 , estimates that in this category with 1001 students (1000 woman and 1 man) on average 50% are women which is clearly wrong.
But if we calculate SUM(Women)/SUM(Men+Women)=1000/1001 = 99.9% we get the correct estimate of average share of women in the category.

You are now manipulating the data to fit your narrative. Which changes the discussion beyond my current level of understanding/knowledge. Statistics and its application to data analysis/science is not my current strong suite.

It’s still a good discussion so more people could help perhaps. Maybe @Rucha, @otavios.s, could pitch in?

If not, perhaps @Sahil could help connect with the Content Author to explain their thinking process for this particular content.

I will try to think on this more as well.

2 Likes

Thanks for helping me explain better. I guess the original Author also was questioning group statistics.

1 Like

hi @mehranmo

I still don’t have the mission link yet, but I am ignoring that for now. Let’s just focus first on your example. I will tweak it a bit:

Category C : Major A: W = 1000, M = 0 \implies Share of W = 1
Category C : Major B: W = 0, M = 1000 \implies Share of W = 0

1st case: avg.(Share of W) for C = \frac {1 + 0}{2} = 0.5 or 50% or probability of Women in any major belonging to Category C is 50%

2nd case (your suggestion): \frac {Tot W}{Tot W + Tot M} = \frac {1000 + 0}{1000 + 0 + 1000 + 0} = 0.5 or 50%. Its is still the same as 1st case, but I have changed the number of men in Major B.

Are you saying you have a scenario that no matter how much I change the denominator of the second case, I will still get the correct estimate?

1 Like

But this is how the average is calculated. I agree that in this (extreme) example it is not the best statistic to use, but it is still the average.

But this is not the average, this is the proportion of women in the example. Is it a better statistic for this situation? Yes, but it is still not the average.

Regarding the code, the following…

SELECT Major_category, AVG(ShareWomen) AS Average_women FROM recent_grads
GROUP BY Major_category;

…does calculate the average share of women for each major, just like the code below…

SELECT Major_category, SUM(Total) AS Total_graduates FROM recent_grads
GROUP BY Major_category;

…calculates the total number of graduates for each major in step 4 of this mission.

The logic in both codes is the same. If the second is fine, then the first should be fine too.

2 Likes

Link: https://app.dataquest.io/m/254/group-summary-statistics/5/group-by-visual-breakdown

case 2 is : 1000 / (1000 +1) = 99.9%
I think that the Original Author has clearly stated that it is interested in whole group statistics :

average share of women with the alias Average_women in that category .
which is case 2

Mybe one can forward it to them.

@mehranmo

I modified your case 2. I want to understand how are we to conclude that 99.9% is the correct estimated value. and it can’t be anything else. How have you considered the weightage here as compared to 1st case.

Ah I see.
The average of ratios is just an estimation of the ratio of sums. In my extreme example it is a very bad estimation and your extreme example it is a perfect estimate.
We may use this estimation if we don’t have access to original data or weights of each sub category or to reduce calculations cost. One can check the quality of this estimation by checking it on half of the samples and see how it changes.

Yup. Avg. of ratios is the estimation for each category which has already taken into account participation of W and M in each of the major belonging to that given category.
@the_doctor response explains this pretty well within the context of data.

If you would like to continue a few experiments and have further thoughts/doubts, please go ahead and raise them. And/or if any of the responses above have answered your question, please mark this topic as solved.