Maybe I was not clear:
ShareWomen is defined as (Women/Women+Men) in each Major.
You can check this by this code:
Where Major_category = 'Law & Public Policy'
When we group by category we cannot take the average of ShareWomen of all member of category as the average of Women share in that category.
ShareWomen of Major Category = (All women in that Category / All Men and Women in that category)
One can check the difference with the following code:
Where Major_category = 'Law & Public Policy'
Group by Major_category
Here the AVG(ShareWomen) is not equal to SUM(W)/SUM(W+M)
The quantity AVG(ShareWomen) has no statistical meaning. and is not an answer to the question that was asked in the task:
The average share of women with the alias Average_women .
Because almost each Major Category has more than 1 Majors. When you calculated the average percentage of women in a particular category, you are estimating the average percentage of women for each of those Majors.
Law & Public Policy as a Category, has 5 Majors. We are essentially saying that the Major Public Policy , on average, has ~33.5% of female students considering the entire Category. We can also say the same of any of the other Majors within that category.
When you calculate the following -
You are calculating the share of women for that category. You can say that the share of women for that Category is ~0.34 or that Category has ~34% female students. Of course, this is ignoring that there can be an overlap of some women taking multiple majors from that category. So, the actual percentage might be lower. If you consider the average of the category, it acts as a potentially reasonably estimate on the average percentage of women for each major within the category. I think it makes sense given that context.
With your code above, you are not calculating the average share of women for that category. This is required to answer that question and as per me holds meaning given the context of the Majors within a Category. When the number of Majors in a Category will be 1, the average shareWomen and shareWomen will be the same, which is the case with the Interdisciplinary category.
Lets see if AVG(ShareWoman) has any statistical meaning with an example:
We have a category with 2 majors: Major A and B.
Major A has 1000 Woman and 0 Men and its ShareWoman is 1 = (1000/(1000+0))
Major B has 0 Woman and 1 Man and its ShareWoman is 0 = (0/(0+1))
Then if we ask the same question as in the task :
average share of women with the alias Average_women in that category .
When we group by category it evidently implies that we are interested in the statistics of that category.
Per suggested solution the AVG(ShareWoman) = (SUM(ShareWomen)/Count(ShareWomen)) = (1+0)/2 = .5 , estimates that in this category with 1001 students (1000 woman and 1 man) on average 50% are women which is clearly wrong.
But if we calculate SUM(Women)/SUM(Men+Women)=1000/1001 = 99.9% we get the correct estimate of average share of women in the category.
You are now manipulating the data to fit your narrative. Which changes the discussion beyond my current level of understanding/knowledge. Statistics and its application to data analysis/science is not my current strong suite.
It’s still a good discussion so more people could help perhaps. Maybe @Rucha, @otavios.s, could pitch in?
If not, perhaps @Sahil could help connect with the Content Author to explain their thinking process for this particular content.
I modified your case 2. I want to understand how are we to conclude that 99.9% is the correct estimated value. and it can’t be anything else. How have you considered the weightage here as compared to 1st case.
Ah I see.
The average of ratios is just an estimation of the ratio of sums. In my extreme example it is a very bad estimation and your extreme example it is a perfect estimate.
We may use this estimation if we don’t have access to original data or weights of each sub category or to reduce calculations cost. One can check the quality of this estimation by checking it on half of the samples and see how it changes.
Yup. Avg. of ratios is the estimation for each category which has already taken into account participation of W and M in each of the major belonging to that given category. @the_doctor response explains this pretty well within the context of data.
If you would like to continue a few experiments and have further thoughts/doubts, please go ahead and raise them. And/or if any of the responses above have answered your question, please mark this topic as solved.