Question about SQL Dataset: 254-6

In the SQL Fundamentals course, the dataset used (jobs.db/recent_grads) has some data that doesn’t really make sense. Perhaps this is an artifact of this being a learning tool? For example, only a few of the rows have consistent values for Men, Women, and Total. This is especially apparent in the following lesson:

https://app.dataquest.io/c/43/m/254/group-summary-statistics/6/multiple-summary-statistics-by-group

None of the values in the columns make any sense.

Which columns do you mean? And why do you think they don’t make any sense?

Specifically, the Total column and the Estimate_women columns. The Total column, which should be the sum of the Men and Women columns, is typically much larger or much smaller than this number. Additionally, the Estimate_women column is often an order of magnitude larger than the number of women in a given category. For example, the Agriculture & Natural Resources Major_category has the following aggregate values:
SUM(Women): 249812
SUM(Men): 197875
SUM(Total): 79981
AVG(ShareWomen): 0.6179384232
Estimate_women: 49423.333025959204

The Estimate_women column is the product of AVG(ShareWomen) and SUM(Total), but it’s nowhere close to the number of women reported in that category, and the total number of women in that category appears to be larger than the total number of students.

1 Like

I checked the total column and you are right! Having a total more than the sum of men + women might make sense because of people not reporting, but having total less then the sum is quite weird.