CYBER WEEK - EXTRA SAVINGS EVENT
TRY A FREE LESSON

Question W/ My Code and Given Solution

Screen Link: https://app.dataquest.io/m/255/subqueries/4/returning-multiple-results-in-subqueries

My Code:
SELECT Major, Major_category
FROM recent_grads
WHERE Major_category IN (SELECT Major_category FROM recent_grads)
GROUP BY Major_category
ORDER BY SUM(Total) DESC
LIMIT 5;

The Solution:
SELECT Major, Major_category FROM recent_grads
WHERE Major_category IN (select Major_category from recent_grads
group by Major_category
order by SUM(Total) DESC
limit 5)

My Question:
Why do we need to include the group by and order by in the parentheses of the subquery? Also in the solution they do not end the code with “;” . It is different from what we have been learning and makes me wonder why its not needed.

hey there @r.hosein96

I’m going to do my best to answer your question but hopefully someone a little more skilled will come along and really knock it out of the park.

I see subqueries as a sort of filtering mechanism that helps you cross select more specific data.

In this particular problem, the filter we want to create for the overall solution is one that includes only the top five results as defined by their Total sum of men and women . Before we make a query, we want the filter to be organized in such a way that it will return our whole query in that order as well.

If we run the subquery by itself and also selecting SUM(Total) we can see how we are ordering the rest of our query:

select Major_category, sum(total) from recent_grads
group by Major_category
order by sum(total) desc
limit 5;

which returns:

Selection_883

These top five results is the filter we are looking to organize the rest of our query from. We are only looking for the top five totals, grouped by those five major’s categories, and ordered by the total sum of people in those categories. In this output you can see that I included SUM(Total) so I can see what I am actually ordering by.

Then, we’re looking for a breakdown of majors within those top five categories WITHIN the confines of the subquery.

Your subquery returns this:

Selection_884

It returns all of the rows, not filtering for the total number of people in each category but returning all categories, regardless of totals. Because of this, your subquery does not actually filter for the results the question is asking for and your query and your query without your subquery is this:

Selection_891

Nothing is filtered and in oddly enough, SQL picks the last possible major (highest index and Rank numbers for their specific major_category group) in all of those categories to represent as I found out after investigating why these particular majors were being picked. You can see my results here:

(I’m not sure why SQL does this)

To answer your second question, the semicolon at the end of statements is to separate out each SQL statement in (larger) SQL queries and database systems. However, since our queries at this stage are quite simple in comparison to the queries of an SQL industry expert, placing semicolons at the end of statements is more for good practice rather than function.

Just like how I’m too lazy to capitalize my SQL keywords in these examples :stuck_out_tongue_winking_eye:

I hope I answered your question!

@r.hosein96, I think @mctopherganesh did a good job explaining it, but I’m going to tl;dr the post just a smidgen.
GROUP BY is needed whenever you’re using an aggregating function such as SUM and you want to split that aggregation up. In this case, your subquery includes
ORDER BY SUM(Total) DESC
You don’t want the sum of the Total column for all rows, you want the sum of Total for each unique Major_category.

1 Like