CYBER WEEK - EXTRA SAVINGS EVENT
TRY A FREE LESSON

GROUP BY question

Hello,
I had two questions on the “Subqueries” Mission on Page 5
URL:Learn data science with Python and R projects

Question 1: This is the code that is on Page 5:

SELECT Major_category, SUM(TOTAL)
  FROM recent_grads
 GROUP BY Major_category
 ORDER BY SUM(TOTAL) DESC;

I tried this just to see what would happen (I took the GROUP BY out):

SELECT Major_category, SUM(TOTAL)
    FROM recent_grads
   ORDER BY SUM(TOTAL) DESC

My code only returns “Education” for the Major_category. I was wondering if someone could tell me why? Why “Education” and not something else?

Question 2:
The correct answer for this page is:

SELECT Major_category, Major
  FROM recent_grads
 WHERE Major_category IN (SELECT Major_category
                            FROM recent_grads
                           GROUP BY Major_category
                           ORDER BY SUM(TOTAL) DESC
                           LIMIT 3
                         );

I tried:

SELECT Major_category, Major
  FROM recent_grads
 WHERE Major_category IN (SELECT Major_category
                            FROM recent_grads
                           ORDER BY SUM(TOTAL) DESC
                           LIMIT 3
                         );

And this produced an error. Why do we need the GROUP BY clause here?

First of all, SUM() is an aggregate function and it doesn’t make sense not to use it as such.

I guess this is some random data you have got.

And the error clearly states:

misuse of aggregate: SUM()

Since SUM() is an aggregate function, we should perform an aggregation first.

@dash.debasmita,
The error that gets produced, does that mean you always have to use the “GROUP BY” clause first? I’m confused about the error because I thought ORDER BY SUM(Total) is an aggregation but it isn’t. It seems like you have to have something before it.

We can aggregate without using a GROUP BY, but in that case, we’ll be aggregating the full table. Refer here

You’re getting this error while running

SELECT Major_category, Major
  FROM recent_grads
 WHERE Major_category IN (SELECT Major_category
                            FROM recent_grads
                           ORDER BY SUM(TOTAL) DESC
                           LIMIT 3
                         );

The problem here is with the subquery,

SELECT Major_category
                            FROM recent_grads
                           ORDER BY SUM(TOTAL) DESC
                           LIMIT 3

You don’t have SUM(TOTAL) in your SELECT clause, in that case, how can you expect the result to get sorted by the SUM(TOTAL) column?

Does this answer your question?

For the correct subquery:

WHERE Major_category IN (SELECT Major_category
                            FROM recent_grads
                           GROUP BY Major_category
                           ORDER BY SUM(TOTAL) DESC
                           LIMIT 3
                         );

Why is it when you have the GROUP BY clause you don’t need to have the SUM(Total) in the SELECT clause here?

Also, how do you aggregate a full table? I thought you aggregate a column and it returns one value. I read the article and the implicit GROUP BY.

I think you got confused.
What do you get when you run

SELECT SUM(Total)
    FROM recent_grads;

So what are we trying to achieve by:

SELECT Major_category FROM recent_grads
ORDER BY SUM(Total);

I’m going to take a break from SQL. Something just isn’t clicking with me on the topic. Thank you for answering my questions though.