SQL Intermediate Joins - Combining Multiple Joins w/ Subqueries

Hi all. I’d like to see if there is any opinion in terms of the difference between these two queries. One is the one I turned in, and was marked correct, and the system’s proper answer. Thanks!

My answer:

select alb.title album, art.name artist, sum(il.quantity) tracks_purchased
from invoice_line il
inner join track t on il.track_id = t.track_id
inner join album alb on t.album_id = alb.album_id
inner join artist art on art.artist_id = alb.artist_id
group by album
order by tracks_purchased desc
limit 5;

Dataquest Answer:

SELECT
    ta.album_title album,
    ta.artist_name artist,
    COUNT(*) tracks_purchased
FROM invoice_line il
INNER JOIN (
            SELECT
                t.track_id,
                al.title album_title,
                ar.name artist_name
            FROM track t
            INNER JOIN album al ON al.album_id = t.album_id
            INNER JOIN artist ar ON ar.artist_id = al.artist_id
           ) ta
           ON ta.track_id = il.track_id
GROUP BY 1, 2
ORDER BY 3 DESC LIMIT 5;
1 Like

Your answer did not group by artist, what happens when there is an album with 2 artists? Which artist is shown? (logically not an issue in this example since album artist usually 1:1 relationship, but may not be so lucky in other cases, also some sql dialects will not allow you to have columns in select that neither got aggregated nor grouped.)

Some other formatting issues making it a little harder to read are that your first two inner joins and the third inner join have the tables in the on condition in different orders (matter of readability only).

Just a note: sum() you used and count() answer used both give same answer because the sum is summing all 1. Good to check if really all 1 before using sum().

1 Like

Great answer @hanqi thank you.

Can you elaborate a bit on this quoted line please?

Here is a short intro to that warning i mentioned.

1 Like

You responded in my favorite form. Lol. Thank you hanqi!

It seems from the linked solution that grouping eugeniosp3’s code on alb.album_id would solve the issue if there were albums with the same name right?

Also, if these databases were huge, which version of the code would be more efficient in terms of resources (assuming the GROUP BY were using alb.album_id)?

Working under the assumption the album_id’s are unique that seems to make sense to me.

I’m curious on this second question also. @Sahil / @hanqi - Do you have any resources explaining performance and other things affected by queries?

@david.james.hewlett @eugeniosp3
I never had any issue with “albums with the same name”, what I was pointing out in my 1st reply is the artist field, the 2nd field in SELECT. It is not in a GROUP BY and not in any aggregation function. That is the issue I raised and explained through an article linked.

I don’t understand how would “albums with the same name” become an issue. Doesn’t same name imply same album, and different name imply different album? I mean a normal person differentiates albums by their names. If there is indeed the problem of same album having different name or different album having same name, the album name field should be cleaned (preprocessing). If GROUPBY turned out too granular for “same album having different name”, then just post-process it. Never bad to be too granular, but you can’t go in reverse from more aggregated to less aggregated.

For performance, practically nobody sits around reasoning with theory except in interviews, the time spent reasoning is better spent just running both versions and timing them. I’m not familiar enough with SQL but one way is to print the execution plan which an optimised preview of how the engine will run the query. However, same SQL you wrote can become different execution plans on different servers or different versions of the same sql engine and likely among different sql engines too. Whether different SQL can become same execution plan I’ve yet to find.

Some heuristics in tools like pandas that do not have the “rewriting your code to make it faster” capabilities:

  1. Groupby 1 col is faster than groupby 2 or more columns
  2. Row filter before joining (analogous to constraining row growth as much as possible with subqueries/CTE in SQL) better than joining before row filter
1 Like

Thanks for that reply very thorough. Most of us will be looking to interview at some point soon right? So the answer is pretty relevant. Again, thanks for the info!