SELECT al.title AS "album",
ar.name AS "artist",
COUNT(t.track_id) AS "tracks_purchased"
FROM invoice_line AS il
JOIN track AS t ON t.track_id = il.track_id
JOIN album AS al ON al.album_id = t.album_id
JOIN artist AS ar ON al.artist_id = ar.artist_id
GROUP BY 1
ORDER BY 3 DESC
Following is the official Solution:
FROM invoice_line il
INNER JOIN (
FROM track t
INNER JOIN album al ON al.album_id = t.album_id
INNER JOIN artist ar ON ar.artist_id = al.artist_id
ON ta.track_id = il.track_id
GROUP BY 1, 2
ORDER BY 3 DESC LIMIT 5;
Both of the above code produce the same output. But using subqueries just make it difficult to read and understand.
Any specific reason subqueries are preferable than a more straightforward solution like mine for this particular example?
The only reason I can think of is that for extremely large tables, subqueries might be faster because you end up joining on fewer columns using subqueries. I am not sure whether that’s something preferred in the industry or not, though.
I guess I am finding it difficult to appreciate subqueries for this particular question because it’s simpler/easier to work without them.
Did you use INNER JOIN or LEFT JOIN? I think the only difference is that you return only the relevant columns using subquery for visualization purposes. You may want to see how things work as you write the query. I noticed that is becomes very hard to display the results as you work when the number of columns get very large.
This exact point has been brought up by others in the past too! While subqueries are invaluable for a whole host of reasons, the particular use-case presented in the situation you raise does not quite demonstrate the full extent of the usefulness of subqueries. Rather, it’s just meant to impart the conceptual understanding of how subqueries work i.e. that you can use subqueries to turn an SQL query into a mini-table, thereby allowing you to do things like join operations on the results of that query.
I’d definitely agree with you that it would probably feel more intuitive for a lot of people to proceed on this particular question without the use of a subquery.
Hey. Sorry for the delayed response, this escaped me.
I don’t have the time to really look at this problem right now, but I did notice an issue with your solution. Because of my lack of time at the moment, I don’t know if the issue (which I’ll mention below) destroys your solution completely or if it is easily fixable while keeping the spirit of your answer.
The problem with your code is that you’re selecting columns on which you’re not using any aggregate function, nor are they part of the GROUP BY clause. This is conceptually incorrect. Please read this answer for more details.
I recently finished writing a blog post (edit: not yet publishedhere it is) that dives deeper into this, but the gist of it is in the link above.
Hi Bruno. Thanks for the response. Yes, I have seen your answers by now.
Dataquest should really stress upon this when actually teaching this topic in the Content so that we develop a better, intuitive understanding of this when trying to implement this throughout the several exercises. Please correct me if I’m mistaken, but I don’t think that’s been done so far.
I still don’t grasp why it’s a problem if the query works but is not conceptually correct.
If we are focusing on a particular question to answer, and the above query runs and works, then why should it have to be conceptually correct? It still answers the question, right? It might not work for a different question, or on a different system with a different flavor of SQL. But it still works as of now. So, there isn’t as such a problem with it, correct?
If we are required to have the query to be structured the way you point out because otherwise it might lead to potential problems, how do we develop a better understanding of structuring queries to avoid those problems? Just through practicing?
You mention that It makes your code more intuitive in that linked post. I am not entirely sure if I would agree. Because not working with this concept allowed me to answer a particular [business] question much more intuitively rather than having to structure the query in a more complicated manner. Would it be possible for you to clarify this a bit more?
I will keep going through those posts of yours again, to develop a better understanding of this. And I have been practicing with those posts in mind through the rest of the exercises related to SQL.
But, it just ends up with me trying to stumble through a lot of attempts and not a more methodical/logical approach, and I am unsure if I am actually developing a better, conceptual understanding of using SQL or not. I hope you understand my point.
It depends on the measure of success. If it is to answer the business question, then yes. But from my point of view what we want is for learners to learn SQL (not just SQLite, but SQL in general), not just answer business questions correctly.
It’s fine to use that answer if you understand what it entails, but we as educators need to make sure learners knows what it entails. We have’t done a good job at that.
I’d say for this particular instance what is needed is a better understanding of what GROUP BY does. This is lacking at the moment.
Intuition is always relative. I think my statement applies if we have a correct understanding of GROUP BY, otherwise it doesn’t.
I understand it, I just don’t have a good solution. The one thing I’ll say it’s important for people to keep in mind is that SQL handles tables, so we need to find away to accomplish our goals by doing table “operations” (whatever SQL allows).
I have read this discussion and @Bruno’s previous post regarding this same issue. However, I still have some questions.
I understand Bruno’s point about selecting fields not used in an aggregation or in the group by not being intuitive/correct. In the case which @the_doctor brings up in this discussion, the query provided the correct output because each album is associated with only one artist. To check that this was the case I performed two checks.
In the screenshot above, the first query is my answer to the exercise on this page. The following two queries are the two checks that I performed:
Calculate the number of unique albums
Calculate the sum of albums associated with each artist
Since the result of these two queries is 347, we know that each album is associated with a maximum of 1 artist. This is why not including field 2 (artist) in the group by still results in the correct answer. Please let me know if my rationale here is flawed.
That aside, why does the solution provided use a subquery? The only explanation that I can devise is that the table created by the series of joins has less fields with the suggested answer’s method as opposed to my (or the_doctor’s) output for the exercise.
I can imagine that as the tables being joined increase horizontally in size, my method will have significantly more data and will take longer to produce (on this latter point, I don’t know exactly how the size of a table or tables affects time performance but it makes sense that the larger the tables in question are, the longer a query might take). Is this the only explanation or is there something else to why the suggested solution was selected?
I took another SQL course and the instructor kept selecting use cases which were not entirely appropriate and not a best practice. He warned us as such but proceeded anyway to use his “example.” IMHO, that is not teaching real world scenarios. I’m a pragmatist and in 100% of the cases if a database is stored in SQLite and the same problem is presented the simpler solution will be a best practice and encouraged. I gave that other course a poor review because of that reason. So, please select a better example to use sub-queries. We’re simply not learning the correct way. (Would you use new math techniques only to teach the times table, or just have 6 year old kids suck it up and memorize the times table so they have a skill set for the rest of their life?)