Intermediate Joins in SQL - Combining Multiple Joins with Subqueries

I would like to know when to use the inner join for one table to connect to a select statement I just made up, regardless if the select statement is complete with multiple inner joins. Google and Youtube is not helping :sweat: :tired_face: :exploding_head:. I can see the answer and then understand the logic in what the end result would be. However, I have a no understanding of how to come up with the answer.

I need to understand what is situation do I have a INNER JOIN to select statement. And what are the steps to solved to this problem. I appreciate an explanation, link, article, etc. I spent a large amount of time trying to dissect this.

Write a query that returns the top 5 albums, as calculated by the number of times a track from that album has been purchased. Your query should be sorted from most tracks purchased to least tracks purchased and return the following columns, in order:

  • album , the title of the album
  • artist , the artist who produced the album
  • tracks_purchased the total number of tracks purchased from that album

My wrong code

SELECT album.title, artist.name, COUNT(track.track_id) as tracks_purchased
FROM artist 
INNER JOIN album on artist.artist_id = album.artist_id
INNER JOIN track on album.album_id = track.album_id
Group by album.album_id 

ORDER BY tracks_purchased Desc LIMIT 5;
``

The correct code 

SELECT
ta.album_title album,
ta.artist_name artist,
COUNT(*) tracks_purchased
FROM invoice_line il
INNER JOIN

(
SELECT
t.track_id,
al.title album_title,
ar.name artist_name
FROM track t
INNER JOIN album al ON al.album_id = t.album_id
INNER JOIN artist ar ON ar.artist_id = al.artist_id
) ta

       ON ta.track_id = il.track_id

GROUP BY 1, 2
ORDER BY 3 DESC LIMIT 5;

Hi eferistin, would you edit your question so it is formatted better? Helps others read the question and show that you are serious about doing well and helping others help you.

Seems like you are asking about the syntax INNER JOIN (SELECT)? If yes and you understood the logic of the correct answer already, can you edit your wrong code so it is logically correct and the only issue left would be syntactic differences?

Currently, the correct code you showed is selecting from invoice_line table, but your code is not selecting from that table. I suspect you may not be counting purchases at all.

I am unable to edit/delete my code shown above. Please, everyone, ignore my code. I am unable to articulate my confusion and request for help to gain some form of clarity does not mean that I take this course lightly. I am devoting so much time and effort to study and fully comprehend.

Can someone in the community kindly provide a detail explaination? Guidance in their train of thought of how it leads to the answer.

For me,

  • there is a distinction between reading the SQL statement or logical operation and following it and understanding it result. (This is like looking at an equation and solving it, which I know how, algebra)

  • It is the thought process that leads to setting up the correct query and knowing why. (This is like being given a problem and I am not sure which equation to use, how and why it is set up.)

-I see the logic of multiple INNER JOINS. But not INNER JOIN one query with a Select statement.

From the course. This is part of the SQL Intermediate: Table Relations and Joins course

, Intermediate Joins in SQL chapter, and 4. Combining Multiple Joins with Subqueries. I get this

SELECT
    t.track_id,
    ar.name artist_name
FROM track t
INNER JOIN album al ON al.album_id = t.album_id
INNER JOIN artist ar ON ar.artist_id = al.artist_id
ORDER BY 1 LIMIT 5;

Not sure when to set this up

SELECT
    il.invoice_line_id,
    il.track_id,
    ta.artist_name
FROM invoice_line il
INNER JOIN (
            SELECT
                t.track_id,
                ar.name artist_name
            FROM track t
            INNER JOIN album al ON al.album_id = t.album_id
            INNER JOIN artist ar ON ar.artist_id = al.artist_id
           ) ta
           ON ta.track_id = il.track_id
ORDER BY 1 LIMIT 5;

I am frustrated with my inability to grasp this concept. Thank you all for your help.

https://sqlbolt.com/lesson/select_queries_order_of_execution
This article illustrates how sql executes the keywords/clauses in a query. Understanding this is also useful for shaping the thought process in building queries.

We usually start thinking about what information (column) we want. Then we look at which table that column exist in, and start building the FROM table of the query. Later we may want to filter the rows using WHERE. During filtering, it may be filtered against a single number/string or a subquery that generates a list of possible values (for membership checking). After WHERE, assuming no GROUP BY aggregations and HAVING filtering of groups, we think about what to column to SELECT, whether to express if-else statements using CASE in SELECT, or to find uniques/duplicates using DISTINCT, or to calculate aggregations in windows using OVER (PARTITION BY “column”). Some columns selected may not be needed for interpretation of values, but may still be needed for the purpose of joining another column in another table in an outer query. For eg. in your INNER JOIN (SELECT) code from the lesson above, track_id from t.track_id has to be selected so that it can be used in ta.track_id = il.track_id for further joining in an outer level.

So above describes single table query thought process.
If we need more than 1 column of information, we may not be able to find the information in 1 table, so start looking at in which tables do every column of information desired exist, and mentally prepare to do JOIN. Sometimes, table A is simply a derived aggregation of another table B, so you could have generated the table A on the fly using B. This is a use case of doing INNER JOIN (“temporary table created on the fly here”) AS “temporary_name” here. The unnested way of joining (with no subqueries)
requires that the table exists already, either in the database as a table, or as a view with some CREATE VIEW AS code that you wrote above. If not, a yet non-existent table or view requires a subquery to create that temporary table first for joining to outer nest levels. You will learn CREATE VIEW AS in dataquest lessons as a cleaner way to substitute subqueries when temporary tables need to be created.

To summarize, available information in available table does not need subquery, yet non-existent tables must be creating in subquery/CREATE VIEW AS before joining. What happens when you unnest the subquery from the lesson answer and do all joins (invoice_line,track,album,artist) on the same level without subqueries?

Now we got the logical motivation out of the way, we can discuss query performance motivations for subqueries. Imagine 2 tables, shop_details, and shop_sales. shop_details have columns shop_id, location, opening_hours,…anything you want. shop_sales contains the shop_id, daily_sales. The resulting table we want is to enrich shop_details with total sales (aggregation of all daily_sales for each shop_id).

How to generate that? 1 way is to take shop_sales, groupby shop_id, sum(daily_sales) to get total_sales for each shop to create a temporary table in a subquery, then join this to shop_details on shop_id. You can see that the temporary table is like a python dictionary with key-value pair and the purpose of join was just to fill in the value (total sales) into a new column in shop_details using key(shop_id).

Another way is to join the whole shop_details into shop_sales first, then groupby later. This 2nd method does not use a subquery. You can time these 2 methods, or use EXPLAIN ANALYZE (taught in Data Engineering path on dataquest) to think through their difference if interested.

In all, both subquery and no subquery can give the same output. What could be different is execution time, maintainability, extensibility, ease of modularization/refactoring. These are just some of the common considerations a coder should keep in mind. Which of these factors is important depends on the strengths/weaknesses of the team you’re in and the stage/type of project/problem. Just being aware of the trade offs is hard enough and will come with more exposure.

1 Like

Thank you for responding. Before starting the course on Dataquest, I completed all the lessons found on Sqlbolt.com, I do understand your explanation of the example problem you provided. However, I think no what is lost is what I am trying to convey where I am lost.

However, I think no one knows is what am I am trying to explain is where I am lost.

I feel that I am close to getting it. I have or understand all the moving parts but need someone to draw the obvious arrow.

Are you asking about a general concept of INNER JOIN (SELECT …) or a specific DQ exercise?
I see you provided 2 different examples from DQ so i’m not sure which 1 of the 2 you are asking about.

For the general concept, my previous reply was addressing that, are my examples insufficient?

I’ll try again, but first i have to be clear of your emphasis and break down the concepts.
Are you asking about subquery vs no subquery? My previous reply was answering that.
Are you asking about JOIN OR INNER JOIN specifically? A minimum viable example way to ask the question would be to just use the word JOIN, if you know the type of JOIN does not affect the question.

In the shortest reply i can give: JOIN exist to row-align information from different tables. One of these joined tables in their raw form may not be desired, so a subquery is used to shape tables/columns into the form you want first before the final outermost JOIN.

Given table A,B,C,D. You can

  1. join all 4 at once
  2. A JOIN subquery(B join C join D)
  3. A JOIN subquery(B JOIN subquery(C JOIN D))

They give the same results, just worsening readability.

Hi,

If I’m able to understand your confusion, it is related to the use of the following code:

SELECT
t.track_id,
al.title album_title,
ar.name artist_name
FROM track t
INNER JOIN album al ON al.album_id = t.album_id
INNER JOIN artist ar ON ar.artist_id = al.artist_id
           

Now your exercise question is

Write a query that returns the top 5 albums, as calculated by the number of times a track from that album has been purchased. Your query should be sorted from most tracks purchased to least tracks purchased and return the following columns, in order
album , the title of the album
artist , the artist who produced the album
tracks_purchased the total number of tracks purchased from that album

and your confusion is around the point:

-I see the logic of multiple INNER JOINS. But not INNER JOIN one query with a Select statement.

To answer this exercise we have to first get the data for the

top 5 albums, as calculated by the number of times a track from that album has been purchased.

Since this data is not available directly, we first have to find it.The above code does that exactly.

It creates the data which has the information about tracks, the albums they belong to and the artist it belongs to.
This data can be queried upon further with invoice_line table to get the result, since it has sales data.

In essence, this is like using inner join on invoice_line table and a table which has the data about
tracks, the albums they belong to and the artist.

Another point of confusion can be, why not use inner joins over all, rather than the solution that has been provided.
This is because of simplicity; for the ease of reading as has been mentioned above by @hanqi.

In case you think this answer doesn’t clear your confusion, let know.
Also if English isn’t your first language and you’re comfortable in any other language, post in that, will try to understand and answer.