This article illustrates how sql executes the keywords/clauses in a query. Understanding this is also useful for shaping the thought process in building queries.
We usually start thinking about what information (column) we want. Then we look at which table that column exist in, and start building the
FROM table of the query. Later we may want to filter the rows using
WHERE. During filtering, it may be filtered against a single number/string or a subquery that generates a list of possible values (for membership checking). After
WHERE, assuming no
GROUP BY aggregations and
HAVING filtering of groups, we think about what to column to SELECT, whether to express if-else statements using CASE in SELECT, or to find uniques/duplicates using DISTINCT, or to calculate aggregations in windows using OVER (PARTITION BY “column”). Some columns selected may not be needed for interpretation of values, but may still be needed for the purpose of joining another column in another table in an outer query. For eg. in your INNER JOIN (SELECT) code from the lesson above,
t.track_id has to be selected so that it can be used in
ta.track_id = il.track_id for further joining in an outer level.
So above describes single table query thought process.
If we need more than 1 column of information, we may not be able to find the information in 1 table, so start looking at in which tables do every column of information desired exist, and mentally prepare to do
JOIN. Sometimes, table A is simply a derived aggregation of another table B, so you could have generated the table A on the fly using B. This is a use case of doing INNER JOIN (“temporary table created on the fly here”) AS “temporary_name” here. The unnested way of joining (with no subqueries)
requires that the table exists already, either in the database as a table, or as a view with some
CREATE VIEW AS code that you wrote above. If not, a yet non-existent table or view requires a subquery to create that temporary table first for joining to outer nest levels. You will learn
CREATE VIEW AS in dataquest lessons as a cleaner way to substitute subqueries when temporary tables need to be created.
To summarize, available information in available table does not need subquery, yet non-existent tables must be creating in subquery/CREATE VIEW AS before joining. What happens when you unnest the subquery from the lesson answer and do all joins (invoice_line,track,album,artist) on the same level without subqueries?
Now we got the logical motivation out of the way, we can discuss query performance motivations for subqueries. Imagine 2 tables,
shop_details have columns shop_id, location, opening_hours,…anything you want.
shop_sales contains the shop_id, daily_sales. The resulting table we want is to enrich
shop_details with total sales (aggregation of all daily_sales for each shop_id).
How to generate that? 1 way is to take
shop_sales, groupby shop_id, sum(daily_sales) to get total_sales for each shop to create a temporary table in a subquery, then join this to shop_details on shop_id. You can see that the temporary table is like a python dictionary with key-value pair and the purpose of join was just to fill in the value (total sales) into a new column in shop_details using key(shop_id).
Another way is to join the whole shop_details into shop_sales first, then groupby later. This 2nd method does not use a subquery. You can time these 2 methods, or use EXPLAIN ANALYZE (taught in Data Engineering path on dataquest) to think through their difference if interested.
In all, both subquery and no subquery can give the same output. What could be different is execution time, maintainability, extensibility, ease of modularization/refactoring. These are just some of the common considerations a coder should keep in mind. Which of these factors is important depends on the strengths/weaknesses of the team you’re in and the stage/type of project/problem. Just being aware of the trade offs is hard enough and will come with more exposure.