Challenge: Complex Query with Joins and Subqueries

Screen Link:
https://app.dataquest.io/m/179/joining-data-in-sql/8/challenge-complex-query-with-joins-and-subqueries

My Code:

SELECT 
    f.name as country,
    c.population as urban_pop,
    f.population as total_pop,
    CAST(c.population as float)/CAST(f.population as float) as urban_pct

FROM facts f
INNER JOIN cities c on c.facts_id=f.id
WHERE urban_pct>0.5
ORDER BY 4;

Solution Code:

SELECT
    f.name country,
    c.urban_pop,
    f.population total_pop,
    (c.urban_pop / CAST(f.population AS FLOAT)) urban_pct
from facts f
INNER JOIN(
    SELECT facts_id,SUM(population) urban_pop
    FROM cities
    GROUP BY 1
    ) c on c.facts_id=f.id
WHERE urban_pct> .5
ORDER BY 4 ASC;

What I expected to happen, using my code:
expect

What actually happened, using my code:
reality

Hello everyone. I am not understanding what is wrong with my logic, I’m aware it highly differs from the one used in the solution. Any help would be greatly appreciated.

To answer in one line, you are analysing urban percentage on the city level, while the question is asking for urban percentage on the country level. Country is made up of a group of cities.

You can download the db from the exercise (in 2nd tab), upload to https://sqliteonline.com, and run your queries faster there (gives you autocomplete too). Also less bugs (error in DQ environment but runs normally there, but nothing too critical that will impede your learning). Use Dbeaver if you want to do production grade sql.

Another point to note is putting urban_pct (a column alias) in WHERE clause is not standard sql.https://stackoverflow.com/questions/10923107/using-a-column-alias-in-sqlite3-where-clause SQL has an order of execution of its clauses https://www.designcise.com/web/tutorial/what-is-the-order-of-execution-of-an-sql-query and WHERE comes before SELECT, so it should not even know what urban_pct means in WHERE. Usually people repeat the whole CAST… calculation inside WHERE if they want to filter on urban_pct.

See the GROUP BY in the answer? Whenever you see this, it means some sort of aggregation is going on in the non grouped columns, and the unit/object of analysis is the grouped column(s). Here the stats for cities are grouped into their own countries (represented by cities.facts_id). This creates a key-value mapping showing for every country, what is the total urban population summed from all its cities. This mapping is later joined into the table with country level information to produce the final 4 columns on a country level. Creating this mapping (by groupby+aggregation) first and joining could save time and memory rather than joining first then aggregating. You can try to achieve this same analysis pattern in pandas in future, it is very common.

What you are missing is the aggregation of cities to country step. Your results are actually shorter than the answer because it is harder for your results to go above the urban_pct > 0.5 threshold. Because you are actually calculating which cities (NOT countries) have an urban population more than half the total population of the country the city is in, while the answer is aggregating all the urban population of all the cities in that country first (which makes the number bigger), before dividing country total population to check if > 0.5.

So, the advice is to understand the entities you are analysing first before writing any sql at all. Example questions to ask yourself are, what is the level of analysis, do I need any group by on 1 of more columns, which tables contains which important columns, do I need to merge/split columns to get joinable information, are there duplicates in the rows that will damage join results, when counting are there nulls that cause wrong count results.

Eventually, you should get better at explaining the results of an sql without even running it, eg, when dealing with whiteboard sql interviews.

2 Likes

Holy jesus this is an extremely complete answer. Thank you so much for taking your time to write this, I greatly appreciate it. I hope your hard work pays off. Thanks again =)))