Not understanding the code in Challenge: Complex Query with Joins and Subqueries

Screen Link: https://app.dataquest.io/m/179/joining-data-in-sql/8/challenge-complex-query-with-joins-and-subqueries

SELECT
   f.name country,
   c.urban_pop,
   f.population total_pop,
   (c.urban_pop / CAST(f.population AS FLOAT)) urban_pct
FROM facts f
INNER JOIN (
           SELECT
               facts_id,
               SUM(population) urban_pop
           FROM cities
           GROUP BY 1
          ) c ON c.facts_id = f.id
WHERE urban_pct > .5
ORDER BY 4 ASC

Hi, would like to understand why is there a need in the subquery

  1. SELECT facts_id
  2. GROUP BY 1

Truly much appreciate if anyone able to explain that to me :slight_smile:

Hey there @willx!

Think about how the subquery returns the information we need for the outer query. Break down the code and run it without what you have said.

SELECT SUM(population) urban_pop
FROM cities

What do you get? Are the values you are looking for make sense?

Also, GROUP 1 is saying the same thing as GROUP BY facts_id within that query.

Hi @emchu16

i now understand the GROUP BY 1.

let me try to understand this -
we need the SELECT facts_id so that we can group the cities by countries? is that the reason?

can i tail this thread ans ask you some more questions?

c.urban_pop

where is this coming from? i understand it’s defined here but i’m a little bit confused. (this was covered in the previous page).

INNER JOIN (SELECT facts_id, 
                   SUM(population) urban_pop
            FROM cities c
            GROUP BY 1
            ) c ON c.facts_id = f.id

My other question is this:

 (c.urban_pop / CAST(f.population AS FLOAT))

why do we have ot use CAST AS FLOAT here for f.population and not c.urban_pop?

@willx correct! it allows it to then be used to join the table where facts_id = id when it is returned from the subquery

@scchoi31

So, from the INNER JOIN, we are returning two columns in a query called ‘c’. Those columns are urban_pop and facts_id.

In the outside main query, we will use the returned values from these columns to perform another aggregate function that allows us to calculate urban_pct. It’s a bit like a temp table where you can manipulate new found information. In practice, if you would need this inner function all the time, it would be best to set up a temp table, and run from this table as opposed to having to write an inner join each time for further analysis.

As for your CAST question, as long as one of the numbers is a float, it will return a float in general. You could do both and it is the same thing. All about preference.