Defining urban_pop outside of the subquery

Screen Link: Learn data science with Python and R projects

Why is it required to define urban_pop inside of the subquery? I notice that if I define it outside of the subquery some countries wind up missing.

To be more specific, why is this incorrect code, which does not include countries such as The Congo:

SELECT DISTINCT(facts.name), SUM(cities.population), facts.population,
CAST(cities.population AS FLOAT)/CAST(facts.population AS FLOAT) AS urban_pct FROM facts
INNER JOIN cities on cities.facts_id = facts.id
WHERE urban_pct > 0.5
GROUP BY 1
ORDER BY urban_pct ASC

And this correct code:

SELECT DISTINCT(f.name) AS country, urban_pop, f.population AS total_pop,
CAST(urban_pop AS FLOAT)/CAST(f.population AS FLOAT) AS urban_pct FROM facts AS f
INNER JOIN (SELECT facts_id, SUM(population) AS urban_pop FROM cities GROUP BY 1) AS c
ON c.facts_id = f.id
WHERE urban_pct > 0.5
GROUP BY country
ORDER BY urban_pct ASC

Click here to open the screen in a new tab.

Check the output for the following -

SELECT DISTINCT(facts.name), SUM(cities.population), facts.population,
CAST(cities.population AS FLOAT)/CAST(facts.population AS FLOAT) AS urban_pct FROM facts
INNER JOIN cities on cities.facts_id = facts.id
GROUP BY 1

That’s part of your code. Specifically, look at the table row for Congo, Republic of the. You will see that the urban_pct value for that is 0.175390.

Now, take part of the code from the solution -

SELECT DISTINCT(f.name) AS country, urban_pop, f.population AS total_pop,
CAST(urban_pop AS FLOAT)/CAST(f.population AS FLOAT) AS urban_pct FROM facts AS f
INNER JOIN (SELECT facts_id, SUM(population) AS urban_pop FROM cities GROUP BY 1) AS c
ON c.facts_id = f.id
GROUP BY country

For Congo, Republic of the urban_pctvalue is0.514185`.

They should be the same. The reasons they are not is because of how you are calculating urban_pct. You calculate urban_pct across the entire population.

CAST(cities.population AS FLOAT)/CAST(facts.population AS FLOAT) AS urban_pct 

The above just calculates the ratio of the two. That’s not what urban_pct is required to be -

percentage of the population within urban areas, calculated by dividing urban_pop by total_pop

You need to divide urban_pop by total_pop. And urban_pop is calculated specifically as -

sum of the population in major urban areas belonging to that country.

You are not calculating the above.

1 Like

I see three issues with your query

  1. There is no need for DISTINCT
  2. In the calculation for urban_pct you should take the sum of cities.population
  3. the check for urban_pct > 0.5 should be in the HAVING clause, not where

This query passed, and it doesn’t use a sub query

select f.name as country, sum(c.population) as urban_pop,
f.population as total_pop,
CAST(sum(c.population) as FLOAT) /
f.population as urban_pct
FROM
facts f inner join cities c on c.facts_id = f.id
GROUP BY country
HAVING urban_pct > 0.5
ORDER BY urban_pct;

I see now that the problem wasn’t the presence of a sub-query, but the need for using SUM. Thank you both for your help

1 Like