BLACK FRIDAY EXTRA SAVINGS EVENT - EXTENDED
START FREE

CASE and Subqueries

,

Hello Dataquest Community,
For the SQL Intermediate Table Relations and Join Course — Guided Project: Answering Business Questions using SQL — Step 5.
I am have trouble understanding the case statement. I do get how a case statement work, but for this question, syntax where country = c.country is where I cannot connect the dot. Counting all the rows in the customer table if country = country and equals 1?

SELECT
CASE
WHEN ( SELECT count(*) FROM customer
where country = c.country
) = 1 THEN “Other”
ELSE c.country
END AS country,

Mission link: https://app.dataquest.io/m/191/guided-project%3A-answering-business-questions-using-sql/5/analyzing-sales-by-country

Hi there @lingkans, welcome to the community!

I haven’t done this mission in Python, but I did do the one in R, where some approaches were different, this being one of them. So I’d love to hear the input of someone who’s actually done it.

From looking at the solutions key, it seems like the code you just highlighted is a neat trick to identify the countries with only 1 customer.

From reading the text in the Guided Project screen, we see the following:

So this gives you an idea that the equating of the sub-query to 1 must have been to identify the countries with only 1 customer.

I’ll try to explain the query from the inside out:

     SELECT
       CASE
           WHEN (
                 SELECT count(*)
                 FROM customer
                 where country = c.country
                ) = 1 THEN "Other"
           ELSE c.country
       END AS country,

1. The WHERE clause.

First, we’ll look at the WHERE clause.

This is being used of course to filter individual rows. Let’s say there were 5 different customers in a particular country. These 5 rows would be pinpointed by that WHERE clause. Likewise if there was only 1 country - that 1 country would be singled out as its own row.

2. SELECT COUNT(*) FROM customer

So what we’re doing here is we’re counting the number of rows left after the WHERE clause we ran. If a given country had 5 customers, SELECT COUNT(*) FROM customer would return 5. If there was only 1 customer in a country, it would return 1.

3. Setting value as “Other”.

This is where your counting of the rows ties back in to what the mission outlined in the quoted text I posted above.

You know that if a given country had only 1 customer, that the following chain of events happen:

  1. The where country = c.country clause results in only 1 row remaining,
  2. The SELECT COUNT(*) FROM customer line only counts that 1 row, thus returning a value of 1,
  3. Since the WHEN (truncated for simplicity) = 1 THEN "Other" condition is now fulfilled, the Other value is assigned.

I remember having no idea how to answer this 1.5 years ago.

There isn’t this concept of having 2 conditions for counting as you mentioned above. 1 comes from one of the two possible outputs from CASE. Whether the country from customer table maps to 1 depends on the result of COUNT(*). The result of COUNT(*) depends on the number of countries in customer where it is equal to c.country.

Here’s the missing magic which may not have been explained by Dataquest.

A correlated subquery is a inner query using values from outer query.
For every row in outer query, the inner query is calculated again based on what value the parameter from the outer query takes during that iteration.
In this solution, The inner query in brackets after WHEN is always looking at a different c.country while working down the rows of the outer query. The same c.country can appear multiple times to the inner query, and this solution could be wasteful.

The wiki above gives a beautiful example of removing correlated subqueries.
The trick is to generate a dictionary mapping unique keys (department in wiki analogous to c.country here) to a single statistic (AVG(salary) in wiki analogous to COUNT(*) here) that you want to know for each unique key. Intuitively this is like substituting multiple executions of the inner subquery with a single dictionary calculated once.

Next, it joins this dictionary into the table (employees in wiki analogous to il+i+c here) that you want to enrich information to. (like doing pd.series.map(dict)). This table used to be the outer query table.

The wiki also teaches you how to create a view and use it to solve both correlated subquery examples there. Now i encourage you to solve this question without correlated subqueries using what you learn from that wiki and share it here before you consider it solved :slight_smile:

1 Like

I studied your whole post. I still do not understand two parts of this code posted on top:

First: how where statement is processed when SELECT count(*) outputs just a single number of all rows. I’m unable to detect all steps “behind the scene” because it’s impossible to “output” them.
Second:

where country = c.country

What’s the country?
Is it a name of a new column, or something else? Based on stuff I learned it seems that country refers to c.country. But isn’t it the same column, from the same dataframe/dataset: customer?

your previous answer unfold some of the magic inside this code, but there is more…Can you explain it, please?

WHERE does not depend on anything that happens in a SELECT statement at the same query level. Maybe you’re not aware of this: SQLBolt - Learn SQL - SQL Lesson 12: Order of execution of a Query

country on the left side is from the inner query, c.country on the right is from the outer query. It’s like there are 2 copies (1 of which is among the joins in FROM of outer query) of the same customer table held in memory, like what happens when you do recursive join. They refer to the same single column in both copies but use them differently.
c.country on the right is a variable that is evaluated for every row’s country value in the outer query. It can be the same as the previous row (that’s why correlated subquery is computationally wasteful compared to creating map then join), or different. c is the alias for the table from outer query.

For each c.country value, an inner query is run on the same table (think of this as a temporarily created 2nd instance of customers table) to see how many customers belong to that country (that is varying for every row). The inner query is run as many times as there are rows in the outer c.customers.

Hi. It’s not what I meant. I meant that I didn’t understand the relation between inner and outer queries in this part of the code. But, now I think I do :slight_smile:

This was the core misunderstanding that you well explained:

Thank you :slight_smile: