191-5 Answering Business Questions using SQL

hi guys,

I am doing the project with SQL - analyze the total purchases per each country for Chinook store.
Link: https://app.dataquest.io/m/191/guided-project%3A-answering-business-questions-using-sql/5/analyzing-sales-by-country

sales_by_country = """
WITH country_or_other AS
    (
    SELECT
        CASE
            WHEN
            (
            SELECT 
                COUNT(*) FROM customer
                ***WHERE country = c.country***
            ) = 1
            THEN "Other"
            ELSE c.country
        END AS country,
        il.*,
        c.customer_id
    FROM invoice_line il
    INNER JOIN invoice i ON i.invoice_id = il.invoice_id
    INNER JOIN customer c ON c.customer_id = i.customer_id
    )
SELECT 
    country,
    COUNT(customer_id) customers,
    SUM(unit_price) total_sales,
    SUM(unit_price) / COUNT(DISTINCT customer_id) customer_lifetime_value,
    SUM(unit_price) / COUNT(DISTINCT invoice_id) avg_order_value
FROM country_or_other
GROUP BY 1
ORDER BY 2 DESC

Could some please help me to understand why we need the row country = c.country in CASE statement but not GROUP BY country instead?

3 Likes

The word in CASE statement literally means,
“Should there be BUYING and SELLING in ‘Holy Sea’ (a unrecognized country, with only transactions with other world which makes them recognized somewhat as a ‘country’ within a country).”

Those I believe, the statement means. Having read through your program, the validity of the program would be tested with the use of in CASE or ORDER BY.

Did you ever figure out what the WHERE country = c.country means?
I got stuck at this exact spot. I can see the answer in the answer key, and I can see that I need the line for the code to run properly. Thankfully StackOverflow had an answer.

(I know that you posted this several months ago, so you probably don’t need this information anymore. However, in an effort to minimize duplicate posts about this question, I’m responding to your post, rather than creating my own.)

It seems as though, somehow, this inner query is referencing the outer query. I thought that the subquery always executed first, but it turns out that this is not the case. I found this post on StackOverflow which explains that if a subquery is correlated to the outer query, it will run the outer query first so that the entire query will run.

So, I think the WHERE country = c.country tells the SQL engine that it needs to look at lines where the country of the subquery is equal to the country in customer of the outer query’s second INNER JOIN.

I hope this helps someone, because I got stuck here for a LONG time, stubbornly refusing to look at the answer key. Maybe @Bruno or @nityesh can add the 191-5 tag so that it pops up when searching by tags?

9 Likes

hey @ngokieuphu, @latoadeoye, @ChrisMatsuoka

this is very late reply I guess but have you guys check oud this post and Bruno’s answer for the part of code in concern.

Apologies if you have.

1 Like

Thank you for this, Ruchi.

I remember noticing Chris pinging me about this, but I guess it then slipped me and I ended up not replying.

1 Like

hey @Bruno

guess what if we ever make a DQ holy book, I would definitely recommend the post as the first verse. I have already mentioned your solution 3 times :smile: :grimacing:

1 Like

I have been puzzling over this question for a bit now too and am going to try to explain it as a way of figuring it out for myself as well. disclaimer: this is just my best guess, Im not a pro.

Short answer: You cant compress one column with grouping, unless you are also compressing the rest of the columns with either grouping or aggregate functions. This code produces a table with 4757 ‘rows’ but if we were to group the table by country, it would be only 24 ‘rows’.

Longer answer: We dont actually need this specific line of code, but using a CASE statement is a handy way to fulfill the task of converting country names that only have one customer to ‘Other’. I did it a slightly different way, but the beauty of the way this case statement works is that it doesnt require aggregation to work, so they can produce a table with all of the data needed for the challenge with one subquery. Here is how this case statement works. The query inside the case statement
SELECT COUNT(*) FROM customer
produces a table that just counts all the rows on the customer table, 59. Adding the Where clause
WHERE country = c.country
creates a table with a single column, 59 rows long, with each ‘row’ of the column being the count of the number of times that each Country is in the table. Try running this

%%sql
SELECT
       c.country,
       (
             SELECT count(*)
             FROM customer
             WHERE country = c.country
                ) count
     FROM customer c

I honestly dont know how or why it does this, or if we have been shown this technique, but this is what they are doing. Then the case statement turns the country name of rows that have a 1 value, meaning there was only one customer, into ‘Other’. Hopefully someone smarter can shed more light on this, and someone newer can get some value from it as well