BLACK FRIDAY EXTRA SAVINGS EVENT - EXTENDED
START FREE

Looking for insights to approaching my SQL queries

This is not really a technical question, since my code provides the correct solution table, but I am looking for input on how to better approach formulating my queries in general by looking at this one in specific. The code provided in the solution set seems to have started by producing a single, non-grouped, table with columns to provide the answers by aggregating in the main query. My approach was based on another posting that made sense to me, but basically, I never felt like I had a plan to get the desired result, I just kept fumbling ahead. But here is roughly my thought process:

  • all of the final questions are grouped by country, so my first step was to create a table grouped by country, and create the ‘Other’ category. I will be able to later join this table based on the original Country column

  • I created another table to do the aggregate functions and grouped them by the original Country column as well.

  • so now I had all the data I need, but I still have to group the ‘Other’ countries together so I used the ‘trick’ shown on this challenge and created a table with the country name or word ‘Other’ and created a column where countries get a 0 and Other gets a 1. Then I grouped the ‘Other’ countries together.

  • Lastly, I just joined all my tables together and selected the requested columns.

So, the code looks disjointed because my thought process was disjointed. Also when putting this together I had a hard time testing each subquery individually since they were linked to each other. Because this project was in a Jupyter notebook, I just opened a new cell, pasted the existing code and ran it with the main query showing only all columns of the subquery I was working on. Wondering if there is a way like in python you can just hide everything you arent working on with hastags? Or anything better than what I’m doing? Any insights welcome, Thanks!

Screen Link:
https://app.dataquest.io/m/191/guided-project%3A-answering-business-questions-using-sql/5/analyzing-sales-by-country

My Code:

%%sql
WITH 

a AS (
                SELECT   c.country country,
                        CASE  
                            WHEN count(c.customer_id) = 1 THEN 'Other'
                            ELSE c.country END AS final_list,
                        count(c.customer_id) customers
                FROM customer c
                GROUP BY 1
                ),

 b AS (
                SELECT c.country country,
                        SUM(i.total) total,
                        SUM(i.total) / COUNT(DISTINCT(i.customer_id)) total_per_cust,
                        SUM(i.total) / COUNT(i.invoice_id) total_per_invoice
                 FROM customer c
                LEFT JOIN invoice i on i.customer_id = c.customer_id
                 GROUP BY 1                        
                ),

c AS (          SELECT a.final_list,
                            CASE
                                WHEN a.final_list == "Other" THEN 1
                                ELSE 0 END AS sort
              FROM a
              GROUP BY 1,2
              )

SELECT a.final_list,
        SUM(a.customers) customers,
        ROUND(b.total, 2) total,
        ROUND(b.total_per_cust, 2) avg_sales_per_customer,
        ROUND(b.total_per_invoice, 2) avg_order_value

FROM a 
LEFT JOIN b ON a.country = b.country
LEFT JOIN c ON c.final_list = a.final_list
GROUP BY 1
ORDER BY c.sort, 3 DESC;

You don’t need to create the c CTE just to add a sorting column and join back. Just create the sorting column on the fly inside ORDER BY.

%%sql
WITH 

a AS (
                SELECT   c.country country,
                        CASE  
                            WHEN count(c.customer_id) = 1 THEN 'Other'
                            ELSE c.country END AS final_list,
                        count(c.customer_id) customers
                FROM customer c
                GROUP BY 1
                ),

 b AS (
                SELECT c.country country,
                        SUM(i.total) total,
                        SUM(i.total) / COUNT(DISTINCT(i.customer_id)) total_per_cust,
                        SUM(i.total) / COUNT(i.invoice_id) total_per_invoice
                 FROM customer c
                LEFT JOIN invoice i on i.customer_id = c.customer_id
                 GROUP BY 1                        
                )



SELECT --a.final_list,
       -- SUM(a.customers) customers,
       -- ROUND(b.total, 2) total,
       -- ROUND(b.total_per_cust, 2) avg_sales_per_customer,
       -- ROUND(b.total_per_invoice, 2) avg_order_value
        *
FROM a 
LEFT JOIN b ON a.country = b.country
ORDER BY (CASE 
              WHEN a.final_list == "Other" THEN 1
              ELSE 0 
          END), 3 DESC;

Less tables more clarity.
For good practice never GROUP BY numbers. If the columns selected changed, the number will refer to another column.

In your final SELECT, you did GROUP BY without aggregating the columns that don’t appear in GROUP BY. This will cause errors in databases that are not Sqlite. Since every group has 1 line only, you can just use SUM() aggregation to get the same result to get around this error. A cleaner solution is just to throw away the group. You will still get the same result because you have already grouped both sides before joining, so both the joining columns are unique, and 1:1 cardinality. There is no way that JOIN will generate duplicate values on the joining columns and so no need to groupby-aggregate after join.

Your way of thinking is clearer than the answer using correlated subqueries, I didn’t check but suspect yours is faster too, so don’t worry there, but learn the answer anyway to enrich your toolbox.
Your’s had multiple groupby, the disadvantage of this is of course doing it repeatedly rather than joining information together first then 1 final groupby to give the answer. The latter way is more debugable as you can see raw information further into the pipeline.

In sql commenting can be done by – as i shown above. I don’t know how to block comment though like in python you can highlight multiple lines and ctrl + / to comment all and press again to uncomment all.

Which part is hard? You can start from the raw unlinked version before trying joins, don’t have to only print output (SELECT * FROM whatever) only after all 3 CTEs are written, you can write 1 cte and immediately print. In automated coding platforms like codility, this is good to show your interviewer you know how to break down a problem into intermediate output(Analogous to writing the general algorithm with unimplemented functions first in Data Structures + algorithms interviews). It also shows you know what sorts of intermediate output is useful to save, not only for your task, but for answering other yet unknown questions in future.

1 Like

A lot of great input here, thank you so much!

  1. Good call moving the c CTE inside the ORDER BY. Just that move makes it much cleaner visually.
  2. Good idea not to GROUP BY numbers. DQ introduced that idea so for so I assumed it was best practice. Makes sense why it is not.
  3. Honestly, Im still struggling with visualizing what is happening with each table here. I see the error that you pointed out with the GROUP BY and only one column aggregated and understand why it is an error. I also understand your first workaround, keep the grouping and just SUM every column. But I don’t understand how to just throw away the group. I was using that GROUP BY only to combine the countries that were being combined into an ‘Other’ category. Maybe I could perform that grouping in the b CTE? The DQ answer used a CASE statement to change the country names of single client countries into ‘Other’ in a single step, but I cant understand how it works so I tried my own way. ha!
  4. Cool trick with – !
  5. I think I’m just not super comfortable with how SQL works yet. Since this project was done in a Jupyter Notebook, I started copy/pasting my test subqueries in a different cell, and then putting it back into my query once it did what I wanted it to do. Just curious if there is a better way to workshop, or if its simply a matter of me needing to get better :slight_smile:

Thanks again!

By subquery i assume you mean the CTE, because i don’t see subqueries in your answer for this exercise.
There’s no need to copy else if you remember the output or just want to check it matches your expectations.

%%sql
WITH 

a AS (
                SELECT   c.country country,
                        CASE  
                            WHEN count(c.customer_id) = 1 THEN 'Other'
                            ELSE c.country END AS final_list,
                        count(c.customer_id) customers
                FROM customer c
                GROUP BY 1
                )
SELECT * FROM a

This is what i mean, you can keep the pipeline of CTE as they are and print any intermediate output by selecting from the CTE. I have shown only CTE a written here but you can leave all a b c defined but select only from a or b or c. If you really mean subqueries then yes writing a separate one in another cell is what i would do too.

You had 4 GROUP BY in a, b, c, final select. See how my solution has only 2 GROUP for a, b. GROUP in c disappeared because it went into ORDER BY and final select GROUP BY is removed. I am talking about this final GROUP BY. You can print the intermediate tables for more clarity to understand what i mean by

You can judge your own familiarity by being able to reason about output just by reading sql without running it, or knowing for what input the query can fail. (i don’t mean syntax/logic errors like GROUP BY without aggregation), but proper working queries that don’t handle edge cases. The reading without running skill is needed in time-limited questions in interview scenario. Like the following allows 1.5 minutes, no way you have time to create tables and insert data to simulate and test.

You were correct, I said subquery but was trying to refer to the CTEs. I actually did start doing the method you showed, and it did work well. My problem was that sometimes I would modify a while building b and then get an error or the wrong result and not know which CTE I had messed up. Like you mentioned, I think that my reading comprehension, as well as syntax, will develop over time. This is just the struggle of learning.

second, when I ran the code that as you posted it, I get a table with all the individual countries (24) listed, instead of 9 named countries and 1 group named ‘Other’ that has the aggregated data of 15 countries that each only had one customer (sorry, I dont know how to post those tables here). As you recommended, I can get that table by grouping by the final list, AND using SUM for each additional column. It seems to me that there are two ‘levels’ of grouping that my approach requires, first, aggregating the data per country, and second, aggregating the ‘Other’ country data together. The DQ solution did both in one step, so maybe there is a way that I can as well but Im not seeing it yet. Still tinkering though :slight_smile:

Two other quick questions: I have been trying for a couple hours to figure out how to put the output table from the Jupyter notebook into a message like this and but I cant get it to show up in a table form. Any recommendations?

Also, I am still struggling to understand this bit of code from the DQ answer. I dont understand what the WHERE clause in the CASE statement does. It seems to be comparing two columns of different lengths and I guess I’m just not sure what is being performed here.

SELECT
       CASE
           WHEN (
                 SELECT count(*)
                 FROM customer
                 where country = c.country
                ) = 1 THEN "Other"
           ELSE c.country
       END AS country,
       c.customer_id,
       il.*
     FROM invoice_line il
     INNER JOIN invoice i ON i.invoice_id = il.invoice_id
     INNER JOIN customer c ON c.customer_id = i.customer_id

again, Thank you so much for your insights,
Nate

This is a general problem in sql for me too. I can miss out a table alias when using table-value-constructors (super powerful tool, moves the dynamic table generation from SELECT back upstream into FROM clause to leave space for more options and complexity in SELECT, you can see an example of a long discussion with use of TVC How to split a table column which contains a list and put it in a separate normalised table?) in postgres or sql server and get the meaningless error message Incorrect syntax near ). It comes with experience to know what types of queries need what syntax. Also some sql theory plus common sense helps.

Yes i realized now i messed up my answer and did not make use of the editted country metadata (other) after removing CTE c and did not use a.final_list.

%%sql

SELECT 
         CASE WHEN 
             (SELECT count(*)
             FROM customer
             where country = c.country
             ) = 1 THEN "Other"
             ELSE c.country
         END
     as editted_country,
        COUNT(DISTINCT c.customer_id) customers,
        SUM(i.total) total,
        SUM(i.total) / COUNT(DISTINCT(i.customer_id)) as avg_sales_per_customer,
        SUM(i.total) / COUNT(i.invoice_id) as avg_order_value
     FROM customer c
    LEFT JOIN invoice i on i.customer_id = c.customer_id
    GROUP BY 
        editted_country   -- column alias in GROUP BY not allowed in sqlserver (allowed in postgres), copy paste whole CASE then
    ORDER BY 
        CASE WHEN editted_country = "Other" THEN 1 ELSE 0 END
        ,total DESC

This is the all in 1 solution. Nice that you used total from invoice rather than unit_price from invoice_line so you had to join 1 less table than the solution.
In the above, your CTE a went entirely into the correlated subquery, your CTE c went into ORDER BY CASE on the fly, the aliases from final SELECT were used directly in place of the useless aliases from CTE b. Side note, in your original query, there was no need to GROUP BY 2 in CTE c because each country is either a 1 or a 0, there will be no way a country can multiple rows of 1 or 0 or a mix generated from CTE c other than the Other countries, so GROUP BY 1 is still necessary there to prevent inflating each of the final stats of Other countries by 15x after LEFT JOIN.

Using column alias from SELECT in GROUP BY will work for sqlite and postgres but not sql server, so for that you need

%%sql

SELECT 
         CASE WHEN 
             (SELECT count(*)
             FROM customer
             where country = c.country
             ) = 1 THEN "Other"
             ELSE c.country
         END
     as editted_country,
        COUNT(DISTINCT c.customer_id) customers,
        SUM(i.total) total,
        SUM(i.total) / COUNT(DISTINCT(i.customer_id)) as avg_sales_per_customer,
        SUM(i.total) / COUNT(i.invoice_id) as avg_order_value
     FROM customer c
    LEFT JOIN invoice i on i.customer_id = c.customer_id
    GROUP BY 
        CASE WHEN 
             (SELECT count(*)
             FROM customer
             where country = c.country
             ) = 1 THEN "Other"
             ELSE c.country
        END
    ORDER BY 
        CASE WHEN editted_country = "Other" THEN 1 ELSE 0 END
        ,total DESC

It is important to give the CASE in SELECT a new name (editted_country above) and not name it back to country. You can try replacing the 2 appearances of editted_country with country to see that Other do not appear last anymore. I don’t understand what is going on in the backend and can’t find in sqlite docs but this has something to do with input column name clashing with output column. Here is a discussion on that issue: https://stackoverflow.com/questions/19848930/group-by-case-statement.

In this case, sqlite must have used the raw country rather than countries with some converted to Other, which still does not make sense because how can it order grouped data using a column that is ungrouped? If you wanted to do it the wrong way and ORDER BY country, but change the RHS from Other to some other country from the final result, you can see that country correctly moves to last position so this is confusing why it works too. You can even choose a country that got condensed into Other (eg. Argentina) and see that it has no effect on the output order.

Yes there must be these 2 levels at least, but in reversed order as you described above. In SQL, what is a step is really hard to define because engines break down queries and combine parts together during optimization, so i won’t say DQ did it in 1 step. I would say my solution above is 1 step since it has a single SELECT.

To create tables, you can use online generators for convenience.
https://www.tablesgenerator.com/markdown_tables
Wrapped in triple backtick to show you how the raw markdown looks.

|   | 0 |    1 |    2 | 3    |
|--:|--:|-----:|-----:|------|
| 0 | 1 |    2 | None | None |
| 1 | 2 |    3 |    4 | None |
| 2 | 1 | None | None | None |
| 3 | 1 |    2 |    3 |    4 |
0 1 2 3
0 1 2 None None
1 2 3 4 None
2 1 None None None
3 1 2 3 4

Correlated subqueries (not taught in mission): CASE and Subqueries. For every row in outer loop, run inner subquery. This could be slow compared to join solutions which that wiki link shows you how. Similarly, for this question, the correlated subquery method in DQ and my answer may not be the fastest.

After some further investigation, I partially understand this behaviour. By copying the CASE in ORDER BY to SELECT too, I saw which rows were getting converted to 1 and 0 to help me judge the sorting behaviour.

By not selecting the CASE with name clashing alias country, but country alone, I see that the Other row is actually Chile under the hood. If you tried changing the correlated subquery from searching for =1 to =5, Brazil and France which both have 5 customers will become the new Other, and France will be the 1 of the 2 selected when name clashing column alias country is referenced. I have no idea what’s going on here and why the country referenced in ORDER BY and SELECT are not using the output (Other) of CASE in SELECT. According to my study here it should: https://hanqi01.medium.com/how-badly-named-sql-column-aliases-confuse-when-using-group-by-and-order-by-261541f86f9a.

The 2 differences I can see are that we’re using CASE in GROUP BY and ORDER BY here, but not in my article, not sure why/how that messes up the country references in SELECT and ORDER BY.

My conclusion is just alias things properly or use subqueries and wrap more layers to avoid all these troubles.