Solution for 190-8 "Challenge: Each Country's Best Customer" seems overly complicated?

The challenge hints that the solution needs

  • One or more named subqueries defined in a WITH clause
  • Aggregate functions like SUM() and MAX()
  • Several INNER JOIN s
  • A subquery to define a column
  • GROUP BY and ORDER BY clauses

and that the solution is

WITH
    customer_country_purchases AS
        (
         SELECT
             i.customer_id,
             c.country,
             SUM(i.total) total_purchases
         FROM invoice i
         INNER JOIN customer c ON i.customer_id = c.customer_id
         GROUP BY 1, 2
        ),
    country_max_purchase AS
        (
         SELECT
             country,
             MAX(total_purchases) max_purchase
         FROM customer_country_purchases
         GROUP BY 1
        ),
    country_best_customer AS
        (
         SELECT
            cmp.country,
            cmp.max_purchase,
            (
             SELECT ccp.customer_id
             FROM customer_country_purchases ccp
             WHERE ccp.country = cmp.country AND cmp.max_purchase = ccp.total_purchases
            ) customer_id
         FROM country_max_purchase cmp
        )
SELECT
    cbc.country country,
    c.first_name || " " || c.last_name customer_name,
    cbc.max_purchase total_purchased
FROM customer c
INNER JOIN country_best_customer cbc ON cbc.customer_id = c.customer_id
ORDER BY 1 ASC

However, I was able to achieve the correct answer in a much much much simpler way:

with c_total_sales as (
    select
        customer.customer_id as id,
        customer.first_name || ' ' || customer.last_name as customer_name,
        customer.country as country,
        sum(invoice.total) as total_purchased
    from customer
    inner join invoice
        on customer.customer_id = invoice.customer_id
    group by id
    order by country
)

select
    country,
    customer_name,
    max(total_purchased) as total_purchased
from c_total_sales
group by country

Am I missing something? Or is the provided solution actually more complicated than needed to be?

1 Like

Hey.

Your solution has insidious issues that aren’t easy to spot, it works here because of some specific circumstances.

The issue is that you’re selecting columns without grouping by them. In a query where you use GROUP BY, every column must either by in the GROUP BY clause or an aggregate function must be used on it.

Please read this for more details. After reading this answer, feel free to read this similar question.

I hope this helps.

8 Likes

Hey @Bruno and @siuangie91

The solution also seemed to be overly complicated to me.

My solution was different again but also shorter than the solution. In brief, my logic was to:

  • Return a table ST of customer_id’s with the sum_total of the purchases for that customer.

  • Join the sum_totals table ST to the customer table and display the best customer by finding the max(total_purchased) grouped by country.

My understanding of the logic in the given solution is:

  • Return table customer_country_purchases with customer_id, country field from customer table and total_purchases

  • Create another table country_max_purchase of maximum purchases by country using customer_country_purchases

  • Create a third table country_best_customer that matches the country and max purchase from each country to a customer_id

  • Join country_best_customer back to the customer table to match the customer_name

Do you see any issues or reasons why my solution below may work here but not in general? I suspect the type of problem you are talking about relates to this GROUP BY lesson where the last value of a column that hasn’t been aggregated is displayed in the resultant table. In my solution below the aggregation is being performed on the inner joined table.

The columns selected are in each of the GROUP BY clauses I’ve used.

WITH
    st AS
            (
            SELECT 
                customer_id,
                SUM(total) AS sum_total
            FROM invoice
            GROUP BY customer_id
            )      
                        
SELECT
    country,
    first_name || " " || last_name customer_name,
    MAX(st.sum_total) total_purchased
FROM customer c
INNER JOIN st ON st.customer_id = c.customer_id
GROUP BY country             
ORDER BY 1
4 Likes

Hey, Mattrosinski.

Your understanding is great, you got the spirit right.

Not so. In the last part of the query you’re selecting three columns, one has an aggregate function, the other two don’t, but you’re only grouping by one of the remaining ones.

Some database systems won’t accept this as valid code, it will just yield an error. And some will do something that isn’t what you expect. Details can be read in the first link in my reply above.

4 Likes

I noticed that the given solution seemed overly complicated as well. My solution involved making a single view for my initial query, and then getting a query from that using an aggregate function. I did not use ORDER BY because I assumed that GROUP BY already covered the country ordering… Is this not correct thinking?

My solution:

CREATE VIEW customer_total AS
SELECT c.first_name || ’ ’ || c.last_name customer_name,
c.country,
SUM(i.total) total_purchases
FROM customer c
INNER JOIN invoice i ON i.customer_id = c.customer_id
GROUP BY 1;

SELECT customer_name,
country,
ROUND(MAX(total_purchases),2) total_purchased
FROM customer_total
GROUP BY country;

Regarding your view, I repeat my words from above.

Hi everyone,

Could someone tell if this code is potentially correct for non-educational purposes and working with other DBs.

It worked fine as a solution.

WITH sq1 AS (
select c.country country, c.customer_id, c.first_name || " " || c.last_name customer_name, SUM(i.total) total_purchased
from customer c
inner join invoice i ON c.customer_id = i.customer_id
group by c.customer_id
)

SELECT sq1.country, sq1.customer_name, max(sq1.total_purchased) total_purchased from sq1
INNER JOIN customer c ON c.customer_id = sq1.customer_id
GROUP BY sq1.country
ORDER BY sq1.country

Hi @1116, as @Bruno has well explained in his responses, your query will not be accepted in DBMSs (e.g., SQL Server) which require that when an aggregate function is used in a SELECT clause, all other columns used in the SELECT clause must be used in the GROUP BY clause.

3 Likes

Hi @Bruno and everyone,

I hope that you guys can check my solution.
It has the same result but is more simple than the solution mentioned in the exercise.
Thank you.

WITH country_customer AS
(
SELECT c.country,
c.first_name || " " || c.last_name AS customer_name,
SUM(i.total) AS total_purchased
FROM customer c
INNER JOIN invoice i ON c.customer_id = i.customer_id
GROUP BY 1, 2
ORDER BY 1
)

SELECT c1.country, c2.customer_name, MAX (c1.total_purchased) AS total_purchased
FROM country_customer c1
INNER JOIN country_customer c2 ON c1.customer_name = c2.customer_name
GROUP BY 1
ORDER BY 1

Hi everyone , please this is my own solution to the challenge, if you could help me check if there is anything to improve
@Bruno i will be glad if you could too.
thanks

Interesting. I was not aware of this. I think this needs to be stated and/or covered in the Introduction to SQL Subqueries course.

On reviewing some of my previous SQL solutions I see I had not included ALL of the SELECT columns in the corresponding GROUP BY yet my solution was accepted when I submitted it.

I too arrived here because of my “simple” answer to this exercise.


– Edit

I did some more research on this. According to the sqlite specification the following happens.

Side note: Bare columns in an aggregate queries. The usual case is that all column names in an aggregate query are either arguments to aggregate functions or else appear in the GROUP BY clause. A result column which contains a column name that is not within an aggregate function and that does not appear in the GROUP BY clause (if one exists) is called a “bare” column. Example:

SELECT a, b, sum ( c ) FROM tab1 GROUP BY a;

In the query above, the “a” column is part of the GROUP BY clause and so each row of the output contains one of the distinct values for “a”. The “c” column is contained within the sum()aggregate function and so that output column is the sum of all “c” values in rows that have the same value for “a”. But what is the result of the bare column “b”? The answer is that the “b” result will be the value for “b” in one of the input rows that form the aggregate. The problem is that you usually do not know which input row is used to compute “b”, and so in many cases the value for “b” is undefined.

Special processing occurs when the aggregate function is either min() or max(). Example:

SELECT a, b, max( c ) FROM tab1 GROUP BY a;

When the min() or max() aggregate functions are used in an aggregate query, all bare columns in the result set take values from the input row which also contains the minimum or maximum. So in the query above, the value of the “b” column in the output will be the value of the “b” column in the input row that has the largest “c” value. There is still an ambiguity if two or more of the input rows have the same minimum or maximum value or if the query contains more than one min() and/or max() aggregate function. Only the built-in min() and max()functions work this way

So while bare columns sometimes work for Min Max aggregate functions I guess the moral of the story would be to avoid use of bare columns in sqlite GROUP BY calls. :slight_smile:

2 Likes

Hi All -
Just chiming in as fellow Dataquest student. @Bruno 's response was great as I don’t recall the Dataquest’s intermediate SQL course mentioning that when an aggregate function is included in the SELECT clause, all of the non-aggregated columns - aka columns without a function applied - need to be included in the GROUP BY clause.

I found this link on stackoverflow explaining how GROUP BY can be interpreted as “for each”. So, “for each” country, customer, and purchase amount, return X, which helped me in my understanding.

Also helpful for this challenge was recalling the order of operations that SQL executes, which is:

  1. FROM
  2. WHERE
  3. GROUP BY
  4. HAVING
  5. SELECT
  6. ORDER BY
  7. LIMIT

Here’s my code answer for this below. Open to feedback!

 WITH customer_country_purchases AS
  (
  SELECT invoice.customer_id,
         customer.first_name,
         customer.last_name,
         customer.country,
         SUM(invoice.total) AS total_purchases

     FROM invoice
          INNER JOIN customer
                  ON customer.customer_id = invoice.customer_id 

    GROUP BY invoice.customer_id,
             customer.first_name,
             customer.last_name,
             customer.country
   )
   
/* 
   below query states:
   for each country where the sum of the total purchased of a customer is the highest amongst
   other customers from the same country, return the column with the countries' name, the
   customers' names, and the accompanying purchased amounts
*/

SELECT country,
       first_name || " " || last_name AS customer_name,
       total_purchases AS total_purchased
  FROM customer_country_purchases
 GROUP BY country
HAVING total_purchases = MAX(total_purchases);
1 Like

There is no aggregate function here, and GROUP BY was only applied to country

@semenchuk.ivan is it required to only use GROUP BY where there is an aggregate function?

Also - should GROUP BY have been applied to more than one column?

@orichh, my understanding after reading Bruno’s comments is that not applying GROUP BY to customer_name in your proposed solution is not advised and could cause problems with some other flavors of SQL.

In this mission this is not a problem because we are using SQLite.
I am glad I came across this thread to learn this potential issue with other SQL flavors.

when I saw my output for an sql i made it was:

WITH customerlist AS
(SELECT *
FROM customer),
billing AS
(SELECT customer_id, SUM(total) total_purchased
FROM invoice i
GROUP BY 1)

SELECT cl.country, cl.first_name ||" "|| cl.last_name customer_name, MAX(b.total_purchased) total_purchased
FROM customerlist cl INNER JOIN billing b ON cl.customer_id = b.customer_id
GROUP BY 1
ORDER BY 1 ASC

Even though the output is identical to the table with 3 columns dataquest refuses to acknowledge it as being correctis it a glitch, as when I use dataquest’s solution it automatically says correct solution even though the output are identical.