190-8 Feedback - Building and Organizing Complex Queries - Challenge

The published solution to this step feels a bit overly complicated. I completely recognize that there are many ways to approach this problem. It appears the solution given is written to encourage using the WITH … AS … workflow taught earlier in the mission. However, in my view, the prompt does not justify the complexity demonstrated in the solution.

Before looking at the solution, I passed the prompt with the following query:

WITH customer_total AS
(
    SELECT 
    c.country country,
    c.first_name || " " || c.last_name customer_name, 
    SUM(i.total) total
    FROM customer c
    INNER JOIN invoice i ON i.customer_id = c.customer_id
    GROUP BY customer_name
)

SELECT ct.country, ct.customer_name, MAX(ct.total) total_purchased 
FROM customer_total ct
GROUP BY 1 ORDER BY 1;

The crux of the solution appeared to be

  1. Find the total amount customers purchased.
  2. Take the maximum customer purchase amount when grouped by country.

When I went to look at the solution, I was at first surprised by the length of the query, and then confused.
I’m curious if there’s a strategic benefit to the level of complexity shown in the solution? Perhaps I’ve missed the point on when and why to use chained WITH clauses?

Again, completely recognize there are numerous ways to approach this, don’t intend to be difficult. Just interested to talk through it a bit more if someone is game.

1 Like

Hey, Wolf. You did well in asking this question. Thanks!

I think the solution doesn’t need to be as long-winded as Dataquest’s is. But yours has issues that are swept under the rug; it ends up working out because of very particular circumstances of the data and the database.

There’s one common thread to all the issues: Including non-aggregated columns in SELECT, but not in GROUP BY and the way SQLite handles this.

For example, in customer_total you want to query for the total by customer, and then fetch the customer’s country. This is a good plan, but your execution only works because each customer only has one country (I think), but this could easily not be the case (the customer can move, change their nationality, or the database could simply have erroneous data) — it’s better to shield the queries against issues like this by assuming as less as possible about the data;

In the final bit, where you get the results, you end up getting correct results because there aren’t any ties by country (if I recall correctly). But what if there were?

Because you’re grouping by only one of the non-aggregated columns, the value (read name) that is picked is the last, since there no ties it’s both the last and the first and thus there are no problems here, but if there were ties, you’d be missing out on all except the last occurrence of them.

2 Likes

I see your point, thank you for the reply!

Just to ponder a bit further, on your first point regarding a customer moving or having erroneous data: I suppose this would depend on the type of question being asked. Are we interested in the total purchases based on where the customer was when buying a track, or by the location of the customer on record?

In the first case, we would want to take the billing_country value from the invoice table. This would create unique rows for any customer that had made purchases with multiple billing addresses.

In the second case, I believe grouping by the customer_id value would yield the same result and grouping by customer_name and country, assuming we are taking country from the customer table. Each customer must have only one country, unless they had two customer records, i.e. purchased under two different accounts. But we would have no way of being sure this is the same person. Still, to your point taking only customer_name definitely lends itself to errors should two different customers share the same name.

Thank you for the attention here, I appreciate your help in improving my understanding.

For sure! It would be more precise to use customer ids throughout and fetch the names at the very end. Having said this, I think I remember there being an observation somewhere stating that each name is unique in this database, and so it ends up working out.

We’re interested in the customer.country, I believe.

Valid points from a business perspective, but the SQL issues are still valid.

I edited my answer to you in another recent post with an extra example where you’ll be able to experiment with some things and better understand the importance of grouping by all the non-aggregated columns.