Alternative solution for a review

Screen Link: Learn data science with Python and R projects

Here´s a short story of how I came up with the solution. First I wrote a solution that gave the right output. Then I checked the DQ answer and found it way more complicated than my first solution. So, I went to the community to check what other learners thought about the DQ solution and such. Looking through the topics I realized that I made quite a common mistake of leaving one of the non-aggregate columns from the SELECT clause outside the GROUP BY clause. So, I came back to the DQ app to try to find another way around but still somewhat simpler than the one offered by the DQ .

So, here´s my solution:

WITH 
    customer_purchases AS
                    (
                     SELECT i.customer_id,
                            c.country,
                            SUM(i.total) total_purchases
                       FROM invoice i
                      INNER JOIN customer c ON c.customer_id = i.customer_id
                      GROUP BY i.customer_id, c.country
                    ),
    max_purchase AS
                    (
                     SELECT country,
                            MAX(total_purchases) country_max_purchase
                       FROM customer_purchases 
                      GROUP BY country
                     )
                     
SELECT mp.country country,
       c.first_name || " " || c.last_name customer_name,
       cp.total_purchases total_purchased
  FROM customer c
 INNER JOIN customer_purchases cp ON c.customer_id = cp.customer_id
 INNER JOIN max_purchase mp ON mp.country = cp.country AND mp.country_max_purchase = cp.total_purchases 
 ORDER BY country;

What do you think about it? Am I missing some possible pitfalls?

1 Like

Hello @ksenia.kustanovich,

I agree with you. When I faced this task I think the way DQ suggested was safer because there is also 1 thing we need to pay attention to when we use ON…AND with JOIN which is explained in detail here especially the answer given by John Woo.

BUT when I did the Guide Project: Answering Business Questions Using SQL, I came up with 1 question: find the best-selling album in each country based on the number of tracks sold from this album, I realize there is a problem if I use the same way that DQ suggested. The problem is that in USA, there are 2 albums that has the same number of tracks sold and both are the highest (i.e. there are 2 top albums for the US).

From my point of view I explain this problem like this:
when you use a subquery in a SELECT clause, it’s critical that subquery returns exactly 1 value (0 dimension) because the result of that being used to fill in just 1 cell of result. In my work I try to find the album that has the highest number of tracks sold but 2 or 3 albums can have the same number of tracks sold, which are the highest too. => the result of SELECT subquery is 1 dimension, so the engine should return error BUT sqlite doesn’t work like that, it still returns a number and miss other values.

So to solve this problem I try to JOIN 2 tables based on 2 columns and there is an interesting link too.

I am on my way to completing the data science path too, so this is just my thought. I am very please to discuss with others on this and maybe @Bruno and @hanqi can help us to be clear about this? :))
Thank you in advance.

I’m too lazy and too busy at the moment to gather all the context to discuss this, however. . .

. . . this is probably right. SQLite is often very permissive in ways that can be harmful not only for learning, but also for getting the results we actually want.

1 Like

Hi! I too am looking for feedback, and hoping that my “more simple” solution is not a fluke:

WITH sumer AS
    (
     SELECT 
        i.customer_id,
        c.first_name||" "||c.last_name AS customer_name,
        c.country,
        SUM(i.total) AS total_purchased
       FROM customer AS c
      INNER JOIN invoice AS i ON i.customer_id = c.customer_id
      GROUP BY i.customer_id, c.country
    ),
    
    maxer AS
    (
     SELECT *, MAX(total_purchased) AS maxed
       FROM sumer
      GROUP BY country
    )

SELECT country, customer_name, total_purchased FROM maxer;

Thanks!

You can actually get the same answer with even less code:

WITH sumer AS
(
SELECT
i.customer_id,
c.first_name||" "||c.last_name AS customer_name,
c.country,
SUM(i.total) AS total_purchased
FROM customer AS c
INNER JOIN invoice AS i ON i.customer_id = c.customer_id
GROUP BY i.customer_id, c.country
)

SELECT
country,
customer_name,
MAX(total_purchased) AS total_purchased
FROM sumer
GROUP BY country

1 Like

I also wish to submit my solution which is quite simple, I would like feedback on my solution for any pointers. Kind regards.

alternative_solution.ipynb (8.5 KB)

Click here to view the jupyter notebook file in a new tab

Good and simple solution but I’m not sure if this works in real life.

At work I use postgreSQL and everytime I use an aggregate function such as MAX( ) I MUST include all the non-aggregate columns in the GROUP BY clause. So if I ran your code, the debugger will throw an error and tell me “customer_name must be in GROUP BY clause”

Maybe other dialects of SQL such as SQLite and T-SQL allow you to do that.

Could someone please have a look at my solution and see if it’s correct or needs to be imrpoved.

Hi @NickKobets,

(FINALLY found the exact same solution as mine…!)
I was too surprised with DQ’s suggested answer too;;;;
Though I’m not a guru in data science, below is what I figured out from reading through DQ’s codes several times, and I am waiting for other members’ opinions as well.

Honestly, it took a while for me to draw my work flows on a paper first (first this…then this…) and coded the solution like yours. However, I think DQ CODED their logical flow instead of scribbling, and, in the mean time, they did not hesitate to make as many new tables as they need. (Kinda like they suggested in basic python sessions that we should utilize customized functions when our codes are about to get messy)

DQ’s solution can be easier for other members/readers to follow how DQ(the first drafter) thought through and, therefore, be helpful with teamwork. But then, here comes my question-- is DQ suggesting that “comprehensibility in terms of coding logic” outweighs the “readability in terms of simplicity”? I want to know the practical norm in data science field.

Look forward to comments from other members…!