191-6: Cheating: SQL is hard ;)

Screen Link:
https://app.dataquest.io/m/191/guided-project%3A-answering-business-questions-using-sql/6/albums-vs-individual-tracks

I have cheated a little bit on this project and I got a different result than the answer, so maybe my cheat was not that great. In general I’ve been banging my head against the wall with SQL, I get the syntax, but I don’t know how to use it (yet!). I wonder how you would fix my ■■■■■■ solution.

WITH albums_invoice AS
    (
    SELECT il.invoice_id,
           COUNT(DISTINCT t.album_id) different_albums
      FROM invoice_line il
      JOIN track t ON t.track_id = il.track_id
     GROUP BY 1
    ),
    
    albums_invoice_def AS
    (
    SELECT *,
           CASE
               WHEN different_albums == 1 THEN "Album"
               ELSE "Singles"
               END AS album_singles
      FROM albums_invoice
    )
    
SELECT album_singles,
       COUNT(invoice_id) count,
       COUNT(invoice_id) * 100.0 / SUM(COUNT(invoice_id)) OVER () AS Percentage
  FROM albums_invoice_def
 GROUP BY 1
album_singles	count	Percentage
Album	        171	        27.850162866449512
Singles	        443	        72.14983713355049

Unfortunately there are some albums with one or two songs on it, which are basically singles as well. The total number of those is 90. But of course not every one of those albums has been purchased, so I can’t hack by simply deducting from the histogram. I wrote this code, but I can’t seem to integrate it anywhere in my original. Do I really have to work with EXCEPT then?

WITH tracks_per_album AS
    (
    SELECT album_id,
      COUNT(track_id) tracks   
      FROM track
     GROUP BY 1
    )
    
SELECT tracks,
       COUNT(album_id)
  FROM tracks_per_album
 GROUP BY 1
tracks	COUNT(album_id)
1	82
2	8
3	3
4	2
5	2
6	3
7	8
8	14
9	15
10	27
11	25
12	29
13	16
14	34
15	18
16	13
17	15
18	8
19	3
20	5
21	2
22	3
23	3
24	3
25	2
26	1
30	1
34	1
57	1

Sorry for the long post :blush:

Hi @DavidMiedema

Apologies if I have completely misunderstood your post. I am confused about the below query.

This query might result in a scenario like this:

  • Album X has 10 tracks - “Customer A” purchases album X only - so complete album for that invoice - classified as “album”

  • Album Y & Z have 4 & 5 tracks respectively - “Customer B” purchases 3 tracks from Y and all the 5 tracks from Z - but this will get classified as “singles”, since for this invoice Count(Distinct(album_id)) > 1 (one for Y and one for Z) even though Z album has been purchased as a full album.

Perhaps why your results are so off from the solution.

Try to tackle this query this way:

  1. Count the no. of tracks in each album. At this point, we don’t care if they are purchased or not-purchased. We just want to know how many tracks are associated with a given album (and this data is best not taken from invoice_line table)

  2. Decide if you want to include the albums which only have 1 or 2 songs with them. Having or Where Clauses (or solution has MIN() applied) are options here

  3. Count the tracks purchased per album per invoice. For example:

    invoice_id album_id track_count
    A X 10
    B Y 3
    B Z 5

Match the count of tracks in 3rd with that of 1. based on your decision in 2. If the count of tracks in 1. and 3. match - full album; if not - single tracks were selected.

The 3. of course may not be this straightforward like it looks here (I wish!). :stuck_out_tongue_winking_eye:

If you didn’t get any of this no worries, I took my time to break this code too and I will try to respond in a better way.

this post has an answer too a bit different from solution so only take a look if you wish to - here

1 Like

Did the exact same thing, same result. :handshake:

After @Rucha’s explanation, it’s clearer. But during this whole section, I feel like all SQL missions that have more than 3 tables need a zoom meeting, texts are just so inefficient. :joy:

While the DQ class form is awesome, I personally think videos may help greatly in introducing SQL missions specifically.

-----------------------------------------------------------edits below------------------------------------------------------------
Just finished the project, managed to ‘cheat’ around EXCEPT, with a side-effect of hair loss… :crazy_face:

There might be ways to clean it up a bit but I can’t be bothered for now, here’s my code:

%%sql
WITH 
track_album AS 
(SELECT t.album_id,
        COUNT(t.track_id) track_num      
FROM track t
INNER JOIN album a ON a.album_id = t.album_id
GROUP BY 1),

orders AS 
(SELECT i.invoice_id,
        a.album_id,
        COUNT(il.track_id) track_num
FROM invoice i
INNER JOIN invoice_line il ON il.invoice_id = i.invoice_id
INNER JOIN track t ON t.track_id = il.track_id
INNER JOIN album a ON a.album_id = t.album_id
GROUP BY 1,2),

purchase_cat AS 
(SELECT CASE 
       WHEN o.track_num = ta.track_num THEN 'album' 
       ELSE 'single'
       END AS purchase_cat,
       o.invoice_id
FROM orders o
INNER JOIN track_album ta ON ta.album_id = o.album_id),

edge_free_cat AS 
(SELECT CASE
       WHEN COUNT(DISTINCT(purchase_cat))= 2 THEN 'single'
       ELSE purchase_cat
       END AS purchase_cat,
       invoice_id
FROM purchase_cat
GROUP BY invoice_id)

SELECT purchase_cat,
       COUNT(invoice_id) amount,
       ROUND(CAST(COUNT(invoice_id) AS FLOAT)/ (SELECT COUNT(invoice_id) FROM edge_free_cat),2) percentage
FROM edge_free_cat
GROUP BY purchase_cat

result:

purchase_cat amount percentage
album 114 0.19
single 500 0.81

I know it’s not as neat as the solution, but I came up with this and it makes a bit more sense to me personally.

1 Like

Hi @veratsien,

In the meantime while preparing my portfolio I actually manage to fix it!

%%sql
WITH albums_invoice AS
    (
    SELECT il.invoice_id,
       COUNT(DISTINCT t.album_id) different_albums,
       COUNT(DISTINCT t.track_id) amount_tracks
      FROM invoice_line il
      JOIN track t ON t.track_id = il.track_id
     GROUP BY 1
    ),
    
    albums_invoice_def AS
    (
    SELECT *,
       CASE
       WHEN different_albums == 1 AND amount_tracks > 5 THEN "Album" 
       ELSE "Singles" -- #The line above filters for invoices containing only 1 album id and at least 6 tracks
       END AS album_singles
        FROM albums_invoice
    )
    
SELECT album_singles,
       COUNT(invoice_id) count,
       COUNT(invoice_id) * 100.0 / SUM(COUNT(invoice_id)) OVER () AS percentage
  FROM albums_invoice_def
 GROUP BY 1

Like I was so close all the time, and I must say, after a lot of python, you also get the hang of SQL. Just go on that steep learning curve, it’s okay not to get everything rightaway!

Even though I am still ‘assuming’ that 5 tracks with the same album ID are albums. They actually say this in the exercise, that these edge cases of people buying a lot of tracks from the same album instead of buying the album hardly happens.

Well now I know, it happened 17 times.

Congrats!

2 Likes

Great work! Your code is so much more concise than mine! I just realized there’s absolutely no need to join the invoice table after reading your first temp table. Duh. I will take that as a sign to take a break. :joy:

And thanks for the encouragement! Makes me feel better about the spaghetti both in my code and head. :crazy_face:

Btw, I can totally see this post the start of EXCEPT haters club… lol

1 Like

You know, this was as hard as it gets for SQL, there is nothing harder than this question in this website :wink:

EXCEPT

No, no, we don’t go there :muscle:

By the way my code does miss the mark by 17 albums compared to the answer, yours is the answer, so I’m not certain who wins, you worked harder. That’s for sure.

EXCEPT

No, still no.

1 Like

■■■■. As far as I’m concerned, we both did an EXCEPTIONAL job just for EXCEPTING EXCEPT :wink:

I also left the EXCEPT aside because it feels more natural to me. I spent two days on this exercise, but I’m glad I finally found a great solution.

I’ll share my AnsweringBusinessQuestionsUsingSQL.ipynb (33.8 KB) for the ones who are interested in a different approach