Clarification on the CTE - Answering Business Questions using SQL

The code below is from the solution file for step 6 under Guided Project: Answering Business Questions using SQL.

WITH invoice_first_track AS
    (
     SELECT
         il.invoice_id invoice_id,
         MIN(il.track_id) first_track_id
     FROM invoice_line il
     GROUP BY 1
    )

SELECT
    album_purchase,
    COUNT(invoice_id) number_of_invoices,
    CAST(count(invoice_id) AS FLOAT) / (
                                         SELECT COUNT(*) FROM invoice
                                      ) percent
FROM
    (
    SELECT
        ifs.*,
        CASE
            WHEN
                 (
                  SELECT t.track_id FROM track t
                  WHERE t.album_id = (
                                      SELECT t2.album_id FROM track t2
                                      WHERE t2.track_id = ifs.first_track_id
                                     ) 

                  EXCEPT 

                  SELECT il2.track_id FROM invoice_line il2
                  WHERE il2.invoice_id = ifs.invoice_id
                 ) IS NULL
             AND
                 (
                  SELECT il2.track_id FROM invoice_line il2
                  WHERE il2.invoice_id = ifs.invoice_id

                  EXCEPT 

                  SELECT t.track_id FROM track t
                  WHERE t.album_id = (
                                      SELECT t2.album_id FROM track t2
                                      WHERE t2.track_id = ifs.first_track_id
                                     ) 
                 ) IS NULL
             THEN "yes"
             ELSE "no"
         END AS "album_purchase"
     FROM invoice_first_track ifs
    )
GROUP BY album_purchase;

Can somebody please throw some light on the logic behind query in the CTE is querying out the MIN of the track_id for the calculation in question?

I am having A LOT of difficulty with this same problem. Can you suggest any other resources to practice more complex SQL queries? It feels like there is a really big jump between this question, as well as the question where you have to group countries into ‘Other’, and other SQL practice questions in the courses and projects. I am having trouble bridging this gap and really want to increase my skills.

6 Likes

Could you actually walk through this solution in chunks, and explain each chunk? I am having trouble keeping track of all the queries within queries and how they fit together.

1 Like

I’m no expert either, I think @Bruno will be able to help us here.

Please see this post.

Also take a look at this, this, and this as this question has been asked multiple times in the past.

@Bruno maybe I am missing something, but none of the referenced posts contain a clear step-by-step breakdown of the logic behind the solution. It seems that the question keeps being asked, in part, because no clear answer has been provided. Are you capable of doing that?

2 Likes

Hey, Joseph.

I honestly didn’t read Sahil breakdown which he provided here (and that I linked to above), but I trust my colleague’s work.

If the community judges it insufficient, I will at least expect a more focused question regarding the explanation, before thinking the explanation is unclear. Have you read it?

1 Like

Thank you! I will check these out.

2 Likes

Thank you @Bruno. Sahil’s breakdown of the query is very helpful.

2 Likes

I also had a very hard time with this one. I took a different approach from the one presented on the guided project solution but still managed to the get to the same results: 18,57% of the purchases are only of entire albums.

I’m posting the solutions bellow, hope it helps:

%%sql
with track_per_album as

(
select
    track.album_id,
    count(track.track_id) number_of_tracks_album
    from track
    group by 1
),

track_per_invoice_per_album as 

(
  select
  il.invoice_id,
  t.album_id,  
  count(t.track_id) tracks_bought
from invoice_line il
inner join track t on il.track_id = t.track_id
group by il.invoice_id,t.album_id),

albuns_per_invoice as

(select 
     tia.*,
     ta.number_of_tracks_album,
     case 
       when tia.tracks_bought = ta.number_of_tracks_album then 1
        else 0
       end as full_album_check       
    from track_per_invoice_per_album tia
    inner join track_per_album ta on tia.album_id = ta.album_id),

full_albuns_per_invoice as

(select
  invoice_id invoice,
  count(album_id) albuns_per_invoice,
  sum(full_album_check) full_albuns_per_invoice,
  case 
    when count(album_id) = sum(full_album_check) then 1
      else 0
     end as album_purchase
  from albuns_per_invoice
  group by invoice_id)

select
   count(invoice) total_purchases,
   sum(album_purchase) album_purchases,
   Round(cast(sum(album_purchase) as float)/cast(count(invoice) as float),4) percentage_album_purchases,
   1 - Round(cast(sum(album_purchase) as float)/cast(count(invoice) as float),4) percentage_non_album_purchases
   from full_albuns_per_invoice```
1 Like

Moving from the previous module (Organizing Complex Queries) to this Guided Project was like jumping from First Grader Math problem to GMAT Exam.
It’s a relief that somebody feel the same :cold_sweat:

3 Likes

Bruno,

I’m going to second Joseph’s question. I read through all the posts you linked to and none provided an explanation of how the solution to the last problem was put together, just a repeat of the instructions.

A more focused question regarding the explanation is that there is no explanation in the solution file so any explanation would be good. I would recommend that you read it.

2 Likes

@Aaronask The following explanation is taken from Sahil’s post here. Hope this helps.

Query Explanation

Summary

We created a subquery invoice_first_track that finds the first_track_id from the invoice_line which we use to determine whether an invoice was album purchase or individual track purchases. Then we performed aggregations on the result to find out the number of invoices and the percentage of them separately for album purchase as well as individual track purchases.

Granular View

SELECT 
  * 
FROM 
  invoice_line

(4757 Rows)

SELECT 
  il.invoice_id invoice_id, -- 23 (First Value)
  MIN(il.track_id) first_track_id -- 1 (First Value)
FROM 
  invoice_line il

(1 Row)

SELECT 
  il.invoice_id invoice_id, 
  MIN(il.track_id) first_track_id 
FROM 
  invoice_line il 
GROUP BY 
  1

(614 Rows)

The above query finds the track_id with the lowest number in the invoice_line table for each invoice_id. However since an invoice can either be an album purchase or individual track purchases, we cannot say that every track_id we have is the first track of the album. It is only first, if the invoice was an album purchase. However, this track_id will help us to identify whether it was an album purchase or individual track purchases. And we can do so by getting all the tracks in an album which contains our first track_id and comparing it with all the tracks in the invoice, if both are returning the same set of tracks then it is an album purchase, otherwise it is individual track purchases. So let’s save this as a subquery with the name invoice_first_track.

WITH invoice_first_track AS (
  SELECT 
    il.invoice_id invoice_id, 
    MIN(il.track_id) first_track_id 
  FROM 
    invoice_line il 
  GROUP BY 
    1
)

Since this is bit complicated query, we will separate the complicated part and use some abstraction to make it appear simple.

SELECT 
  ifs.*, 
  CASE WHEN (
    SELECT 
      t.track_id 
    FROM 
      track t 
    WHERE 
      t.album_id = (
        SELECT 
          t2.album_id 
        FROM 
          track t2 
        WHERE 
          t2.track_id = ifs.first_track_id
      ) 
    EXCEPT 
    SELECT 
      il2.track_id 
    FROM 
      invoice_line il2 
    WHERE 
      il2.invoice_id = ifs.invoice_id
  ) IS NULL 
  AND (
    SELECT 
      il2.track_id 
    FROM 
      invoice_line il2 
    WHERE 
      il2.invoice_id = ifs.invoice_id 
    EXCEPT 
    SELECT 
      t.track_id 
    FROM 
      track t 
    WHERE 
      t.album_id = (
        SELECT 
          t2.album_id 
        FROM 
          track t2 
        WHERE 
          t2.track_id = ifs.first_track_id
      )
  ) IS NULL THEN "yes" ELSE "no" END AS "album_purchase" 
FROM 
  invoice_first_track ifs

(614 Rows)

If we replace the query that identifies whether the tracks in the invoice and tracks in the album were same or not with an abstract name called remaining_tracks then the above query is very easy to understand and it immediately becomes too short.

SELECT 
  ifs.*, 
  CASE WHEN remaining_tracks IS NULL THEN "yes" ELSE "no" END AS "album_purchase" 
FROM 
  invoice_first_track ifs

(614 Rows)

Now let’s leave the remaining_tracks query explanation there until we check it’s outer query.

SELECT 
  * 
FROM 
  invoice

(614 Rows)

SELECT 
  album_purchase, -- yes
  COUNT(invoice_id) number_of_invoices, -- 614
  CAST(
    count(invoice_id) AS FLOAT
  ) / (
    SELECT 
      COUNT(*) 
    FROM 
      invoice
  ) percent -- 1.0
FROM 
  (
    SELECT 
      ifs.*, 
      CASE WHEN remaining_tracks IS NULL THEN "yes" ELSE "no" END AS "album_purchase" 
    FROM 
      invoice_first_track ifs
  )

(1 Row)

SELECT 
  album_purchase, 
  COUNT(invoice_id) number_of_invoices, 
  CAST(
    count(invoice_id) AS FLOAT
  ) / (
    SELECT 
      COUNT(*) 
    FROM 
      invoice
  ) percent 
FROM 
  (
    SELECT 
      ifs.*, 
      CASE WHEN remaining_tracks IS NULL THEN "yes" ELSE "no" END AS "album_purchase" 
    FROM 
      invoice_first_track ifs
  ) 
GROUP BY 
  album_purchase;

(2 Rows)
Select all rows from our invoice_first_track subquery and if the result of remaining_tracks is NULL then assign "yes" to album_purchase else assign "no" to album_purchase. Find the number of invoices and there percentage of invoices grouped by album_purchase (yes and no). Yes, the entire query was as simple as that. So let’s move on to the query that appears to be a complicated one. But before that, we need to make sure that we understand how EXCEPT works. EXCEPT is basically used to subtract one set of rows from another. Let’s say a query A generated 5 rows (1, 2, 3, 4, 5) and query B generated 4 rows (1, 2, 3, 4). If we subtract B from A, we will get (5). That is, we will get rows that are in A but not in B. Now let’s subtract A from B. We will get NULL because there are no rows in B that are not in a A. So let’s say query A is album_tracks and B is invoice_tracks. So to be considered as album_purchase, (A - B) AND (B - A) should evaluate to True. However, as there is difference between the results of A and B, (A - B) AND (B - A) will become 5 AND NULL. And this will evaluate to False. Now that we understand the concept, let’s dive into understand remaining_tracks. We will explore from outer query to inner query by using abstract names.

SELECT 
  ifs.*, 
  CASE WHEN remaining_tracks IS NULL THEN "yes" ELSE "no" END AS "album_purchase" 
FROM 
  invoice_first_track ifs
SELECT 
  ifs.*, 
  CASE WHEN (
    album_tracks -- A
    EXCEPT -- -
    invoice_tracks -- B
  ) IS NULL 
  AND (
    invoice_tracks -- B
    EXCEPT -- -
    album_tracks -- A
  ) IS NULL THEN "yes" ELSE "no" END AS "album_purchase" 
FROM 
  invoice_first_track ifs
SELECT 
  ifs.*, 
  CASE WHEN (
    SELECT 
      t.track_id 
    FROM 
      track t 
    WHERE 
      t.album_id = album_id_first_track 
    EXCEPT 
    SELECT 
      il2.track_id 
    FROM 
      invoice_line il2 
    WHERE 
      il2.invoice_id = ifs.invoice_id
  ) IS NULL 
  AND (
    SELECT 
      il2.track_id 
    FROM 
      invoice_line il2 
    WHERE 
      il2.invoice_id = ifs.invoice_id 
    EXCEPT 
    SELECT 
      t.track_id 
    FROM 
      track t 
    WHERE 
      t.album_id = album_id_first_track
  ) IS NULL THEN "yes" ELSE "no" END AS "album_purchase" 
FROM 
  invoice_first_track ifs
SELECT 
  ifs.*, 
  CASE WHEN (
    SELECT 
      t.track_id 
    FROM 
      track t 
    WHERE 
      t.album_id = (
        SELECT 
          t2.album_id 
        FROM 
          track t2 
        WHERE 
          t2.track_id = ifs.first_track_id
      ) 
    EXCEPT 
    SELECT 
      il2.track_id 
    FROM 
      invoice_line il2 
    WHERE 
      il2.invoice_id = ifs.invoice_id
  ) IS NULL 
  AND (
    SELECT 
      il2.track_id 
    FROM 
      invoice_line il2 
    WHERE 
      il2.invoice_id = ifs.invoice_id 
    EXCEPT 
    SELECT 
      t.track_id 
    FROM 
      track t 
    WHERE 
      t.album_id = (
        SELECT 
          t2.album_id 
        FROM 
          track t2 
        WHERE 
          t2.track_id = ifs.first_track_id
      )
  ) IS NULL THEN "yes" ELSE "no" END AS "album_purchase" 
FROM 
  invoice_first_track ifs

Now that we have successfully understood, how that query works, let’s do some number crunching. One thing to note is that, we won’t be able to understand the output of the current query because the query does the same process for all the invoices. So let’s pick a first_track_id to make it easier.

SELECT 
  t2.album_id 
FROM 
  track t2 
WHERE 
  t2.track_id = 1158 -- 91

(1 Row)

SELECT 
  t.track_id 
FROM 
  track t 
WHERE 
  t.album_id = (
    SELECT 
      t2.album_id 
    FROM 
      track t2 
    WHERE 
      t2.track_id = 1158
  )

(16 Rows)

SELECT 
  il2.track_id 
FROM 
  invoice_line il2 
WHERE 
  il2.invoice_id = 1

(16 Rows)

SELECT 
  t.track_id 
FROM 
  track t 
WHERE 
  t.album_id = (
    SELECT 
      t2.album_id 
    FROM 
      track t2 
    WHERE 
      t2.track_id = 1158
  ) 
EXCEPT 
SELECT 
  il2.track_id 
FROM 
  invoice_line il2 
WHERE 
  il2.invoice_id = 1

(0 Rows) - NULL

Now let’s do the reverse (B - A).

SELECT 
  il2.track_id 
FROM 
  invoice_line il2 
WHERE 
  il2.invoice_id = 1 
EXCEPT 
SELECT 
  t.track_id 
FROM 
  track t 
WHERE 
  t.album_id = (
    SELECT 
      t2.album_id 
    FROM 
      track t2 
    WHERE 
      t2.track_id = 1158
  )

(0 Rows) - NULL

This is definitely an album purchase. However, let’s proceed and verify that.

SELECT 
  ifs.*, 
  CASE WHEN (
    SELECT 
      t.track_id 
    FROM 
      track t 
    WHERE 
      t.album_id = (
        SELECT 
          t2.album_id 
        FROM 
          track t2 
        WHERE 
          t2.track_id = ifs.first_track_id
      ) 
    EXCEPT 
    SELECT 
      il2.track_id 
    FROM 
      invoice_line il2 
    WHERE 
      il2.invoice_id = ifs.invoice_id
  ) IS NULL 
  AND (
    SELECT 
      il2.track_id 
    FROM 
      invoice_line il2 
    WHERE 
      il2.invoice_id = ifs.invoice_id 
    EXCEPT 
    SELECT 
      t.track_id 
    FROM 
      track t 
    WHERE 
      t.album_id = (
        SELECT 
          t2.album_id 
        FROM 
          track t2 
        WHERE 
          t2.track_id = ifs.first_track_id
      )
  ) IS NULL THEN "yes" ELSE "no" END AS "album_purchase" -- yes
FROM 
  invoice_first_track ifs 
WHERE 
  ifs.invoice_id = 1

(1 Row)

And the above process happens with every single invoices so finding the rows of each tiny query is not easy for this one.

3 Likes

Super !!!

After reading multiple answers, this i think is the best which has given right direction to learn and understand

Hi menon,

I do not get how to see whether it is an album or track from the query output please explain?