GP: Answering Business Questions using SQL

Hello there!
This specific SQL exercise is haunting me all the time. I spend more than 10 hours trying to solve it but I can not come up with the solution. My SQL code is:

/* selecting all existing albums and their songs */

WITH albums AS (
SELECT a.title, t.track_id from album a
INNER JOIN track t ON t.album_id=a.album_id
),

/* selecting all customer purchases with their songs and album names */

customer_purchases AS (
SELECT c.first_name|| ’ '|| c.last_name customer_name, i.invoice_id, a.track_id, a.title
FROM CUSTOMER c
INNER JOIN invoice i ON c.customer_id=i.customer_id
INNER JOIN invoice_line il ON i.invoice_id=il.invoice_id
INNER JOIN albums a ON il.track_id=a.track_id
)

I have two tables, the first one with all songs in each album and the next table returns all the purchases of customers with an album name. I am using customer name and album names to easily understand the result. If I perform albums EXCEPT customer_purchases, I end up with only those songs/tracks that have never been purchased. How do I create a table that shows all tracks, whether bought or not, of each album for each customer?
If I am on the wrong track, what type of two tables should I create to be able to use EXCEPT to get the number of album purchases?
I checked the answer and it has a different approach. I think my approach should also work if I only could understand how to combine two tables when there is no common key that would indicate whether this albums all songs were purchased by a customer.
Thanks for your help!

Hey, saidakbarp.

A caveat: I didn’t run your query and I’m basing my comments on my memory of this database.

Seems like you want to join these tables on track_id.

To get a table as you described, I don’t see a way to do an EXCEPT that doesn’t involve some sort of join that would make EXCEPT useless.

HI Bruno,
This is a bit difficult exercise. Please, check out the mission question first. Joining with track_id does not help because we have two levels of ids to consider: if customer (customer_id) purchased any of the tracks, then pull all the tracks (track_id) from the album whether the customer (customer_id) purchased all the track or not. Because album table has many albums with each track and customer buys only several tracks, I will need to create a table which has each customer and albums (with all track) the customers’ tracks belong to. This has been a difficult problem. I hope I will get a hint how to proceed. Thanks!

I think I now understand what you want. If I understood you correctly, this isn’t going in the right direction because you’re assuming that a customer can’t buy an album little by little in different purchases.

Anyway, what you want can be obtained with the following query (slightly adjusted from yours):

  WITH customer_purchases AS (
SELECT DISTINCT c.first_name || ' ' || c.last_name AS customer_name, c.customer_id,
       i.invoice_id, il.track_id
  FROM CUSTOMER AS c
 INNER JOIN invoice i ON c.customer_id=i.customer_id
 INNER JOIN invoice_line il ON i.invoice_id=il.invoice_id
),

       all_tracks_customer AS (
SELECT t.album_id, t.track_id, c.customer_id
  FROM track AS t
 CROSS JOIN customer as c
)

SELECT atc.*, cp.customer_name
  FROM all_tracks_customer AS atc
  LEFT JOIN customer_purchases AS cp
       ON atc.track_id = cp.track_id AND atc.customer_id = cp.customer_id
;

The table all_tracks_customer has every possible combinations of track and customer. It doesn’t tells us which tracks the customer purchased.

The table customer_purchases is one you’re familiar with. I just made an adjustment and removed duplicates. That’s because there are invoices with duplicate tracks (i.e. the customer bought the same track in the same purchase). If you care who they are, the customers who did this are those whose id is in the list [3, 10, 12, 13, 28, 34, 37, 43, 46, 48, 49].

Check what this table looks like:

album_id track_id customer_id customer_name
7 1 1 29 None
66 1 6 29 None
125 1 7 29 None
184 1 8 29 None
243 1 9 29 None
302 1 10 29 None
361 1 11 29 None
420 1 12 29 None
479 1 13 29 None
538 1 14 29 None
597 2 2 29 None
656 3 3 29 None
715 3 4 29 None
774 3 5 29 None
833 4 15 29 None
892 4 16 29 None
951 4 17 29 Robert Brown
1010 4 18 29 None
1069 4 19 29 None
1128 4 20 29 None
1187 4 21 29 None
1246 4 22 29 None
1305 5 23 29 None
1364 5 24 29 None
1423 5 25 29 None
1482 5 26 29 None
1541 5 27 29 None
1600 5 28 29 None
1659 5 29 29 None
1718 5 30 29 None

You can improve on this by using a CASE statement that assigns the value of 0 when there’s no customer_name and 1 otherwise, or something like this.

But as I said, this isn’t going in the right direction. Does this help?

1 Like

Thanks, Bruno. This is exactly what I wanted.
Yes, you are correct. This is not going in the right direction. Can you, please, give me a hint on what type of tables should create to use with EXCEPT statement to get a correct answer?
Thanks!

For a solution that uses EXCEPT you can check the solution notebook. If you have something else in mind that uses EXCEPT, please create a minimal working example showcasing what you intend with small tables.

I would use a different strategy, to be honest. On a high level I would do something like:

  • Create a table where each row is an album and the columns are album_id and the number of tracks it has.
  • Create a table where each line is a combination of invoice and album. The columns can be invoice_id, album_id and number of tracks purchased in each album in each invoice.
  • Join the two tables above on album_id and check which rows are such that the “number of tracks” of each of them matches, those are your album purchases.
2 Likes

Thanks Bruno.
I came up with this solution. I am almost there but I don’t know why my solution is different from the DataQuest solution:

returns track count for each album
WITH albums AS (
SELECT album_id, COUNT(*) tracks
FROM track
GROUP BY album_id
),

returns track counts for each album in each invoice
tracks_purchased AS (
SELECT il.invoice_id, t.album_id, COUNT(*) tracks_p
FROM track t INNER JOIN invoice_line il ON t.track_id=il.track_id
GROUP BY 1,2
),

returns only full album purchases
full_album_p AS (
SELECT p.invoice_id
FROM tracks_purchased p
INNER JOIN albums a ON a.album_id=p.album_id where tracks=tracks_p AND tracks >= 2
)
returns the counts of album and non-album purchases
SELECT count(*) full_album_purchases,
(SELECT count(*) FROM tracks_purchased) - COUNT(*) individual_purchases
FROM full_album_p;

but the results are:

full_album_purchases: 111 individual_purchases: 2966

Even after you fix your solution, it’s going to end up being different because the proposed solution is using a different definition of what an album purchase is.

For example the following invoice ids are purchases that include full albums with only one track that the given solution ignores: {4, 517, 7, 11, 15, 527, 20, 27, 548, 37, 38, 39, 41, 565, 59, 577, 579, 67, 586, 78, 597, 87, 91, 94, 97, 101, 103, 106, 111, 121, 127, 134, 137, 164, 168, 176, 180, 189, 190, 195, 198, 210, 216, 218, 227, 242, 250, 256, 260, 265, 270, 272, 275, 285, 288, 306, 311, 313, 316, 324, 325, 338, 345, 360, 365, 368, 369, 384, 395, 402, 405, 415, 421, 423, 424, 428, 434, 437, 441, 445, 450, 452, 456, 462, 467, 486, 509}.

You can take a look at, for instance, invoice id 597. The track id 2093 is part of an album with only one track. To see Dataquest’s invoice classification, you can run the following query:

WITH invoice_first_track AS
    (
     SELECT
         il.invoice_id invoice_id,
         MIN(il.track_id) first_track_id
     FROM invoice_line il
     GROUP BY 1
    )

SELECT
    ifs.*,
    CASE
        WHEN
             (
              SELECT t.track_id FROM track t
              WHERE t.album_id = (
                                  SELECT t2.album_id FROM track t2
                                  WHERE t2.track_id = ifs.first_track_id
                                 ) 

              EXCEPT 

              SELECT il2.track_id FROM invoice_line il2
              WHERE il2.invoice_id = ifs.invoice_id
             ) IS NULL
         AND
             (
              SELECT il2.track_id FROM invoice_line il2
              WHERE il2.invoice_id = ifs.invoice_id

              EXCEPT 

              SELECT t.track_id FROM track t
              WHERE t.album_id = (
                                  SELECT t2.album_id FROM track t2
                                  WHERE t2.track_id = ifs.first_track_id
                                 ) 
             ) IS NULL
         THEN "yes"
         ELSE "no"
     END AS "album_purchase"
 FROM invoice_first_track ifs
;

Problems with your solution are:

  • Filtering for albums with more than one track. There are a ton of albums with only one track.

  • SELECT count(*) FROM tracks_purchased is a table comprised of tracks, basically. You’re supposed to count invoices, hence the huge number 2966, despite the fact that the invoice table only has 614 rows.

2 Likes

Hey @saidakbarp,

We have a solved feature that allows you the ability to mark something as the “correct” answer, which helps future students with the same question quickly find the solution they’re looking for.

Here’s an article on how to mark posts as solved - I don’t want to do this for you until I know that solution/explanation works.

Best,
Alvin.

2 Likes

@saidakbarp I don’t know if this is still relevant to you. But a few solutions to this were given here.

Hi @Bruno ,

I’m not sure why we have to find the invoice_first_track in order to find if the customer puchased an album or just tracks.

please enlighten me

Hey. @ngokieuphu

We don’t need to find invoice_first_track. That was used in Dataquest’s solution, but it’s not necessary. I even suggested a strategy that doesn’t use that table.

When i run my code my answer differs from the solution. Did i mess up something logically because I don’t see what I did wrong. My logic being that if there are any invoices with more than one distinct album_id than it is not an album and if the album has less than a few tracks than it does not count as an album. The solution from this code is 132 albums and 482 individual purchases.

WITH album_indiv AS
    (
    SELECT
        CASE
            WHEN COUNT(DISTINCT album_id)=1 and COUNT(il.track_id)>4 THEN "album"
            ELSE "individual"
        END purchase_type,
        t.album_id,
        t.name,
        il.invoice_id
    FROM invoice_line il
    INNER JOIN track t ON t.track_id=il.track_id
    GROUP BY invoice_id
    )

SELECT
    CASE
        WHEN purchase_type="album" THEN "yes"
        WHEN purchase_type="individual" THEN "no"
        END album_purchase,
    COUNT(*) total_type
    
FROM album_indiv ai
GROUP BY purchase_type

Hi Bruno. I tried using the method you mentioned and I get a different answer than the solution. I was wondering if you know what mistake I made. Thanks!

WITH 
    album_title AS
        (
        SELECT
            a.title,
            a.album_id,
            COUNT(t.name) total_tracks
        FROM album a
        INNER JOIN track t ON t.album_id=a.album_id
        GROUP BY a.title
        ),
    
    invoice_table AS
        (
        SELECT
            invoice_id,
            album_id,
            COUNT(il.track_id) tracks_purchased
        FROM invoice_line il
        INNER JOIN track t ON t.track_id=il.track_id
        GROUP BY il.invoice_id
        )
SELECT 
    CASE 
        WHEN total_tracks=tracks_purchased THEN "YES"
        ELSE "NO"
    END album_purchase,
    COUNT(*) total_type,
    CAST(COUNT(*) as float)/CAST(MAX(invoice_id) as float) percentage
FROM album_title at
INNER JOIN invoice_table it ON it.album_id=at.album_id
GROUP BY album_purchase

Hey, Ja.

What I mentioned was in another answer, so please ask it there (or ask in a separate question with the appropriate links to the sources).

In any case, in that answer you mention, I provided a file which classifies invoices as to whether they are album purchases or not. You can use that to investigate the differences and try to figure it out yourself — it’s a good exercise.

Hi @Bruno,

I am also having trouble with this exercise. I want to understand Dataquest’s solution since it shows something I have not seen in past exercises.

“…SELECT t2.album_id FROM track t2
WHERE t2.track_id = ifs.first_track_id…”

In past execises, I remember seeing a number or string in the WHERE clause, for instance:
WHERE t2.track_id = 2
or a subquery that results in a number or string. In this exercise, we are comparing two columns in the WHERE clause, similarly to when we use the JOIN ON clause.

I do not understand how WHERE in this situation works, specially considering that the ifs table was not joined in this subquery.

Can you help me to understand this, please?

I do not want to advance with an alternate solution because I fear I might need to use this WHERE condition again in future exercises.

Thank you very much for your help :slight_smile:

Hey, Rodrigo.

Since this is an entirely different question, I won’t go into too much detail. Feel free to ask your questions in a new post if this doesn’t suffice.

This is great feedback, thanks. @rodrigoc.lomonaco

A clause like WHERE a_column = another_column is filtering for rows in which the values of the columns match.

For instance, say you have a table called a_table that looks like this:

a_column another_column yet_another_column
1 1 “Whateva!”
2 5 “I do what I want.”
3 3 “— Eric Cartman”

The query SELECT * FROM a_table WHERE a_column = another_column; yields

a_column another_column yet_another_column
1 1 “Whateva!”
3 3 “— Eric Cartman”
1 Like

Hi @Bruno, thank you for your reply. It was very enlightening.

When you meant that I should ask my questions in a new post, I am not sure if you meant I should create a new topic or if I should not reply the post I have previously replied. If you meant the former, I am sorry :confused:

I still have one question. In the same subquery:

a column of “ifs” table is used but “ifs” table is not in a FROM clause nor in a JOIN clause. I thought that if you are going to use a column from a table, this table had to be in a FROM clause or in a JOIN clause. Was that assumption incorrect?

Thanks again and sorry if I did a mistake.

Sorry, I used the wrong term, I meant a new topic. The one we’re using isn’t meant to encompass all questions regarding 191-7. The 191-7 tag only serves to inform what the topic pertains to.

It is:

image

1 Like

I think you are asking about “within the same SELECT statement, how did a new table appear?”

This was the concept where i remember the lessons missed totally.
Here is a clear example.

2 Likes