CYBER WEEK - EXTRA SAVINGS EVENT
TRY A FREE LESSON

Albums vs Individual Tracks: my `EXCEPT` statement brought me the unexpected output

Hello,

I’m finishing my 6. Albums vs Individual Tracks. But I encounter some logical errors in my code. The problem is, my EXCEPT statement gave me an unexpected output.

I created two columns that should be compared :
-album_track_count
-invoice_track_count

In theory, when both have the same value in a row, that means we have an “album” string in the purchase column. But in the very first row, the output is wrong: “tracks”

My code:

%%sql
DROP VIEW IF EXISTS invoice_album_track_count;
CREATE VIEW invoice_album_track_count AS
                                        SELECT 
                                               il.invoice_id,
                                               t.album_id AS invoice_album_id,
                                               COUNT(il.track_id) AS invoice_track_count
                                          FROM track AS t
                                         INNER JOIN invoice_line AS il ON il.track_id = t.track_id
                                         GROUP BY 1, 2;

DROP VIEW IF EXISTS all_tracks_from_album;
CREATE VIEW all_tracks_from_album AS
                                    SELECT 
                                           album_id,
                                           COUNT(track_id) AS album_track_count
                                      FROM track AS t
                                     GROUP BY 1;
                    
DROP VIEW IF EXISTS compare_purchase_to_album;
CREATE VIEW compare_purchase_to_album AS
                                        SELECT *
                                          FROM all_tracks_from_album AS atfa
                                         INNER JOIN invoice_album_track_count AS iatc ON iatc.invoice_album_id = atfa.album_id
                                         ORDER BY invoice_id;

SELECT cpta.*,
        CASE                            
             WHEN (
                    SELECT cpta.album_track_count 
                      FROM compare_purchase_to_album AS cpta
                 
                    EXCEPT
                 
                    SELECT cpta.invoice_track_count 
                      FROM compare_purchase_to_album AS cpta
                  ) IS NULL
        
                  AND
            
                  (
                    SELECT cpta.invoice_track_count 
                      FROM compare_purchase_to_album AS cpta
                      
                    EXCEPT
                      
                    SELECT cpta.album_track_count 
                      FROM compare_purchase_to_album AS cpta
                  ) IS NULL
            THEN "albums"
            ELSE "tracks"
            END AS purchase
  FROM compare_purchase_to_album AS cpta
              

My output:

Why use that whole convoluted A except B IS NULL AND B EXCEPT A IS NULL just to check if 2 columns have equal values? (How do I compare two columns for equality in SQL Server? - Stack Overflow).

You may not be familiar enough with how except works, go to https://sqliteonline.com/, connect to SQL Server (Not the default sqlite because parenthesis don’t work) and explore

(SELECT 1 UNION SELECT 2)
EXCEPT 
(SELECT 2 UNION SELECT 0)

swapping 0,1,2 or any other number.
My point is EXCEPT completely does not do what you are trying to do. Except is set-based operation. There is no concept of row alignment. It compares 2 sets of rows from 2 SELECT statements, with rows within each SELECT in any order. You are not trying to compare across rows, but across columns instead. Your logic is to use the count integer to decide if a purchase is an album purchase. If using EXCEPT which compares across rows, it makes more sense to SELECT 1 set of unaggregated tracks (not their counts, but their individual track_id), then EXCEPT another set of tracks. EXCEPTing aggregated integers are weird.

What do you do when a query doesn’t work? You can break it down, go back step by step and observe the individual behaviours.

Now your outermost if-else logic is failing, so you can check the output of A EXCEPT B, then check the output of B EXCEPT A, check what they contain and why they contain that, then you will understand why you get all 'tracks' under purchase.

Besides this, the rest of the problem solving strategy is right. Even if you get purchase column right, you still need to think about how to interpret if an invoice_id contains tracks from multiple album_id. For example 11 tracks are bought and 10 of them belong to an album while last 1 doesn’t, do you call this an album purchase? If someone is strict, he may require the purchase to be just albums and no single tracks, meaning if more than 1 album, the leftover tracks must also make up an album, else not count as album purchase. To fulfil the latter stricter logic, you would then have to groupby invoice_id and check if all entries in purchase for a particular invoice_id group contains 'albums'. To fulfil to former lax definition, there must still be groupby invoice_id, but the check is using “any” instead of “all” logic.

The whole previous paragraph is just something to discuss in interviews during question clarification before you even start working on the problem. For the purpose of this guided project i believe you can ignore it entirely because the dataset is simple and none of the 2 cases i said will come up. If I remember right, if it’s an album purchase, it will be a single album with no extra singles, and nobody purchased 2 albums in 1 invoice, would appreciate if you could confirm this.

2 Likes

Hi,
thank you for this broad reply!

Yes, I misused EXCEPT

This part of the correct code should look like the last block below:

%%sql
DROP VIEW IF EXISTS invoice_album_track_count;
CREATE VIEW invoice_album_track_count AS
                                        SELECT 
                                               il.invoice_id,
                                               t.album_id AS invoice_album_id,
                                               COUNT(il.track_id) AS invoice_track_count
                                          FROM track AS t
                                         INNER JOIN invoice_line AS il ON il.track_id = t.track_id
                                         GROUP BY 1, 2;

DROP VIEW IF EXISTS all_tracks_from_album;
CREATE VIEW all_tracks_from_album AS
                                        SELECT 
                                               album_id,
                                               COUNT(track_id) AS album_track_count
                                          FROM track AS t
                                         GROUP BY 1;
                    
DROP VIEW IF EXISTS compare_purchase_to_album;
CREATE VIEW compare_purchase_to_album AS
                                        SELECT *
                                          FROM all_tracks_from_album AS atfa
                                         INNER JOIN invoice_album_track_count AS iatc ON iatc.invoice_album_id = atfa.album_id
                                         ORDER BY invoice_id;

DROP VIEW IF EXISTS final_compare_db;
CREATE VIEW final_compare_db AS 
                                        SELECT cpta.*,
                                               CASE                            
                                                    WHEN cpta.album_track_count = cpta.invoice_track_count 
                                                    THEN "albums"
                                                    ELSE "tracks"
                                                    END AS purchase
                                          FROM compare_purchase_to_album AS cpta;

BUT, later on, there is another problem which I failed to solve and you predicted it:

yes… there was a case when someone bought 2 albums on 1 invoice and because of that the output is incorrect:

image

Proper output is:

image

So I failed to use this approach. The full code that generates incorrect output is below:

%%sql
DROP VIEW IF EXISTS invoice_album_track_count;
CREATE VIEW invoice_album_track_count AS
                                        SELECT 
                                               il.invoice_id,
                                               t.album_id AS invoice_album_id,
                                               COUNT(il.track_id) AS invoice_track_count
                                          FROM track AS t
                                         INNER JOIN invoice_line AS il ON il.track_id = t.track_id
                                         GROUP BY 1, 2;

DROP VIEW IF EXISTS all_tracks_from_album;
CREATE VIEW all_tracks_from_album AS
                                        SELECT 
                                               album_id,
                                               COUNT(track_id) AS album_track_count
                                          FROM track AS t
                                         GROUP BY 1;
                    
DROP VIEW IF EXISTS compare_purchase_to_album;
CREATE VIEW compare_purchase_to_album AS
                                        SELECT *
                                          FROM all_tracks_from_album AS atfa
                                         INNER JOIN invoice_album_track_count AS iatc ON iatc.invoice_album_id = atfa.album_id
                                         ORDER BY invoice_id;

DROP VIEW IF EXISTS final_compare_db;
CREATE VIEW final_compare_db AS 
                                        SELECT cpta.*,
                                               CASE                            
                                                    WHEN cpta.album_track_count = cpta.invoice_track_count 
                                                    THEN "albums"
                                                    ELSE "tracks"
                                                    END AS purchase
                                          FROM compare_purchase_to_album AS cpta;

SELECT purchase,
       COUNT(DISTINCT(invoice_id)) AS num_invoices,
       ROUND(COUNT(DISTINCT(invoice_id)) * 100.0 /(SELECT COUNT(*)
                                  FROM invoice), 2) AS pct
  FROM final_compare_db
 GROUP BY 1
 ORDER BY 2 DESC;

I don’t know how to solve it without writing many more VIEWs . I have just rebuilt the proper query using EXCEPT. I’m a little curious about how to solve it without EXCEPT, but on the other hand, even if I succeed, the code will be very long and will be processing itself much longer, so it’s pointless. Anyway, it teaches me a lot.

All these incorrectly performed tasks made by me, taught me, that it is necessary to check in the side queries: whether the quantities (in this case, invoices) agree with the actual state. It’s another reason that it was worth trying this - incorrect - way.

Thank you for your effort. I appreciate that !!! :smiley:

1 Like

Why not write more? There’s nothing bad with doing something with more code. It usually ends up with modular results that are more readable and debuggable.

Let’s drive on from your final_compare_db result set.
image

This table shows a few possible scenarios.

  1. invoice_id with only tracks in COUNT (can be any integer)
  2. invoice_id with only albums in COUNT (i checked there is only {1})
  3. invoice_id with mix of albums and tracks, each of which can be any integer (eg. invoice_id=4)

From this table we dive deeper.

1. Pure album invoices (nevermind how many albums, as long as no tracks appear in group)
image
HAVING MAX(CASE is a trick to implement the ALL concept, by reversing the logic to check that if any track appeared, it will make the whole group sum to 1 (any integer here is fine, just a non-zero flag)

The solution identified this case, then did 614-114 to get 500, which includes mix of album+track

2. Pure track invoices
image

3. Mix (can be lazy by doing 614-114-414 = 86, but we want to practice SQL)
Using CASE here and checking if it sums to 2 (or 2x any integer you define inside THEN) works too but is just unnecessary.

image

Generally, I still prefer comparing individual track_id instead of aggregated counts to identify album purchases. If someone bought 2x same single in an invoice, it will mess up the count, but comparing unaggregated track_id will still work.

By incorrect, I assume you mean your albums row gave 200 while answer gave 114, and that your percentages do not sum to 100. If so, your answer is not wrong because of that quoted statement above, it’s because you didn’t count invoices of mixed purchases properly. Someone does not have to buy 2 albums to cause your error, it just needs to mix at least 1 album with at least 1 track. Anyway, i prefer you practice querying for each group explicitly then counting instead of doing the lazy way of count A, then B = 1-A kind of thing. (although that’s exactly what CASE does but hope you get my point).

Incorrect way to check purchases of more than 1 album, not identifying whether it’s a mix or pure album invoice:image

Confirming that they are really mixed (by observing that every of these multi-album Invoice_id has 2 rows in result)

Proof that for pure album purchases, they are all only 1 album

If multi-pure-album, any number > 1 should appear.

To expand your learning, look at this student’s answer who also didn’t do EXCEPT but use aggregations like you: My code without IS NULL and EXCEPT from last step
Limitation of that answer is it can only identify album purchases when there is a pure album with no mix of extra singles, plus there can only be 1 album, which coincidentally the dataset is. I wouldn’t recommend that solution in practice because of its limitations. The solution here is more general, though still not ideal (checking with aggregates instead of track_id).

Unrelated note: Please do not use numbers in GROUP BY 1,2 but list out the column names explicitly. Numbers are unreadable and if you add a new column to beginning of SELECT, numbers start refering to different things and everything breaks.

1 Like