My code without IS NULL and EXCEPT from last step

Hey!

Just want to share my code from last step of Guided Project: Answering Business Questions using SQL.

I did it without IS NULL and EXCEPT.

First I checked if the number of album titles equal 1.
If not, this means that its an individual purchase.

If yes, then I checked if the number of tracks in each invoice equal to number of tracks in the album that purchased.

If not, this means that its an individual purchase.

If yes, the whole album is purchased.

%%sql

WITH number_of_tracks_in_album AS
    (
    SELECT a.title,
           COUNT(t.track_id) AS number_of_tracks_in_album
    FROM album a
    INNER JOIN track t on t.album_id = a.album_id
    GROUP BY 1
    ),

table_1 AS
    (
    SELECT il.invoice_id,
       COUNT(DISTINCT(a.title)) AS number_of_albums_in_invoice,
       COUNT(il.track_id) AS number_of_tracks_in_invoice,
       a.title
    FROM invoice_line il
    INNER JOIN track ON track.track_id = il.track_id
    INNER JOIN album a ON a.album_id = track.album_id
    GROUP BY 1
    ),
    
table_2 AS
    (
    SELECT        
        table_1.*,
        notia.number_of_tracks_in_album,
        CASE 
            WHEN (table_1.number_of_albums_in_invoice = 1 AND notia.number_of_tracks_in_album  = table_1.number_of_tracks_in_invoice)
            THEN 'whole_album_purshase'
            ELSE 'individuals_track_purshase'
            END AS 'result'
            FROM table_1
    INNER JOIN number_of_tracks_in_album notia ON notia.title = table_1.title
    )
    
SELECT 
    result,
    COUNT(invoice_id) AS number_of_invoices,
    COUNT(invoice_id) / (SELECT CAST(COUNT(invoice_id) AS FLOAT) FROM table_2) AS percent
FROM table_2
GROUP BY result;
    result          number_of_invoices	percent
individuals_track_purshase	500	       0.8143322475570033
whole_album_purshase	    114  	   0.18566775244299674

Had to think a couple of days to get simple solution :slight_smile:

3 Likes

Hi @7933509

Very interesting solution. although i’m very uncomfortable with your SELECT a.title (4th column) in table_1, but this solution does solve the problem.

I have some issues with this solution.

a.title in table_1 is not a GROUP BY column and is not aggregated. sqlite may accept this and return you the 1st appearing string in each group but stricter engines like Postgres/MSSQL,Mysql(with flag) will throw an error: ERROR: column “a.title” must appear in the GROUP BY clause or be used in an aggregate function
You can read about this issue, and functional dependency here: https://medium.com/@riccardoodone/the-love-hate-relationship-between-select-and-group-by-in-sql-4957b2a70229

The issue is if an invoice_id contains tracks from more than 1 album, and you SELECT a.title like that, the engine has to randomly select 1 title to show. It does not matter in this question + your implementation because your CASE classifier never depended (because of your logic short-circuiting) on this album title when table_1.number_of_albums_in_invoice > 1, your CASE classifier only depended on this album title only when table_1.number_of_albums_in_invoice = 1. This album title was used to join in notia to find # tracks in album to use in your WHEN condition.

In other SQL engines, you could have fixed this by wrapping any random aggregation like MAX/MIN, it will not affect the result.

Anyway, great idea to use statistics/heuristics to build a classifier, you have some feature engineering talent for machine learning.

I took awhile to understand why you have notia.number_of_tracks_in_album = table_1.number_of_tracks_in_invoice) , so it is because you want to ensure if the album has 10 tracks, the invoice has not less than 10. So there is an assumption that nobody buys 2x same album (which gives table_1.number_of_albums_in_invoice = 1), leading to 10 < invoice track count of 20, how would you handle this?

2 Likes

Thanks a lot for the feedback!

When an invoice_id contains tracks from more than 1 album I dont care about which a.title is selected because on this step I can assume that the purchase is made as individual tracks.

In other SQL engines, you could have fixed this by wrapping any random aggregation like MAX/MIN, it will not affect the result.

I tried MIN(a.title) and it works. Thanks!

I think that nobody purchases one album 2 times within one invoice. It doesn’t make sense. But any way if someone does this my code doesn’t reckon for this.

Maybe someone bought extra copies as a gift/on request for another, like how budding authors buy their own books to give away for free for marketing.

Also, there can be data loading scripts that get triggered 2/more times for 1 real life action due to bad programming logic, leading to duplicate information. For example, when data loading depends on time from time.time(), and the coder thought time should be the same for every item looped in a collection, but later found out he wrote the time.time() inside the loop instead of right before entering, leading to every loop iteration having a different time and seen as distinct (if coded to load once per new timestamp), and so getting loaded every iteration.

1 Like

Thank you.

Its not easy to me imagine that in practicing due to my lack of experience. I would appreciate if you give a link i can dive deeply.

There is no link. Idea is this

last_loaded_time = 0 # random init

for i in collection:
    process(i)  # imagine this is some random check like fillna
last_loaded_time = time.time()

vs

last_loaded_time = 0 # random init

for i in collection:
    process(i)  # imagine this is some random check like fillna
    last_loaded_time = time.time()

# somewhere else in another file with code that gets event-triggered by last_loaded_time value changing
load_data()  # could be loading the collection processed in other file, or something else