191-7 Sql business analysis project

I have been stuck to this step for quite few days.
i have read the suggested solution and havent quite catch the logic behind it.
so after days of thinking i came with my solution that outputs a result quite similar with the suggested one.
could you pls give it a eye to see ιf my logic is right in answering the question?
thank you very much

For convenience, find the code in text format by expanding here
query = '''

with 
    
    a as 
    (
    select invoice_id, track_id
    from invoice_line
    ),
    
    b as 
    (
    select track_id, album_id
    from track
    ),
    
    c as 
    (
    select invoice_id, min(track_id) track_example
    from a
    group by invoice_id
    ),
    
    d as 
    (
    select invoice_id, track_id,album_id
    from c
    inner join b
    on c.track_example = b.track_id
    )
    
  
    
select *
from d
'''

run_query(query)

Hey, Kostasmandilass.

In the mean time, please use the search function to search for 191-7. There has been another question about this exercise in the past and that can be helpful to you.

At any rate, I did read your code (but I only ran it in my head and the database can be corrupted there), but I think what you’re doing is wrong and you only obtain similar results by coincidence.

You’re basically using the ratio count(distinct album_id) / count(*) to classify whether or not an invoice should be an album purchase, and the criterion is that the number of albums has to be smaller than the number of tracks by a factor of 10.

I just don’t understand the logic. Let’s say we have an invoice in which two tracks from a single album were purchased. Something like the table below (I just made up the data):

| invoice_id | track_id | album_id |
|------------|----------|----------|
| 1000000    | 9998     | 7331     |
| 1000000    | 9999     | 7331     |

In this case, we’d get:

  • count(distinct album_id) equal to 1 (there is only one album);
  • count(*) equal to 2, as there are two rows;
  • And consequently cast(count(distinct album_id) as float) / cast(count(*) as float) would be 0.5 which is not smaller than 0.1.
  • To conclude, this invoice would not be tagged as an album purchase.

Now note this: you know nothing about this album other than it has these two tracks. It could very well be the case that this album only has two tracks, in which case this would be an album purchase.

at first i had written only this :

count(distinct album_id) = 1

but i thought about the edge cases that the mission refers
so for example if an invoice has 17 songs and the 16 of them are of the same album, then according to my previous code it wouldnt count it as an album purchase while it actually is.
so i created a ratio that compares the number of albums to the number of songs contained in an invoice.

But as i can see, you are right about when an invoice contains few songs the logic isnt right

In the question I alluded to above, I suggested the following strategy:

Alternatively, feel free to continue to pursue your idea and share your thoughts, or take a look at the solution notebook and try to make sense of it.

Hi everyone, I’m jumping in the zone here - I wanted to give a shot at a python-based solution rather than a SQL one for this problem.
The approach I took was:

  • Define a dictionary of all albums and tracks
#build a mapping dictionary with the following structure:
# {albumId: [tracks]}
album_tracks_map = dict()
for i in albums['album_id'].unique().tolist():   #generate list of unique albumIds and iterate through them
    bool_filter = albums['album_id'] == i        #for each albumId generate a boolean filter evaluating each row (song)
    alb_slice = albums[bool_filter]             #generate a DF containing list of songs for each album
    album_tracks_map[i] = alb_slice['track_id'].tolist()      #add to dictionary list of tracks as VALUES per each KEY (albumId)
    
album_tracks_map
  • As a second step, starting from the full list of purchases, I built a nested dictionary with the following structure:
    {invoiceid: {albumId: [track_id1, track_id2 … track_idN]}
    This allowed me to understand for each invoice which albums were purchased and which tracks for each of them.
master_dict={}
unique_invoices = df["invoice_id"].unique().tolist()

#build a dict of dicts with the following structure:
# {invoiceId: {albumId:[trackid1, trackid_n]  }  }

for i in unique_invoices:
    album_dict = {}                  #for each unique invoice generate an empty album dictionary... 
    master_dict[i] = album_dict      #...and push it into the master dict, creating a nested structure
    ref_df = df[df["invoice_id"] == i]                #generate subset of data for each invoice
    album_list = ref_df["album_id"].unique().tolist()     #select unique albums purchased per invoice

    for alb_id in album_list:                     #for each album puchased (in each invoice - nested loop)
        sub_ref = ref_df[ref_df["album_id"] == alb_id] #generate subset of data
        track_list = sub_ref["track_id"].tolist()  #identify tracks per album and reduce to list
        album_dict[alb_id] = track_list           #access nested dictionary. Key:value --> Album_id: Track_list

master_dict

As a last step, i pulled the two dictionaries together, running a check for each invoice, if the list of tracks per album matched the mapping dictionary. When the two lists were equal, I counted a True.

full_albums_purchased = []

for invoice, album in master_dict.items():
    #album_purchase = master_dict[invoice]  -->   values, that is, album
    for album_id, track_id in album.items():      #access album - the nested dictionary containing purchased albums and tracks
        purchased_tracks = track_id               #pick the track_ids for each purchased album
        comparison = album_tracks_map[album_id]   #access the list of tracks in each album full version (from the map dict)
        if purchased_tracks == comparison:
            full_albums_purchased.append(True)
        else:
            full_albums_purchased.append(False)

pd.Series(full_albums_purchased).value_counts()

This will indeed count albums with a single tracks, but still I wanted to take a python approach.
@Bruno would you mind giving me your opinion on the approach above?

As a beginner I am really curious to get feedback on my choices.
Thanks,
Niccolò

2 Likes

Hey, Niccolò. Nice effort!

I didn’t run your code because I’m missing context for it. For instance, what is albums? I thought it would be the dataframe returned by run_query("SELECT * FROM album;"), but it’s not (for instance, this table doesn’t have a track_id column). Can you provide a version of your code that is self-contained (apart from input files)?

Due to the above, I just focused on the intention of your code cells (using a mix of comment and code reading — and not 100% of the code). It mostly looks good, I would just be wary of a couple of things:

  • This isn’t necessarily wrong, but I don’t know what’s going on there, you’ll have to check.

    In order to compare the contents of lists for membership, you will be comparing the lists themselves. But let’s not forget that [0, 1] is not the same as [1, 0], the order matters. When you use alb_slice['track_id'].tolist(), you need to make sure that you understand what’s going on there with respect to the order and tweak your code appropriately.

    • It’s probably better to use sets for the tracks instead.
  • It looks silly to use the following assignment:

    for album_id, track_id in album.items():
            purchased_tracks = track_id 
    

    Why not just name it purchased_tracks in the for line? I don’t like the idea of naming it track_id to start with, it’s not a track ID, it’s a list of ids of tracks.


Edit: Feel free to compare your results with the ones I shared here in this reply.

Here’s a query that solves this exercise.

N.B.: My definition of what an album purchase is is different from the one used in the learn section. The definition of album purchase I’m using is: an invoice that contains at least a full album.

This means that invoices that contain ten individual tracks from an album with ten tracks, and then some more tracks from some other album, are album purchases.

Anyway, here's the query: (expand to see it)
/* Table that counts the number of tracks each album has */      
  WITH tr_per_al AS (
SELECT album_id, COUNT(DISTINCT track_id) AS nr_of_tracks
  FROM track
 GROUP BY album_id
 ),

/* Table that associates to each combination of invoice and track
the album to which the track belongs */
       inv_tr_al AS (
SELECT il.invoice_id,
       il.track_id,
       tr.album_id
  FROM invoice_line as il
  LEFT JOIN track AS tr
       ON il.track_id = tr.track_id
       ),

/* Table that associates to each combination of invoice and album
the number of purchased tracks from that album */
       inv_al AS (
SELECT invoice_id, album_id, COUNT(DISTINCT track_id) AS nr_purchased_tracks
  FROM inv_tr_al
 GROUP BY invoice_id, album_id
 ),

/* For each combination of invoice and album,
this table classifies whether the full album was purchased */
       class_inv_al AS (
SELECT ia.*, ta.nr_of_tracks,
       (
       CASE
         WHEN ia.nr_purchased_tracks = nr_of_tracks THEN 1
         WHEN ia.nr_purchased_tracks <> nr_of_tracks THEN 0
         ELSE -9999 /* Never happens. The point is to capture errors */
       END
       ) AS full_album
  FROM inv_al AS ia
 INNER JOIN tr_per_al AS ta
       ON ia.album_id = ta.album_id
       ),

/* Table that classifies each invoice
with respect to the feature 'It contains a full album' */
       inv_class AS (
SELECT invoice_id, MAX(full_album) AS album_purchase
  FROM class_inv_al
 GROUP BY invoice_id
 )
     
SELECT album_purchase,
       COUNT(*) as nr_invoices,
       ROUND(
         100*CAST(COUNT(*) as FLOAT)/(SELECT COUNT(*) FROM inv_class),
         2) AS pct_of_invoices
  FROM inv_class
 GROUP BY album_purchase
;

And here is the resulting dataframe:

album_purchase nr_invoices pct_of_invoices
0 0 413 67.26
1 1 201 32.74

The CTE inv_class in the query, is the table with the information that we really want, what follows is just summarizing this data. I’m attaching it as a CSV file in case anyone wants to compare their invoice classification with this one.

invoice_classification.csv (3.5 KB)

2 Likes

Hi @Bruno, thanks for your answer and the comments.
You are right, without explicitly mentioning albums, it was hard for you to get a grasp on the content.

Albums -> Basically a df containing all album_ids and related track_ids
#retrieve full list of tracks per album
q = '''
SELECT
    al.album_id,
    al.title,
    t.track_id,
    t.name
FROM album al 
    INNER JOIN track t on t.album_id = al.album_id
ORDER BY 1,3
'''
albums = run_query(q)
run_query(q)
  • It was a VERY good point raiing the fact that [0,1] is different from [1,0]. This should be mitigated by the way I ordered data on the dataframe, but happened totally by chance.
    If I got it right, by applying set(my_list) I make the list object capable of being compared to any other list, not considering the order of items within it?

  • And also I agree it was silly iterating through an object and then changing the variable name withing the loop. Also, the variable name doesn’t make much sense, being track_id a list of of track ids.

Overall I think that going through the dictionary approach was not as easy as I would have hoped in my mind, both in terms of building it up, and also because it made more difficult any extra analysis (e.g. considerations of revenue generated by full albums vs non-full albums).

Do you think the approach was unnecessarily complicated or it is just simpler to go through SQL?

In light of the above the (probably) easiest solution was this one on my end - it doesn’t match yours in the way you have chosen a more inclusive approach as to what should be considered as an album purchase:

Solution to 191-7
sql='''
/* Build a table containing all album ids and the related amount of tracks */

WITH albums_and_tracks AS
    (SELECT
        a.album_id,
        COUNT(DISTINCT t.track_id) as album_tracks
        FROM album a
            JOIN track t ON a.album_id = t.album_id
        GROUP BY 1
        ),


/* Second table built up on invoices and containing albums and tracks purchased, as well as generated revenue*/

    invoice_and_albums AS
    (SELECT
        il.invoice_id,
        t.album_id,
        COUNT(DISTINCT il.track_id) as purchased_tracks,
        SUM(il.unit_price) as revenue
        FROM invoice_line il
            JOIN track t ON il.track_id = t.track_id
        GROUP BY 1,2
        
    )

/* Put subqueries together: the result is a dataframe that addresses each row as an album part of an invoice.
This allows to infer whether the album was a full purchase or not, and understand its %weight in generating revenue */

        SELECT 
        iaa.invoice_id,
        iaa.album_id,
        iaa.purchased_tracks,
        iaa.revenue,
        aat.album_tracks,
        
        CASE WHEN iaa.purchased_tracks = aat.album_tracks AND aat.album_tracks>1 THEN 1 ELSE 0 END as full_album_purchased
            FROM invoice_and_albums iaa
            INNER JOIN albums_and_tracks aat ON aat.album_id = iaa.album_id
        ORDER BY 1 ASC
    '''

answer = run_query(sql)
answer
1 Like

You’re welcome :slight_smile:

It’s not just albums that’s missing, though. You’re also using this df object. That’s all I caught that was missing.

If with “any other list” you mean “any other set” then yes. Sets in Python mimic sets in mathematics (non-ordered collections of distinct objects).

I don’t think it is unnecessarily complicated (although, to be honest, I didn’t think how to make it simpler using Python, so maybe there’s room for improvement there).

I actually find SQL to be less intuitive here because by its nature it doesn’t allow you to easily code some concepts directly and we end up using proxies (for instance both me and you ended up using the number of tracks to model the concept of “purchase is a full album”.

This looks good. As you mentioned, different definitions, different results. From reading the code it looks good. You’re not using the same definition in your Python solution though.

Good lord you’re right the query generating df was missing.
I’ve published the yet not complete project on my github repo, if you wish to take a look.

Why do you say that the two definitions mismatch between the SQL and Python solutions?
I find that the two solutions deliver almost identical results (a single record mismatches…).

If I remove the clauses that filter out albums with a single row, they actually match perfectly, so I guess something stands in there. Might be overkilling this though, more value in jumping on another project I guess :slight_smile:

Because I don’t see in your code above anything that makes it filter out one-track albums. But I see you included a condition in the repository, hence the mismatch :slight_smile:


Edit: By the way, instead of appending True and False, I’d consider appending invoice_id when it’s a full album purchase and do nothing otherwise. What you did makes it not immediate to identify what invoices are full album purchases.

I’m laughing my ass off, my query is actually wrong because there’s a customer that bought the same one-track album twice in the same invoice (invoice_id 456), hahahaha. This is such a real life example, stuff like this really happens.

You shielded yourself from stuff like this by using COUNT(DISTINCT t.track_id) instead of COUNT(*). Nice job! I have the habit of doing the same. Assume nothing about the correctness of the data.

I’ll be updating my query above shortly. Well done :slight_smile:

Ahhh nice catch Bruno!
I didn’t think at all about the option for someone to buy the same album twice in the same invoice - If I had followed your solution in appending invoice_id , I would have probably spotted it. :slight_smile:

Thanks a lot for the support on the project, loved the debate :smiley:

2 Likes

Hey Bruno. I actually approached the problem as you did by first creating a table of album_id and number of tracks.

Then I joined it to a table of invoice_id, album_id, tracks purchase (grouped by invoice_id and then album id), and assigned a 1 if the num tracks purchased of each album equaled the number of tracks in that album.

Then, since I have multiple 0 or 1 values for each invoice_id (because someone might purchase an entire album worth of tracks from once album, but only 1 from another), I created a new table where I grouped by invoice_id again, but returned the max value in my column containing the 0/1.

Finally I summed up the 0’s and 1’s. I got 111 album sales (like that other person did) and 503 non-album sales.

All the logic in my statements seems to work ok. I accounted in my first table to only include albums that have over 2 songs.

The github link here located in run line 62.

Was wondering if you could see any flaws in it.

Looks good! Thank you for sharing.