Why is the Solution Ignoring 'Quantity' in invoice_line when Calculating Sold Tracks per Genre per Country?

Hi! This is my first post.

I’m working on the Guided Project ‘Answering Business Questions using SQL’ and just completed the exercise from screen 3 (finding out which music genres are sold the most in the USA).

Screen Link:
https://app.dataquest.io/m/191/guided-project%3A-answering-business-questions-using-sql/3/selecting-albums-to-purchase

Looking at the solution for this guided project, the solution did not have into account ‘quantity’ from the invoice_line table, which I did. Am I wrong in my reasoning? As I understood it, the fact that there is a ‘quantity’ field in invoice_line means that a customer can buy multiple ‘copies’ of a track, therefore a ‘quantity’ field needs to exist.

Then, my reasoning was: to calculate how much a genre sells, we need to sum each track quantity for each invoice_line, and then sum the quantities of all tracks for each genre. But the solution for this guided project merely counts Invoice Line IDs for each genre to know how ‘sold’ each genre is.

Comparing my results and the results from the solution, and considering what we know from music genres popularity, it makes more sense to me that for example Pop is more popular than Blues, which is not reflected in the solution’s results.

Any feedback on this?

My Code:

# View 1: Genre Name for Each Track
CREATE VIEW track_id_genre AS
    SELECT 
        tr.track_id AS track_id,
        gr.name AS genre
    FROM
        track AS tr
    INNER JOIN
        genre AS gr ON gr.genre_id = tr.genre_id;

# View 2: Quantity, Genre and Country for Each Track
CREATE VIEW tqgc AS
    SELECT 
        il.track_id AS trackid,
        SUM(il.quantity) AS sum_qua,
        tig.genre AS genre,
        il.invoice_id AS invoiceid,
        cu.country AS country
    FROM 
        invoice_line AS il
    INNER JOIN
        track_id_genre AS tig ON tig.track_id = il.track_id
    LEFT JOIN
        invoice AS i ON i.invoice_id = il.invoice_id
    INNER JOIN
        customer AS cu ON cu.customer_id = i.customer_id
    GROUP BY 
        trackid;

# Query: Quantity of Sold Tracks per Genre in the USA¶
SELECT
    country,
    genre,
    SUM(sum_qua) AS tracks_total
FROM tqgc
WHERE country = "USA"
GROUP BY country, genre
ORDER BY 3 DESC;

My Results:

country genre tracks_total
USA Rock 588
USA Metal 175
USA Alternative & Punk 133
USA Pop 49
USA R&B/Soul 49
USA Alternative 43
USA Hip Hop/Rap 20
USA Jazz 20
USA Easy Listening 19
USA Latin 17

Results in Solution:

genre tracks_sold
Rock 561
Alternative & Punk 130
Metal 124
R&B/Soul 53
Blues 36
Alternative 35
Latin 22
Pop 22
Hip Hop/Rap 20
Jazz 14
1 Like

HI @fredericoferreira

In my understanding invoice line is created for each track sold within an invoice.
And if you try to order it be quantity asc and descending(one by one)

SELECT * FROM invoice_line
order by quantity asc   (/desc)

you will find all the quantities at the top and bottom are 1 which means they only have one value which is 1.

So invoice line id is basically equals to each track sold. Probably that is why solution used invoiceline id

Also, I believe the data is a fictitious dataset. So might not follow what we know. That is what I think. Though I’m not sure if I did answer your question satisfactorily.

Hey @jithins123,

Thanks for your answer!

Indeed, I should have inspected ‘quantity’ values and come to that conclusion :smile:

This means that the difference in results between my work and the solution comes from something else. I will take a look at it again.

Cheers!

1 Like

It could be the joins. Also have a look at the customers table. Do you think it will be needed for this given question in focus?

Very interesting results which on a brief look makes you think the solution undercounted.

If your hypothesis is correct, that quantity should be counted rather than invoice_line_id, that when we look at an invoice_line item, it’s quantity can show 1 or more than 1, and you think the solution assumed quantity to be always 1, then we should expect your answer to have a >= number of tracks_sold in every genre than the solution.

Now, look at the R&B/Soul category. Solution is 53, your answer is 49. How can quantity be less than invoice_line_id when the minimum quantity must be 1, else it won’t even appear in any invoice as an invoice_line!

Also, look at the genres in the results. The solution has Latin and Blues in top 10 but they are not in yours because the count was so low (i ran your solution and see they’re 17 and 11 respectively).

So above reasoning was to show how a sanity check can identify issues with your assumptions.

Now back to what I mentioned above, you seem to think counting invoice_line_id is not correct because quantity is the more accurate indicator of sales, because an invoice_line_id can have more than 1 quantity right? (like someone buying the same track more than once for friends).

Did you check to confirm this? Given that the data has no cardinality (1:1, 1:m, m:1, m:m) information provided to you and no documentation of relationships of physical items within a table, how do you know if 1 invoice_line_id can have quantity more than 1? The best and fastest way in this case is just write an SQL to check it. Note that in real life, it is still advised to check with the designer, to understand whether you don’t see something because

  1. it’s impossible in real life
  2. it’s possible and disallowed by the DB through constraints
  3. it’s possible and allowed, but just not happened yet so you don’t see it in data.

Good thinking here on inferring values from DB design, the 2nd half of case 3 above addresses this issue. If quantity should always be 1 as in case 1, the DB should be redesigned to remove that quantity column.

Now let’s come to to core bug of your code.

In the tqgc view, you did a GROUP BY. When you SELECT columns, some of these will appear in the GROUP BY, but for the others who don’t appear, you have to write an aggregation function on them, or else how does SQL know how to condense a group of values into a single value? There is a table of aggregation functions at the bottom of this thread from an article (linked there too) teaching Window functions: When is SUM Aggregation Function or Window function, and column order in SELECT

In your SELECT, you have 5 columns. 1 of them (trackid) is in GROUP BY, leaving 4 left to aggregate. You aggregated il.quantity with SUM, but what about the other 3? How does SQL know which value among the group to show you in the result?

In this exercise, you are using sqlite which is a very lax engine. On stricter systems like sql server, postgres, you will see the error: ERROR: column “MY_TABLE.MY_COLUMN” must appear in the GROUP BY clause or be used in an aggregate function.

Suggested reading (includes harder concepts like functional dependency) after you get this basic concept: https://medium.com/@riccardoodone/the-love-hate-relationship-between-select-and-group-by-in-sql-4957b2a70229

In this case, sqlite is not giving you error, because it is simply giving you value of the 1st row in the group. Try running the following test in https://sqliteonline.com/ (good place for your other small experiments too)

CREATE TABLE test(track_id INT, customer_id int, country text);
INSERT INTO test VALUES
 (1,10,'USA'),
 (1,11,'Canada'),
 (2,12,'Canada'),
 (2,13,'Italy');
 
SELECT track_id, COUNT(customer_id), country -- No aggregation on country, first row in group displayed
FROM test
GROUP BY track_id

Firstly, it’s important to note the same track can be bought by people from different countries.

%%sql
SELECT t.track_id, COUNT(DISTINCT c.country)
FROM track t
INNER JOIN invoice_line il ON t.track_id = il.track_id
INNER JOIN invoice i ON i.invoice_id = il.invoice_id
INNER JOIN customer c ON c.customer_id = i.customer_id
GROUP BY t.track_id

image

Practically, this means if a track was bought by customers from x different countries, and coincidentally, the first customer in row order that bought a track came from US, all x sales will be attributed to US, inflating US counts (same inflation possible for any other country in any row). Of course, if the row orders were swapped, US will be undercounted for that track, but if you look at the distribution of countries, US occupies the majority of the dataset, and it is much more likely overcounting happens than undercounting.
image

In fact, I checked it to see your solution has 1149 US tracks while the answer has 1051 US tracks. (You can try to prove it yourself once you understand).

Below is the intermediate table from your solution, 1 step before the final GROUP BY (no problem here).
image
The values you see under sum_qua are not reliable. trackid 1 not necessarily has 8 sales from USA, could be less than or equal to, and USA sales of trackid 4 could be buried in the 4 sales all attributed to Canada.

SELECT SUM(tracks_total)
FROM (
SELECT
    country,
    genre,
    SUM(sum_qua) AS tracks_total
FROM tqgc
--WHERE country = "USA"
GROUP BY country, genre
ORDER BY 3 DESC
)

Wrapping your final SELECT, we can see the total number of tracks sold from your solution is still correct at 4757 though, so just the country attributions in tqgc are wrong.

THE HACKY FIX THAT WORKS ONLY FOR SQLITE (does not address wrong GROUP BY mentioned above)
I’ll leave you to fix the GROUP BY issue, you may have to rethink and rewrite the query entirely.

%%sql
WITH track_id_genre AS
    (SELECT 
        tr.track_id AS track_id,
        gr.name AS genre
    FROM
        track AS tr
    INNER JOIN
        genre AS gr ON gr.genre_id = tr.genre_id
    ),

tqgc AS
    (SELECT 
        il.track_id AS trackid,
        COUNT(il.invoice_line_id) AS sum_qua,
        tig.genre AS genre,
        il.invoice_id AS invoiceid,
        cu.country AS country
    FROM 
        invoice_line AS il
    INNER JOIN
        track_id_genre AS tig ON tig.track_id = il.track_id
    LEFT JOIN
        invoice AS i ON i.invoice_id = il.invoice_id
    INNER JOIN
        customer AS cu ON cu.customer_id = i.customer_id AND cu.country = "USA"
    GROUP BY 
        trackid
    )

SELECT
    country,
    genre,
    SUM(sum_qua) AS tracks_total
FROM tqgc
GROUP BY country, genre
ORDER BY 3 DESC;

image

Note how I moved the USA filtering from WHERE clause in your final SELECT into the INNER JOIN in the tqgc CTE by adding an additional join condition AND cu.country = "USA",
because filtering can be done at both places to the same effect for INNER JOIN (not LEFT/RIGHT JOIN though: https://www.pluralsight.com/guides/using-on-versus-where-clauses-to-combine-and-filter-data-in-postgresql-joins)

What this does is indirectly fix the problem of selecting columns without aggregation, because at table joining stage, you have already filtered all the USA only rows, so the 1st row that sqlite conveniently selected will definitely be from USA, and there can’t be the previous problem of sum_qua value being inflated by customers from other countries, or USA sales of tracks being buried in sum_qua of other countries.

PERFORMANCE AND CLARITY OF CODE

Note that the solution filtered as early as possible on USA. It started by ascertaining that quantity are all 1, so invoice_line is the key unit of analysis, and joined the required tables to get the country information to filter.
Filtering early makes future JOINs work on less rows so is faster. The benefit of your solution filtering late is that you get an informative big intermediate table, that if you saved as a new table in the DB for future filtering on other countries, saves time running more joins as the solution must if someone wanted to get information from another country based on solution code.

Another point is you used LEFT JOIN to add invoice table to the combination of invoice_line and track_id_genre, why not INNER JOIN?

Normally people use LEFT JOIN because they think some values on the joining column in the left table could not be found in the right table (eg. because customers have not yet ordered products). In this case, the left table (invoice_line) joining column is an invoice_id. If there was an invoice_line that refered to an invoice_id that did not exist in the right table (invoice), something is wrong with the right table.

Given that the solution is provided, this is still a protected environment where we can do the sanity checks mentioned above. However, when no solution is given, we would have to be extra careful when writing SQL which comes with exposure and practice.

@jithins123 Both solution and @fredericoferreira used the same 5 tables, and I believe all 5 are necessary, because genre, and country information are spread 5 tables apart (see from schema diagram on page 1 of mission) with 3 intermediate tables required to link them.

2 Likes

Thanks a lot for the very detailed explanation.

Regarding this…

Do you think it is possible to use just invoice, invoice_line, track and genre tables to find the number of tracks sold and its percentage?

If you don’t care about country, those 4 tables are enough.
Every new table used should serve a purpose, to get access to certain column(s) for calculating statistics, grouping on, or for joining to another table.

1 Like

Oh sorry, We can’t use the billing country! Yes, it makes sense now. Thank you for your help.