SQL produces different output please explain?

The above link has been edited to the correct mission now.

I am doing above mission and I looked at someone else’s query to see the total money made by each sales agent was different to my result as their query is slightly different but do not understand why please explain?

here is link to ther person’s query and they get 18246 for Jane I get different number (less):

If I include the extra table invoice line I get same output , but do not understand why this was making a difference please advise?

SELECT
        e.first_name || " " || e.last_name AS employee_name,
        cast(SUM(i.total)as int) as sales
            FROM 
             invoice_line il ,
                customer c,
                employee e,
                invoice i
            WHERE
                e.employee_id = c.support_rep_id
                AND i.customer_id = c.customer_id
                and i.invoice_id = il.invoice_id 
            GROUP BY e.employee_Id

--seems to give wrong output below
%%sql
SELECT
        e.first_name || " " || e.last_name AS employee_name,
        cast(SUM(i.total)as int) as sales
            FROM 
                customer c,
                employee e,
                invoice i
            WHERE
                e.employee_id = c.support_rep_id
                AND i.customer_id = c.customer_id
            GROUP BY e.employee_Id
1 Like

This mission opens up to 6. Pattern Matching Using Like, how is this page related to the question?

A way to see differences is to move backwards in the SQL processing pipeline. Since the last step is groupby, remove the groupby, print out all the rows and compare differences. For larger datasets, if not eyeballing, then invent your own metrics to study the stats and search for differences.

The superficial answer is 1 solution joined an extra table, 1 didn’t, of course they are different.
What happens when tables are joined? Rows can increase/decrease/stay same, you never know.
Print row counts for fast debugging. You don’t even have to see the values to have a sense what’s going on. If row count show no difference, then look into the values.

So who’s answer do you think is correct? The one with 4 tables joined or 3 tables?
Then you have to ask why is he joining an extra table, what does it mean?

Don’t fall into the trap of hoping things work because you guess your reasoning of the sql theory is right. DQ does not really emphasize how to break down a long query into smaller parts, so learners have to fill in this gap themselves to avoid being left helpless when the final result is unexpected.

2 Likes

Hi,
A>
sorry about the link ill correct it in a minute.
B>
removing ‘group by’ does not show all the rows but just a grand total
for all 3 employees which is wrong in the context here since it is
next to the name of one employee - Jane

SELECT
        e.first_name || " " || e.last_name AS employee_name,
        cast(SUM(i.total)as int) as sales
            FROM 
             invoice_line il ,
                customer c,
                employee e,
                invoice i
            WHERE
                e.employee_id = c.support_rep_id
                AND i.customer_id = c.customer_id
                and i.invoice_id = il.invoice_id 
            GROUP BY e.employee_Id

Thanks for the tip about row counts.
I see there is a difference here.


I am not sure why there are more rows when including the invoice line table.
When looking at the invoice line table – all it tells me is that it is a table with details about the invoice…If I compare this table to invoice table which shows less detail/breakdown I do not see why invoice_line table gives different total to the total column on the far right of the invoice table. …an invoice_id in the invoice table is just a sum of all the invoice_line_ids for one particular invoice_id in the invoice_line table ,so I still cannot see why I am getting a different total.


1 Like

Because that is not how a group by should be removed.

A GROUP BY is not just a GROUP BY statement. It involves

  1. Choosing what column(s) to group on
  2. Choosing what aggregation functions to use for the columns in SELECT that are not grouped

You simply deleted the GROUP BY line rather than reverse step 2 too. So you did not really go back 1 step up the SQL processing pipeline. Deleting 1 without doing 2 makes no sense.

Going back is so you can see what are the values before grouping to help debugging. Anyway, it is not necessary, since you have shown for yourself joining 4 tables creates 4757 rows, way more than joining 3 tables which shows 614 rows.

1. Why does joining more tables create more rows?

As I said above it is uncertain how many rows the result of an inner join will produce. Create your own 2 tables, and synthesize data to show for yourself what sorts of data patterns create more/equal/less rows after join and why it is unpredictable. Without understanding joins, it’s hard to do any SQL.

2. Why is creating more rows a problem for GROUP BY?

Assuming you skip exercise 1 above, we already know joining 4 tables creates more rows. Why does creating more rows give a larger total? Because you are adding more values into each group. It’s like 1+1 (2 members in group) < 1+1+1 (3 members in group).
You can argue what if the smaller group size has larger values? Like 2+3 > 1+1+1+1, that is why I ask you to remove the groupby and investigate the values for yourself.
You can set EXCEPT, UNION, INTERSECT to do set operations in SQL and find out what’s going on.

Think about what does it mean when an extra invoice_line table gets joined to the group of 3 tables.
It means for every invoice_id in the invoice_line table, the total from invoice table gets duplicated once after join. invoice_id = 1 in invoice table has 16 invoice_id = 1 in invoice_line table, meaning the total of 15.84 of invoice_id = 1 from invoice table got repeated 16x after inner joining to invoice_line, which later got summed 16x in group by.
I may be wrong. You can verify it by counting how many duplicated counts does each invoice_id get after joining, and compare the numbers yourself.

You should be able to answer whether join 3 or 4 tables is correct now.

Another thing to investigate is why 18246 + 15923 + 13333 != 47503 (same for the 3 table edition)

Extra readings

Dangerous Subtleties of Joins: https://alexpetralia.com/posts/2017/7/19/more-dangerous-subtleties-of-joins-in-sql

ERD/Cardinality lesson (DQ did not show in database diagram): https://www.youtube.com/watch?v=QpdhBUYk7Kk&ab_channel=Lucidchart

Understanding cardinality helps you think about physical meanings, what’s possible/probable/impossible and helps debugging queries

2 Likes

[quote=“hanqi, post:4, topic:546289”]
1.
Assuming you skip exercise 1 above, we already know joining 4 tables creates more rows.

Could you please demonstrate by editing my query how you would show all rows?

2.
Why does creating more rows give a larger total?
Because you are adding more values into each group. It’s like 1+1 (2 members in group) < 1+1+1 (3 members in group).

[/quote]

I understand the adding more rows part, but not why you get a larger total.

You see the invoice_id in the second column here?
It’s 1 and has 5 invoice lines associated when we sum the unit_price it should equal the total in the invoice table which shows the invoice_id as an aggregate in total. In other words the totals should be the same for invoice_id 1,2,3…101,102 etc there are just less lines in the invoice table because each invoice is not broken down into its invoice-lines.

Think about what does it mean when an extra invoice_line table gets joined to the group of 3 tables.
It means for every invoice_id in the invoice_line table, the total from invoice table gets duplicated once after join. invoice_id = 1 in invoice table has 16 invoice_id = 1 in invoice_line table, meaning the total of 15.84 of invoice_id = 1 from invoice table got repeated 16x after inner joining to invoice_line, which later got summed 16x in group by.
I may be wrong. You can verify it by counting how many duplicated counts does each invoice_id get after joining, and compare the numbers yourself.

Could you please provide some example code on how to do this?
Comparing before and after totals from above, if it was duplicated then total should be double , but it is not, it is 10x more for each sales agent.

1 Like

Figure 1: when remove GROUP BY.

Figure 2: how many times total got overcounted for each invoice_id for each employee (not important to this explanation)

Figure 3: 3 table join result which has no overcounting.

Minimal Reproducible Example: https://stackoverflow.com/help/minimal-reproducible-example

invoice_line = pd.DataFrame({'invoice_line_id':range(1,4),'invoice_id':[1]*2+[2]*1})
invoice_line

invoice = pd.DataFrame({'invoice_id':range(1,3),'total':[15.84,9.9]})
invoice

merged = invoice.merge(invoice_line)
merged

merged.groupby('invoice_id').sum('total')

Above shows the exact same problem.
Note that I want to avoid creating 2 more useless tables and joining them, so the groupby is a bit different, but this example is sufficient to describe the issue.

In this example, if invoice_line was not joined in, 2nd table correctly shows invoice_id = 1 earned $15.84. If you join invoice_line, 3rd table shows there are 2 invoice_id = 1 inside invoice (caused by the 2 invoice_id = 1 in 1st table) in the joined result before groupby. This is where the number of rows wrongly grew. 4th table shows how more rows caused a larger result, because 15.84 < 15.84 + 15.84.

In the original question. The difference is 15.84 < 16 x 15.84. Repeat this row overcount for every single invoice_id (figure 2 shows how many repeats for each invoice_id) and that’s why there is a difference in total or total per employee.

The totals are not the same. The totals are shown in Figure 3. You are correct that invoice_line breaks down an invoice, and unit_price summed together give the value of total in invoice. My point is invoice_line or what the unit_price is, is completely irrelevant to this analysis.

As an analogy, if I want to find weekly sales per employee and it is provided in a table already, I don’t need to know the sales on mon, tue, wed … per employee. Because the weekly data I want (the total from invoice) is already given, and doing extra joins to include daily data (number of invoice_line per invoice or what the unit_price is) is just dirtying the analysis.


This is the syntax difference between no groupby and groupby. See how there are 2 changes as mentioned previously

Further discussion

image
These counts match the counts you got after joining 3 or 4 tables. 614 remaining 614 after 2 more joins proves that invoice_id : customer_id is 1:1 and customer_id : employee_id is 1:1.

The expansion of 614 to 4757 happens because invoice_id : invoice_line_id is 1:m cardinality.

Let me prove this for you by multiplying COUNT(*) from Figure 2 by sales from Figure 3 and sum, result is 47503.

import sqlite3
import pandas as pd

conn = 'sqlite:///chinook.db'

query = """
SELECT
    e.first_name || " " || e.last_name AS employee_name,
    i.total as sales
    FROM 
        customer c,
        employee e,
        invoice i
    WHERE
        e.employee_id = c.support_rep_id
        AND i.customer_id = c.customer_id     
"""
df = pd.read_sql(query,con=conn)

query2 = """
SELECT
        e.first_name || " " || e.last_name AS employee_name,
        COUNT(*) as overcount
            FROM 
             invoice_line il ,
                customer c,
                employee e,
                invoice i
            WHERE
                e.employee_id = c.support_rep_id
                AND i.customer_id = c.customer_id
                and i.invoice_id = il.invoice_id 
            GROUP BY il.invoice_id, e.employee_Id
"""

df2 = pd.read_sql(query2,con=conn)

sum(df.sales * df2.overcount)

Rounding by int CAST

2 Likes

Hi hanqi,

Many thanks.

It clicked when I saw this line in your explanation–

because invoice_id : invoice_line_id is 1:m cardinality.

one-to-many

invoice_id = 1 in invoice table has 16 invoice_id = 1 in invoice_line table, meaning the total of 15.84 of invoice_id = 1 from invoice table got repeated 16x after inner joining to invoice_line, which later got summed 16x in group by.

so because of joining there , but how do I know it is not other way around—
does it depend on the order of whether it is

(one to many)

FROM invoice i
JOIN invoice_line il ON i.invoice_id =il.invoice_id

vs
(many to one)

FROM invoice_line il
JOIN invoice i ON i.invoice_id =il.invoice_id

if invoice_id #8 had a total of $50 in the invoice table because of the join and one -to-many relationship to the invoice_line table if there were 3 invoice_lines for invoice_id #8 teach with a unit_price of $0.99 the total would become 3*$0.99 *$50 = $148?

1 Like

When presented with any question, before writing any query, you have to understand the ERD and their cardinalities. Without knowing the cardinality and object relationships there’s no way you can write correct queries. Running without error does not mean the result answers the question. You may have to run EDA queries in this case to find out the cardinalities.

Cardinalities are logically designed facts, can’t just switch them around without huge redesign to DB. A bicycle has 2 wheels. A wheel does not have 2 bicycles (unless you swap them around).

invoice_line top few lines already shows there are invoice_line_id 1,2,3… for invoice_id = 1. The to-many is proven here. You don’t need to see whether invoice_id = 2/3/4… maps to multiple invoice_line_id anymore because that doesn’t change the to-many fact since invoice_id = 1 is already to-many invoice_line_id. Then you do a groupby invoice_line_id count invoice_id to see if it’s many-to-many. Which you will see as result that every invoice_line_id only comes from 1 invoice_id which proves 1:m.

The resultant number of rows of a join can be reasoned from knowledge of cardinality. If you know it’s 1:1, then it’s very easy to expect the output to have same number of rows (not always) as input, when it becomes m:1, 1:m, m:m then row counts are more unpredictable. The order does not matter for any number of tables when only INNER JOIN is used, like A * B * C = A * C * B = … (6 permutations). Also, INNER JOIN type is easier to predict output row count than LEFT,RIGHT,OUTER.

You haven’t understood the relationship between unit_price from invoice_line and total from invoice. unit_price is price of apple ($2), orange ($3), if i go to the supermarket to buy 2 apple + 1 orange, total = $7. The joining caused duplicates of this $7 to become 3*7 = 21 since there were 2+1 = 3 items bought. There is no sense in multiplying unit_price by total

Anyway to get better with these, go through the Minimal Reproducible Example above, set up your own small tables, and do some joining experiments with pd.merge/df.merge and see how row counts change.

2 Likes

but i did not only multiply unit price by total it was quantity * unit price * total. By associative property of multiplication it looks like the same computation as you had done 3*7 = 21 in your example.

Thanks for explanation on cardinality I had read it , but only when I came across this problem did I realise its usage.

Anyway to get better with these, go through the Minimal Reproducible Example above, set up your own small tables, and do some joining experiments with pd.merge/df.merge and see how row counts change.

Good idea.I shall try the pd.merge/df.merge.

1 Like

Yes i understand you are asking about 3*$0.99 *$50 which is number of invoice_line_id * unit_price * total.

Let me just stop here and ask, what is the physical meaning of this to you?

My point is just the last 2 terms multiplied together make no sense already, nevermind whether the first term 3 is multiplied in or not.

No my example of 3 x 7 = 21 is 3 (number of invoice_line_id) x 7 (total). There is no multiplication of total and unit_price. The unit_price are part of the 7. $7 (total) = 2 * $2 (unit_price for apple) + 1 * $3 (unit_price for orange)

Do you get this point? This was the 1st analogy before my supermarket basket example.

2 Likes

I think you are asking what I understand is happening in the db here
It is is quantity 3) multiplied by unit price $0.99 in invoice line table then again multiplied /duplicated by the total (which is unit price * invoice line/quantity) in the invoice table which is $50.

Sorry but I am unsure what you mean by last terms making no sense.
Could you say why please?

I understand how the total is derived in invoice table. I think I see what you mean which is that even though answer comes to 21 how it is derived is fundamentally different? I think your point is that the invoice lines how many there are for a particular invoice is multiplied by the total for that invoice in the invoice table. I thikn I get it now. Thanks.

1 Like

Let’s stop going deeper about this 3x7=21 or 3*$0.99*$50=$148 discussion.
It does not contribute much to the problem in this thread.
Let me summarize your whole issue in simple english.

  1. You joined 3 tables. You think someone else who joined the same 3 tables + 1 more table has the correct answer.
  2. The fact is joining that extra table is wrong. Explained below
  3. 3 tables alone were sufficient to fully answer the question, joining 4th table created extra rows because of 1:m relationship.
  4. These extra rows created a larger groupby-sum value for each group, thus it gave wrong values for each group.

magnifique. thanks for your explanation.

Hi hanqi,

I am trying out the above but get an error on this line please advise?

merged.groupby('invoice_id').sum('total')

I split your code over different lines so I could see result of each operation.

I tried looking here and other sources, but could not work out error

Edit:

I got this line to work taking a line from stackoverflow link,
but it would be nice to know why what I had was not working?

merged.groupby(['invoice_id', 'total']).agg('sum')

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-146e6c5d5a45> in <module>
----> 1 merged.groupby('invoice_id','total').sum()

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, **kwargs)
   7881         if level is None and by is None:
   7882             raise TypeError("You have to supply one of 'by' and 'level'")
-> 7883         axis = self._get_axis_number(axis)
   7884         return groupby(
   7885             self,

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in _get_axis_number(cls, axis)
    409             except KeyError:
    410                 pass
--> 411         raise ValueError("No axis named {0} for object type {1}".format(axis, cls))
    412 
    413     @classmethod

ValueError: No axis named total for object type <class 'pandas.core.frame.DataFrame'>

Have you wondered if you can see variables without wrapping print()? If you worked in jupyter notebook, you can see all output at once.

I’ve no idea why that TypeError appeared, my pandas version is 1.0.5, that code looks correct. You can try updating your local pandas to latest version 1.13 and paste my code again. (answer below)

Please always ask why you are doing what you are doing rather than just running error free code. Again I must emphasize, error-free code does not necessaily mean it correctly answers a question.

What you did was groupby on 2 columns, which is a totally different analysis from the erroring code, that groups by 1 column. Fyi, df[col].value_counts() is almost like groupby(col).size() or groupby(col).count(). People use groupby rather than value_counts() for its flexibility to groupby on more than 1 column, and ability to explore individual groups and their indexes.

This looks like you’re trying random syntax and hoping a miracle happens/that it doesn’t error. That’s not a very systematic way of learning.
1st open the docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html, look at the arguments of groupby, and how what you are feeding it is being interpreted by the method. The error above means your positional argument got matched to the axis parameter in groupby, but axis is not defined to accept anything other than 0/1/‘index’/‘column’ as mentioned in docs.

If you don’t understand all these keyword argument/ positional argument stuff, better to detour awhile to study them. They are fundamental python concepts, primarily needed for good function api design. You won’t need to know the api design part, but its also for debugging in cases like this.

3 Likes

@jamesberentsen

My bad, that last line of merged.groupby('invoice_id').sum('total') is wrong syntax.
It should be merged.groupby('invoice_id').sum(), and what i mean’t to show was merged.groupby('invoice_id')['total'].sum() the aggregation only on total column because summing invoice_line_id is meaningless.

The TypeError: f() takes 1 positional argument but 2 were given is refering to groupby.sum() expecting 1 argument (self) , but it got 2, (self + ‘total’). It is expecting self, because it is a method of the pandas.core.groupby.generic.DataFrameGroupBy object, or the pandas.core.groupby.generic.SeriesGroupBy object if you had filtered [‘total’] before .sum()

Thanks for the link.

This looks like you’re trying random syntax and hoping a miracle happens/that it doesn’t error. That’s not a very systematic way of learning.

You’re right I was. I now see importance of reading the documentation.

I looked up youtube video for groupby, but that indeed was not enough.

merged.groupby('invoice_id')['total'].sum()
Nice output now. I see what was wrong by joining extra table.

1 Like