Figure 1: when remove GROUP BY.
Figure 2: how many times total
got overcounted for each invoice_id
for each employee (not important to this explanation)
Figure 3: 3 table join result which has no overcounting.
Minimal Reproducible Example: https://stackoverflow.com/help/minimal-reproducible-example
invoice_line = pd.DataFrame({'invoice_line_id':range(1,4),'invoice_id':[1]*2+[2]*1})
invoice_line
invoice = pd.DataFrame({'invoice_id':range(1,3),'total':[15.84,9.9]})
invoice
merged = invoice.merge(invoice_line)
merged
merged.groupby('invoice_id').sum('total')
Above shows the exact same problem.
Note that I want to avoid creating 2 more useless tables and joining them, so the groupby is a bit different, but this example is sufficient to describe the issue.
In this example, if invoice_line
was not joined in, 2nd table correctly shows invoice_id
= 1 earned $15.84. If you join invoice_line
, 3rd table shows there are 2 invoice_id
= 1 inside invoice
(caused by the 2 invoice_id
= 1 in 1st table) in the joined result before groupby. This is where the number of rows wrongly grew. 4th table shows how more rows caused a larger result, because 15.84 < 15.84 + 15.84.
In the original question. The difference is 15.84 < 16 x 15.84. Repeat this row overcount for every single invoice_id
(figure 2 shows how many repeats for each invoice_id
) and that’s why there is a difference in total or total per employee.
The totals are not the same. The totals are shown in Figure 3. You are correct that invoice_line breaks down an invoice, and unit_price summed together give the value of total in invoice. My point is invoice_line or what the unit_price is, is completely irrelevant to this analysis.
As an analogy, if I want to find weekly sales per employee and it is provided in a table already, I don’t need to know the sales on mon, tue, wed … per employee. Because the weekly data I want (the total
from invoice
) is already given, and doing extra joins to include daily data (number of invoice_line per invoice or what the unit_price is) is just dirtying the analysis.
This is the syntax difference between no groupby and groupby. See how there are 2 changes as mentioned previously
Further discussion

These counts match the counts you got after joining 3 or 4 tables. 614 remaining 614 after 2 more joins proves that invoice_id : customer_id is 1:1 and customer_id : employee_id is 1:1.
The expansion of 614 to 4757 happens because invoice_id : invoice_line_id is 1:m cardinality.
Let me prove this for you by multiplying COUNT(*) from Figure 2 by sales from Figure 3 and sum, result is 47503.
import sqlite3
import pandas as pd
conn = 'sqlite:///chinook.db'
query = """
SELECT
e.first_name || " " || e.last_name AS employee_name,
i.total as sales
FROM
customer c,
employee e,
invoice i
WHERE
e.employee_id = c.support_rep_id
AND i.customer_id = c.customer_id
"""
df = pd.read_sql(query,con=conn)
query2 = """
SELECT
e.first_name || " " || e.last_name AS employee_name,
COUNT(*) as overcount
FROM
invoice_line il ,
customer c,
employee e,
invoice i
WHERE
e.employee_id = c.support_rep_id
AND i.customer_id = c.customer_id
and i.invoice_id = il.invoice_id
GROUP BY il.invoice_id, e.employee_Id
"""
df2 = pd.read_sql(query2,con=conn)
sum(df.sales * df2.overcount)
Rounding by int CAST