BLACK FRIDAY EXTRA SAVINGS EVENT - EXTENDED
START FREE

Calculating variance in SQL

Screen Link:

https://app.dataquest.io/c/46/m/191/guided-project%3A-answering-business-questions-using-sql/4/analyzing-employee-sales-performance

My Code:

SELECT 
((total - AVG(total)) * (total - AVG(total))) 
FROM invoice;

What I expected to happen:

I’m trying to calculate the variance of invoice data from the SQL Business Analysis Project. There was a line from the slide: “You might like to consider whether any extra columns from the employee table explain any variance you see, or whether the variance might instead be indicative of employee performance.” I realize now that “variance” wasn’t referring a mathematical definition of variance, but rather why some employees performed better than others. However, I’m now really curious as to how I would calculate variance in SQL, i.e. sum (xi - mean)^2 / (n-1), and how you would calculate it for grouped data. In the code block above, I calculated (xi - mean)^2 and would expect to see a table with the same number of rows as the original invoice table.

What actually happened:

((total - AVG(total)) * (total - AVG(total)))
27.03915310109401

However, as you can see, SQL only calculated (xi - mean) * (xi - mean) for the first entry.

Another note, there looks like there is a built-in variance function VAR in SQL, but it’s not available on the version of SQL on the Dataquest platform.

Any guidance on this would be helpful, thank you!

Ooh, interesting question!

You will have to calculate the average for each row using a subquery. Something like -

(total - (SELECT AVG(total) FROM invoice))

I am not sure if there’s an easier/simpler/more efficient way of doing this, though.

That would depend on the variant of SQL, yes. Dataquest uses SQLite, I believe. Other variants like MySQL etc. might have such built-in functions.