Unnecessary additional GROUP BY in 190-3?

Screen Link: https://app.dataquest.io/m/190/building-and-organizing-complex-queries/3/the-with-clause

The answer code includes
GROUP BY 1, 2

However, in my own solution, I only did GROUP BY 1. What does adding the , 2 do to this query? It seems unnecessary.

1 Like

1,2 refers to the first, second column in the SELECT statement. Really bad practice to be using positional indexing rather than explicitly writing the names of columns. Note groupby 1,2 is not the same as groupby 2,1. Pandas has more examples on df.groupby for you to get familiar.

1 Like

Hey, Chris.

I answered this question before here. I’ll copy and paste it here for self-containment.

Every non-aggregated column should be included in the GROUP BY clause, not doing so can lead to errors and mistakes in calculations. Let’s explore this with the table elite_agent where each row represents a person.

id city gender age
1 Lisbon M 21
2 Chicago F 20
3 New York F 20
4 Chicago M 27
5 Lisbon F 27
6 Lisbon M 19
7 Lisbon F 23
8 Chicago F 24
9 Chicago M 21

If you wish to experiment, feel free to use the agents database attached here: agents.db (12 KB). You can even do it online, for instance on sqliteonline.com.

Each person has an assigned city and a gender. On this table, it makes perfect sense to ask what is the mean age, the mean age by gender, the mean age by city, or the mean age by city and gender. The last question can be answered by the following query:

SELECT city, gender, AVG(age) as mean_age
  FROM elite_agent
 GROUP BY city, gender /* alternatively GROUP BY 1, 2 */
;
If you're curious, expand here to see the results
city gender age
Chicago F 22.0
Chicago M 24.0
Lisbon F 25.0
Lisbon M 20.0
New York F 20.0

Now let’s think about what’s happening when we drop the gender from the GROUP BY clause. You’d be asking for the mean age by city. This is fine, but what could it possibly mean to also include gender in the SELECT statement? Let’s take a look at the results of the mean age by city, without including gender:

Expand to see query and results
SELECT city,AVG(age) as mean_age
  FROM elite_agent
 GROUP BY city
;
city mean_age
Chicago 23.0
Lisbon 22.5
New York 20.0

Now I ask again, what could it possibly mean to include the gender column, but not have it in the GROUP BY statement? Cities don’t have genders, that’s one way to convince oneself that it’s not meaningful to include gender.

But it’s actually a wrong way to think about this. The database doesn’t know that cities don’t have genders, it’s not something that’s captured by a database model. In this case, it is gender, but could be something like public transportation companies. And then the question would remain: what would it mean to include public transportation companies when grouping by mean age? Presumably, each city has more than one public transportation company, how would we choose which to include?

Let’s take a look at the results of including the gender in the last query:

Expand to see query and results
SELECT city, gender, AVG(age) as mean_age
  FROM elite_agent
 GROUP BY city
;
city gender age
Chicago M 23.0
Lisbon F 22.5
New York F 20.0

In this case, it didn’t yield an error because that’s how SQLite developers implemented this particular scenario — it seems the SQLite engine picks the last occurrence for the non-aggregate column that isn’t in the GROUP BY clause. Some SQL engines will yield an error, some will behave just like SQLite, and others behave even differently.

Expand to see different behaviors

The SQL engine in SAS, for instance, repeats each value:

city gender age
Chicago F 23.0
Chicago M 23.0
Lisbon F 22.5
Lisbon M 22.5
New York F 20.0

In SQL Server you’d just get an error. You can experiment this yourself in Stack Exchange Data Explorer — the public database provided by Stack Exchange.

If you run query below on the Stack Overflow database here, you’ll get an error. If you remove the comment marker (--), it will run fine. Note that not including it in GROUP BY is the same type of error as not including gender in the example above.

SELECT YEAR(creationdate) AS creation_year, PostTypeId, COUNT(*)
  FROM posts
 GROUP BY YEAR(creationdate)--, PostTypeId
;

I don’t know what are the motivations for some of these databases not making it yield an error. In my opinion, it should. And I suggest you act as if it does because:

  1. It makes your code more intuitive;
  2. It makes your code easier to migrate into other databases;

I hope this helps.

11 Likes

Hi Bruno,

Sorry, I didn’t mean to post again. I had searched for any threads tagged on this lesson, but didn’t see one that addressed this topic.

Thanks for your explanation. I does make sense that it should yield an error. And it’s worth practicing as though it does yield an error. However, it can be tough to know when it’s supposed to yield an error since I’m learning still.

1 Like

Interesting point, and I realise it doesn’t just repeat the 1st seen value, but looks like it calculates an average of the group and repeats that.

1 Like

Thanks for the good explanation!

2 Likes