I ran the GROUP BY clause within the subquery and got near identical results yet slightly off and do not understand why

Screen Link:
Building And Organizing Complex Queries | Dataquest

My Code:

WITH playlist_info AS
        p.name AS playlist_name,
        COUNT(pt.track_id) AS number_of_tracks,
        SUM(t.milliseconds)/1000 AS length_seconds
      FROM playlist AS p 
     LEFT JOIN playlist_track AS pt ON pt.playlist_id = p.playlist_id
     LEFT JOIN track AS t ON t.track_id = pt.track_id
     GROUP BY p.playlist_id
     ORDER BY p.playlist_id
  FROM playlist_info

What I expected to happen:

I expected the returned table to properly sum up the seconds of each playlist.

What actually happened:

The length_total column was marginally off by less than 100 seconds for all playlists that had more than one track as you can see I performed the GROUP BY and all aggregations within the subquery. I am not entirely sure why this is the wrong approach and am very confused as to why my output is so similar to the solution yet just slightly off.

Replace this line with the output/error

Hi jordan.z.raport

LEFT JOIN returns all the records from the left table. So the JOIN conditions should have been
LEFT JOIN playlist_track pt ON p.playlist_id = pt.playlist_id
LEFT JOIN track t ON pt.track_id = t.track_id

Also, while calculating the length_seconds, you’ll have to use
SUM(t.milliseconds/1000) AS length_seconds
instead of SUM(t.milliseconds)/1000 AS length_seconds

Hope its clear now.

Is this syntax required or is it just a style guide? I wrote my solution as you have here but to test it out, I switched the JOIN conditions around and still got the same results.

Is this why @jordan.z.raport got slightly different results in their length_seconds column…essentially due to rounding errors? If so, I believe that converting to seconds (i.e. dividing by 1000) should be done after summing up the milliseconds rather than before the summation since this leads to less rounding error accumulation. I realize the solution requires it to be done before but logically, wouldn’t it make sense to do it afterwards?

As per the question, we have to select

  • The name of each track from the playlist.

So, we’ll have to select all the track_id values from playlist_track table, JOIN the entire row from track if its track_id value matches.

It’s working here, but the correct order should be

LEFT JOIN playlist_track pt ON pt.playlist_id = p.playlist_id
LEFT JOIN track t ON t.track_id = pt.track_id

To answer your second question, as per the instructions

  • length_seconds - The sum of the length of the playlist in seconds. This column should be an integer.

the length of the playlist in seconds - So, first we have to find the length of the playlist in seconds, and then sum of that length (in seconds).

Is it clear now?

Thank you for the response however I think I’m more confused now regarding the order of the JOIN conditions. In your original post you advised:

But this seems to contradict your most recent reply:

The question remains: in what order should we place our JOIN conditions? To me, it makes sense to put the “left table” first and the “right table” second…if for no other reason than it seems more intuitive. I noticed that the missions tend to put the “right table” first and the “left table” second. I guess my real question is, does it make a difference with other flavours of SQL? It doesn’t seem to matter so much for sqlite since changing this order in the missions does not seem to change the result of the queries.

For my second question I was wondering which way you think we should be trying to calculate “The sum of the length of the playlist in seconds” because there is more than one way to go about finding this length. The two most obvious ways are:

  1. Sum the lengths of each track in milliseconds (per playlist) and then convert to seconds
  2. Convert the length in milliseconds to seconds (per track) and then sum up all the tracks per playlist

These strategies are mathematically equivalent. However, the system only accepts one of these strategies because their results differ slightly due to rounding errors. I was just curious which strategy you thought was better?

Left join is not commutative,
*table1.column_name* = *table2.column_name* is not the same as table2.column_name = table2.column_name.
image is not the same as image

For INNER joins, the order doesn’t matter. But for LEFT joins, it does.

A left join keeps all rows in the first (left) table plus all rows in the second (right) table, when the on clause evaluates to “true”.
When the on clause evaluates to “false” or NULL, the left join still keeps all rows in the first table with NULL values for the second table.

In this DQ exercise, although we’re getting the same result when we swap the order in LEFT JOIN, the correct order should be

LEFT JOIN playlist_track pt ON pt.playlist_id = p.playlist_id
LEFT JOIN track t ON t.track_id = pt.track_id

The instructions are very clear.

Hence, we go with the second strategy.

Does it make sense now?

The order for the JOIN conditions (ie ON keys) was still unclear for me. It was always clear to me that A LEFT JOIN B is not equivalent to B LEFT JOIN A because LEFT JOINs are not commutative…but INNER JOINs are :sunglasses: I think you may have misunderstood what I was asking which was: is the ON clause of a JOIN commutative? Specifically: is A.key = B.key logically equivalent to B.key = A.key in an ON clause? (…the answer is yes!)

I read this post on SO earlier today. The accepted answer seems to indicate that order of keys doesn’t matter but that last answer by Nashev seems quite adamant that we should be ordering as: A LEFT JOIN B ON B.key=A.key for readability reasons. I also read this SO post as well which also confirms that order of keys does not matter. So now, no more confusion: the ON clause is definitely commutative!

For the second part, I think you may have misunderstood my original question…which all boiled down to:

I understood the question, I wrote the code as per the instructions, submitted my code and it was accepted. But after reading this post, I went back to the mission and changed where I converted to seconds and saw there was a slight difference in the final results. I was just curious which strategy you thought was better, logically speaking?

You’re right about the order of keys in LEFT JOIN.
Regarding the second question, it’s best practice to do most of your complicated stuff in the subqueries, hence the conversion can be done there, then the summation in the outer query.

1 Like

That is some good advise, thank you!