Window Functions vs DISTINCT vs GROUP BY relationship

I’ve no idea why sqlfiddle is having all sorts of TypeError: Cannot read property 'substring' of undefined when query runs fine on https://sqliteonline.com/ so i’ll just paste the scripts for reproducibility. Database used is sqlite, the default on that website.

Creating sample data

DROP TABLE IF EXISTS twitter; 
CREATE TABLE IF NOT EXISTS twitter (session_id INT,session_type VARCHAR, user_id INT,duration INT);
INSERT INTO twitter VALUES
(1,'A',10,101),
(2,'A',10,102),
(3,'A',11,103),
(4,'A',11,104),
(5,'A',11,105),
(6,'B',10,201),
(7,'B',10,202),
(8,'B',11,203),
(9,'B',11,204),
(10,'B',11,205)

Questions on DISTINCT

SELECT distinct session_type, min(duration)over(partition by session_type) 
FROM twitter
--SELECT session_type, min(duration)
--FROM twitter
--GROUP BY session_type
session_type min(duration)over(partition by session_type
A 101
B 201
  1. Can DISTINCT be used as a GROUP BY tool? Both the active and commented queries are returning the same results (ignoring col names). My first exposure of this idea from https://stackoverflow.com/a/40594647/8621823

  2. How does the window function correctly align the results to the distinct session_types?
    When googling “multiple columns with distinct”, results show examples of select distinct A,B., but i’m looking for select distinct A, some_other_operation in a single select. I’ve never seen other columns being added to SELECT distinct . What is the alignment mechanism here?

Questions on GROUP BY
I want to rank the users based on total duration for each user for each session_type.
Note: I may have wrote the query wrongly even though it executes because i don’t know how window functions interact with GROUP BY

select session_type,user_id, sum(duration),rank()over(partition by session_type order by sum(duration)) AS rank
from twitter
GROUP BY session_type,user_id
session_type user_id sum(duration) rank
A 10 203 1
A 11 312 2
B 10 403 1
B 11 612 2
  1. The window function is not the usual aggregate function (eg. COUNT, AVG) that i see (or is it actually an aggregate?), and not in one of the GROUP BY columns. Would this have errored (must appear in the GROUP BY clause or be used in an aggregate function) in a stricter database engine than sqlite or is a perfectly good query?

EDIT: My own answer after tests on postgres is it doesn’t!

  1. How did the window function interact with the GROUP BY to return correct ranks? (I don’t want to wrongly learn something that works only because of the lax rules of sqlite). The GROUP BY was grouping on 2 columns while the window was only looking at 1 column sum(duration), something feels wrong.

  2. Was the window function independent of the GROUP BY’s operations? Even changing sum(duration) to duration in the partition by returns the same ranks without erroring.

EDIT: Removing sum errors on postgres with 42803 column "twitter.duration" must appear in the GROUP BY clause or be used in an aggregate function. What is going on? Does this imply window functions are not aggregate functions? Or just not the specific aggregate functions that GROUP BY wants?

  1. What is the correct way of doing the task? (if i still prefer to use window functions rather than subqueries?)

If anyone has solid ways of explaining, like debugging through query plans, even better

1 Like