A google search of “tabibitosan sql” surprisingly returned only links from articles years ago. Even though since Oracle 12c,
match_recognize is available to replace tabibitosan method, I felt that it’s still valuable to raise awareness of this technique both to solve problems in non-Oracle SQL flavours, and more importantly to develop general problem solving skills.
The data shows that we have 3 people tom, jerry, nibbles who attended an event some time around late January 2020.
The goal is to be able to identify that tom (20–22 Jan) and jerry (23–26 Jan) attended the event for at least 3 consecutive days while nibbles did not.
The technique aims to create a new grouping column (grp) that we can use to answer the problem. Looking at grp, we see that it contains 4 duplicates of 2020–01–21 marking the 4 consecutive days jerry attended. We also see 3 duplicates of 2020–01–19 marking the 3 consecutive days tom attended. For the non-consecutive visit on 2020–01–21 by jerry, it has it’s own grp value. Similarly, for nibbles’ only visit and tom’s non-consecutive visit on 2020–01–24.
The exact values in the dates in the derived grp column are not important at all. The goal of the technique is to create a column of values (any type) that contains the same value for what we care about (consecutive event visits for this problem) and different values for the boundaries we want to demarcate (non-consecutive event visits).
The entire technique boils down to this line
joindate - row_number() over(partition by name order by joindate)::int grp in SELECT.
There are 2 operands to a minus operation. The left operand is a value increasing by 1 unit some times and more than 1 unit other times. The right operand is an auto-generated row number that (within each of it’s own partition), increases by 1 unit consistently. Overall, the left operand has an opportunity to travel faster than the right operand.
Going down the rows in the grp column (result of the difference), when the days are consecutive, both operands increase at the same speed of 1 unit, so their difference remains the same. However when the days are not consecutive, the left operand (temporarily) travels faster than the right operand, causing their difference change to a new increased value.
Again for emphasis, what this value is, is not important at all. The query above did
::int integer casting of the generated row_number() solely to handle
operator does not exist: date - bigint so the differencing is possible. You may need to do other conversions as you see fit to make this technique work.
A caveat is that when this technique is used with partitioning, the grp column not necessarily increases monotonically, and that the same grp value can appear in different partitions. (2020–01–20 appears in both 1st and last row for jerry and tom respectively). If we wanted to do downstream analysis, we must be careful to GROUP BY with not just grp column but also the partitioning column usually (name in this case).
We can simply wrap the previous query in a CTE and query for groups HAVING more than or equals to 3 rows. SELECT DISTINCT name is important to deduplicate names occuring due to same person having multiple consecutive streaks satisfying the HAVING. Another view to this is that the final GROUP BY works on (name,grp) so it is possible to have duplicates if looking at name column only.
Sharp readers would realize by now that this solution is not only limited to solving questions of 3, and not only works for the
No pain no gain, previous question was too easy so let’s extend it.
To be able to filter the rows with the longest streak, it’s helpful to first calculate some statistics on which rows are part of the longest streak.
The key technique of the above query is to expand the frame of the window function from default
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW to being unbounded in both directions so we can calculate statistics on the whole group. The partition here is multi-level, defined by both the name and grp columns.
For people unfamiliar with window functions, this step can also be achieved by the usual
GROUP BY name, grp, then JOIN the
cnt statistics back into this table. I’m showing unbounded window functions just to expose readers to alternatives.
We can get to the answer with 2 more steps, by calculating the longest_streak per name to generate
max_cnt , then using a join to filter the data from the previous step.
This solution finds all the rows contributing to the max streak per person but does not identify which of the streaks it’s in. The question can be made even harder by looking for the rows from the first, last or n-th max streak, assuming the same max streak occurred multiple times.
- First and last day of each group
- How many consecutive runs are there
Question: Who attended the event for at least 3 consecutive days?
Here we use the RANGE keyword to look back 2 days from current row up to current row inclusive. If we count 3, that means there is a streak of 3. As you can see, this solution has to hardcode the value of 2 in the query to answer for 3 consecutive days, so not as flexible as tabibitosan.
If you tried to use ROWS for the windowing in the previous example, it will give False Positives. ROWS just counts physical rows with no care about the values in the rows. RANGE looks into the cell values and looks backwards and forwards, extending the window frame in both directions while the values in column in the ORDER BY of OVER are still within range relative to the current row’s value.
joindate-'2020-01-01' is done only to handle
RANGE with offset PRECEDING/FOLLOWING is not supported for column type date and offset type integer HINT: Cast the offset value to an appropriate type . I was pleasantly surprised Postgres actually tells you how to hack datatypes to make it work. The choice of
2020-01-01 is completely arbitrary. Any date is fine, goal being to convert date to integer so the defined offsets (2 values backwards and 0 forward in this case) can work on an integer window ordering column. Other databases may be able to handle date offsets directly.
The solution would be just as static as the previous solution using no Tabibitosan but with window function, but way more tedious to write.
- Self join 3 tables and write the conditions to define consecutiveness
- Generate 4 additional columns, (lag2,lag1,lead1,lead2) and filter based on them
I have only shown the tip of the iceberg, below are some helpful links:
- Top SQL Interview Test Questions & Techniques (Part 1) | by Ian Ho | Towards Data Science (cross-join and lag,lead patterns, valuable for practicing join conditions)
- About Oracle: Tabibitosan (swapping which is faster/slower runner in tabibitosan, changing granularities of timeseries, forward/backward filling nulls)
PL/SQL 101 : Grouping Sequence Ranges (Tabibitosan Method) — oracle-tech (more tabibitosan, handling duplicates with
- https://social.technet.microsoft.com/wiki/contents/articles/51523.transactsql-simulating-ignore-nulls-functionality-on-first-value-last-value-functions.aspx (general sql data manipulation skills)
- 10 SQL Tricks That You Didn’t Think Were Possible – Java, SQL and jOOQ. (more sql manipulation skills)
- The RANGE Clause in SQL Window Functions: 5 Practical Examples | LearnSQL.com (Practical uses of RANGE, linked cheatsheet very good)
- Window Function ROWS and RANGE on Redshift and BigQuery - Sonra (More RANGE)
- DB2 SQL Finding rows with a gap of 1 minute with other rows - Stack Overflow (new challenge to test yourself, comparing not just across rows but also columns)
CREATE TABLE event(name varchar, joindate date); INSERT into event(name, joindate) values ('tom','2020-01-20'), ('tom','2020-01-21'), ('tom','2020-01-22'), ('tom','2020-01-24'), ('jerry','2020-01-21'), ('jerry','2020-01-23'), ('jerry','2020-01-24'), ('jerry','2020-01-25'), ('jerry','2020-01-26'), ('nibbles','2020-01-26')
WITH tbb as ( select name, joindate, row_number() over(partition by name order by joindate), joindate - row_number() over(partition by name order by joindate)::int grp from event ) SELECT DISTINCT name FROM tbb GROUP BY name,grp HAVING COUNT(*) >= 3
WITH tbb as ( select name, joindate, row_number() over(partition by name order by joindate), joindate - row_number() over(partition by name order by joindate)::int grp from event ), count_stats AS ( select *, COUNT(*) over(partition by name,grp order by joindate rows BETWEEN unbounded preceding and unbounded following) cnt from tbb ), longest_streak AS ( SELECT name, max(cnt) max_cnt FROM count_stats GROUP BY name ) SELECT * FRom count_stats cs join longest_streak ls on cs.name = ls.name and cs.cnt = ls.max_cnt
SELECT name, joindate, joindate-'2020-01-01' for_making_range_work, COUNT(*) over(partition by name order by joindate-'2020-01-01' range between 2 preceding and current row) FROM event