31 Years of Python | 48 Hour Sale Extension!!!
days
hours
minutes
seconds

Tabibitosan for Consecutive SQL problems

A google search of “tabibitosan sql” surprisingly returned only links from articles years ago. Even though since Oracle 12c, match_recognize is available to replace tabibitosan method, I felt that it’s still valuable to raise awareness of this technique both to solve problems in non-Oracle SQL flavours, and more importantly to develop general problem solving skills.

Who attended the event for at least 3 consecutive days?

image

The data shows that we have 3 people tom, jerry, nibbles who attended an event some time around late January 2020.

The goal is to be able to identify that tom (20–22 Jan) and jerry (23–26 Jan) attended the event for at least 3 consecutive days while nibbles did not.

The Technique

The technique aims to create a new grouping column (grp) that we can use to answer the problem. Looking at grp, we see that it contains 4 duplicates of 2020–01–21 marking the 4 consecutive days jerry attended. We also see 3 duplicates of 2020–01–19 marking the 3 consecutive days tom attended. For the non-consecutive visit on 2020–01–21 by jerry, it has it’s own grp value. Similarly, for nibbles’ only visit and tom’s non-consecutive visit on 2020–01–24.

The exact values in the dates in the derived grp column are not important at all. The goal of the technique is to create a column of values (any type) that contains the same value for what we care about (consecutive event visits for this problem) and different values for the boundaries we want to demarcate (non-consecutive event visits).

What sorcery is this?

https://images.unsplash.com/photo-1551029506-0807df4e2031?ixlib=rb-1.2.1&q=85&fm=jpg&crop=entropy&cs=srgb

The entire technique boils down to this line joindate - row_number() over(partition by name order by joindate)::int grp in SELECT.

There are 2 operands to a minus operation. The left operand is a value increasing by 1 unit some times and more than 1 unit other times. The right operand is an auto-generated row number that (within each of it’s own partition), increases by 1 unit consistently. Overall, the left operand has an opportunity to travel faster than the right operand.

Going down the rows in the grp column (result of the difference), when the days are consecutive, both operands increase at the same speed of 1 unit, so their difference remains the same. However when the days are not consecutive, the left operand (temporarily) travels faster than the right operand, causing their difference change to a new increased value.

Again for emphasis, what this value is, is not important at all. The query above did ::int integer casting of the generated row_number() solely to handle operator does not exist: date - bigint so the differencing is possible. You may need to do other conversions as you see fit to make this technique work.

A caveat is that when this technique is used with partitioning, the grp column not necessarily increases monotonically, and that the same grp value can appear in different partitions. (2020–01–20 appears in both 1st and last row for jerry and tom respectively). If we wanted to do downstream analysis, we must be careful to GROUP BY with not just grp column but also the partitioning column usually (name in this case).

image

We can simply wrap the previous query in a CTE and query for groups HAVING more than or equals to 3 rows. SELECT DISTINCT name is important to deduplicate names occuring due to same person having multiple consecutive streaks satisfying the HAVING. Another view to this is that the final GROUP BY works on (name,grp) so it is possible to have duplicates if looking at name column only.

Sharp readers would realize by now that this solution is not only limited to solving questions of 3, and not only works for the >= operator.

Question: List the rows of the longest consecutive visit streak for each person

https://images.unsplash.com/photo-1517232115160-ff93364542dd?ixlib=rb-1.2.1&q=85&fm=jpg&crop=entropy&cs=srgb

No pain no gain, previous question was too easy so let’s extend it.

Generating some statistics

To be able to filter the rows with the longest streak, it’s helpful to first calculate some statistics on which rows are part of the longest streak.

The key technique of the above query is to expand the frame of the window function from default RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW to being unbounded in both directions so we can calculate statistics on the whole group. The partition here is multi-level, defined by both the name and grp columns.

For people unfamiliar with window functions, this step can also be achieved by the usual GROUP BY name, grp, then JOIN the cnt statistics back into this table. I’m showing unbounded window functions just to expose readers to alternatives.

Filtering the rows

We can get to the answer with 2 more steps, by calculating the longest_streak per name to generate max_cnt , then using a join to filter the data from the previous step.

This solution finds all the rows contributing to the max streak per person but does not identify which of the streaks it’s in. The question can be made even harder by looking for the rows from the first, last or n-th max streak, assuming the same max streak occurred multiple times.

What else you can solve with this superpower

  1. First and last day of each group
  2. How many consecutive runs are there

I don’t know Tabibitosan, how do I solve the original question?

https://images.unsplash.com/photo-1566004100631-35d015d6a491?ixlib=rb-1.2.1&q=85&fm=jpg&crop=entropy&cs=srgb

Question: Who attended the event for at least 3 consecutive days?

Here we use the RANGE keyword to look back 2 days from current row up to current row inclusive. If we count 3, that means there is a streak of 3. As you can see, this solution has to hardcode the value of 2 in the query to answer for 3 consecutive days, so not as flexible as tabibitosan.

ROWS vs RANGE

If you tried to use ROWS for the windowing in the previous example, it will give False Positives. ROWS just counts physical rows with no care about the values in the rows. RANGE looks into the cell values and looks backwards and forwards, extending the window frame in both directions while the values in column in the ORDER BY of OVER are still within range relative to the current row’s value.

Minutiae

The joindate-'2020-01-01' is done only to handle RANGE with offset PRECEDING/FOLLOWING is not supported for column type date and offset type integer HINT: Cast the offset value to an appropriate type . I was pleasantly surprised Postgres actually tells you how to hack datatypes to make it work. The choice of 2020-01-01 is completely arbitrary. Any date is fine, goal being to convert date to integer so the defined offsets (2 values backwards and 0 forward in this case) can work on an integer window ordering column. Other databases may be able to handle date offsets directly.

I don’t know Tabibitosan or Window Functions, how do I solve the original question?

The solution would be just as static as the previous solution using no Tabibitosan but with window function, but way more tedious to write.

  1. Self join 3 tables and write the conditions to define consecutiveness
  2. Generate 4 additional columns, (lag2,lag1,lead1,lead2) and filter based on them

Relevant Links

I have only shown the tip of the iceberg, below are some helpful links:

  1. Top SQL Interview Test Questions & Techniques (Part 1) | by Ian Ho | Towards Data Science (cross-join and lag,lead patterns, valuable for practicing join conditions)
  2. About Oracle: Tabibitosan (swapping which is faster/slower runner in tabibitosan, changing granularities of timeseries, forward/backward filling nulls)
  3. PL/SQL 101 : Grouping Sequence Ranges (Tabibitosan Method) — oracle-tech (more tabibitosan, handling duplicates with dense_rank() instead of row_number())
  4. https://social.technet.microsoft.com/wiki/contents/articles/51523.transactsql-simulating-ignore-nulls-functionality-on-first-value-last-value-functions.aspx (general sql data manipulation skills)
  5. 10 SQL Tricks That You Didn’t Think Were Possible – Java, SQL and jOOQ. (more sql manipulation skills)
  6. The RANGE Clause in SQL Window Functions: 5 Practical Examples | LearnSQL.com (Practical uses of RANGE, linked cheatsheet very good)
  7. Window Function ROWS and RANGE on Redshift and BigQuery - Sonra (More RANGE)
  8. DB2 SQL Finding rows with a gap of 1 minute with other rows - Stack Overflow (new challenge to test yourself, comparing not just across rows but also columns)

Code for this article:

Examples created in postgres 14.1 on www.sqliteonline.com. sqlfiddle.com would be easier to share sql snippets but their sql engines are too old to handle variable window frames.

Table set up

CREATE TABLE event(name varchar, joindate date);
INSERT into event(name, joindate) values 
('tom','2020-01-20'),
('tom','2020-01-21'),
('tom','2020-01-22'),
('tom','2020-01-24'),
('jerry','2020-01-21'),
('jerry','2020-01-23'),
('jerry','2020-01-24'),
('jerry','2020-01-25'),
('jerry','2020-01-26'),
('nibbles','2020-01-26')

Who attended the event for at least 3 consecutive days?

WITH tbb as ( 
  select 
      name,
      joindate,
      row_number() over(partition by name order by joindate),
      joindate - row_number() over(partition by name order by joindate)::int grp
  from event
)
SELECT 
 DISTINCT name
FROM tbb
GROUP BY name,grp
HAVING COUNT(*) >= 3

List the rows of the longest consecutive visit streak for each person

WITH tbb as (
  select 
      name,
      joindate,
      row_number() over(partition by name order by joindate),
      joindate - row_number() over(partition by name order by joindate)::int grp
  from event
),
count_stats AS ( 
  select 
    *,
    COUNT(*) over(partition by name,grp order by joindate rows BETWEEN unbounded preceding and unbounded following) cnt
  from tbb 
),
longest_streak AS (
  SELECT name, max(cnt) max_cnt
  FROM count_stats
  GROUP BY name
)
SELECT * FRom count_stats cs join longest_streak ls on cs.name = ls.name and cs.cnt = ls.max_cnt

No tabibitosan, hardcode window size

SELECT 
name,
    joindate,
    joindate-'2020-01-01' for_making_range_work,
    COUNT(*) over(partition by name order by joindate-'2020-01-01' range between 2 preceding and current row)
FROM event
1 Like