Skip to main content

Gaps and Islands in SQL

Problem Statement

This problem is a variation of the famous Gaps and Islands problem, and the idea behind solving it will be really helpful in your SQL interviews.

Given a table with event logs, find the top five users with the longest continuous streak of visiting the platform.

Note: A continuous streak counts if the user visits the platform at least once per day on consecutive days.

Example:

Input:

events table

Column Type
user_id INTEGER
created_at DATETIME
url VARCHAR

Output:

Column Type
user_id INTEGER
streak_length INTEGER

Solution:

WITH grouped AS (
    SELECT 
        DATE(DATE_ADD(created_at, INTERVAL - ROW_NUMBER() 
        OVER (PARTITION BY user_id ORDER BY created_at) DAY)) 
        AS grp,
        user_id, 
        created_at 
    FROM (
        SELECT * 
        FROM events 
        GROUP BY created_at, user_id) dates
)
SELECT 
    user_id, streak_length 
FROM (
    SELECT user_id, COUNT(*) as streak_length
    FROM grouped
    GROUP BY user_id, grp
    ORDER BY COUNT(*) desc) c
GROUP BY user_id
LIMIT 5

Explanation:

We need to find the top five users with the longest continuous streak of visiting the platform. Before anything else, let’s make sure we are selecting only distinct dates from the created_at column for each user so that the streaks aren’t incorrectly interrupted by duplicate dates.

SELECT *
        FROM events
        GROUP BY created_at, user_id) dates

After that, the first step is to find a method of calculating the “streaks” of each user from the created_at column. This is a “gaps and islands” problem, in which the data is split into “islands” of consecutive values, separated by “gaps” (i.e. 1-2-3, 5-6, 9-10).

A clever trick which will help us group consecutive values is taking advantage of the fact that subtracting two equally incrementing sequences will produce the same difference for each pair of values.

For example, [1, 2, 3, 5, 6] - [0, 1, 2, 3, 4] = [1, 1, 1, 2, 2].

By creating a new column containing the result of such a subtraction, we can then group and count the streaks for each user. For our incremental sequence, we can use the row number of each event, obtainable with either window functions: ROW_NUMBER() or DENSE_RANK().

The difference between these two functions lies in how they deal with duplicate values, but since we need to remove duplicate values either way to accurately count the streaks, it doesn’t make a difference.

SELECT 
        DATE(DATE_ADD(created_at, INTERVAL -ROW_NUMBER() 
        OVER (PARTITION BY user_id ORDER BY created_at) DAY)) 
        AS grp,
        user_id, 
        created_at 
    FROM (
        SELECT * 
        FROM events 
        GROUP BY created_at, user_id) dates

With the events categorized into consecutive streaks, it is simply a matter of grouping by the streaks, counting each group, selecting the highest streak for each user, and ranking the top 5 users.

WITH grouped AS (
    SELECT 
        DATE(DATE_ADD(created_at, INTERVAL -ROW_NUMBER() 
        OVER (PARTITION BY user_id ORDER BY created_at) DAY)) 
        AS grp,
        user_id, 
        created_at 
    FROM (
        SELECT * 
        FROM events 
        GROUP BY created_at, user_id) dates
)
SELECT 
    user_id, streak_length 
FROM (
    SELECT user_id, COUNT(*) as streak_length
    FROM grouped
    GROUP BY user_id, grp
    ORDER BY COUNT(*) desc) c
GROUP BY user_id
LIMIT 5

Note that the second subquery was necessary in order to get the streak_length (count) as a column in our final selection, as it involves multiple groupings.


Post Tags: