SQL: LAG() - Finding maximum concecutive streak
The LAG() function retrieves a value from a previous row, while LEAD() retrieves a value from a subsequent row, both based on a specified ordering within a result set.
The LAG()
function in SQL provides access to a value from a previous row in the same result set, based on a specified ordering. It is commonly used to compare the current row with the preceding one, allowing for calculations like differences or detecting changes in sequences.
Problem Statement:
Determine each user's longest consecutive streak of days watching TV shows on Netflix in January 2023. The result should include the user ID and the length of the longest streak in days, with the output ordered by the streak length.
INPUT
user_streaming_activity
user_id | watch_date | hours_streamed 1 | 2023-01-01 | 2.5 1 | 2023-01-02 | 1.7 1 | 2023-01-03 | 1.9 1 | 2023-01-05 | 1.5 2 | 2023-01-01 | 2.0 2 | 2023-01-02 | 2.1 2 | 2023-01-04 | 2.6 2 | 2023-01-05 | 3.0 3 | 2023-01-01 | 1.5 3 | 2023-01-07 | 2.4 3 | 2023-01-08 | 2.1
SOLUTION
WITH streak_days AS (
SELECT user_id,
watch_date,
number_hours_streamed,
CASE
WHEN watch_date - LAG(watch_date) OVER(PARTITION BY user_id ORDER BY watch_date) = 1
THEN 0
ELSE 1
END AS streak
FROM user_streaming_activity
WHERE watch_date >= '2023-01-01'
),
Streak_per_user AS (
SELECT user_id,
watch_date,
streak,
SUM(streak) OVER(PARTITION BY user_id ORDER BY watch_date) AS streak_total
FROM streak_days
),
Streak_length AS (
SELECT user_id,
streak_total,
COUNT(streak_total) AS streak_length
FROM Streak_per_user
GROUP BY user_id, streak_total
)
SELECT user_id,
MAX(streak_length) AS max_streak_length
FROM Streak_length
GROUP BY user_id
ORDER BY max_streak_length DESC;
EXPLANATION:
WITH streak_days AS
:SELECT user_id, watch_date, number_hours_streamed, CASE WHEN watch_date - LAG(watch_date) OVER (PARTITION BY user_id ORDER BY watch_date) = 1 THEN 0 ELSE 1 END AS streak FROM user_streaming_activity WHERE watch_date >= '2023-01-01'
Purpose: This block calculates whether a new streak begins for a user.
Explanation:
It partitions the data by
user_id
so the calculation is done separately for each user.The
LAG()
function compares the currentwatch_date
with the previous date in the sequence.If the difference is 1 day, it sets
streak = 0
(indicating the streak continues), otherwisestreak = 1
(indicating a new streak).
Output: A new column
streak
is added, marking where new streaks begin.
WITH Streak_per_user AS
:SELECT user_id, watch_date, streak, SUM(streak) OVER(PARTITION BY user_id ORDER BY watch_date) AS streak_total FROM streak_days
Purpose: This block calculates a running total (
streak_total
) of streaks for each user.Explanation:
The
SUM(streak)
window function cumulatively adds thestreak
values for each user.Each time a new streak starts (
streak = 1
), thestreak_total
increases, marking the beginning of a new streak group.
Output: A new column
streak_total
is added, which tracks the streak number for each user.
WITH Streak_length AS
:SELECT user_id, streak_total, COUNT(streak_total) AS streak_length FROM Streak_per_user GROUP BY user_id, streak_total
Purpose: This block calculates the length of each streak for every user.
Explanation:
It groups the data by
user_id
andstreak_total
, so each unique streak gets grouped together.The
COUNT(streak_total)
function counts how many consecutive days belong to each streak, effectively calculating the length of the streak.
Output: A table that shows the length of each streak for each user.
Final
SELECT
:SELECT user_id, MAX(streak_length) AS max_streak_length FROM Streak_length GROUP BY user_id ORDER BY max_streak_length DESC;
Purpose: This block selects the maximum streak length for each user.
Explanation:
The
MAX(streak_length)
function finds the longest streak for each user.The results are ordered by
max_streak_length
in descending order, so users with the longest streaks appear first.
Output: The final output table shows the user ID and their maximum streak length.
OUTPUT:
user_id | max_streak_length 1 | 3 2 | 2 3 | 2