avatarThang Vu

Summary

This tutorial provides a way to query a database to generate a cohort report using SQL in PostgreSQL.

Abstract

The tutorial provides a step-by-step guide on how to query a database to generate a cohort report using SQL in PostgreSQL. The pre-conditions include having a histories table and wanting to query daily retention from 2021-05-22 to 2021-05-26. The tutorial provides SQL queries for selecting all activities, getting user's first visit time, generating all user activities with period number, calculating how many users in each group, and calculating retention table. The final calculation is to construct the final cohort table. The tutorial also provides references and a note on user 1 being calculated as a new user in the period.

Opinions

  • The tutorial provides a comprehensive guide on how to query a database to generate a cohort report using SQL in PostgreSQL.
  • The tutorial provides SQL queries that are easy to follow and implement.
  • The tutorial provides a note on user 1 being calculated as a new user in the period, which is a useful reminder for users to adjust the query accordingly.
  • The tutorial provides references for further reading and learning.

How to write SQL to calculate user cohort retention

Photo by Stephen Dawson on Unsplash

User retention is a meaningful metric to measure how long users come back to use our product. Measure this metric can help understand how products fit and can help you analyze behaviors of each user group by location, gender, or age.

This tutorial provides a way to query the database to generate a cohort report. In this tutorial, I’m using PostgreSQL. It should be similar to other SQL database engines.

Pre-conditions

Assume that we have histories table.

We want to query DAILY retention from 2021–05–22T00:00:00 to 2021–05–26T23:59:59.999Z

+-----------+-------------------------+
| person_id |          start          |
+-----------+-------------------------+
|         1 | 2021–04–30 00:43:17.927 |
|         1 | 2021–05–22 00:43:17.927 |
|         1 | 2021–05–23 00:43:17.927 |
|         1 | 2021–05–25 00:43:17.927 |
|         2 | 2021–05–23 00:43:17.927 |
|         2 | 2021–05–24 00:43:17.927 |
|         2 | 2021–05–26 00:43:17.927 |
|         2 | 2021–05–27 00:43:17.927 |
|         3 | 2021–05–19 00:43:17.927 |
|         3 | 2021–05–23 00:43:17.927 |
+-----------+-------------------------+

Build SQL query

1. Select all activities from 2021/05/22 to 2021/05/26

Result is the activities :

+-----------+-------------------------+
| person_id |          start          |
+-----------+-------------------------+
|         1 | 2021–05–22 00:43:17.927 |
|         1 | 2021–05–23 00:43:17.927 |
|         1 | 2021–05–25 00:43:17.927 |
|         2 | 2021–05–23 00:43:17.927 |
|         2 | 2021–05–24 00:43:17.927 |
|         2 | 2021–05–26 00:43:17.927 |
|         3 | 2021–05–23 00:43:17.927 |
+-----------+-------------------------+

2. Get user’s first visit time

Result is new_users

+-----------+------------+
| person_id | start_time |
+-----------+------------+
|         1 | 2021–05–22 |
|         2 | 2021–05–23 |
|         3 | 2021–05–23 |
+-----------+------------+

3. Generate all user_activities with period_number

Results: user_activities

+-----------+---------------+
| person_id | period_number |
+-----------+---------------+
|         1 |             0 |
|         1 |             1 |
|         1 |             3 |
|         2 |             0 |
|         2 |             1 |
|         2 |             3 |
|         3 |             0 |
+-----------+---------------+

period_number indicates if a user is active after x days after their first visit

4. Calculate how many users in each group

Results: cohort_size

+------------+-----------+
| start_time | num_users |
+------------+-----------+
| 2021–05–22 |         1 |
| 2021–05–23 |         2 |
+------------+-----------+

5. Calculate retention table

Results: retention_table

+------------+---------------+-----------+
| start_time | period_number | num_users |
+------------+---------------+-----------+
| 2021–05–22 |             0 |         1 |
| 2021–05–22 |             1 |         1 |
| 2021–05–22 |             3 |         1 |
| 2021–05–23 |             0 |         2 |
| 2021–05–23 |             1 |         1 |
| 2021–05–23 |             3 |         1 |
+------------+---------------+-----------+

6. Final calculation

+------------+-------------+---------------+-----------------+
| start_time | total_users | period_number | total_repeators |
+------------+-------------+---------------+-----------------+
| 2021–05–22 |           1 |             0 |               1 |
| 2021–05–22 |           1 |             1 |               1 |
| 2021–05–22 |           1 |             3 |               1 |
| 2021–05–23 |           2 |             0 |               2 |
| 2021–05–23 |           2 |             1 |               1 |
| 2021–05–23 |           2 |             3 |               1 |
+------------+-------------+---------------+-----------------+

From this table, we can construct the final cohort table

+------------+-------+---------+------+------+------+------+------+
|    Date    | Total | NewUser | Day0 | Day1 | Day2 | Day3 | Day4 |
+------------+-------+---------+------+------+------+------+------+
| 2021–05–22 |     1 |       1 |    1 |    0 |    1 |    0 |      |
| 2021–05–23 |     2 |       2 |    1 |    0 |    1 |      |      |
| 2021–05–24 |     0 |       0 |    0 |    0 |      |      |      |
| 2021–05–25 |     0 |       0 |    0 |      |      |      |      |
| 2021–05–26 |     0 |       0 |      |      |      |      |      |
+------------+-------+---------+------+------+------+------+------+

Final SQL Query

Note: User 1 joined at 2021–04–30, but he is calculated as a new user in the period 2021–05–22T00:00:00 to 2021–05–26T23:59:59.999Z

If we don’t want to count him as a new user in the above period, we need to change the query a bit:

Or

Reference - https://blog.statsbot.co/customer-retention-analysis-93af9daee46b - https://www.periscopedata.com/blog/how-to-calculate-cohort-retention-in-sql - https://www.holistics.io/blog/calculate-cohort-retention-analysis-with-sql/ - https://mherman.org/blog/2013/07/19/cohort-analysis-data-sourcing-with-sql/

Cohort
Retention
Sql Report
Sql Query
Postgresql
Recommended from ReadMedium