Recursive CTE in Snowflake ❄️

Heard about CTE in SQL but didn’t hear about recursive CTE. The same was with me till last night when I was trying to solve an SQL problem. Don't worry this article is about the recursive CTE in Snowflake.
Let's understand what is CTE in SQL. CTE is an acronym for Common Table Expression in SQL, which provides a way to create temporary result sets within SQL queries.
Understanding Recursive CTEs
A recursive CTE is a special type of CTE used to work with hierarchical or tree-structured data within a relational database. Hierarchical data is data that is organized in a parent-child relationship, such as an organizational chart, file system structure, or product category hierarchy.
The recursive CTE allows you to perform operations on hierarchical data by repeatedly querying and joining a table with itself. This recursion continues until a specified condition is met, such as reaching the root node or a specific depth in the hierarchy.
The Snowflake Database
Before we dive into a real-world use case of Snowflake Recursive CTE, let’s briefly introduce Snowflake, a cloud-based data warehousing platform. Snowflake is known for its scalability, flexibility, and performance, making it a popular choice for data professionals. It supports ANSI SQL and offers various advanced features, including support for recursive CTEs.
Use Case: Organizational Hierarchy
Imagine you are working with a large organization’s employee data stored in a Snowflake database. The employee data is organized hierarchically, with each employee having a manager who, in turn, reports to another manager. Your task is to retrieve the reporting chain for a specific employee, starting from the employee and going up to the CEO of the company.
This is a classic use case for a Snowflake Recursive CTE. Here’s how you can use it to solve this problem:
First, let’s prepare an employee table and a few records for our use case.
CREATE DATABASE IF NOT EXISTS DS_EMPLOYEE;
CREATE OR REPLACE TABLE DS_EMPLOYEE.PUBLIC.EMPLOYEE
(
EMPLOYEE_NAME VARCHAR,
MANAGER_NAME VARCHAR
);
INSERT INTO DS_EMPLOYEE.PUBLIC.EMPLOYEE
VALUES ('AP','TP'),
('KP','TP'),
('TP','GP'),
('KI','GP'),
('GP','LK'),
('LK','OI'),
('KJ','OI');Now it’s time to find the employee hierarchy for the employee ‘AP’ from the data generated in the above example.
-- DS_EMPLOYEE.PUBLIC.EMPLOYEE
WITH RECURSIVE EmployeeHierarchy AS (
SELECT
EMPLOYEE_NAME,
MANAGER_NAME
FROM
DS_EMPLOYEE.PUBLIC.EMPLOYEE
WHERE
EMPLOYEE_NAME = 'AP'
UNION ALL
SELECT
e.EMPLOYEE_NAME,
e.MANAGER_NAME
FROM
DS_EMPLOYEE.PUBLIC.EMPLOYEE AS e
JOIN
EmployeeHierarchy AS eh
ON
e.EMPLOYEE_NAME = eh.MANAGER_NAME
)
SELECT
EMPLOYEE_NAME,
MANAGER_NAME
FROM
EmployeeHierarchy;In this query:
- We start with the base case by selecting the employee whose hierarchy we want to retrieve (in this case, ‘AP’).
- We use the UNION ALL clause to combine the base case with the recursive case. In the recursive case, we join the “employees” table with the “EmployeeHierarchy” CTE using the manager’s ID to find the next level of employees in the hierarchy.
- The recursion continues until there are no more managers to traverse, effectively building the entire reporting chain.
- Finally, we select the employee_id and employee_name from the “EmployeeHierarchy” CTE, which contains the entire reporting chain.
- By using a Snowflake Recursive CTE, you can efficiently retrieve the organizational hierarchy for any employee in the organization, regardless of how deep or complex the hierarchy is.
My Use Case
I have been tasked with establishing column lineage within Snowflake, which involves identifying the source tables for populating data in a specific column. While I managed to gather this information from Snowflake’s usage tables, creating a hierarchy for a given column based on the available data proved to be a challenge. In order to tackle this issue, I employed a recursive Common Table Expression (CTE) to resolve the problem.
Here is code for the creation of column-level lineage in the Snowflake by using recursive CTE.
CREATE OR REPLACE TEMPORARY TABLE DATA_LINAGE
AS
SELECT
SOURCE,
TARGET
FROM (
SELECT
QH.QUERY_TEXT,
AH.OBJECTS_MODIFIED,
TRIM(
COALESCE(SRC.VALUE:objectName::STRING, '')
|| '.' || COALESCE(SRC.VALUE:columnName::STRING, ''), '.'
) AS SOURCE,
TRIM(
COALESCE(OM.VALUE:objectName::STRING, '')
|| '.' || COALESCE(COL.VALUE:columnName::STRING, ''), '.'
) AS TARGET
FROM SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY AS AH
LEFT JOIN SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY AS QH
ON AH.QUERY_ID = QH.QUERY_ID,
LATERAL FLATTEN(input => OBJECTS_MODIFIED) AS OM,
LATERAL FLATTEN(input => OM.VALUE:"columns", outer => TRUE) AS COL,
LATERAL FLATTEN(input => COL.VALUE:directSources, outer => TRUE) AS SRC
WHERE AH.QUERY_START_TIME > CURRENT_DATE - 2
ORDER BY AH.QUERY_START_TIME
) AS MAIN;
WITH RECURSIVE DLHIERARCHY AS (
SELECT
SOURCE,
TARGET
FROM
DATA_LINAGE
WHERE
SOURCE = 'AYUSH_STG.PUBLIC.TEST1.A'
UNION ALL
SELECT
D.SOURCE,
D.TARGET
FROM
DATA_LINAGE AS D
INNER JOIN
DLHIERARCHY AS DH
ON
D.SOURCE = DH.TARGET
)
SELECT
SOURCE,
TARGET
FROM
DLHIERARCHY;Summary
Snowflake Recursive CTEs are a powerful tool for working with hierarchical data in Snowflake databases. They allow you to solve complex organizational hierarchy problems, analyze file systems, model product category structures, and more. Understanding and mastering recursive CTEs can greatly enhance your ability to work with hierarchical data efficiently and effectively. As you explore the capabilities of Snowflake and SQL, keep in mind the valuable role that recursive CTEs can play in your data analysis and reporting tasks.
If you think that this article is informative and helped you with what you are looking then give a clap and follow my medium account( datageeks.medium.com ) and feel free to write in the comments if you have any doubts about this topic.
By signing up as a member (https://datageeks.medium.com/membership), you can read every story and help the authors on Medium.
