avatarDataGeeks

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

3798

Abstract

n class="hljs-keyword">AS</span> ( <span class="hljs-keyword">SELECT</span> EMPLOYEE_NAME, MANAGER_NAME <span class="hljs-keyword">FROM</span> DS_EMPLOYEE.<span class="hljs-keyword">PUBLIC</span>.EMPLOYEE <span class="hljs-keyword">WHERE</span> EMPLOYEE_NAME = <span class="hljs-comment">'AP'</span> UNION ALL <span class="hljs-keyword">SELECT</span> e.EMPLOYEE_NAME, e.MANAGER_NAME <span class="hljs-keyword">FROM</span> DS_EMPLOYEE.<span class="hljs-keyword">PUBLIC</span>.EMPLOYEE <span class="hljs-keyword">AS</span> e <span class="hljs-keyword">JOIN</span> EmployeeHierarchy <span class="hljs-keyword">AS</span> eh <span class="hljs-keyword">ON</span> e.EMPLOYEE_NAME = eh.MANAGER_NAME ) <span class="hljs-keyword">SELECT</span> EMPLOYEE_NAME, MANAGER_NAME <span class="hljs-keyword">FROM</span> EmployeeHierarchy;</pre></div><p id="bbe7">In this query:</p><ol><li>We start with the base case by selecting the employee whose hierarchy we want to retrieve (in this case, ‘AP’).</li><li>We use the UNION ALL clause to combine the base case with the recursive case. In the recursive case, we join the “employees” table with the “EmployeeHierarchy” CTE using the manager’s ID to find the next level of employees in the hierarchy.</li><li>The recursion continues until there are no more managers to traverse, effectively building the entire reporting chain.</li><li>Finally, we select the employee_id and employee_name from the “EmployeeHierarchy” CTE, which contains the entire reporting chain.</li><li>By using a Snowflake Recursive CTE, you can efficiently retrieve the organizational hierarchy for any employee in the organization, regardless of how deep or complex the hierarchy is.</li></ol><h2 id="474a">My Use Case</h2><p id="02e9">I have been tasked with establishing column lineage within Snowflake, which involves identifying the source tables for populating data in a specific column. While I managed to gather this information from Snowflake’s usage tables, creating a hierarchy for a given column based on the available data proved to be a challenge. In order to tackle this issue, I employed a recursive Common Table Expression (CTE) to resolve the problem.</p><p id="d30b">Here is code for the creation of column-level lineage in the Snowflake by using recursive CTE.</p><div id="aad2"><pre><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">OR</span> REPLACE TEMPORARY <span class="hljs-keyword">TABLE</span> DATA_LINAGE <span class="hljs-keyword">AS</span> <span class="hljs-keyword">SELECT</span> SOURCE, TARGET <span class="hljs-keyword">FROM</span> ( <span class="hljs-keyword">SELECT</span> QH.QUERY_TEXT, AH.OBJECTS_MODIFIED, <span class="hljs-built_in">TRIM</span>( <span class="hljs-built_in">COALESCE</span>(SRC.VALUE:objectName::STRING, <span class="hljs-string">''</span>) <span class="hljs-operator">||</span> <span class="hljs-string">'.'</span> <span class="hljs-operator">||</span> <span class="hljs-built_in">COALESCE</span>(SRC.VALUE:columnName::STRING, <span class="hljs-string">''</span>), <span class="hljs-string">'.'</span> ) <span class="hljs-keyword">AS</span> SOURCE, <span class="hljs-built_in">TRIM</span>( <span class="hljs-built_in">COALESCE</span>(OM.VALUE:objectName::STRING, <span class="hljs-string">''</span>) <span class="hljs-operator">||</span> <span class="hljs-string">'.'</span> <span class="hljs-operator">||</span> <span class="hljs-built_in">COALESCE</span>(COL.VALUE:columnName::STRING, <span class="hljs-string">''</span>), <span class="hljs-string">'.'</span> ) <span class="hljs-keyword">AS</span> TARGET <span class="hljs-keyword">FROM</span> SNO

Options

WFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY <span class="hljs-keyword">AS</span> AH <span class="hljs-keyword">LEFT</span> <span class="hljs-keyword">JOIN</span> SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY <span class="hljs-keyword">AS</span> QH <span class="hljs-keyword">ON</span> AH.QUERY_ID <span class="hljs-operator">=</span> QH.QUERY_ID, <span class="hljs-keyword">LATERAL</span> FLATTEN(input <span class="hljs-operator">=</span><span class="hljs-operator">></span> OBJECTS_MODIFIED) <span class="hljs-keyword">AS</span> OM, <span class="hljs-keyword">LATERAL</span> FLATTEN(input <span class="hljs-operator">=</span><span class="hljs-operator">></span> OM.VALUE:"columns", <span class="hljs-keyword">outer</span> <span class="hljs-operator">=</span><span class="hljs-operator">></span> <span class="hljs-literal">TRUE</span>) <span class="hljs-keyword">AS</span> COL, <span class="hljs-keyword">LATERAL</span> FLATTEN(input <span class="hljs-operator">=</span><span class="hljs-operator">></span> COL.VALUE:directSources, <span class="hljs-keyword">outer</span> <span class="hljs-operator">=</span><span class="hljs-operator">></span> <span class="hljs-literal">TRUE</span>) <span class="hljs-keyword">AS</span> SRC <span class="hljs-keyword">WHERE</span> AH.QUERY_START_TIME <span class="hljs-operator">></span> <span class="hljs-built_in">CURRENT_DATE</span> <span class="hljs-operator">-</span> <span class="hljs-number">2</span> <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> AH.QUERY_START_TIME ) <span class="hljs-keyword">AS</span> MAIN;

<span class="hljs-keyword">WITH</span> <span class="hljs-keyword">RECURSIVE</span> DLHIERARCHY <span class="hljs-keyword">AS</span> ( <span class="hljs-keyword">SELECT</span> SOURCE, TARGET <span class="hljs-keyword">FROM</span> DATA_LINAGE <span class="hljs-keyword">WHERE</span> SOURCE <span class="hljs-operator">=</span> <span class="hljs-string">'AYUSH_STG.PUBLIC.TEST1.A'</span> <span class="hljs-keyword">UNION</span> <span class="hljs-keyword">ALL</span> <span class="hljs-keyword">SELECT</span> D.SOURCE, D.TARGET <span class="hljs-keyword">FROM</span> DATA_LINAGE <span class="hljs-keyword">AS</span> D <span class="hljs-keyword">INNER</span> <span class="hljs-keyword">JOIN</span> DLHIERARCHY <span class="hljs-keyword">AS</span> DH <span class="hljs-keyword">ON</span> D.SOURCE <span class="hljs-operator">=</span> DH.TARGET )

<span class="hljs-keyword">SELECT</span> SOURCE, TARGET <span class="hljs-keyword">FROM</span> DLHIERARCHY;</pre></div><h2 id="7c66">Summary</h2><p id="70f3">Snowflake Recursive CTEs are a powerful tool for working with hierarchical data in Snowflake databases. They allow you to solve complex organizational hierarchy problems, analyze file systems, model product category structures, and more. Understanding and mastering recursive CTEs can greatly enhance your ability to work with hierarchical data efficiently and effectively. As you explore the capabilities of Snowflake and SQL, keep in mind the valuable role that recursive CTEs can play in your data analysis and reporting tasks.</p><p id="f237">If you think that this article is informative and helped you with what you are looking then give a clap and follow my medium account( <a href="https://datageeks.medium.com/">datageeks.medium.com</a> ) and feel free to write in the comments if you have any doubts about this topic.</p><p id="1b69">By signing up as a member (<a href="https://datageeks.medium.com/membership">https://datageeks.medium.com/membership</a>), you can read every story and help the authors on Medium.</p></article></body>

Recursive CTE in Snowflake ❄️

Heard about CTE in SQL but didn’t hear about recursive CTE. The same was with me till last night when I was trying to solve an SQL problem. Don't worry this article is about the recursive CTE in Snowflake.

Let's understand what is CTE in SQL. CTE is an acronym for Common Table Expression in SQL, which provides a way to create temporary result sets within SQL queries.

Understanding Recursive CTEs

A recursive CTE is a special type of CTE used to work with hierarchical or tree-structured data within a relational database. Hierarchical data is data that is organized in a parent-child relationship, such as an organizational chart, file system structure, or product category hierarchy.

The recursive CTE allows you to perform operations on hierarchical data by repeatedly querying and joining a table with itself. This recursion continues until a specified condition is met, such as reaching the root node or a specific depth in the hierarchy.

The Snowflake Database

Before we dive into a real-world use case of Snowflake Recursive CTE, let’s briefly introduce Snowflake, a cloud-based data warehousing platform. Snowflake is known for its scalability, flexibility, and performance, making it a popular choice for data professionals. It supports ANSI SQL and offers various advanced features, including support for recursive CTEs.

Use Case: Organizational Hierarchy

Imagine you are working with a large organization’s employee data stored in a Snowflake database. The employee data is organized hierarchically, with each employee having a manager who, in turn, reports to another manager. Your task is to retrieve the reporting chain for a specific employee, starting from the employee and going up to the CEO of the company.

This is a classic use case for a Snowflake Recursive CTE. Here’s how you can use it to solve this problem:

First, let’s prepare an employee table and a few records for our use case.

CREATE DATABASE IF NOT EXISTS DS_EMPLOYEE;
CREATE OR REPLACE TABLE DS_EMPLOYEE.PUBLIC.EMPLOYEE
(
EMPLOYEE_NAME VARCHAR,
MANAGER_NAME VARCHAR
);

INSERT INTO DS_EMPLOYEE.PUBLIC.EMPLOYEE
VALUES ('AP','TP'),
       ('KP','TP'),
       ('TP','GP'),
       ('KI','GP'),
       ('GP','LK'),
       ('LK','OI'),
       ('KJ','OI');

Now it’s time to find the employee hierarchy for the employee ‘AP’ from the data generated in the above example.

-- DS_EMPLOYEE.PUBLIC.EMPLOYEE

WITH RECURSIVE EmployeeHierarchy AS (
  SELECT
    EMPLOYEE_NAME,
    MANAGER_NAME
  FROM
    DS_EMPLOYEE.PUBLIC.EMPLOYEE
  WHERE
    EMPLOYEE_NAME = 'AP'
  UNION ALL
  SELECT
    e.EMPLOYEE_NAME,
    e.MANAGER_NAME
  FROM
    DS_EMPLOYEE.PUBLIC.EMPLOYEE AS e
  JOIN
    EmployeeHierarchy AS eh
  ON
    e.EMPLOYEE_NAME = eh.MANAGER_NAME
)
SELECT
  EMPLOYEE_NAME,
  MANAGER_NAME
FROM
  EmployeeHierarchy;

In this query:

  1. We start with the base case by selecting the employee whose hierarchy we want to retrieve (in this case, ‘AP’).
  2. We use the UNION ALL clause to combine the base case with the recursive case. In the recursive case, we join the “employees” table with the “EmployeeHierarchy” CTE using the manager’s ID to find the next level of employees in the hierarchy.
  3. The recursion continues until there are no more managers to traverse, effectively building the entire reporting chain.
  4. Finally, we select the employee_id and employee_name from the “EmployeeHierarchy” CTE, which contains the entire reporting chain.
  5. By using a Snowflake Recursive CTE, you can efficiently retrieve the organizational hierarchy for any employee in the organization, regardless of how deep or complex the hierarchy is.

My Use Case

I have been tasked with establishing column lineage within Snowflake, which involves identifying the source tables for populating data in a specific column. While I managed to gather this information from Snowflake’s usage tables, creating a hierarchy for a given column based on the available data proved to be a challenge. In order to tackle this issue, I employed a recursive Common Table Expression (CTE) to resolve the problem.

Here is code for the creation of column-level lineage in the Snowflake by using recursive CTE.

CREATE OR REPLACE TEMPORARY TABLE DATA_LINAGE
AS
SELECT
    SOURCE,
    TARGET
FROM (
    SELECT
        QH.QUERY_TEXT,
        AH.OBJECTS_MODIFIED,
        TRIM(
            COALESCE(SRC.VALUE:objectName::STRING, '')
            || '.' || COALESCE(SRC.VALUE:columnName::STRING, ''), '.'
        ) AS SOURCE,
        TRIM(
            COALESCE(OM.VALUE:objectName::STRING, '')
            || '.' || COALESCE(COL.VALUE:columnName::STRING, ''), '.'
        ) AS TARGET
    FROM SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY AS AH
    LEFT JOIN SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY AS QH
        ON AH.QUERY_ID = QH.QUERY_ID,
        LATERAL FLATTEN(input => OBJECTS_MODIFIED) AS OM,
        LATERAL FLATTEN(input => OM.VALUE:"columns", outer => TRUE) AS COL,
        LATERAL FLATTEN(input => COL.VALUE:directSources, outer => TRUE) AS SRC
    WHERE AH.QUERY_START_TIME > CURRENT_DATE - 2
    ORDER BY AH.QUERY_START_TIME
) AS MAIN;

WITH RECURSIVE DLHIERARCHY AS (
    SELECT
        SOURCE,
        TARGET
    FROM
        DATA_LINAGE
    WHERE
        SOURCE = 'AYUSH_STG.PUBLIC.TEST1.A'
    UNION ALL
    SELECT
        D.SOURCE,
        D.TARGET
    FROM
        DATA_LINAGE AS D
    INNER JOIN
        DLHIERARCHY AS DH
        ON
            D.SOURCE = DH.TARGET
)

SELECT
    SOURCE,
    TARGET
FROM
    DLHIERARCHY;

Summary

Snowflake Recursive CTEs are a powerful tool for working with hierarchical data in Snowflake databases. They allow you to solve complex organizational hierarchy problems, analyze file systems, model product category structures, and more. Understanding and mastering recursive CTEs can greatly enhance your ability to work with hierarchical data efficiently and effectively. As you explore the capabilities of Snowflake and SQL, keep in mind the valuable role that recursive CTEs can play in your data analysis and reporting tasks.

If you think that this article is informative and helped you with what you are looking then give a clap and follow my medium account( datageeks.medium.com ) and feel free to write in the comments if you have any doubts about this topic.

By signing up as a member (https://datageeks.medium.com/membership), you can read every story and help the authors on Medium.

Snowflake
Database
Sql
Data Engineering
Data Visualization
Recommended from ReadMedium