avatarLouis G

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

3139

Abstract

e statement, telling Snowflake we are creating a function, provide the name of our function (keyword_filter), and define the inputs it will take, which in this case, is a string and an array (list)</p><div id="ff92"><pre><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">OR</span> REPLACE <span class="hljs-keyword">FUNCTION</span> keyword_filter(input_text STRING, search_terms <span class="hljs-keyword">ARRAY</span>)</pre></div><p id="0986">Then we define what data type we are returning, what language we are using and which runtime version for that language</p><div id="0a0e"><pre><span class="hljs-keyword">RETURNS</span> <span class="hljs-keyword">ARRAY</span> <span class="hljs-keyword">LANGUAGE</span> python RUNTIME_VERSION <span class="hljs-operator">=</span> <span class="hljs-string">'3.8'</span></pre></div><p id="dfbb">Then we define the ‘handler’ (the logic), using ‘’ to bookend our code. Everything within the ‘’ symbols is Python code.</p><div id="5f0f"><pre>HANDLER <span class="hljs-operator">=</span> <span class="hljs-string">'keyword_filter'</span> <span class="hljs-keyword">AS</span>

def keyword_filter(input_text, search_terms):
    split_text <span class="hljs-operator">=</span> [
        word.replace(",", "").replace(".", "") <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> input_text.split(" ")
    ]

    filtered_list <span class="hljs-operator">=</span> list(
        <span class="hljs-keyword">set</span>(
            [
                word.<span class="hljs-built_in">lower</span>()
                <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> split_text
                if word.<span class="hljs-built_in">lower</span>() <span class="hljs-keyword">in</span> search_terms
            ]
        )
    )
    <span class="hljs-keyword">return</span> filtered_list
$$</pre></div><p id="8b5a">After executing this create statement Snowflake, our Python UDF is available to use in SQL scripts. For example:</p><div id="0659"><pre><span class="hljs-keyword">SELECT</span> 
    jobs_id
    , keyword_filter(description, STRTOK_TO_ARRAY(<span class="hljs-string">'python,sql'</span>, <span class="hljs-string">','</span>))
    , job_title
    , company_name
    , job_type
    , load_timestamp
<span class="hljs-keyword">FROM</span>   
    jobs_raw</pre></div><p id="a5cf">If you had a much larger list of keywords, it might make sense to store these in a table, and then query that table as an array using Snowflake’s ARRAY_AGG() to use as input to our keyword_filter() UDF:</p><p id="8542">First, we create a table to hold our search_terms:</p><div id="5dc4"><pre><span class="hljs-keyword">CREATE</span> SEQUENCE IF <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">EXISTS</span> MY_KEYWORD_SEQ;

<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> IF <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">EXISTS</span> search_terms (
  keyword_id <span class="hljs-type">INTEGER</span> <span class="hljs-keyword">DEFAULT</span> MY_KEYWORD_SEQ.NEXTVAL <span clas
# Options
s="hljs-keyword">PRIMARY</span> KEY,
  keyword <span class="hljs-type">VARCHAR</span>(<span class="hljs-number">50</span>)
);</pre></div><p id="a71c">Then we define a list of keywords we want to search for and load them to our search_terms table</p><div id="9789"><pre><span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> search_terms (keyword)
<span class="hljs-keyword">VALUES</span> 
    (<span class="hljs-string">'sql'</span>),
    (<span class="hljs-string">'nosql'</span>),
    (<span class="hljs-string">'mysql'</span>),
    (<span class="hljs-string">'postgres'</span>),
    (<span class="hljs-string">'mongodb'</span>),
    (<span class="hljs-string">'snowflake'</span>),
    (<span class="hljs-string">'databricks'</span>),
    (<span class="hljs-string">'airflow'</span>),
    (<span class="hljs-string">'prefect'</span>),
    (<span class="hljs-string">'java'</span>),
    (<span class="hljs-string">'dbt'</span>),
    (<span class="hljs-string">'python'</span>),
    (<span class="hljs-string">'pytorch'</span>),
    (<span class="hljs-string">'dax'</span>),
    (<span class="hljs-string">'bash'</span>),
    (<span class="hljs-string">'shell'</span>),
    (<span class="hljs-string">'aws'</span>),
    (<span class="hljs-string">'azure'</span>),
    (<span class="hljs-string">'tensorflow'</span>),</pre></div><p id="9f3e">Then we can query this table, using ARRAY_AGG() to return an array, and feed this to our UDF:</p><div id="aa54"><pre><span class="hljs-keyword">WITH</span> search_terms <span class="hljs-keyword">AS</span> (
    <span class="hljs-keyword">SELECT</span> keyword <span class="hljs-keyword">FROM</span> my_db.my_schema.search_terms
)

<span class="hljs-keyword">SELECT</span> 
    jobs_id
    , keyword_filter(description, (<span class="hljs-keyword">SELECT</span> <span class="hljs-built_in">ARRAY_AGG</span>(keyword) <span class="hljs-keyword">from</span> search_terms))
    , job_title
    , company_name
    , job_type
    , load_timestamp
<span class="hljs-keyword">FROM</span>   
    jobs_raw</pre></div><p id="acb3">Or you could take this a bit further and flatten the list into one row per keyword:</p><div id="fc0c"><pre><span class="hljs-keyword">WITH</span> search_terms <span class="hljs-keyword">AS</span> (
    <span class="hljs-keyword">SELECT</span> keyword <span class="hljs-keyword">FROM</span> my_db.my_schema.search_terms
)

<span class="hljs-keyword">SELECT</span> 
    jobskill_id
    , F.value <span class="hljs-keyword">AS</span> jobskill 
    , job_title
    , company_name
    , job_location
    , job_type
    , posted_via
    , load_timestamp
<span class="hljs-keyword">FROM</span>   
    my_raw_data, 
    <span class="hljs-keyword">Table</span>(
        Flatten(
            keyword_filter(
                description,
                (<span class="hljs-keyword">SELECT</span> <span class="hljs-built_in">ARRAY_AGG</span>(keyword) <span class="hljs-keyword">from</span> search_terms)
            )
         )
     ) F</pre></div><p id="48ae">And that's it. Snowflake makes it super easy to integrate Python into a SQL script!</p></article></body>

How to integrate Python into your SQL scripts in Snowflake

How to use Snowflake UDFs to combine SQL and Python

Snowflake is a cloud-based data warehousing platform designed to handle large-scale data analytics workloads. It is known for its elasticity, scalability, and performance, providing organizations with a powerful solution for storing, processing, and analyzing vast amounts of data.

One of its many benefits is the ease at which you can integrate other programming languages alongside your staple SQL. Python, for example, can be used within the Snowflake ecosystem in a number of ways, such as Snowpark or the Snowflake Connector for Python.

One particularly fast and easy method is to write a Python UDF. UDF stands for User-Defined Function. It is a feature that allows you to define custom functions within the Snowflake data warehouse platform. UDFs in Snowflake enable you to extend the built-in SQL functionality and create custom logic for data manipulation, calculations, transformations, and aggregations. Snowflake supports different types of UDFs, such as those that return scalar values or a table.

Snowflake UDFs are a convenient way to integrate Python logic into a SQL script. For example, you could search a string for keywords using Python’s list comprehension. The following example will take a list of strings as input and filter that list to return a list of just the keywords present.

CREATE OR REPLACE FUNCTION keyword_filter(input_text STRING, search_terms ARRAY)
RETURNS ARRAY
LANGUAGE python
RUNTIME_VERSION = '3.8'
HANDLER = 'keyword_filter'
AS
$$
def keyword_filter(input_text, search_terms):
    split_text = [
        word.replace(",", "").replace(".", "") for word in input_text.split(" ")
    ]

    filtered_list = list(
        set(
            [
                word.lower()
                for word in split_text
                if word.lower() in search_terms
            ]
        )
    )
    return filtered_list
$$

Let's break this down and look at each part.

First, we write a create statement, telling Snowflake we are creating a function, provide the name of our function (keyword_filter), and define the inputs it will take, which in this case, is a string and an array (list)

CREATE OR REPLACE FUNCTION keyword_filter(input_text STRING, search_terms ARRAY)

Then we define what data type we are returning, what language we are using and which runtime version for that language

RETURNS ARRAY
LANGUAGE python
RUNTIME_VERSION = '3.8'

Then we define the ‘handler’ (the logic), using ‘$$’ to bookend our code. Everything within the ‘$$’ symbols is Python code.

HANDLER = 'keyword_filter'
AS
$$
def keyword_filter(input_text, search_terms):
    split_text = [
        word.replace(",", "").replace(".", "") for word in input_text.split(" ")
    ]

    filtered_list = list(
        set(
            [
                word.lower()
                for word in split_text
                if word.lower() in search_terms
            ]
        )
    )
    return filtered_list
$$

After executing this create statement Snowflake, our Python UDF is available to use in SQL scripts. For example:

SELECT 
    jobs_id
    , keyword_filter(description, STRTOK_TO_ARRAY('python,sql', ','))
    , job_title
    , company_name
    , job_type
    , load_timestamp
FROM   
    jobs_raw

If you had a much larger list of keywords, it might make sense to store these in a table, and then query that table as an array using Snowflake’s ARRAY_AGG() to use as input to our keyword_filter() UDF:

First, we create a table to hold our search_terms:

CREATE SEQUENCE IF NOT EXISTS MY_KEYWORD_SEQ;

CREATE TABLE IF NOT EXISTS search_terms (
  keyword_id INTEGER DEFAULT MY_KEYWORD_SEQ.NEXTVAL PRIMARY KEY,
  keyword VARCHAR(50)
);

Then we define a list of keywords we want to search for and load them to our search_terms table

INSERT INTO search_terms (keyword)
VALUES 
    ('sql'),
    ('nosql'),
    ('mysql'),
    ('postgres'),
    ('mongodb'),
    ('snowflake'),
    ('databricks'),
    ('airflow'),
    ('prefect'),
    ('java'),
    ('dbt'),
    ('python'),
    ('pytorch'),
    ('dax'),
    ('bash'),
    ('shell'),
    ('aws'),
    ('azure'),
    ('tensorflow'),

Then we can query this table, using ARRAY_AGG() to return an array, and feed this to our UDF:

WITH search_terms AS (
    SELECT keyword FROM my_db.my_schema.search_terms
)

SELECT 
    jobs_id
    , keyword_filter(description, (SELECT ARRAY_AGG(keyword) from search_terms))
    , job_title
    , company_name
    , job_type
    , load_timestamp
FROM   
    jobs_raw

Or you could take this a bit further and flatten the list into one row per keyword:

WITH search_terms AS (
    SELECT keyword FROM my_db.my_schema.search_terms
)

SELECT 
    jobskill_id
    , F.value AS jobskill 
    , job_title
    , company_name
    , job_location
    , job_type
    , posted_via
    , load_timestamp
FROM   
    my_raw_data, 
    Table(
        Flatten(
            keyword_filter(
                description,
                (SELECT ARRAY_AGG(keyword) from search_terms)
            )
         )
     ) F

And that's it. Snowflake makes it super easy to integrate Python into a SQL script!

Snowflake
Snowflake Data Warehouse
Python
Snowflake Data Cloud
Recommended from ReadMedium