avatarMatteo Bennio

Summary

The QUALIFY clause in SQL, particularly in Google's BigQuery, enhances query efficiency and code readability by enabling the filtering of results based on window functions, thereby simplifying complex queries and improving the maintainability of SQL code.

Abstract

The article discusses the QUALIFY clause in SQL, emphasizing its benefits for writing cleaner and more advanced SQL queries. The author, a proponent of clean coding practices, advocates for the QUALIFY clause as a means to shorten and clarify SQL code. This clause is particularly useful for filtering data based on the results of window functions, which are analytic functions that operate on a set of rows. The QUALIFY clause is positioned after HAVING and before the WINDOW clause in the SQL command sequence, providing an additional layer of data filtration. The article illustrates the practical use of QUALIFY through a detailed example involving a dataset of pizza delivery interactions by the Teenage Mutant Ninja Turtles, demonstrating how QUALIFY simplifies the process of identifying the first pizza page each turtle visited compared to traditional methods involving GROUP BY, MIN(), and JOIN. Performance considerations reveal that using QUALIFY with window functions is more efficient than relying on GROUP BY and JOIN. The article concludes by encouraging SQL users to adopt the QUALIFY clause to streamline their query writing and improve their data analysis workflow.

Opinions

  • The author highly recommends the QUALIFY clause for its ability to prettify SQL code and ensure clarity and maintainability.
  • QUALIFY is seen as a significant step toward more advanced query writing and is valued for its versatility across various SQL dialects and tools, not just BigQuery.
  • The author believes that the QUALIFY clause has firmly established its place in their repertoire of favorite SQL expressions due to its effectiveness in simplifying complex queries.
  • The article suggests that using QUALIFY adds a layer of syntactic sugar to SQL queries, making them more enjoyable to write and easier to understand.
  • The author expresses a preference for using QUALIFY with window functions over traditional GROUP BY and JOIN operations for performance reasons, as it results in less complex query execution.

BigQuery QUALIFY Clause: Towards Cleaner SQL Queries

Illustration for the article created using Midjourney

I’ve been using the QUALIFY clause for a while now, and all I can say is that I love it for its simplicity and how it can prettify SQL code. It’s one of those SQL statements that I believe is a significant step toward more advanced query writing. As someone who always strives to optimize and clean my code, I’ve found that using QUALIFY not only makes your code shorter but also more readable and understandable.

Besides good commenting practices and the right structuring of your query, QUALIFY helps ensure that your code remains clear and maintainable. For me, the QUALIFY clause has firmly qualified itself into my repertoire of favorite SQL expressions:)

Note: QUALIFYis not only available in Google’s BigQuery but also in other SQL dialects and tools like Snowflake, Databricks, DuckDB, and more. Each platform has its own version, making QUALIFY a versatile and valuable tool for any data professional.

Understanding the QUALIFY Clause

If you are here, you are probably familiar with window, also known as analytic functions. We won’t delve deep into that topic now, but essentially, QUALIFY is used to filter your data based on the results of window functions.

In other words, it is to window functions what HAVINGis to GROUP BYand WHERE is to FROM.

SQL filtering tools: WHERE, HAVING, and QUALIFY

These are all tools to filter your data based on specific criteria. To give you a clearer picture, here are descriptions from the BigQuery documentation for these clauses:

The WHERE clause filters the results of the FROM clause. The HAVING clause filters the results produced by GROUP BY or aggregation. The QUALIFY clause filters the results of window functions.

Thus, QUALIFY adds another layer to the filtering process.

Usage of the QUALIFY Clause

To effectively use the QUALIFY clause, it’s essential to understand its place within the overall structure of a SQL query in BigQuery. Here’s the order in which the different clauses should be used:

SQL command sequence from BigQuery documentation

This syntax structure, taken directly from the BigQuery documentation, shows where the QUALIFY clause fits in the sequence. You can see that QUALIFY comes after HAVING and before the WINDOW clause, highlighting its role in filtering the results of window functions.

Here’s a simplified pseudocode example to demonstrate the general usage of the QUALIFY clause:

SELECT
  first_column,
  second_column,
  some_window_function() OVER (PARTITION BY first_column ORDER BY second_column) AS window_result
FROM
  table_name
QUALIFY
  condition_on_window_result

Practical Use Case | Example

To illustrate the concept of the QUALIFY clause, let’s use an engaging example involving our old friends, the Teenage Mutant Ninja Turtles (TMNT)🐢. Imagine we have a dataset that tracks their interactions on a pizza delivery platform, where they can view and purchase different types of pizzas 🍕.

Let’s say we want to determine the first pizza page each turtle visited on the platform. There are several ways to achieve this.

                                Source table

+--------------+--------------------------------+-------------+------------+
|    ninja     |        event_timestamp         |   action    | pizza_name |
+--------------+--------------------------------+-------------+------------+
|   Leonardo   | 2024-06-01 10:00:00.000000 UTC | item_viewed | Margherita |
| Michelangelo | 2024-06-01 10:05:00.000000 UTC | item_viewed | Pepperoni  |
|   Leonardo   | 2024-06-01 10:10:00.000000 UTC | item_viewed | Pepperoni  |
| Michelangelo | 2024-06-01 10:15:00.000000 UTC | item_viewed | Margherita |
|   Leonardo   | 2024-06-01 10:20:00.000000 UTC |  purchased  | Margherita |
| Michelangelo | 2024-06-01 10:25:00.000000 UTC |  purchased  | Margherita |
+--------------+--------------------------------+-------------+------------+

                               Expected result

        +--------------+------------+--------------------------------+
        |    ninja     | pizza_name |        first_timestamp         |
        +--------------+------------+--------------------------------+
        |   Leonardo   | Margherita | 2024-06-01 10:00:00.000000 UTC |
        | Michelangelo | Pepperoni  | 2024-06-01 10:05:00.000000 UTC |
        +--------------+------------+--------------------------------+

Solution 1: Naive Approach

  1. Identify the earliest timestamp for each turtle (using GROUP BYand MIN())
  2. Use this minimum timestamp to find the corresponding pizza (using JOIN)
WITH first_interaction AS (
  SELECT
    ninja,
    MIN(event_timestamp) AS first_timestamp
  FROM
    pizza_events
  WHERE
    action = 'item_viewed'
  GROUP BY
    ninja
)

SELECT
  e.ninja,
  e.pizza_name,
  f.first_timestamp
FROM
  pizza_events e
JOIN
  first_interaction f
ON
  e.ninja = f.ninja AND e.event_timestamp = f.first_timestamp
WHERE
  e.action = 'item_viewed';

Solution 2: A Better Alternative

  1. Assign a sequential number to each pizza view event in a CTE (using the ROW_NUMBER() window function)
  2. Select the first row in an other query (using WHERE)
WITH numbered_interactions AS (
  SELECT
    ninja,
    pizza_name,
    event_timestamp,
    ROW_NUMBER() OVER(PARTITION BY ninja ORDER BY event_timestamp) AS row_num
  FROM
    pizza_events
  WHERE
    action = 'item_viewed'
)

SELECT
  ninja,
  pizza_name,
  event_timestamp
FROM
  numbered_interactions
WHERE
  row_num = 1;

Solution 3: We’re Almost There

  1. Assign a sequential number to each pizza view event (using the ROW_NUMBER() window function).
  2. Filter to retain only the first row (using QUALIFY).
SELECT
  ninja,
  pizza_name,
  event_timestamp,
  ROW_NUMBER() OVER(PARTITION BY ninja ORDER BY event_timestamp) AS row_num
FROM
  pizza_events
WHERE
  action = 'item_viewed'
QUALIFY
  row_num = 1;

Solution 4: Here We Go

Similar to Solution 3, but the row number is used directly within the QUALIFY clause, avoiding its inclusion in the SELECT statement.

Optimized code using QUALIFY and inline window functions

If you want to explore this toy example further, you can find the full source as a gist here.

Performance considerations

We didn’t dive into the performance details of each query, but I encourage you to check it out yourself. Generally, using window function (as in Solution 4) is better than relying on GROUP BY and JOIN (like in Solution 1) because the execution is less complex.

Execution graph comparison in BigQuery: Solution 1 vs. others

When comparing the three versions with QUALIFY (Solutions 2–4), you'll see that the execution graph is the same in BigQuery under the hood.

So, using QUALIFY is like adding some syntactic sugar to your SQL — and who doesn’t like a bit of sugar?

Summary

Using QUALIFY makes your SQL queries shorter and easier to read if you have to limit the output based on a window function result. It’s a small change that can make a big difference in your workflow. So next time you’re writing a query, give QUALIFY a try and see how it can simplify your code.

Further readings

Gist for the example showed in the article Google BigQuery documentation A good read about window functions from Axel Thevenot 🐣

Data
Sql
Google Cloud Platform
Bigquery
Sql Optimization
Recommended from ReadMedium