avatarWei-Meng Lee

Summary

The web content provides a comprehensive guide on querying Polars DataFrames using SQL through the SQLContext object.

Abstract

The article discusses the use of SQLContext in Polars to execute SQL queries directly on DataFrames or LazyFrames. It begins by demonstrating the creation of a sample Polars DataFrame and then explains how to create a SQLContext to run SQL queries. The author illustrates various SQL operations, such as SELECT, WHERE, and JOIN, and shows how to register multiple DataFrames with the SQLContext. The article also explores the use of SQLContext with the 2015 Flights Delay dataset, highlighting the process of joining two LazyFrames and performing aggregations to analyze flight delays by airline. The author concludes by mentioning the potential of this feature for developers and encourages readers to support the writer by becoming Medium members.

Opinions

  • The author suggests that using SQL to query Polars DataFrames is a time-saving feature for developers.
  • They imply that the SQLContext feature in Polars is promising, although it may not yet support all SQL queries.
  • The author encourages readers to engage with their content by clapping for the story, following the author, and considering a Medium membership to support their work.
  • The author provides a referral link for readers to join Medium, indicating a desire to grow their audience and readership.
  • They express enthusiasm about the Level Up Coding community, inviting readers to join the talent collective and explore additional resources such as a coding interview course and social media channels.

Querying Polars DataFrames using SQL

Learn how to use the SQLContext object to query your Polars DataFrame/LazyFrame directly using SQL

Photo by Hans-Jurgen Mager on Unsplash

In my previous article on Polars, I discussed how you can query your Polars dataframes using SQL through the use of DuckDB:

In this article, I will show you another approach to querying Polars dataframes using SQL— directly on the Polars DataFrame.

If you have installed Polars some time back, be sure to install the latest version of Polars using the following command: !pip install polars --upgrade

Creating a Sample Polars DataFrame

For the first example in this article, I will create a Polars DataFrame manually:

import polars as pl

df = pl.DataFrame(
     {
         'Model': ['iPhone X','iPhone XS','iPhone 12',
                   'iPhone 13','Samsung S11',
                   'Samsung S12','Mi A1','Mi A2'],
         'Sales': [80,170,130,205,400,30,14,8],     
         'Company': ['Apple','Apple','Apple','Apple',
                     'Samsung','Samsung','Xiao Mi',
                     'Xiao Mi'],
     })
df

The df dataframe looks like this:

Creating a Context for SQL

To use SQL to query a Polars DataFrame/LazyFrame, you need to create a polars.sql.context.SQLContext object, which will allow you to run SQL queries against a DataFrame/LazyFrame.

You can create a SQLContext object using the following statement:

ctx = pl.SQLContext(frame = df)

In the above statement, your SQLContext object now allows you to query against the df dataframe directly using SQL.

Performing SQL Queries

Let’s try a few SQL queries on our df dataframe. For this, you use the execute() function, and pass in your SQL statement:

ctx.execute(
    "SELECT * from frame"
).collect()

The above returns all the rows from the frame (df) that is associated with the SQLContext object:

As with SQL, you can do filtering using the WHERE clause:

ctx.execute(
    "SELECT * from frame WHERE Company = 'Apple'"
).collect()

And now the results are those rows containing products from Apple:

Registering Multiple DataFrames with the SQLContext Object

Another way to associate a dataframe (or lazyframe) with the SQLContext object is to create an instance of it and then use the register() function to associate the SQLContext object with the dataframe:

ctx = pl.SQLContext()       # create an instance of the SQLContext class
ctx.register('df', df)      # name of dataframe, dataframe

The advantage of this approach is that it allows you to register multiple dataframes with the SQLContext object so that you can perform more sophisticated SQL query. If you use this approach, your SQL statement must now reference the name of the dataframe rather than the frame keyword:

ctx.execute(
    "SELECT * from df"
).collect()

Using the Flights Delay Dataset

Let me now illustrate using a more interesting example — the 2015 Flights Delay dataset:

2015 Flights Delay datasethttps://www.kaggle.com/datasets/usdot/flight-delays. Licensing — CC0: Public Domain

For this example, I am going to use the following CSV files:

  • flights.csv
  • airlines.csv

First load up the CSV files as LazyFrame:

import polars as pl

df_flights = pl.scan_csv('flights.csv')
df_airlines = pl.scan_csv('airlines.csv')

Note that df_flights anddf_airlines are all PolarsLazyFrames.

Let’s have a quick view of how the dataframes look like:

display(df_flights.collect().head())
display(df_airlines.collect().head())

You can associate the various lazyframe with the SQLContext object using the register() function:

ctx = pl.SQLContext()
ctx.register('df_flights',  df_flights)
ctx.register('df_airlines', df_airlines)

Alternatively, you can also use the register_many() function and pass it a dictionary containing the name and the actual lazyframe/dataframe:

ctx.register_many(
    {
        'df_flights': df_flights,
        'df_airlines': df_airlines 
    }
)

To print out the tables that are associated with the SQLContext object, use the tables() function:

print(ctx.tables())

You should see the names of the two lazyframe that you had registered earlier:

['df_airlines', 'df_flights']

We can now use SQL to perform an inner join of two lazyframe — df_flights and df_airlines:

df_result = ctx.execute('''
    SELECT *
    FROM df_flights
    INNER JOIN df_airlines
    ON df_flights.AIRLINE = df_airlines.IATA_CODE
''').collect()

df_result

The above SQL joins the two lazyframe based on the IATA code of the airlines. The result is as shown:

Since there are quite a number of columns, let’s view the column names using the columns attribute:

df_result.columns

You should see the following (those in bold are those that I want to print next):

['YEAR',
 'MONTH',
 'DAY',
 'DAY_OF_WEEK',
 'AIRLINE',
 'FLIGHT_NUMBER',
 'TAIL_NUMBER',
 'ORIGIN_AIRPORT',
 'DESTINATION_AIRPORT',
 'SCHEDULED_DEPARTURE',
 'DEPARTURE_TIME',
 'DEPARTURE_DELAY',
 'TAXI_OUT',
 'WHEELS_OFF',
 'SCHEDULED_TIME',
 'ELAPSED_TIME',
 'AIR_TIME',
 'DISTANCE',
 'WHEELS_ON',
 'TAXI_IN',
 'SCHEDULED_ARRIVAL',
 'ARRIVAL_TIME',
 'ARRIVAL_DELAY',
 'DIVERTED',
 'CANCELLED',
 'CANCELLATION_REASON',
 'AIR_SYSTEM_DELAY',
 'SECURITY_DELAY',
 'AIRLINE_DELAY',
 'LATE_AIRCRAFT_DELAY',
 'WEATHER_DELAY',
 'AIRLINE_right']

I can now modify my SQL statement to print out only the columns that I want:

df_result = ctx.execute('''
    SELECT AIRLINE_right, FLIGHT_NUMBER, YEAR, MONTH, DAY, DEPARTURE_TIME
    FROM df_flights
    INNER JOIN df_airlines
    ON df_flights.AIRLINE = df_airlines.IATA_CODE
''').collect()

df_result

Here’s the result:

Next, I want to know which airlines has the most delay. Here’s the SQL:

df_delays = ctx.execute('''
    SELECT 
        AIRLINE, count(AIRLINE) as Count,
    FROM df_flights
    WHERE ARRIVAL_DELAY > 0
    GROUP BY AIRLINE
    ORDER BY Count DESC
''').collect()

df_delays

You can see that the airline WN has the most delay and VX has the least delay:

But I want to display the airline name on each row. Ideally, I can use the following SQL statement:

ctx.execute('''
     SELECT * 
     FROM
         (SELECT 
             AIRLINE, count(AIRLINE) as Count,
         FROM df_flights
         WHERE ARRIVAL_DELAY > 0
         GROUP BY AIRLINE
         ORDER BY Count DESC) 
     AS result1
     INNER JOIN 
         (SELECT * 
          FROM df_airlines) 
     AS result2
     ON result1.AIRLINE = result2.IATA_CODE
''').collect()

However, the above code snippet returned a “Not implemented error” error. To resolve this, I first count the delays for each airline and then register the result with the SQLContext object:

df_delays = ctx.execute('''
    SELECT 
        AIRLINE, count(AIRLINE) as Count,
    FROM df_flights
    WHERE ARRIVAL_DELAY > 0
    GROUP BY AIRLINE
    ORDER BY Count DESC
''').collect()

ctx.register('df_delays',  df_delays)

Then, I use an INNER JOIN to join the two dataframe/lazyframe:

ctx.execute('''
    SELECT *
    FROM df_delays
    INNER JOIN df_airlines
    ON df_delays.AIRLINE = df_airlines.IATA_CODE
''').collect()

Finally I get what I want:

Unregistering Tables

If you want to unregister a table from the SQLContext object, you can use the unregister() function and pass it the name of the dataframe/lazyframe to unregister:

ctx.unregister('df_flights')
ctx.tables()

The output below shows that the df_flights lazyframe was unregistered:

['df_airlines']

If you like reading my articles and that it helped your career/study, please consider signing up as a Medium member. It is $5 a month, and it gives you unlimited access to all the articles (including mine) on Medium. If you sign up using the following link, I will earn a small commission (at no additional cost to you). Your support means that I will be able to devote more time on writing articles like this.

Summary

In this article, you have learned how to use the SQLContext object to query a Polars dataframe/lazyframe directly using SQL. While not all the SQL queries are supported at the moment, this feature is definitely very promising. Using SQL for querying is definitely a time saver for a lot of developers.

Level Up Coding

Thanks for being a part of our community! Before you go:

🚀👉 Join the Level Up talent collective and find an amazing job

Polars
Lazyframe
Sqlcontext
Sql
Dataframe
Recommended from ReadMedium