Querying Polars DataFrames using SQL
Learn how to use the SQLContext object to query your Polars DataFrame/LazyFrame directly using SQL
In my previous article on Polars, I discussed how you can query your Polars dataframes using SQL through the use of DuckDB:
Using DuckDB with Polars
Learn how to use SQL to query your Polars DataFrames
towardsdatascience.com
In this article, I will show you another approach to querying Polars dataframes using SQL— directly on the Polars DataFrame.
If you have installed Polars some time back, be sure to install the latest version of Polars using the following command:
!pip install polars --upgrade
Creating a Sample Polars DataFrame
For the first example in this article, I will create a Polars DataFrame manually:
import polars as pl
df = pl.DataFrame(
{
'Model': ['iPhone X','iPhone XS','iPhone 12',
'iPhone 13','Samsung S11',
'Samsung S12','Mi A1','Mi A2'],
'Sales': [80,170,130,205,400,30,14,8],
'Company': ['Apple','Apple','Apple','Apple',
'Samsung','Samsung','Xiao Mi',
'Xiao Mi'],
})
df
The df
dataframe looks like this:

Creating a Context for SQL
To use SQL to query a Polars DataFrame/LazyFrame, you need to create a polars.sql.context.SQLContext
object, which will allow you to run SQL queries against a DataFrame/LazyFrame.
You can create a SQLContext
object using the following statement:
ctx = pl.SQLContext(frame = df)
In the above statement, your SQLContext
object now allows you to query against the df
dataframe directly using SQL.
Performing SQL Queries
Let’s try a few SQL queries on our df
dataframe. For this, you use the execute()
function, and pass in your SQL statement:
ctx.execute(
"SELECT * from frame"
).collect()
The above returns all the rows from the frame (df
) that is associated with the SQLContext
object:

As with SQL, you can do filtering using the WHERE clause:
ctx.execute(
"SELECT * from frame WHERE Company = 'Apple'"
).collect()
And now the results are those rows containing products from Apple:

Registering Multiple DataFrames with the SQLContext Object
Another way to associate a dataframe (or lazyframe) with the SQLContext
object is to create an instance of it and then use the register()
function to associate the SQLContext
object with the dataframe:
ctx = pl.SQLContext() # create an instance of the SQLContext class
ctx.register('df', df) # name of dataframe, dataframe
The advantage of this approach is that it allows you to register multiple dataframes with the SQLContext
object so that you can perform more sophisticated SQL query. If you use this approach, your SQL statement must now reference the name of the dataframe rather than the frame
keyword:
ctx.execute(
"SELECT * from df"
).collect()
Using the Flights Delay Dataset
Let me now illustrate using a more interesting example — the 2015 Flights Delay dataset:
2015 Flights Delay dataset — https://www.kaggle.com/datasets/usdot/flight-delays. Licensing — CC0: Public Domain
For this example, I am going to use the following CSV files:
- flights.csv
- airlines.csv
First load up the CSV files as LazyFrame
:
import polars as pl
df_flights = pl.scan_csv('flights.csv')
df_airlines = pl.scan_csv('airlines.csv')
Note that
df_flights
anddf_airlines
are all PolarsLazyFrame
s.
Let’s have a quick view of how the dataframes look like:
display(df_flights.collect().head()) display(df_airlines.collect().head())

You can associate the various lazyframe with the SQLContext
object using the register()
function:
ctx = pl.SQLContext()
ctx.register('df_flights', df_flights)
ctx.register('df_airlines', df_airlines)
Alternatively, you can also use the register_many()
function and pass it a dictionary containing the name and the actual lazyframe/dataframe:
ctx.register_many(
{
'df_flights': df_flights,
'df_airlines': df_airlines
}
)
To print out the tables that are associated with the SQLContext
object, use the tables()
function:
print(ctx.tables())
You should see the names of the two lazyframe that you had registered earlier:
['df_airlines', 'df_flights']
We can now use SQL to perform an inner join of two lazyframe — df_flights
and df_airlines
:
df_result = ctx.execute('''
SELECT *
FROM df_flights
INNER JOIN df_airlines
ON df_flights.AIRLINE = df_airlines.IATA_CODE
''').collect()
df_result
The above SQL joins the two lazyframe based on the IATA code of the airlines. The result is as shown:

Since there are quite a number of columns, let’s view the column names using the columns
attribute:
df_result.columns
You should see the following (those in bold are those that I want to print next):
['YEAR',
'MONTH',
'DAY',
'DAY_OF_WEEK',
'AIRLINE',
'FLIGHT_NUMBER',
'TAIL_NUMBER',
'ORIGIN_AIRPORT',
'DESTINATION_AIRPORT',
'SCHEDULED_DEPARTURE',
'DEPARTURE_TIME',
'DEPARTURE_DELAY',
'TAXI_OUT',
'WHEELS_OFF',
'SCHEDULED_TIME',
'ELAPSED_TIME',
'AIR_TIME',
'DISTANCE',
'WHEELS_ON',
'TAXI_IN',
'SCHEDULED_ARRIVAL',
'ARRIVAL_TIME',
'ARRIVAL_DELAY',
'DIVERTED',
'CANCELLED',
'CANCELLATION_REASON',
'AIR_SYSTEM_DELAY',
'SECURITY_DELAY',
'AIRLINE_DELAY',
'LATE_AIRCRAFT_DELAY',
'WEATHER_DELAY',
'AIRLINE_right']
I can now modify my SQL statement to print out only the columns that I want:
df_result = ctx.execute('''
SELECT AIRLINE_right, FLIGHT_NUMBER, YEAR, MONTH, DAY, DEPARTURE_TIME
FROM df_flights
INNER JOIN df_airlines
ON df_flights.AIRLINE = df_airlines.IATA_CODE
''').collect()
df_result
Here’s the result:

Next, I want to know which airlines has the most delay. Here’s the SQL:
df_delays = ctx.execute('''
SELECT
AIRLINE, count(AIRLINE) as Count,
FROM df_flights
WHERE ARRIVAL_DELAY > 0
GROUP BY AIRLINE
ORDER BY Count DESC
''').collect()
df_delays
You can see that the airline WN has the most delay and VX has the least delay:

But I want to display the airline name on each row. Ideally, I can use the following SQL statement:
ctx.execute('''
SELECT *
FROM
(SELECT
AIRLINE, count(AIRLINE) as Count,
FROM df_flights
WHERE ARRIVAL_DELAY > 0
GROUP BY AIRLINE
ORDER BY Count DESC)
AS result1
INNER JOIN
(SELECT *
FROM df_airlines)
AS result2
ON result1.AIRLINE = result2.IATA_CODE
''').collect()
However, the above code snippet returned a “Not implemented error” error. To resolve this, I first count the delays for each airline and then register the result with the SQLContext
object:
df_delays = ctx.execute('''
SELECT
AIRLINE, count(AIRLINE) as Count,
FROM df_flights
WHERE ARRIVAL_DELAY > 0
GROUP BY AIRLINE
ORDER BY Count DESC
''').collect()
ctx.register('df_delays', df_delays)
Then, I use an INNER JOIN to join the two dataframe/lazyframe:
ctx.execute('''
SELECT *
FROM df_delays
INNER JOIN df_airlines
ON df_delays.AIRLINE = df_airlines.IATA_CODE
''').collect()
Finally I get what I want:

Unregistering Tables
If you want to unregister a table from the SQLContext
object, you can use the unregister()
function and pass it the name of the dataframe/lazyframe to unregister:
ctx.unregister('df_flights')
ctx.tables()
The output below shows that the df_flights
lazyframe was unregistered:
['df_airlines']
If you like reading my articles and that it helped your career/study, please consider signing up as a Medium member. It is $5 a month, and it gives you unlimited access to all the articles (including mine) on Medium. If you sign up using the following link, I will earn a small commission (at no additional cost to you). Your support means that I will be able to devote more time on writing articles like this.
Summary
In this article, you have learned how to use the SQLContext
object to query a Polars dataframe/lazyframe directly using SQL. While not all the SQL queries are supported at the moment, this feature is definitely very promising. Using SQL for querying is definitely a time saver for a lot of developers.
Level Up Coding
Thanks for being a part of our community! Before you go:
- 👏 Clap for the story and follow the author 👉
- 📰 View more content in the Level Up Coding publication
- 💰 Free coding interview course ⇒ View Course
- 🔔 Follow us: Twitter | LinkedIn | Newsletter
🚀👉 Join the Level Up talent collective and find an amazing job