4 Functions to Know If You Are Planning to Switch from Pandas to Polars

Both Pandas and Polars code are included

Pandas can sometimes be difficult to work with when data size is large. Two main issues associated with large datasets are Pandas doing in-memory analytics and creating intermediate copies.

On the other hand, Pandas’ user-friendly API and rich selection of flexible functions make it one of most popular data analysis and manipulation libraries.

Polars is a great alternative to Pandas especially when the data size becomes too large for Pandas to handle easily. The syntax of Polars is somewhere between Pandas and PySpark.

In this article, we’ll go over four must-know functions for data cleaning, processing, and analysis with both Pandas and Polars.

Data

First things first. We, of course, need data to learn how these functions work. I prepared sample data, which you can download in my datasets repository. The dataset we’ll use in this article is called “data_polars_practicing.csv”.

Let’s start by reading the dataset into a DataFrame, which is the two-dimensional data structure of both Polars and Pandas libraries.

import polars as pl

df_pl = pl.read_csv("data_polars_practicing.csv")

df_pl.head()

import pandas as pd

df_pd = pd.read_csv("data_polars_practicing.csv")

df_pd.head()

As we see in the code snippets above, the head method displays the first five rows of the DataFrame in both Polars and Pandas. One important difference is that Polars show the data types of columns but Pandas doesn’t. We can also use the dtypes method to see column data types.

We now have a Polars DataFrame called df_pl and a Pandas DataFrame called df_pd.

1. Filter

The first Polars function we’ll cover is filter. As its name suggests, it can be used for filtering DataFrame rows.

Both libraries are highly flexible in terms of the conditions we can use for filtering. Let’s go over some examples.

# Polars

# sales quantity is more than 0
df_pl.filter(pl.col("sales_qty") > 0)

# store code is B1
df_pl.filter(pl.col("store_code") == "B1")

# sales quantity is more than 0 and store code is A2
df_pl.filter((pl.col("store_code") == "A2") & (pl.col("sales_qty") > 0))

# product code is one of the following: 89909, 89912, 89915, 89918
df_pl.filter(pl.col("product_code").is_in([89909, 89912, 89915, 89918]))

The column to apply the condition is selected with the col method.

Let’s see how these operations are done with Pandas.

# Pandas

# sales quantity is more than 0
df_pd[df_pd["sales_qty"] > 0]

# store code is B1
df_pd[df_pd["store_code"] == "B1"]

# sales quantity is more than 0 and store code is A2
df_pd[(df_pd["store_code"] == "A2") & (df_pd["sales_qty"] > 0)]

# product code is one of the following: 89909, 89912, 89915, 89918
df_pd[df_pd["product_code"].isin([89909, 89912, 89915, 89918])]

The operations are quite similar except for a few small syntactic differences.

2. with_columns

The with_columns function creates a new column in Polars DataFrames. The new column can be derived from other columns such as extracting the year from a date value. We can do arithmetic operations including multiple columns, or simply create a column with a constant.

One thing to note here is that with_columns function can also be used for changing column data types.

Following code snippet shows how all these operations are done.

# Polars

# change the data type of sales date from string to date
df_pl = df_pl.with_columns(pl.col("sales_date").str.to_date())

# create year column by extracting year from date column
df_pl = df_pl.with_columns(pl.col("sales_date").dt.year().alias("year"))

# create price column by dividing sales revenue by sales quantity
df_pl = df_pl.with_columns((pl.col("sales_rev") / pl.col("sales_qty")).alias("price"))

# create a column with a constant value
df_pl = df_pl.with_columns(pl.lit(0).alias("dummy_column"))

The alias method is used for naming the new column. For instance, the last operation creates a column called “dummy_column” that contains the value 0 in all the rows.

Let’s see how we can do the same operations with Pandas.

# Pandas

# change the data type of sales date from string to date
df_pd = df_pd.astype({"sales_date": "datetime64[ns]"})

# create year column by extracting year from date column
df_pd.loc[:, "year"] = df_pd.loc[:, "sales_date"].dt.year

# create price column by dividing sales revenue by sales quantity
df_pd.loc[:, "price"] = df_pd.loc[:, "sales_rev"] / df_pd.loc[:, "sales_rev"]

# create a column with a constant value
df_pd.loc[:, "dummy_column"] = 0

3. group_by

The group_by function groups the rows based on the distinct values in a given column or columns. Then, we can calculate several different aggregations on each group such as mean, max, min, sum, and so on.

Following code snippet shows some different groupings and aggregations done on our DataFrame.

# Polars

# calculate total and average sales for each store
df_pl.group_by(["store_code"]).agg(
    pl.sum("sales_qty").alias("total_sales"),
    pl.mean("sales_qty").alias("avg_sales")
)

# calculate total and average sales for each store-year pair
df_pl.group_by(["store_code", "year"]).agg(
    pl.sum("sales_qty").alias("total_sales"),
    pl.mean("sales_qty").alias("avg_sales")
)

# create product lifetime and unique day count for each product
df_pl.group_by(["product_code"]).agg(
    [
        pl.n_unique("sales_date").alias("unique_day_count"),
        ((pl.max("sales_date") - pl.min("sales_date")).dt.total_days() + 1).alias("lifetime")
    ]
)

In the third operation, we calculate the product lifetime by grouping the rows by product and finding the difference between the minimum and maximum dates for each group (i.e. product).

The Pandas equivalent of these operations are as follows:

# Pandas

# calculate total and average sales for each store
df_pd.groupby(["store_code"], as_index=False).agg(
    total_sales = ("sales_qty", "sum"),
    avg_sales = ("sales_qty", "mean")
)

# calculate total and average sales for each store-year pair
df_pd.groupby(["store_code","year"], as_index=False).agg(
    total_sales = ("sales_qty", "sum"),
    avg_sales = ("sales_qty", "mean")
)

# create product lifetime and unique day count for each product
df_pd.groupby(["product_code"], as_index=False).agg(
    unique_day_count = ("sales_date", "nunique"),
    lifetime = ("sales_date", lambda x: (x.max() - x.min()).days + 1)
)

In Pandas, we can pass the aggregate function names as string if it’s a built-in function. Or, we can write a custom aggregation (e.g. calculating product lifetime) using lambda functions.

4. when

We can use the when function along with the with_columns function for creating conditional columns. It works as shown in the drawing below. If the condition is met, then the column takes corresponding value. We can provide multiple conditions by chaining when-then pairs. Finally, the otherwise part is for the remaining rows that don’t fit any of the given conditions.

Let’s do a couple of examples to see how these work in the code.

# Polars

# create has_value column that takes the value 1 if 
# sales quantity is higher than 0 and the value 0 otherwise
df_pl = df_pl.with_columns(
    pl.when(pl.col("sales_qty") > 0).then(1).otherwise(0).alias("has_sales")
)

# create sales_group column that takes the 
# value low if sales quantity is less than 5
# value medium if sales quantity is between 5 and 20
# value high otherwise (i.e. sales is more than 20)
df_pl = df_pl.with_columns(
    pl.when(pl.col("sales_qty") < 5).then(pl.lit("low")).\
    when((pl.col("sales_qty") >= 5) & (pl.col("sales_qty") < 20)).then(pl.lit("medium")).\
    otherwise(pl.lit("high")).alias("sales_group")
)

Let’s also see how we can do these operations with Pandas. We’ll use two functions from the NumPy library, which are the where and select functions.

# Pandas
import numpy as np

# create has_value column that takes the value 1 if 
# sales quantity is higher than 0 and the value 0 otherwise
df_pd.loc[:, "has_sale"] = np.where(df_pd["sales_qty"] > 0, 1, 0)

# create sales_group column that takes the 
# value low if sales quantity is less than 5
# value medium if sales quantity is between 5 and 20
# value high otherwise (i.e. sales is more than 20)
conditions = [
    df_pd["sales_qty"] < 5,
    (df_pd["sales_qty"] >= 5) & (df_pd["sales_qty"] < 20)
]

values = ["low", "medium"]

df_pd.loc[:, "sales_group"] = np.select(conditions, values, default="high")

The where function takes a condition and allows for assigning separate values to rows that fit and don’t fit the condition.

The select function accepts multiple conditions and their corresponding values. Then, it assigns the values accordingly. It’s quite similar to the when-then structure in Polars. For the rows that don’t fit any of the given conditions, we can use the default parameter (similar to the otherwise in Polars).

Final words

I think of Polars library as an intermediate step between Pandas and Spark. It works quite well with datasets that Pandas struggle with. I haven’t tested Polars with much larger datasets (i.e. billions of rows) but I don’t think it can be a replacement for Spark. With that being said, the syntax of Polars is very intuitive. It’s similar to both Pandas and PySpark SQL syntax. I think this also indicates that Polars is kind of a transition step from Pandas to PySpark (my subjective opinion).

Thank you for reading. Please let me know if you have any feedback.