avatarSoner Yıldırım

Summary

The article provides an overview of four essential functions in Polars, a data manipulation library that serves as an efficient alternative to Pandas for handling large datasets.

Abstract

The article introduces Polars, a data analysis library designed for performance with large datasets where Pandas may fall short due to memory constraints and slow processing. It highlights four key functions—filter, with_columns, group_by, and when—and compares their usage in Polars with equivalent operations in Pandas. The author demonstrates how to perform data filtering, column creation, data grouping, and conditional column generation in both libraries, noting that Polars' syntax is reminiscent of a blend between Pandas and PySpark, making it an intuitive transition for users familiar with these tools. The article also provides practical code examples and emphasizes Polars' ability to handle datasets that are too large for Pandas, positioning it as a middle ground between Pandas and Apache Spark for big data processing.

Opinions

  • The author suggests that Polars is particularly advantageous for large datasets due to its efficient in-memory analytics and avoidance of intermediate copies, unlike Pandas.
  • Polars is described as having an intuitive syntax, combining elements of both Pandas and PySpark, which may ease the transition for users looking to scale up from Pandas.
  • The author implies that while Polars is suitable for larger datasets than Pandas can handle, it may not replace Apache Spark for processing extremely large datasets (billions of rows).
  • The author expresses a subjective opinion that Polars serves as a stepping stone from Pandas to PySpark, indicating a progressive learning path for data analysts and scientists dealing with increasing data sizes.
  • A recommendation is made for an AI service, ZAI.chat, which is presented as a cost-effective alternative to ChatGPT Plus (GPT-4), suggesting the author's endorsement of the service based on performance and value.

4 Functions to Know If You Are Planning to Switch from Pandas to Polars

Both Pandas and Polars code are included

Photo by israel palacio on Unsplash

Pandas can sometimes be difficult to work with when data size is large. Two main issues associated with large datasets are Pandas doing in-memory analytics and creating intermediate copies.

On the other hand, Pandas’ user-friendly API and rich selection of flexible functions make it one of most popular data analysis and manipulation libraries.

Polars is a great alternative to Pandas especially when the data size becomes too large for Pandas to handle easily. The syntax of Polars is somewhere between Pandas and PySpark.

In this article, we’ll go over four must-know functions for data cleaning, processing, and analysis with both Pandas and Polars.

Data

First things first. We, of course, need data to learn how these functions work. I prepared sample data, which you can download in my datasets repository. The dataset we’ll use in this article is called “data_polars_practicing.csv”.

Let’s start by reading the dataset into a DataFrame, which is the two-dimensional data structure of both Polars and Pandas libraries.

import polars as pl

df_pl = pl.read_csv("data_polars_practicing.csv")

df_pl.head()
(image by author)
import pandas as pd

df_pd = pd.read_csv("data_polars_practicing.csv")

df_pd.head()
(image by author)

As we see in the code snippets above, the head method displays the first five rows of the DataFrame in both Polars and Pandas. One important difference is that Polars show the data types of columns but Pandas doesn’t. We can also use the dtypes method to see column data types.

We now have a Polars DataFrame called df_pl and a Pandas DataFrame called df_pd.

1. Filter

The first Polars function we’ll cover is filter. As its name suggests, it can be used for filtering DataFrame rows.

Both libraries are highly flexible in terms of the conditions we can use for filtering. Let’s go over some examples.

# Polars

# sales quantity is more than 0
df_pl.filter(pl.col("sales_qty") > 0)

# store code is B1
df_pl.filter(pl.col("store_code") == "B1")

# sales quantity is more than 0 and store code is A2
df_pl.filter((pl.col("store_code") == "A2") & (pl.col("sales_qty") > 0))

# product code is one of the following: 89909, 89912, 89915, 89918
df_pl.filter(pl.col("product_code").is_in([89909, 89912, 89915, 89918]))

The column to apply the condition is selected with the col method.

Let’s see how these operations are done with Pandas.

# Pandas

# sales quantity is more than 0
df_pd[df_pd["sales_qty"] > 0]

# store code is B1
df_pd[df_pd["store_code"] == "B1"]

# sales quantity is more than 0 and store code is A2
df_pd[(df_pd["store_code"] == "A2") & (df_pd["sales_qty"] > 0)]

# product code is one of the following: 89909, 89912, 89915, 89918
df_pd[df_pd["product_code"].isin([89909, 89912, 89915, 89918])]

The operations are quite similar except for a few small syntactic differences.

2. with_columns

The with_columns function creates a new column in Polars DataFrames. The new column can be derived from other columns such as extracting the year from a date value. We can do arithmetic operations including multiple columns, or simply create a column with a constant.

One thing to note here is that with_columns function can also be used for changing column data types.

Following code snippet shows how all these operations are done.

# Polars

# change the data type of sales date from string to date
df_pl = df_pl.with_columns(pl.col("sales_date").str.to_date())

# create year column by extracting year from date column
df_pl = df_pl.with_columns(pl.col("sales_date").dt.year().alias("year"))

# create price column by dividing sales revenue by sales quantity
df_pl = df_pl.with_columns((pl.col("sales_rev") / pl.col("sales_qty")).alias("price"))

# create a column with a constant value
df_pl = df_pl.with_columns(pl.lit(0).alias("dummy_column"))

The alias method is used for naming the new column. For instance, the last operation creates a column called “dummy_column” that contains the value 0 in all the rows.

Let’s see how we can do the same operations with Pandas.

# Pandas

# change the data type of sales date from string to date
df_pd = df_pd.astype({"sales_date": "datetime64[ns]"})

# create year column by extracting year from date column
df_pd.loc[:, "year"] = df_pd.loc[:, "sales_date"].dt.year

# create price column by dividing sales revenue by sales quantity
df_pd.loc[:, "price"] = df_pd.loc[:, "sales_rev"] / df_pd.loc[:, "sales_rev"]

# create a column with a constant value
df_pd.loc[:, "dummy_column"] = 0

3. group_by

The group_by function groups the rows based on the distinct values in a given column or columns. Then, we can calculate several different aggregations on each group such as mean, max, min, sum, and so on.

Following code snippet shows some different groupings and aggregations done on our DataFrame.

# Polars

# calculate total and average sales for each store
df_pl.group_by(["store_code"]).agg(
    pl.sum("sales_qty").alias("total_sales"),
    pl.mean("sales_qty").alias("avg_sales")
)

# calculate total and average sales for each store-year pair
df_pl.group_by(["store_code", "year"]).agg(
    pl.sum("sales_qty").alias("total_sales"),
    pl.mean("sales_qty").alias("avg_sales")
)

# create product lifetime and unique day count for each product
df_pl.group_by(["product_code"]).agg(
    [
        pl.n_unique("sales_date").alias("unique_day_count"),
        ((pl.max("sales_date") - pl.min("sales_date")).dt.total_days() + 1).alias("lifetime")
    ]
)

In the third operation, we calculate the product lifetime by grouping the rows by product and finding the difference between the minimum and maximum dates for each group (i.e. product).

The Pandas equivalent of these operations are as follows:

# Pandas

# calculate total and average sales for each store
df_pd.groupby(["store_code"], as_index=False).agg(
    total_sales = ("sales_qty", "sum"),
    avg_sales = ("sales_qty", "mean")
)

# calculate total and average sales for each store-year pair
df_pd.groupby(["store_code","year"], as_index=False).agg(
    total_sales = ("sales_qty", "sum"),
    avg_sales = ("sales_qty", "mean")
)

# create product lifetime and unique day count for each product
df_pd.groupby(["product_code"], as_index=False).agg(
    unique_day_count = ("sales_date", "nunique"),
    lifetime = ("sales_date", lambda x: (x.max() - x.min()).days + 1)
)

In Pandas, we can pass the aggregate function names as string if it’s a built-in function. Or, we can write a custom aggregation (e.g. calculating product lifetime) using lambda functions.

4. when

We can use the when function along with the with_columns function for creating conditional columns. It works as shown in the drawing below. If the condition is met, then the column takes corresponding value. We can provide multiple conditions by chaining when-then pairs. Finally, the otherwise part is for the remaining rows that don’t fit any of the given conditions.

when-then-otherwise (image by author)

Let’s do a couple of examples to see how these work in the code.

# Polars

# create has_value column that takes the value 1 if 
# sales quantity is higher than 0 and the value 0 otherwise
df_pl = df_pl.with_columns(
    pl.when(pl.col("sales_qty") > 0).then(1).otherwise(0).alias("has_sales")
)

# create sales_group column that takes the 
# value low if sales quantity is less than 5
# value medium if sales quantity is between 5 and 20
# value high otherwise (i.e. sales is more than 20)
df_pl = df_pl.with_columns(
    pl.when(pl.col("sales_qty") < 5).then(pl.lit("low")).\
    when((pl.col("sales_qty") >= 5) & (pl.col("sales_qty") < 20)).then(pl.lit("medium")).\
    otherwise(pl.lit("high")).alias("sales_group")
)

Let’s also see how we can do these operations with Pandas. We’ll use two functions from the NumPy library, which are the where and select functions.

# Pandas
import numpy as np

# create has_value column that takes the value 1 if 
# sales quantity is higher than 0 and the value 0 otherwise
df_pd.loc[:, "has_sale"] = np.where(df_pd["sales_qty"] > 0, 1, 0)

# create sales_group column that takes the 
# value low if sales quantity is less than 5
# value medium if sales quantity is between 5 and 20
# value high otherwise (i.e. sales is more than 20)
conditions = [
    df_pd["sales_qty"] < 5,
    (df_pd["sales_qty"] >= 5) & (df_pd["sales_qty"] < 20)
]

values = ["low", "medium"]

df_pd.loc[:, "sales_group"] = np.select(conditions, values, default="high")

The where function takes a condition and allows for assigning separate values to rows that fit and don’t fit the condition.

The select function accepts multiple conditions and their corresponding values. Then, it assigns the values accordingly. It’s quite similar to the when-then structure in Polars. For the rows that don’t fit any of the given conditions, we can use the default parameter (similar to the otherwise in Polars).

Final words

I think of Polars library as an intermediate step between Pandas and Spark. It works quite well with datasets that Pandas struggle with. I haven’t tested Polars with much larger datasets (i.e. billions of rows) but I don’t think it can be a replacement for Spark. With that being said, the syntax of Polars is very intuitive. It’s similar to both Pandas and PySpark SQL syntax. I think this also indicates that Polars is kind of a transition step from Pandas to PySpark (my subjective opinion).

Thank you for reading. Please let me know if you have any feedback.

Data Science
Data Analysis
Python
Programming
Pandas
Recommended from ReadMedium