avatarBex T.

Summary

The article provides a guide for transitioning from Pandas to Polars for data manipulation, emphasizing Polars' performance and memory efficiency for handling large datasets.

Abstract

The article "7 Easy Steps To Switch From Pandas to Lightning Fast Polars And Never Return" serves as a comprehensive cheat sheet for data scientists and analysts looking to migrate from Pandas to Polars. It outlines the limitations of Pandas in handling large datasets due to memory constraints and introduces Polars as a Rust-based alternative that offers superior speed and scalability. The author breaks down the transition process into seven steps, covering reading/writing data, creating Series and DataFrames, understanding expressions, selecting and filtering data, creating new columns, groupby operations, and utilizing the lazy API for optimized performance. The article highlights Polars' query engine and expressions, which enable fast and parallel data processing, and its growing popularity in the open-source community, as evidenced by its rapidly increasing GitHub stars.

Opinions

  • The author suggests that Pandas, while versatile, falls short in handling large datasets efficiently, especially with the increasing size and complexity of modern datasets.
  • Wes McKinney's rule of thumb for Pandas, requiring 5 to 10 times as much RAM as the dataset size, is impractical for today's large datasets, according to the author.
  • Dask is criticized for not addressing the underlying performance and memory issues of Pandas, treating it as a black box.
  • Polars is praised for its speed advantages over Pandas, including the upcoming Pandas 2.0 with PyArrow backend, and for its syntax and functionality that closely resemble SQL.
  • The author expresses enthusiasm about Polars' query engine and expressions, which are key to its "embarrassingly parallel" performance capabilities.
  • The article promotes the use of Polars' lazy API for its query optimization and parallel processing benefits, suggesting it as a preferred mode for data manipulation.
  • The author is optimistic about the future of Polars, indicating its rising popularity and potential to surpass competitors in the data processing landscape.

7 Easy Steps To Switch From Pandas to Lightning Fast Polars And Never Return

A cheat sheet of the most common Pandas operations translated into Polars

Image by author via Midjourney

Time for goodbyes!

Pandas can do anything. Virtually anything. But (and this is an I-wish-a-million-times-it-was-any-other-way but) it lacks speed. Pandas just can't keep up with the pace at which the size and complexity of today's datasets are growing.

Pandas author, Wes McKinney, states that when he wrote Pandas, he had this rule of thumb in mind for his library:

Have 5 to 10 times as much RAM as the size of your dataset.

Maybe you could've snoozed past this rule when the Iris dataset was first introduced, but it is different today. You simply can't load a 100GB dataset (already common in modern times) when your RAM is resolutely stuck at 64 GBs.

Sure, there are great alternatives like Dask. But Dask doesn’t implement new functionality. It stretches existing Pandas syntax over multiple processes (threads) and ignores the underlying performance and memory issues.

It treats Pandas as a black box. Forgive me for saying this, but it is almost like putting lipstick on a pig.

Polars, the focus of our article, was written in Rust from the ground up to fix all the shortcomings of Pandas. In my last article, we've seen how it is already faster than the upcoming Pandas 2.0 with PyArrow backend.

While that article focused on speed advantages, this one focuses more on the syntax and functionality and shows you how to switch from Pandas to Polars in seven easy steps and, maybe, never switch back.

0. Reading/writing data

Even though it is painfully slow, CSV is still one of the most popular file formats to store data. So, let's start with the read_csv function of Polars.

Apart from the apparent speed benefits, it only differs from its Pandas namesake in terms of the number of parameters (Pandas read_csv has 49. Yep, I counted) and syntax.

Image by author

The confusing parameter names shouldn't be a problem as most modern IDEs have tab-completion or pop-up documentation features (Shift + Tab on JupyterLab, thank you 🙏).

If you were unaware, the dtype parameter prevents Pandas from setting automatic datatypes and allows the user to set custom ones like string type for cut or datetime for date-type columns.

You can use the same behavior in Polars with dtypes (mind the ‘s’) though it doesn’t allow the types to be set via strings. You have to provide either Python built-in types or through Polars’, like pl.Boolean, pl.Categorical, pl.DateTime, pl.Int64 or pl.Null for missing values. You can see the complete list by calling dir(pl).

Reading and writing Parquet files, which are much faster and more memory-efficient than CSVs, are also supported in Polars through read_parquet and write_parquet functions.

Image by author

1. Creating Series and DataFrames

You don't always read data from files. Like in Pandas, you can create DataFrames and Series from scratch, and the syntaxes are almost identical:

Images by author

There are also many name and behavior-(almost)-identical methods of Polars DataFrames to Pandas. Say hello to:

  • apply - Apply custom user-defined functions on each row of the DataFrame
  • corr - correlation matrix
  • describe - Summary statistics, 5-number summary
  • drop - Remove columns from a DataFrame
  • explode - Unpacking the given column to long format (when cells contain multiple values like [1, 2, 3])
  • head, tail and sample(n) - get different views of the DataFrame (top, bottom, random)
  • iter_rows - returns an iterator of DataFrame rows of Python-native values
  • max, mean, median, sum, std and the usual gang of common math and statistical functions.

and so on. Check out this page of the docs for the full list of DataFrame methods in Polars.

2. Understanding expressions in Polars

This is a query engine, human. — Midjourney

At the core of Polars is its query engine, which runs user-defined expressions. The query engine and expressions are two critical components for Polars' blazing fast and, as the Polars user guide puts it, "embarrassingly parallel" performance.

You may be shocked at how closely Polars expressions resemble SQL while keeping firm ties to the familiar Pandas syntax.

Like SQL queries, you can write expressions for:

  • Creating new columns from existing ones
  • Getting views of the data after some transformation
  • Summary statistics
  • Processing and cleaning data
  • Groupby statements

and so on.

df.filter(pl.col('column') == 'some_value')

In the query above, the expression is pl.col('column)' == 'some_value', which, as you guess, filters the DataFrame for rows where column is equal to some_value.

When you run the expression on its own, you won't get a boolean Series as you would in Pandas:

type(pl.col("column") == "some_value")
polars.expr.expr.Expr

That's because expressions are only evaluated under contexts. There are three broad contexts in Polars:

  1. Selecting data — in the select context, expressions are applied over columns and must produce columns of the same length in the result. This behavior must be familiar from your SQL days. The filter function is also tied into this context.
  2. Grouping data — in the groupby context, expressions work on groups, and the results may have any length as a group can have many members.
  3. Adding new columns — in this context, expressions are used to create new columns from scratch or existing ones.

Let's see each context in detail.

3. Selecting data

The Pandas' brackets notation gives way to expressions in Polars for selecting columns.

To select a single column, you can use their literal names inside select or use the recommended pl.col function to reference columns.

To select multiple, you can list the column names with commas inside pl.col or as a list of pl.col references inside select. We will see the differences between these syntaxes in a bit.

Image by author

Polars includes certain functionality not fully available in Pandas for selecting data. For example, you can exclude columns in your selection with exclude:

df.select(pl.exclude("price")).head()
Image by author

Or use regular expressions between ^ and $ characters. Below, we are choosing all columns that start with the letter c:

df.select(pl.col("^c.+$")).head()
Image by author

You can also subset the DataFrame based on data type, which might remind you of select_dtypes of Pandas (on the left):

Image by author

To select all numeric columns, we are using both Int64 and Float64 types inside pl.col.

4. Filtering data

You can use the filter function to subset DataFrames with boolean indexing. For example, using is_between function on a column creates an expression to filter numeric columns within a range.

Image by author

You can combine multiple conditional expressions with the familiar boolean operators & (AND), | (OR). In the example below, we are choosing rows where the color column is either 'E' or 'J' AND the price of diamonds are below 500:

Image by author

Also, notice how we are using is_in in Polars on the right.

5. Creating new columns

You can create new columns under the with_columns context. In the example below, new_col is defined using pl.col('price') ** 2 and aliased to give the new column a name, just like as keyword in SQL.

Image by author

In the second example, we combine two columns (even though it doesn't make sense), demonstrating how integer and string columns can be combined with Polars. You can use any native Python or third-party function and operator on columns referenced with pl.col.

If you want the new column to be inserted into the DataFrame, you have to override the original df variable.

While we are at it, string columns in Polars have the familiar .str interface for special text manipulation functions like contains or lengths. See the full list here. There are also .cat, .dt and .arr interfaces for specialized categorical, temporal and array functions.

6. Groupby

I don't think we can leave without mentioning Groupby operations:

Image by author

When using the groupby function in Polars, be sure to include maintain_order=True so that groups are not displayed randomly. Also, unlike Pandas, groupby(col_name) expression only works on the given column. To group all columns based on col_name, you have to use an aggregation context. Here is its syntax:

df.groupby(
    "cut", maintain_order=True
).agg(pl.col("*").count())

After the groupby context, you chain the aggregation context and specify which columns the context affects. And chain any function on the result like count.

Here is another example that groups by the diamond cut quality and returns the average numeric value for each group:

To learn more advanced groupby expressions in Polars, read here.

7. The lazy API in Polars

One of the best features of Polars is its lazy API. In it, queries aren't run line-by-line but processed end-to-end by the query engine.

This is where query optimization and the embarrassingly parallel magic happen. You can turn any expression written in eager mode into the lazy mode with only two keywords:

import polars as pl

df = pl.read_csv("data/diamonds.csv")

query = df.lazy().filter(
    pl.col("cut") == "Ideal"
)

type(query)
polars.lazyframe.frame.LazyFrame

When you add the lazy() function before an expression is chained, the DataFrame becomes a LazyFrame. At this point, the query isn't executed, and you can chain more expressions. Once ready, you call collect() to get the result:

query.collect().head()
Image by author

While it is already fast in eager mode, the lazy mode adds exxxtra (yes, triple x) fuel to the query engine.

If you want to make the lazy API your default, you can use scan_* functions when reading data instead of read_*:

df = pl.scan_csv("data/diamonds.csv")

q1 = df.filter(
    pl.col("cut") == "Ideal"
)

q1.collect().head()

This way, you avoid writing the lazy() function every time.

If you are dealing with datasets that are intimidating to your RAM, you can use streaming so that Polars processes your data in batches. This feature is enabled in the lazy API by setting streaming=True inside collect. Find out more about this excellent feature from this page.

Conclusion

Polars might be new (I mean, it is fresh out of its crib), but it is already very popular. In the open-source standards, it is a rock star. Just look at its competitors:

  1. Pandas, released in 2011, has 37.5k GitHub stars.
  2. Apache Spark, released in 2014, has 26.8k stars.
  3. Vaex, released in 2017, has 7.9k GitHub stars.
  4. Dask, released in 2015, has 10.9k stars.
  5. Apache Arrow, released in 2016, has 11.4k stars.

In comparison, Polars was released in 2020 and already amassed 15.9k stars, halfway to its long-standing top competitors.

This should give you a rough picture of where the wind is blowing. Things might change when Pandas 2.0 is released, but I think Polars is already giving Pandas a run for its money.

Loved this article and, let’s face it, its bizarre writing style? Imagine having access to dozens more just like it, all written by a brilliant, charming, witty author (that’s me, by the way :).

For only 4.99$ membership, you will get access to not just my stories, but a treasure trove of knowledge from the best and brightest minds on Medium. And if you use my referral link, you will earn my supernova of gratitude and a virtual high-five for supporting my work.

Artificial Intelligence
Data Science
Programming
Machine Learning
Python
Recommended from ReadMedium