3x times faster Pandas with PyPolars

Summary

PyPolars is a Python data frame library that utilizes all available CPU cores to perform computations faster than Pandas, with a similar API for easier developer transition.

Abstract

PyPolars is an open-source Python data frame library designed to speed up computations by utilizing all available CPU cores. It offers an API similar to Pandas, making it easier for developers to transition. PyPolars has two APIs: Eager API, which produces results immediately after execution like Pandas, and Lazy API, which forms a map or plan upon execution and then executes it in parallel across all CPU cores. Although PyPolars does not cover all Pandas functions, it is a memory-efficient library due to its immutable memory backing. Benchmark tests show that PyPolars is 2x to 3x faster than Pandas for basic operations.

Bullet points

PyPolars is an open-source Python data frame library that performs computations faster than Pandas by utilizing all available CPU cores.
PyPolars has an API similar to Pandas, making it easier for developers to transition.
PyPolars has two APIs: Eager API and Lazy API. Eager API produces results immediately after execution, while Lazy API forms a map or plan upon execution and then executes it in parallel across all CPU cores.
PyPolars is a memory-efficient library due to its immutable memory backing.
Benchmark tests show that PyPolars is 2x to 3x faster than Pandas for basic operations.
PyPolars does not cover all Pandas functions, but it can be used when the data is too big for Pandas and too small for Spark.
PyPolars can be installed using pip and imported using import pypolars as pl.

Speed up your Pandas workflow using the PyPolars library

Pandas is one of the most important Python packages among data scientist’s to play around with the data. Pandas library is used mostly for data explorations and visualizations as it comes with tons of inbuilt functions. Pandas fail to handle large size datasets as it does not scale or distributes its process across all the cores of the CPU.

To speed up the computations, one can utilize all the cores of the CPU and speed up the workflow. There are various open-source libraries including Dask, Vaex, Modin, Pandarallel, PyPolars, etc that parallelize the computations across multiple cores of the CPU. In this article, we will discuss the implementation and usage of the PyPolars library and compare its performance with Pandas library.

Whats is PyPolars?

PyPolars is an open-source Python data frame library similar to Pandas. PyPolars utilizes all the available cores of the CPU and hence performs the computations faster than Pandas. PyPolars has an API similar to that of Pandas. It is written in rust with Python wrappers.

Ideally, PyPolars is used when the data is too big for Pandas and too small for Spark

How PyPolars Works?

PyPolars library has two APIs, one is Eager API and the other is Lazy API. Eager API is very similar to that of Pandas, and the results are produced just after the execution is completed similar to Pandas. Lazy API is very similar to Spark, where a map or plan is formed upon execution of a query. Then the execution is executed parallelly across all the cores of the CPU.

(Image by Author), PyPolars API’s

PyPolars is basically as python binding to Polars library. The best part of the PyPolars library is its API similarity to Pandas, which makes it easier for the developers.

Benchmark Time Constraints:

For demonstrations, I have used a large size dataset (~6.4Gb) having 25 million instances.

(Image by Author), Benchmark Time Number for Pandas and Py-Polars basic operations

For the above benchmark time numbers for some basic operations using Pandas and PyPolars library, we can observe that PyPolars is almost 2x to 3x faster than Pandas.

Now we know that PyPolars has an API very similar to that of Pandas, but still, it does not cover all the functions of Pandas. For example, we don’t have .describe() function in PyPolars, instead, we can use df_pypolars.to_pandas().describe()

Conclusion:

In this article, we have covered a small introduction to the PyPolars library, including its implementation, usage, and comparing its benchmark time numbers with Pandas for some basic operations. Note that PyPolars works very similar to that of Pandas, and PyPolars is a memory-efficient library since the memory backed by it is immutable.

One can go through the documentation to get a detailed understanding of the library. There are various other open-source libraries that can parallelize the Pandas operations and speed up the process. Read the below-mentioned article to know 4 such libraries:

3x times faster Pandas with PyPolars

Speed up your Pandas workflow using the PyPolars library

Whats is PyPolars?

How PyPolars Works?

Installation:

Benchmark Time Constraints:

Usage:

Conclusion:

4 Libraries that can parallelize the existing Pandas ecosystem

Distribute Python workload by parallel processing using these frameworks

References: