The article presents and provides tutorials for three alternative Python libraries to Pandas, Polars, and PySpark for handling dataframes: Dask, Modin, and Koalas.
Abstract
The article "Three Alternatives to Pandas, Polars and PySpark to Work With Data In Python" delves into the capabilities and usage of Dask, Modin, and Koalas as efficient alternatives for dataframe manipulation in Python. It outlines the benefits of these libraries, such as Dask's parallel computing for large datasets, Modin's seamless integration with existing pandas code, and Koalas' compatibility with Apache Arrow and Apache Spark. The author includes detailed instructions on installing each library, creating dataframes, performing transformations, and executing computations, along with the libraries' integration with distributed computing systems. The article emphasizes the ease of transition for users already familiar with Pandas and provides resources for further exploration, such as official documentation and tutorials.
Opinions
The author suggests that Dask is an ideal choice for working with dataframes that exceed the system's memory capacity.
Modin is highlighted for its ability to accelerate Pandas' performance without significant changes to existing codebases, leveraging Ray, Dask, or Unidist.
Koalas is praised for its Pandas-like API, which facilitates a smooth learning curve for those transitioning from Pandas to distributed computing with Apache Spark.
The article indicates a preference for these alternative libraries due to their specialized features for handling big data and their potential to enhance productivity for data scientists.
The author hints at the deprecation of the Koalas library, suggesting consideration of other options like DuckDB for future data handling needs.
Three Alternatives to Pandas, Polars and PySpark to Work With Data In Python
When it comes to working with data in Python, Pandas, Polars, and PySpark have long been popular choices.
If you saw my previous articles you saw some performance comparisons between these three libraries:
However, there are several alternative libraries available that offer unique features and capabilities for handling dataframes.
In this article, I will:
Explore three alternatives: Dask, Modin, and Koalas;
Provide a quick tutorial on all start to work with them.
1. Dask
Dask is a library that brings parallel computing and out-of-memory execution to the world of data analysis.
It allows you to efficiently handle large datasets that don’t fit into memory by breaking them into smaller partitions and executing computations in parallel.
With a Pandas-like API, Dask seamlessly integrates into your existing data processing workflows, making it a powerful choice for scaling your computations.
“A Dask DataFrame is a large parallel DataFrame composed of many smaller pandas DataFrames, split along the index.
These pandas DataFrames may live on disk for larger-than-memory computing on a single machine, or on many different machines in a cluster.
One Dask DataFrame operation triggers many operations on the constituent pandas DataFrames.”
According with the official website “Dask is included by default in Anaconda.
You can also install Dask with Pip, or you have several options for installing from source. You can also use Conda to update Dask or to do a minimal Dask install.”
Here’s a step-by-step tutorial on working with dataframes using Dask from the official website
Install Dask
You can install Dask using pip:
pip install dask and pip install "dask[distributed]”
Import Dask and Create Dataframe
Start by importing the necessary modules and creating a Dask dataframe.
import dask.dataframe as dd
from dask.distributed import Client
defread_dask_parquet():
ddf = dd.read_parquet(
"yellow_tripdata/yellow_tripdata_2014*.parquet",
columns=["passenger_count", "trip_distance"],
storage_options={"anon": True},
)
return ddf
Perform Operations
You can now perform various operations on the Dask dataframe, similar to Pandas.
Modin is a library designed to speed up Pandas by utilizing parallel and distributed processing.
According with the official website “Modin uses Ray, Dask or Unidist to provide an effortless way to speed up your pandas notebooks, scripts, and libraries.
Unlike other distributed DataFrame libraries, Modin provides seamless integration and compatibility with existing pandas code. Even using the DataFrame constructor is identical.
This means that we can use Modin and Dask together.
Koalas is a library that provides a Pandas-like API on top of Apache Arrow and Apache Spark.
According with the official website “The Koalas project makes data scientists more productive when interacting with big data, by implementing the pandas DataFrame API on top of Apache Spark.
Pandas is the de facto standard (single-node) DataFrame implementation in Python, while Spark is the de facto standard for big data processing. With this package, you can:
Be immediately productive with Spark, with no learning curve, if you are already familiar with pandas.
Have a single codebase that works both with pandas (tests, smaller datasets) and with Spark (distributed datasets).“
Here's a tutorial on working with dataframes using Koalas:
Install Koalas
You can install Koalas using pip:
pip install koalas.
Be aware that Koalas requires PySpark so please make sure your PySpark is available.
Import Koalas and Create Dataframe
Start by importing Koalas and creating a Koalas dataframe.
UPDATE 2023/06/21: According to some followers, the library Koalas was deprecated. I also have some comments talking about DuckDB — I will do some analysis :)