avatarLuís Oliveira

Summary

The article presents and provides tutorials for three alternative Python libraries to Pandas, Polars, and PySpark for handling dataframes: Dask, Modin, and Koalas.

Abstract

The article "Three Alternatives to Pandas, Polars and PySpark to Work With Data In Python" delves into the capabilities and usage of Dask, Modin, and Koalas as efficient alternatives for dataframe manipulation in Python. It outlines the benefits of these libraries, such as Dask's parallel computing for large datasets, Modin's seamless integration with existing pandas code, and Koalas' compatibility with Apache Arrow and Apache Spark. The author includes detailed instructions on installing each library, creating dataframes, performing transformations, and executing computations, along with the libraries' integration with distributed computing systems. The article emphasizes the ease of transition for users already familiar with Pandas and provides resources for further exploration, such as official documentation and tutorials.

Opinions

  • The author suggests that Dask is an ideal choice for working with dataframes that exceed the system's memory capacity.
  • Modin is highlighted for its ability to accelerate Pandas' performance without significant changes to existing codebases, leveraging Ray, Dask, or Unidist.
  • Koalas is praised for its Pandas-like API, which facilitates a smooth learning curve for those transitioning from Pandas to distributed computing with Apache Spark.
  • The article indicates a preference for these alternative libraries due to their specialized features for handling big data and their potential to enhance productivity for data scientists.
  • The author hints at the deprecation of the Koalas library, suggesting consideration of other options like DuckDB for future data handling needs.

Three Alternatives to Pandas, Polars and PySpark to Work With Data In Python

Source David Clode

When it comes to working with data in Python, Pandas, Polars, and PySpark have long been popular choices.

If you saw my previous articles you saw some performance comparisons between these three libraries:

Between Pandas and Polars

Between Polars and PySpark

However, there are several alternative libraries available that offer unique features and capabilities for handling dataframes.

In this article, I will:

  • Explore three alternatives: Dask, Modin, and Koalas;
  • Provide a quick tutorial on all start to work with them.

1. Dask

Dask is a library that brings parallel computing and out-of-memory execution to the world of data analysis.

It allows you to efficiently handle large datasets that don’t fit into memory by breaking them into smaller partitions and executing computations in parallel.

With a Pandas-like API, Dask seamlessly integrates into your existing data processing workflows, making it a powerful choice for scaling your computations.

“A Dask DataFrame is a large parallel DataFrame composed of many smaller pandas DataFrames, split along the index.

These pandas DataFrames may live on disk for larger-than-memory computing on a single machine, or on many different machines in a cluster.

One Dask DataFrame operation triggers many operations on the constituent pandas DataFrames.”

According with the official website “Dask is included by default in Anaconda.

You can also install Dask with Pip, or you have several options for installing from source. You can also use Conda to update Dask or to do a minimal Dask install.”

Here’s a step-by-step tutorial on working with dataframes using Dask from the official website

Install Dask

You can install Dask using pip:

pip install dask and pip install "dask[distributed]”

Import Dask and Create Dataframe

Start by importing the necessary modules and creating a Dask dataframe.

import dask.dataframe as dd
from dask.distributed import Client

def read_dask_parquet():
    ddf = dd.read_parquet(
        "yellow_tripdata/yellow_tripdata_2014*.parquet",
        columns=["passenger_count", "trip_distance"],
        storage_options={"anon": True},
    )
    return ddf

Perform Operations

You can now perform various operations on the Dask dataframe, similar to Pandas.

def transformation(ddf):
    ddf = ddf.groupby("passenger_count").trip_distance.mean()
    return ddf

Execute Computations

To execute computations and get the results, use the .compute() method.

print(final_df.compute())

Distributed Computing

Dask supports distributed computing, enabling parallel execution on multiple cores or a cluster.

You can set up a Dask cluster and scale your computations accordingly.

This is a brief overview of using Dask for working with dataframes.

You can check the official Dask documentation for more detailed tutorials and examples.

2. Modin

Modin is a library designed to speed up Pandas by utilizing parallel and distributed processing.

According with the official website “Modin uses Ray, Dask or Unidist to provide an effortless way to speed up your pandas notebooks, scripts, and libraries.

Unlike other distributed DataFrame libraries, Modin provides seamless integration and compatibility with existing pandas code. Even using the DataFrame constructor is identical.

This means that we can use Modin and Dask together.

Here’s a quick tutorial on working with dataframes using Modin based on https://modin.readthedocs.io/en/stable/getting_started/quickstart.html :

Install Modin

You can install Modin using pip:

pip install modin.

Import Modin and Create Dataframe:

Start by importing Modin and creating a Modin dataframe.

import modin.pandas as pd

def read_csv(s3_path):
    modin_df = pd.read_csv(s3_path, parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"], quoting=3)
    return modin_df

Transformations

You can perform various operations on the Modin dataframe, just like you would with Pandas.

def transformation(big_modin_df):
    rounded_trip_distance_modin = big_modin_df["trip_distance"].apply(round)
    return rounded_trip_distance_modin

These are the basic steps involved in working with dataframes using Modin.

For more detailed tutorials and examples, you can check the Modin documentation.

3. Koalas

Koalas is a library that provides a Pandas-like API on top of Apache Arrow and Apache Spark.

According with the official website “The Koalas project makes data scientists more productive when interacting with big data, by implementing the pandas DataFrame API on top of Apache Spark.

Pandas is the de facto standard (single-node) DataFrame implementation in Python, while Spark is the de facto standard for big data processing. With this package, you can:

  • Be immediately productive with Spark, with no learning curve, if you are already familiar with pandas.
  • Have a single codebase that works both with pandas (tests, smaller datasets) and with Spark (distributed datasets).“

Here's a tutorial on working with dataframes using Koalas:

Install Koalas

You can install Koalas using pip:

pip install koalas.

Be aware that Koalas requires PySpark so please make sure your PySpark is available.

Import Koalas and Create Dataframe

Start by importing Koalas and creating a Koalas dataframe.

import databricks.koalas as ks


def read_koalas_parquet():
    df = ks.read_parquet(
        "yellow_tripdata/yellow_tripdata_2014*.parquet"
        ,columns=["passenger_count", "trip_distance"],)
    return df

Transformations

You can perform various operations on the Koalas dataframe, similar to Pandas.

def transformation(df):
    average_distance = df.groupby("passenger_count").mean()["trip_distance"]
    return average_distance

Compatibility with Spark Ecosystem:

Koalas seamlessly integrates with the Spark ecosystem.

You can convert Koalas dataframes to Spark DataFrames to leverage the full power of Spark for distributed computing and scale your computations.

This is a brief overview of working with dataframes using Koalas.

You can explore the Koalas documentation for more detailed tutorials and examples.

UPDATE 2023/06/21: According to some followers, the library Koalas was deprecated. I also have some comments talking about DuckDB — I will do some analysis :)

Extra Resources:

DASK

MODIN

KOALAS

Did you like this article? Follow me for more articles on Medium.

Read every story from Luís Oliveira (and thousands of other writers on Medium)

Level Up Coding

Thanks for being a part of our community! Before you go:

🚀👉 Join the Level Up talent collective and find an amazing job

Python
Data
Data Science
Data Engineering
Pandas
Recommended from ReadMedium