Putting Pandas(Python) on Fire: Get 25x Performance Improvement

Data analysis is a fundamental aspect of any data-driven project. As datasets grow larger and more complex, the performance of our data analysis tools becomes crucial.
Pandas, a popular Python library, has long been a go-to choice for data manipulation and analysis. However, as the size of datasets expands, the limitations of Pandas in terms of speed and memory usage become apparent.
Fortunately, there are alternative libraries that address these challenges head-on. Vaex, Ray, and Dask are emerging tools in the data analysis landscape that offer high-performance solutions for large-scale data processing. In this blog, we will dive into a comparative analysis of Pandas, Vaex, Ray, and Dask, focusing on their execution time for common data manipulation tasks.
So, fasten your seatbelts as we embark on a journey to uncover the speed champions of data analysis in Python.
Creating the dataset
We will be using the following dataset to compare the performance of Pandas, Vaex, Ray, and Dask.
import pandas as pd
import timeit
import numpy as np
df = pd.DataFrame(np.random.randint(0, 50, size=(5_000_000, 4)), columns=('a','b','c','d'))

We will be comparing the run time for aggregation(group by) and the apply function.
The apply function will be used to call the following sum function:
def sum_row(row):
return row["a"] + row["b"] + row["c"] + row["d"]
Now, let’s start with the comparison:
Pandas
In this section, we will look at the time taken by pandas to perform group-by and apply operations on our dataset.
Group by
%%time
output = df.groupby("a").count()
# CPU times: user 128 ms, sys: 81.7 ms, total: 209 ms
# Wall time: 254 ms
The group by operation in pandas took a total time of 254ms.
Apply function
%%time
output = df.apply(sum_row, axis = 1)
# CPU times: user 46.5 s, sys: 797 ms, total: 47.3 s
# Wall time: 47.5 s
The apply function in pandas took a total of 47.5 seconds.
Modin/Ray
Modin, a powerful library built on top of Pandas, takes data analysis to the next level by integrating with Ray, a distributed computing framework.
Modin leverages Ray’s parallel computing capabilities to efficiently distribute tasks across multiple cores or even clusters, enabling seamless scalability and significantly reducing the processing time.
We will start by converting the pandas dataframe into the Modin dataframe.
# !pip install ray
import modin.pandas as pd
import ray
ray.init()
df_ray = pd.DataFrame(df)
Group by
%%time
output = df_ray.groupby("a").count()
# CPU times: user 32.6 ms, sys: 14.3 ms, total: 46.9 ms
# Wall time: 179 ms
The group by operation with Modin took a total time of 179ms, almost 1.5x faster than pandas.
Apply
%%time
output = df_ray.apply(sum_row, axis = 1)
# CPU times: user 88.3 ms, sys: 49.9 ms, total: 138 ms
# Wall time: 25.8 s
The apply function with Modin took a total of 25.8 seconds, which is 2X faster than the pandas apply function.
Vaex
Vaex handles big datasets by utilizing memory-mapping and lazy computing techniques, which minimize the need for data movement and computation.
With Vaex, you can perform a wide range of operations on large datasets, including filtering, aggregation, visualization, and machine learning, all while maintaining excellent performance.
Let’s start by creating a vaex df.
import time
# Using Vaex
import vaex
df_vaex = vaex.from_pandas(df)
Groupby
%%time
output = df_vaex.groupby(by = "a").agg({'a':'count'})
# CPU times: user 139 ms, sys: 17.1 ms, total: 156 ms
# Wall time: 44.9 ms
The group by operation with Vaex took a total time of ~45ms, almost 5.5X faster than pandas.
Apply
To use the apply()
function in Vaex, we need to define a custom function or use a lambda function and apply it to a column using the apply()
method. Here's an example:
def sum_row2(a,b,c,d):
return a + b + c + d
Calling the function in the code below:
%%time
output_vaex = df_vaex.apply(sum_row2, arguments = [df_vaex.a,df_vaex.b,df_vaex.c,df_vaex.d])
output_vaex = output_vaex.evaluate()
# CPU times: user 152 ms, sys: 175 ms, total: 327 ms
# Wall time: 1.84 s
The apply function with Vaex took a total of 1.8 seconds, which is 25X faster than pandas, this is super duper fast.
DASK
Dask also implements parallel processing by utilizing out-of-core processing techniques to efficiently handle large datasets.
Let’s start by converting our pandas dataframe into a Dask dataframe and setting the number of partitions to 6. You can set this argument based on the number of cores in your machine.
import dask.dataframe as dd
df_dask = dd.from_pandas(df, npartitions=6)
Groupby
%%time
ag_dash = df_dask.groupby("a").count().compute()
# CPU times: user 138 ms, sys: 79.8 ms, total: 218 ms
# Wall time: 153 ms
The group by operation with Vaex took a total time of ~153 ms, 1.7X faster than the pandas dataframe.
Apply
While implementing apply on the Dask dataframe, we need to set the datatype and the column name of the output. The below code will return a series named “x” having datatype as “float” (f8).
%%time
output_dask = df_dask.apply(sum_row, axis = 1, meta=('x', 'f8'))
# CPU times: user 5.75 s, sys: 116 ms, total: 5.87 s
# Wall time: 5.99 s
The apply function with Vaex took a total of ~6 seconds, which is 8X faster than pandas.

Conclusion
Parallel processing libraries such as Vaex, Ray, and Dask could be a great alternative to pandas if you are working on huge datasets.
From the above comparison, we can conclude that Vaex is the fastest among the 3 alternatives (at least for the group by and apply operations).
Try to use these libraries if you are working on big datasets and share your experiences in the comments.
Thank You!
If you find my blogs useful, then follow me to get direct notifications whenever I publish a story.
If you like to access all the amazing stories on Medium, consider supporting me and thousands of other writers by signing up for a membership. It only costs $5 per month, it supports us, writers, greatly.