avatarSoner Yıldırım

Summary

The article compares the performance of Pandas and Vaex for large dataset operations, demonstrating Vaex's superior speed in reading data, filtering, sorting, and creating new columns.

Abstract

The article "4 Examples to Compare the Speed of Pandas and Vaex" illustrates the performance limitations of Pandas when handling very large datasets due to its in-memory analytics approach. It introduces Vaex, a Python library designed for efficient data manipulation and analysis with features such as memory-mapping, lazy execution, virtual columns, and non-copying filtering and selection operations. Through a series of examples using a 10 million row dataset, the author showcases Vaex's significantly faster performance in data reading, filtering by category, sorting by single and multiple columns, and adding new columns based on calculations. The comparison reveals that Vaex consistently outperforms Pandas, completing tasks in mere milliseconds or seconds that take Pandas several seconds to minutes. The article concludes by emphasizing Vaex's potential for even larger datasets and its capability to handle complex operations and data visualizations, suggesting that data scientists should consider Vaex as a powerful alternative to Pandas for big data applications.

Opinions

  • The author expresses admiration for Pandas but acknowledges its limitations with very large datasets.
  • Vaex is presented as a highly efficient alternative to Pandas, particularly for large datasets that exceed available memory.
  • The syntax similarity between Pandas and Vaex is noted as a benefit for users transitioning to Vaex.
  • The author provides evidence that Vaex's performance advantages are substantial, with speed improvements ranging from 10x to over 100x in the examples provided.
  • The author encourages readers to experiment with the dataset used in the examples and suggests that Vaex is capable of handling even larger datasets, up to 1 billion rows.
  • The article hints at future content from the author on Vaex, indicating a commitment to educating readers on leveraging this tool for big data challenges.

4 Examples to Compare the Speed of Pandas and Vaex

Performance matters more as the data size increases.

Photo by Ross Sneddon on Unsplash

I admire Pandas. It’s one of the first tools I learned in my data science journey and I have been using it frequently ever since.

Pandas was the only tool I needed to do data cleaning, manipulation, and analysis until I had to work with very large datasets.

Pandas starts to slow down when the data size becomes very large because it does in-memory analytics. Hence, if the dataset is larger than memory, it becomes very difficult, or impossible, to use Pandas.

I wrote about some tips for making Pandas more efficient when working with large datasets. In this article, we will learn about an alternative tool to Pandas when working with large datasets: Vaex.

What is Vaex?

Vaex is also a Python library and can be used for data analysis and manipulation. The key features of Vaex makes it outperform Pandas when working with large datasets.

Some of these key features:

  • Memory-mapping
  • With lazy execution, calculations are performed only when needed. Vaxes uses expression objects to keep track of the executions.
  • Virtual columns which are treated as regular columns but do not occupy memory
  • Filtering and selection operations do not make a copy of data.

These features come together to bring a performance-boost as we will see in the following examples.

Reading data

I prepared a dataset with 10 million rows and 51 columns, which is a little over 1 GB of data. It is not too much and the files can easily go up to 1 billion rows but 10 million rows are enough to demonstrate our case.

Let’s start with reading the dataset into Pandas and Vaex DataFrames. Both libraries have the read_csv function to create a DataFrame from a CSV file.

import vaex
vdf = vaex.read_csv("very_large_dataset.csv")
vdf.shape
# output
(10000000, 51)

import pandas as pd
df = pd.read_csv("very_large_dataset.csv")
df.shape
# output
(10000000, 51)

As we see in the output, the dataset has 10 million rows and 51 columns.

I think here is a good place to mention that Vaex’s syntax is quite similar to that of Pandas, which makes it easier to adopt for Pandas users.

By the way, if you’d like to generate this dataset and play around with it, here is the code to do it:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0, 100, size=(10000000, 50)))
df = df.rename(columns={i:f"x_{i}" for i in range(50)})
df["category"] = ["A", "B", "C", "D"] * 2500000

The first 50 columns contain integers between 0 and 100 and the last column contains string values of A, B, C, and D.

Example 1

Let’s do a filtering operation by selecting rows that belong to category A. We will be timing the operations so that we can compare the performance of Vaex and Pandas.

# Pandas
%time df_a = df[df["category"]=="A"]
CPU times: user 518 ms, sys: 826 ms, total: 1.34 s

# Vaex
%time vdf_a = vdf[vdf["category"]=="A"]
CPU times: user 4.03 ms, sys: 6.55 ms, total: 10.6 ms

It took Pandas 1.3 seconds whereas Vaex performs the same operation in 10 milliseconds. This is a highly significant difference, which will even be higher when working with datasets of bigger size.

Example 2

In the second example, we will sort the DataFrame by the values in a column.

# Pandas
%time df_sorted = df.sort_values(by="x_1", ascending=False)
CPU times: user 2.2 s, sys: 1.3 s, total: 3.5 s

# Vaex
%time vdf_sorted = vdf.sort(by=["x_1"], ascending=["False"])
CPU times: user 110 ms, sys: 57.7 ms, total: 168 ms

Vaex was able to sort the rows in 168 milliseconds whereas it took Pandas over 3.5 seconds. Again, this is a significant difference.

Example 3

Let’s take the previous example one step further and sort the rows by two columns.

# Pandas
%time df_sorted = df.sort_values(by=["x_1", "x_2"], ascending=[False, True])
CPU times: user 2.92 s, sys: 1.33 s, total: 4.25 s

# Vaex
%time vdf_sorted = vdf.sort(by=["x_1", "x_2"], ascending=["False", "True"])
CPU times: user 1.14 s, sys: 90.1 ms, total: 1.23 s

Vaex still outperforms Pandas with 1.23 seconds compared to 4.25 seconds.

Example 4

In the last example, we will create new columns by using other columns.

# Pandas
%time df["new"] = df["x_1"]**2 + df["x_2"]**2
CPU times: user 34.8 ms, sys: 114 ms, total: 149 ms

# Vaex
%time vdf["new"] = vdf["x_1"]**2 + vdf["x_2"]**2
CPU times: user 3.25 ms, sys: 5.97 ms, total: 9.22 ms

The new column contains the sum of the square of values in the x_1 and x_2 columns. Vaex did much better than Pandas in this operation too.

Conclusion

We did 4 examples and Vaex was much faster than Pandas in all of them:

Pandas vs Vaex (image by author)

The difference in performance will be more when we work with larger datasets (e.g. 1 billion rows).

The examples we did cover only a simple set of operations. Just like Pandas, Vaex is capable of performing more complex operations. It’s also quite simple to create data visualizations with Vaex, which comes in handy for analyzing large tabular datasets.

I will write more articles and tutorials on Vaex so stay tuned if you want to learn more about handling big data.

You can become a Medium member to unlock full access to my writing, plus the rest of Medium. If you already are, don’t forget to subscribe if you’d like to get an email whenever I publish a new article.

Thank you for reading. Please let me know if you have any feedback.

Data Science
Python
Big Data
Programming
Machine Learning
Recommended from ReadMedium