avatarXiaoxu Gao

Summary

This article compares the performance of three Python libraries, Pandas, Dask, and Datatable, for processing CSV files, focusing on reading, combining, and writing multiple CSV files, as well as computing aggregations.

Abstract

The article discusses the limitations of using Pandas for processing large CSV files and introduces two alternatives, Dask and Datatable, which provide better performance and scalability. The author shares their experience with combining 30K+ CSV files and presents a comparison of the three libraries based on their performance in reading, combining, and writing multiple CSV files, as well as computing aggregations. The results show that Dask outperforms Pandas and Datatable in most cases, while Datatable provides a safer approach when dealing with different CSV file formats. The article concludes with a discussion of the advantages and disadvantages of each library and encourages readers to share their experiences with the community.

Bullet points

  • The author encountered performance issues when using Pandas to combine 30K+ CSV files.
  • Dask and Datatable are introduced as alternatives to Pandas for processing large CSV files.
  • Dask is designed to scale analytics packages to multi-core machines and distributed clusters.
  • Datatable aims to perform large data processing on a single-node machine at maximum speed.
  • The author compares the performance of the three libraries in reading, combining, and writing multiple CSV files, as well as computing aggregations.
  • Dask outperforms Pandas and Datatable in most cases, while Datatable provides a safer approach when dealing with different CSV file formats.
  • Dask supports Apache Parquet format, which provides efficient data compression and encoding schemes with enhanced performance.
  • The article concludes with a discussion of the advantages and disadvantages of each library and encourages readers to share their experiences with the community.

Pandas vs Dask vs Datatable: A Performance Comparison for processing CSV files

Pandas might not be the best option anymore

Photo by Martin Reisch on Unsplash

When it comes to processing CSV files, the first tool that appears in everyone’s mind is pandas. There is no doubt that pandas is a great framework, the dataframe provides an extremely streamlined form of data representation that helps us to analyze and understand data better.

Recently, I did a task that required me to combine 30K+ CSV files. My god feeling was to use pandas, but it didn’t go well because of the performance of certain file operations. In this article, I want to share with you 2 other alternatives and compare their behavior and performance with pandas. By the end, you will understand the tradeoff of each library and be able to make the right choice.

I’m very surprised about some of the results coming out from the experiments as there is no way to predict it unless you actually try it. I hope this article can save your trial-and-error time and help you make a better decision on choosing the library.

Dask

A problem with most data analytics Python libraries like Numpy, pandas, and scikit-learn is that they are not designed to scale beyond a single machine. Dask is an open-source library that provides advanced parallelization for analytics when you work with large data. It natively scales these analytics packages to multi-core machines and distributed clusters whenever needed. It provides a similar API interface as pandas to ensure consistency and minimize friction.

Datatable

Datatable is another Python library with performance in mind. Unlike dask, the goal of datatable is to perform large data processing on a single-node machine at the maximum speed possible. Meanwhile, its interoperability with pandas provides the ability to convert to another data-processing framework with ease.

Both of these 2 libraries have the intention of improving pandas’ performance and keeping similar interfaces as pandas for ease of use. In the following sections, I ran experiments on my Macbook Pro (2,6 GHz 6-Core Intel Core i7, 16GB memory), you might get different results when running in a different environment.

Read a single CSV file

Let’s start with the simplest operation — read a single CSV file. To my surprise, we can already see a huge difference in the most basic operation. Datatable is 70% faster than pandas while dask is 500% faster! The outcomes are all sorts of DataFrame objects which have very identical interfaces.

Read multiple CSV files

This is where I got stuck when I tried to accomplish this task using pandas. pandas doesn’t provide an interface that can read multiple CSV files in one line. The only way is to do a for loop and append each dataframe into a list, in the end, using pd.concat to combine all these dataframes. But this is quite inefficient. Let’s checkout the following:

This result is also interesting because it turns out that datatable performs worse than pandas in terms of reading multiple CSV files which is the opposite of my expectations. Nevertheless, dask still wins the game and performs 4 times better than pandas.

Caveat — CSV files have different formats

When combining multiple CSV files, there is a chance that CSV files have different formats because of different versions or corrupted data files. In that case, we should be careful to not unintentionally mix up different CSV files.

New column

In this example, I prepared 2 files and 1 of them has an extra column. I’m trying to mimic a real-world scenario where there is a minor schema change in the source file.

Is this what you expected?

Pandas will use the new schema as the target schema and the new column will be backfilled as NaN in the old data. The order in the list files doesn’t matter because pandas will select the schema with more columns regardless of the reading sequence.

Dask, on the other hand, surprised me a lot. It uses the schema of the first file in the list as the target schema and ignores the files which don’t match. When I reverse files , I got completely different results because a different file was read first. It’s kind of risky as it will potentially ignore many rows without notifying you. A workaround is to read the header of every file and compare them before combining. Although this creates a little bit of overhead, the total processing time is still faster than pandas given the huge performance gain in the ‘combine’ stage.

Compared to the others, datatable is rather on the safe side. It will raise an exception if it finds a difference in the schema. You can add force=True to rbind function, so it will have the same behavior as pandas.

Rename an existing column

Another common schema change is to rename an existing column which is a breaking change. In this example, data.csv contains only gross_amount and data_rename_col.csv contains only net_amount. After seeing the previous example, let’s have a guess before checking out the results.

So, the outcome of pandas contains both gross_amount and net_amount, and it fills the missing value with NaN. Dask gives the same result as before which depends on the file it reads first. Datatable, in this case, throws a different exception with the name of the breaking column. This is helpful during the debug. In general, they all inherit the same behavior from the previous example.

Computing Aggregations

An essential piece of analysis is computing aggregations like min, max, sum, mean, and median in which a single number gives insight into the entire (partial) dataset. So how is the performance for these computations?

According to dask documentation: Generally speaking, Dask.dataframe groupby-aggregations are roughly same performance as pandas groupby-aggregations, just more scalable.

The performance for computing aggregations is equally the same. But compared to pandas, dask is able to scale the solution in a cluster.

Writing to CSV files

The last part is the off-loading. All of these 3 libraries have the same interface .to_csv() to save a dataframe into a CSV. Other than that, dask also supports Apache Parquet format which is a popular, columnar binary format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

As you can see, the winner is dask with Parquet format. Pandas and datatable perform equally the same. Dask with regular CSV format performs the worst which is quite opposite to the performance for reading CSV files. The high performance of parquet is due to the fact that data is split into several partitions. By default, dask will load each parquet file individually as a partition in the dataframe which is easier for parallel loading.

Besides, dask writes out parquet files with snappy compression by default. Snappy compression is usually the best for files in distributed computation. Although it doesn’t compress files as much as other compression algorithms like gzip, it’s faster when decompressing files. You can overwrite it with other compression algorithms as well.

Conclusion

I hope this article can give you a holistic view of different aspects of pandas, dask, and datatable. It turns out that no library is perfect. Dask is good at reading and writing file(s), especially using its parquet format. And it’s able to distribute your solution to a cluster. Datatable tries to mimic pandas' behavior with slightly better performance. Pandas is the core of the other 2 libraries and offers the most complete computation methods.

It’s always interesting to compare different tools and see their advantages and disadvantages. I would encourage you to do such exercises in your interested area as well and share it with the community. It will benefit so many people.

Let me know what you think about these 3 libraries! Cheers!

Reference

Python
Programming
Software Development
Data Science
Editors Pick
Recommended from ReadMedium