avatarSatyam Kumar

Summary

This article compares the performance of various data formats, including CSV, JSON, Pickle, Parquet, and Feather, for reading, saving, and memory consumption operations in Pandas.

Abstract

The article titled "Stop saving your Data frame in CSV format" discusses the limitations of using CSV, Excel, or other text file formats for working with large datasets or frequent reading and saving operations. It introduces eight data formats, including CSV, Compressed CSV, JSON, Pickle, Compressed Pickle, Parquet, HDF, and Feather, and compares their performance in terms of reading time, saving time, and memory consumption. The author uses the New York City Taxi Duration dataset from Kaggle to compare the benchmark numbers. The article provides recommendations for different use-cases, such as using Pickle, Feather, or Parquet formats to save data between sessions or intermediate files, and using compressed pickle, parquet, or compressed CSV formats to optimize memory consumption. The article concludes that there is no thumb rule for choosing a particular data format, and it depends on the use case.

Bullet points

  • The article discusses the limitations of using CSV, Excel, or other text file formats for working with large datasets or frequent reading and saving operations.
  • The author introduces eight data formats, including CSV, Compressed CSV, JSON, Pickle, Compressed Pickle, Parquet, HDF, and Feather, and compares their performance in terms of reading time, saving time, and memory consumption.
  • The author uses the New York City Taxi Duration dataset from Kaggle to compare the benchmark numbers.
  • The article provides recommendations for different use-cases, such as using Pickle, Feather, or Parquet formats to save data between sessions or intermediate files, and using compressed pickle, parquet, or compressed CSV formats to optimize memory consumption.
  • The article concludes that there is no thumb rule for choosing a particular data format, and it depends on the use case.

Stop saving your Data frame in CSV format

Benchmark time comparison of using various data formats for reading and saving operations

Image by Pexels from Pixabay

Data Science is all about working with data. The entire data science model development pipeline involves data wrangling, data explorations, exploratory data analysis, feature engineering, and modeling. Reading and saving intermediate files is a common task in a model development pipeline. A data scientist often prefers reading and saving Pandas' data frame in CSV format. Working with a small size or moderate size data is very easy and does not require too much overhead, but when it comes to working with a large size dataset, the workflow slows down, due to the limitation of resources.

CSV, Excel, or other text file formats lose their attractiveness while working with large data, or your workflow involves frequent reading and saving operations. There are various binary data formats that can be preferred over CSV file format, usage of such data format is supported by Pandas package. In this article, we will compare the memory consumption, saving, and reading operations time number for various data formats and further conclude with the data formats that can be used for each use case.

Data formats to compare:

In this article, we will be comparing the below-mentioned 8 data formats on metrics including reading time with Pandas, saving time with Pandas, memory consumption on disk.

  • CSV (Comma Separated Value)
  • Compressed CSV
  • JSON
  • Pickle
  • Compressed Pickle
  • Parquet
  • HDF
  • Feather

Data:

I will be using New York City Taxi Duration dataset from Kaggle having 1,458,644 records and 12 features to compare the benchmark numbers.

Benchmark:

(Image by Author), Left: Reading and Saving Time Comparison (seconds), Right: Memory Consumption (MB)

Recommendations:

From observing the benchmark numbers for reading and saving operations and memory consumption for each of the above-discussed file formats, one can follow the below-mentioned recommendations for different use-cases:

  • To save the data between the sessions or intermediate files, use pickle, feather, or parquet formats. Pickle to the most preferred and recommended as it has the lowest reading the saving time.
  • To save the data in the smallest size or optimized in-memory consumption, use compressed pickle, parquet, compressed CSV formats. A compressed pickle is most recommended as it has the least memory consumption (78% optimized than CSV format).
  • The compressed pickle format has the least memory consumption but requires a lot of time while saving the data. One can use parquet format that has comparatively faster reading and saving time and is 70% optimized than standard CSV format.
  • To save a very large Data frame, use HDF format.
  • To read the data on another platform (not Python) that doesn’t support other formats, use CSV, compressed CSV formats.
  • To review and observe the data using Excel, Google Sheets, Notepad, use CSV format.

Conclusion:

In this article, we have discussed 8 data formats that can be used to save the raw, intermediate data, and compared the benchmark time numbers and memory consumption. The above-discussed recommendations are concluded by observing the benchmark numbers and domain knowledge. There is no thumb rule to follow a particular data format, but it depends on the use case.

Parquet, Compressed Pickle, Compressed CSV, Feather data formats can be used to optimize the memory consumption of the data. Using these file formats can optimize the memory up to 78% compared to the standard CSV format.

Pickle, Parquet, Feather file formats can be preferred over standard CSV format, due to their faster reading and saving capabilities.

After reading the data in Pandas Data frame, one can optimize the memory usage by downgrading the datatype of each feature. Read the below-mentioned article to know about the implementation and usage:

References:

[1] Pandas Documentation: https://pandas.pydata.org/docs/reference/io.html

Thank You for Reading

Artificial Intelligence
Machine Learning
Data Science
Education
Pandas
Recommended from ReadMedium