Stop Using CSVs for Storage — This File Format Is 150 Times Faster
CSV’s are costing you time, disk space, and money. It’s time to end it.
CSV is not the only data storage format out there. In fact, it’s likely the last one you should consider. If you don’t plan to edit the saved data manually, you’re wasting both time and money by sticking to it.
Picture this — you collect large volumes of data and store them in the cloud. You didn’t do much research on file formats, so you opt for CSVs. Your expenses are through the roof! A simple tweak can reduce them by half, if not more. That tweak is — you’ve guessed it — choosing a different file format.
Today you’ll learn the ins and outs of the Feather data format — a fast and lightweight binary format for storing data frames.
What exactly is Feather?
Put simply, it’s a data format for storing data frames (think Pandas). It’s designed around a simple premise — to push data frames in and out of memory as efficiently as possible. It was initially designed for fast communication between Python and R, but you’re not limited to this use case.
So, no, Feather isn’t limited to Python and R — you can work with Feather files in every major programming language.
The data format is not designed for long-term storage. The original intention was the quick exchange between R and Python programs, and short-term storage in general. No-one can stop you to dump Feather files to disk and leave them for years, but there are more efficient formats.
In Python, you can work with Feather through Pandas or a dedicated library. The article will show you how to work with both. You’ll need to install feather-format
to follow along. Here’s the Terminal command:
# Pip
pip install feather-format
# Anaconda
conda install -c conda-forge feather-format
That’s all you need to get started. Open up JupyterLab or any other data science IDE, as the next section covers the basics of Feather.
How to work with Feather in Python?
Let’s start simple by importing libraries and creating a relatively large dataset. You’ll need Feather, Numpy, and Pandas to follow along. The dataset will have five columns and 10M rows of random numbers:
import feather
import numpy as np
import pandas as pd
np.random.seed = 42
df_size = 10_000_000
df = pd.DataFrame({
'a': np.random.rand(df_size),
'b': np.random.rand(df_size),
'c': np.random.rand(df_size),
'd': np.random.rand(df_size),
'e': np.random.rand(df_size)
})
df.head()
Here’s how the dataset looks like:
Let’s save it locally next. You can use the following command to save the DataFrame to a Feather format with Pandas:
df.to_feather('1M.feather')
And here’s how to do the same with the Feather library:
feather.write_dataframe(df, '1M.feather')
Not much of a difference. Both files are saved locally now. You can read them either with Pandas or with the dedicated library. Here’s the syntax for Pandas first:
df = pd.read_feather('1M.feather')
Change it to the following if you’re using the Feather library:
df = feather.read_dataframe('1M.feather')
And that covers everything you should know. The following section covers the comparison with CSV file format — in file size, read, and write times.
CSV vs. Feather — Which one should you use?
If you don’t need to change the data on the fly, the answer is simple — you should use Feather over CSV. Still, let’s do some testing.
The following chart shows the time needed to save the DataFrame from the last section locally:
That’s a drastic difference — native Feather is around 150 times faster than CSV. It doesn’t matter too much if you use Pandas to work with Feather files, but the speed increase when compared to CSV is significant.
Next, let’s compare the read times — how long does it take to read identical datasets in different formats:
Once again, significant differences. CSVs are much slower to read. Sure, they take more disk space, but how much more exactly?
That’s what the next visualization answers:
As you can see, CSV files take more than double the space Feather file take.
If you store gigabytes of data daily, choosing the correct file format is crucial. Feather demolishes CSVs in that regard. If you need even more compression, you should try out Parquet. I found it to be the best format yet.
To summarize, changing to_csv()
to to_feather()
and read_csv()
to read_feather()
can save you a lot of time and disk space. Take that into account on your next big data project.
Loved the article? Become a Medium member to continue learning without limits. I’ll receive a portion of your membership fee if you use the following link, with no extra cost to you.
Stay connected
- Follow me on Medium for more stories like this
- Sign up for my newsletter
- Connect on LinkedIn