avatarRitam Mukherjee

Summary

Zstd is emerging as the preferred compression algorithm for Parquet files, challenging the long-standing dominance of Snappy due to its superior compression ratios and good performance, while Gzip remains a viable option for scenarios prioritizing maximum storage efficiency over speed.

Abstract

In the realm of data engineering, the choice of compression algorithm for Parquet files is pivotal for storage savings, faster I/O, and cost optimization. Traditionally, Snappy has been favored for its quick compression and decompression speeds, but at the cost of lower compression efficiency. The introduction of Zstd (Zstandard) presents a compelling alternative, offering excellent compression ratios akin to Gzip without significantly compromising speed, thus making it suitable for a wide range of use cases, from batch processing to cloud storage. While Snappy maintains its edge in real-time data pipelines due to its speed, and Gzip in archival storage due to its high compression ratio, Zstd's balanced approach positions it as the new king of compression for Parquet, reflecting its adoption by many organizations for their data storage needs.

Opinions

  • The author suggests that Zstd is creating a buzz in the data engineering world, implying it is a significant advancement in compression technology.
  • Snappy is deemed the "Speed Demon" for its blazing-fast compression and decompression, ideal for scenarios where speed is paramount.
  • Gzip is considered the "Mightiest but Slowest," offering the best compression ratios but at the expense of speed and higher CPU usage.
  • Zstd is hailed as the "Balanced Warrior," providing a tunable compression level that balances speed and efficiency, making it versatile for various use cases.
  • The author indicates that Zstd's CPU usage is higher than Snappy's but manageable with modern processors, and its tunability allows for performance adjustments.
  • The author believes that while Snappy and Gzip still have their specific use cases, Zstd's overall performance and efficiency make it the superior choice for most Parquet compression needs.
  • The article implies that the choice of compression algorithm should be based on the specific requirements of the workload, with Zstd being the recommended default choice for its balance of speed and compression.

Zstd vs Snappy vs Gzip: The Compression King for Parquet Has Arrived

For years, Snappy has been the go-to choice, but its dominance is being challenged

Non-members can access the full article through this Link.

Photo by Bakir Custovic on Unsplash

If you’ve been working with Parquet files, chances are you’ve debated over which compression algorithm to use. For years, Snappy has been the go-to choice, offering quick compression and decompression at the cost of a bit of compression efficiency. But hold onto your seats because Zstandard (Zstd) is here to challenge the throne. Also, let’s not forget about the veteran, Gzip, a compression powerhouse with its own strengths. Spoiler alert: Snappy might have to take a backseat!

In this article, I’ll break down the differences between Zstd, Snappy and Gzip, look at why Zstd is creating a buzz in the data engineering world, and help you decide which one’s right for your use case. Let’s dive in!

Why Compression Matters for Parquet?

Before we pit these two algorithms against each other, let’s quickly revisit why compression is critical for Parquet.

  1. Storage Savings: Parquet already gives you columnar storage efficiency, but compression squeezes even more juice out of your data.
  2. Faster I/O: Compressed smaller files mean faster reads-and-writes, which is a big deal when dealing with massive datasets.
  3. Cost Optimization: Whether on cloud storage or in your data lake, compression directly translates to cost savings.

Snappy was the reigning champion because it offered fast compression and decompression, perfect for scenarios where speed trumps everything else. But what if you could get better compression without sacrificing much speed? Enter Zstd.

Meet the Contenders for Parquet Compression

Snappy: The Speed Demon

  • Strengths: Blazing-fast compression and decompression, very lightweight, and easy on your CPU.
  • Weaknesses: Mediocre compression ratios. It’s more about speed than storage savings.
  • Use Case: Great for real-time pipelines and situations where you need to trade compression efficiency for processing speed.

Gzip : The Mightiest but Slowest

  • Strengths: High compression ratio, widely compatible, and great for read-once scenarios.
  • Weaknesses: Slow compression/decompression, high CPU usage, and limited tunability.
  • Use Case: Best for archival storage, static data distribution, and cost-sensitive storage.

Zstandard (Zstd): The Balanced Warrior

  • Strengths: Excellent compression ratios without drastically slowing down speed. Highly tunable to balance speed and compression.
  • Weaknesses: Marginally slower than Snappy in some cases (but faster than you’d expect for its efficiency).
  • Use Case: Ideal for batch processing, archival storage, or anytime you need better storage efficiency without crippling performance.

It also supports various other compression algorithms such as Brotli, LZ4, LZO, LZ4_RAW etc.

Head-to-Head: Snappy vs Zstd vs Gzip

Let’s get to the fun part — how do these three stack up against each other in the context of Parquet? The below stats are general estimates and can vary based on the specific dataset and its characteristics.

+---------------------+---------------+--------------------+------------+
|       Metric        |    Snappy     |        Zstd        |    Gzip    |
+---------------------+---------------+--------------------+------------+
| Compression Ratio   | 2:1 to 3:1    | 3:1 to 5:1         | 3:1 to 6:1 |
| Compression Speed   | 🏎️ Very Fast  | 🚗 Fast             |🐢 Slow     |
| Decompression Speed | 🏎️ Very Fast  | 🚗 Slightly Slower  |🐢 Slow     |
| CPU Usage           | Low           | Moderate           | High       |
| File Size           | Larger        | Smaller            | Smallest   |
+---------------------+---------------+--------------------+------------+

Compression Ratio

Zstd and Gzip crushes Snappy here. For datasets like logs, metrics, or JSON-like structures, Zstd/Gzip can deliver nearly twice the compression efficiency, if not more. In cloud environments where storage is a premium, this can lead to substantial cost savings.

Moreover, Zstd beats Gzip hands down when it comes to speed and using less CPU power. It strikes a great balance between getting the job done fast and squeezing your data down efficiently.

Speed

Snappy is slightly faster when it comes to compression and decompression, but the gap isn’t as wide as you’d expect. For most workloads, the difference is negligible unless you’re running ultra-latency-sensitive jobs.

Resource Usage

Zstd’s CPU usage is higher, but modern processors can handle the load without breaking a sweat. Plus, Zstd is tunable — meaning you can dial it up or down depending on your requirements.

Real-World Scenarios

Let’s break this down into practical use cases:

  1. Streaming Pipelines (Snappy Wins) If you’re working with real-time data pipelines (Kafka, Spark Streaming), where milliseconds matter, Snappy is still a solid choice.
  2. Batch Processing (Zstd Wins) For batch jobs in Spark or Hive that process large datasets, Zstd’s smaller file size and efficient decompression offer a clear advantage.
  3. Cloud Storage (Zstd or Gzip Win) Storing Parquet files on AWS S3 or Azure Blob?Zstd works well when you need efficient storage with reasonable speed, while Gzip is the best choice for maximum compression efficiency.
  4. Data Archival (Gzip Wins) For archiving historical data, Gzip’s superior compression ratio makes it the obvious choice.

The Final Verdict

Each algorithm has its strengths, and the right choice depends on your workload:

  • Snappy: Choose this if speed is critical and storage isn’t a concern ie. low latency workloads.
  • Zstd: The all-rounder that balances compression efficiency and speed, making it ideal for most modern workloads.
  • Gzip: The heavyweight champ for scenarios where storage savings are more important than speed.

While Snappy/Gzip still has its niche, Zstd’s better compression ratios and good performance make it the compression king for Parquet files. This is the reason many organisations have already moved to Zstd for their parquet datasets. Try them out in your workflow and see who wins your crown! 👑 👑

Thank you for taking the time to read my article! If you found this useful, your claps 👏 would motivate me to keep on writing such valuable content.

💡Fun Fact : Did you know, the clap counter for each reader can go upto 30?

I regularly share my knowledge on BigData, ML & Cloud Technologies. You can follow me on Medium and LinkedIn to stay connected and catch all my latest insights. To get an email straight to your inbox Click here.

Further Reading

You may also like some of my below articles —

#Zstd #Snappy #Optimization #Compression #Parquet #Spark #DataEngineering

Addendum : Some Interesting Points

  • Snappy, previously known as Zippy, is a fast data compression and decompression library written in C++ by Google based on ideas from LZ77 and open-sourced in 2011.
  • Gzip is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman coding. DEFLATE was designed to replace LZW and other data compression algorithms that were restricted by patents.
  • Zstandard(zstd) was designed to give a compression ratio comparable to that of the DEFLATE algorithm but faster, especially for decompression.
  • Zstd is tunable and it uses a compression level scale from -7 to 22, with lower levels prioritiizing speed and higher levels prioritizing compression.
Parquet
Spark
Data Engineering
Compression
Cloud Computing
Recommended from ReadMedium