Zstd vs Snappy vs Gzip: The Compression King for Parquet Has Arrived
For years, Snappy has been the go-to choice, but its dominance is being challenged
Non-members can access the full article through this Link.
If you’ve been working with Parquet files, chances are you’ve debated over which compression algorithm to use. For years, Snappy has been the go-to choice, offering quick compression and decompression at the cost of a bit of compression efficiency. But hold onto your seats because Zstandard (Zstd) is here to challenge the throne. Also, let’s not forget about the veteran, Gzip, a compression powerhouse with its own strengths. Spoiler alert: Snappy might have to take a backseat!
In this article, I’ll break down the differences between Zstd, Snappy and Gzip, look at why Zstd is creating a buzz in the data engineering world, and help you decide which one’s right for your use case. Let’s dive in!
Why Compression Matters for Parquet?
Before we pit these two algorithms against each other, let’s quickly revisit why compression is critical for Parquet.
- Storage Savings: Parquet already gives you columnar storage efficiency, but compression squeezes even more juice out of your data.
- Faster I/O: Compressed smaller files mean faster reads-and-writes, which is a big deal when dealing with massive datasets.
- Cost Optimization: Whether on cloud storage or in your data lake, compression directly translates to cost savings.
Snappy was the reigning champion because it offered fast compression and decompression, perfect for scenarios where speed trumps everything else. But what if you could get better compression without sacrificing much speed? Enter Zstd.
Meet the Contenders for Parquet Compression
Snappy: The Speed Demon
- Strengths: Blazing-fast compression and decompression, very lightweight, and easy on your CPU.
- Weaknesses: Mediocre compression ratios. It’s more about speed than storage savings.
- Use Case: Great for real-time pipelines and situations where you need to trade compression efficiency for processing speed.
Gzip : The Mightiest but Slowest
- Strengths: High compression ratio, widely compatible, and great for read-once scenarios.
- Weaknesses: Slow compression/decompression, high CPU usage, and limited tunability.
- Use Case: Best for archival storage, static data distribution, and cost-sensitive storage.
Zstandard (Zstd): The Balanced Warrior
- Strengths: Excellent compression ratios without drastically slowing down speed. Highly tunable to balance speed and compression.
- Weaknesses: Marginally slower than Snappy in some cases (but faster than you’d expect for its efficiency).
- Use Case: Ideal for batch processing, archival storage, or anytime you need better storage efficiency without crippling performance.
It also supports various other compression algorithms such as Brotli, LZ4, LZO, LZ4_RAW etc.
Head-to-Head: Snappy vs Zstd vs Gzip
Let’s get to the fun part — how do these three stack up against each other in the context of Parquet? The below stats are general estimates and can vary based on the specific dataset and its characteristics.
+---------------------+---------------+--------------------+------------+
| Metric | Snappy | Zstd | Gzip |
+---------------------+---------------+--------------------+------------+
| Compression Ratio | 2:1 to 3:1 | 3:1 to 5:1 | 3:1 to 6:1 |
| Compression Speed | 🏎️ Very Fast | 🚗 Fast |🐢 Slow |
| Decompression Speed | 🏎️ Very Fast | 🚗 Slightly Slower |🐢 Slow |
| CPU Usage | Low | Moderate | High |
| File Size | Larger | Smaller | Smallest |
+---------------------+---------------+--------------------+------------+
Compression Ratio
Zstd and Gzip crushes Snappy here. For datasets like logs, metrics, or JSON-like structures, Zstd/Gzip can deliver nearly twice the compression efficiency, if not more. In cloud environments where storage is a premium, this can lead to substantial cost savings.
Moreover, Zstd beats Gzip hands down when it comes to speed and using less CPU power. It strikes a great balance between getting the job done fast and squeezing your data down efficiently.
Speed
Snappy is slightly faster when it comes to compression and decompression, but the gap isn’t as wide as you’d expect. For most workloads, the difference is negligible unless you’re running ultra-latency-sensitive jobs.
Resource Usage
Zstd’s CPU usage is higher, but modern processors can handle the load without breaking a sweat. Plus, Zstd is tunable — meaning you can dial it up or down depending on your requirements.

Real-World Scenarios
Let’s break this down into practical use cases:
- Streaming Pipelines (Snappy Wins) If you’re working with real-time data pipelines (Kafka, Spark Streaming), where milliseconds matter, Snappy is still a solid choice.
- Batch Processing (Zstd Wins) For batch jobs in Spark or Hive that process large datasets, Zstd’s smaller file size and efficient decompression offer a clear advantage.
- Cloud Storage (Zstd or Gzip Win) Storing Parquet files on AWS S3 or Azure Blob?Zstd works well when you need efficient storage with reasonable speed, while Gzip is the best choice for maximum compression efficiency.
- Data Archival (Gzip Wins) For archiving historical data, Gzip’s superior compression ratio makes it the obvious choice.
The Final Verdict
Each algorithm has its strengths, and the right choice depends on your workload:
- Snappy: Choose this if speed is critical and storage isn’t a concern ie. low latency workloads.
- Zstd: The all-rounder that balances compression efficiency and speed, making it ideal for most modern workloads.
- Gzip: The heavyweight champ for scenarios where storage savings are more important than speed.
While Snappy/Gzip still has its niche, Zstd’s better compression ratios and good performance make it the compression king for Parquet files. This is the reason many organisations have already moved to Zstd for their parquet datasets. Try them out in your workflow and see who wins your crown! 👑 👑
Thank you for taking the time to read my article! If you found this useful, your claps 👏 would motivate me to keep on writing such valuable content.
💡Fun Fact : Did you know, the clap counter for each reader can go upto 30?
I regularly share my knowledge on BigData, ML & Cloud Technologies. You can follow me on Medium and LinkedIn to stay connected and catch all my latest insights. To get an email straight to your inbox Click here.
Further Reading
You may also like some of my below articles —
- Building Real-Time Recommendations with Spark, ALS, and Kafka
- Real-Time Use-case : Fraud Detection in Financial Transactions with Kafka and Spark Streaming
- Customer 360 in E-commerce : Real-Life Use Case with Delta Lake on Databricks
- Data Engineering for ML: Building a Customer Churn Prediction Pipeline with Airflow
- Data Skew in Spark : Using Salting while avoiding common mistakes
- Understanding Database Isolation and Concurrency Management : Preventing Data Mix-Ups
- Building End-to-End Customer Insights Pipeline by Integrating Multiple Data Sources in Spark With Airflow Scheduler
#Zstd #Snappy #Optimization #Compression #Parquet #Spark #DataEngineering
Addendum : Some Interesting Points
- Snappy, previously known as Zippy, is a fast data compression and decompression library written in C++ by Google based on ideas from LZ77 and open-sourced in 2011.
- Gzip is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman coding. DEFLATE was designed to replace LZW and other data compression algorithms that were restricted by patents.
- Zstandard(zstd) was designed to give a compression ratio comparable to that of the DEFLATE algorithm but faster, especially for decompression.
- Zstd is tunable and it uses a compression level scale from -7 to 22, with lower levels prioritiizing speed and higher levels prioritizing compression.






