avatarRitam Mukherjee

Summary

Parquet is an efficient storage format for OLAP systems due to its columnar storage and compression, but it is not suitable for OLTP use cases because of its lack of in-place updates, inefficiency in row-level operations, lack of ACID compliance, and compression overhead for small transactions.

Abstract

Parquet excels in OLAP scenarios, offering efficient data compression and fast query performance due to its columnar storage format, which is ideal for read-heavy analytical workloads. It allows for selective column access and significant storage savings. However, its design is not conducive to OLTP tasks, which require frequent row-level updates, transactional support, and real-time data handling. Parquet's inability to perform in-place updates necessitates rewriting entire files for even minor changes, leading to performance issues. Additionally, the overhead from compression algorithms during write operations makes it unsuitable for transactional systems that demand low latency and high throughput for small, frequent writes.

Opinions

  • The author believes that Parquet's columnar storage is a key strength for OLAP, as it enables efficient compression and querying of large datasets.
  • The author points out that Parquet's lack of support for in-place updates is a significant drawback for OLTP, as it leads to unnecessary rewrites of large data chunks.
  • The author suggests that Parquet's design, which is optimized for batch operations, is not suitable for the row-level mutations typical in OLTP scenarios.
  • The author emphasizes that the absence of native ACID compliance in Parquet complicates the handling of concurrent writes and data consistency, which is critical for transactional systems.
  • The author notes that the compression overhead in Parquet negatively impacts performance in OLTP scenarios, where small and frequent transactions are common.
  • The author concludes that for OLTP applications, row-oriented databases like PostgreSQL or MySQL are preferable due to their support for efficient row-level operations, transaction support, and indexing capabilities.

Parquet is Good for OLAP but Not for OLTP Use Cases. But Why?

Many engineers and data scientists praise Parquet for its efficient compression and fast query performance. While it’s a highly valued storage format for OLAP (Online Analytical Processing) workloads, it does have its own limitations. Unfortunately, Parquet falls short when it comes to OLTP (Online Transactional Processing) use cases. Let me share why, along with some examples from my experience.

Photo by Joshua Hoehne on Unsplash

Parquet’s Strengths in OLAP

Columnar Storage and Compression

  • Parquet is a columnar storage format, which means it stores data column-wise rather than row-wise.
  • This design allows for excellent compression (Why ? because row wise storage has different data formats for different columns, whereas column wise storage has same datatype through-out that column; this results in efficient application of compression algorithms)
  • Hence it is well suited for querying of large datasets.

Example:

I was working on a project that involved analysing billions of records to generate monthly reports. Using Parquet, I could read just the necessary columns (like product_id, sale_amount) without scanning the entire dataset. The compression reduced storage costs and improved I/O performance, making queries run faster, saving time and compute cost.

The Challenges with OLTP scenarios in Parquet

1. No In-Place Update

Lets learn this from an example code :

// Sample DataFrame
val df = spark.read.parquet("employees.parquet")

// Update salary by 20% for Top employee in the "Engineering" department
val updatedDF = df.withColumn("salary", 
                   when(col("department") === "Engineering" &&  col("rank") === 1, col("salary") * 1.20)
                   .otherwise(col("salary")))

// Write the updated DataFrame back to Parquet (overwriting)
updatedDF.write.mode("overwrite").parquet("employees_updated.parquet")

Here we want to update salary by 20% for Top employee in the “Engineering” department.

Issue: Unlike row-based storage formats (e.g., relational databases), Parquet cannot modify data in place.

To apply the update in the above example, Spark reads the entire Parquet data, updates the salary column, and then rewrites the entire DataFrame back to Parquet. This means that even if you only update a few records, the entire file or partition is rewritten.

2. Inefficient Row-Level Operations

Parquet isn’t designed for row-level mutations, it is optimised for batch operations

Issue: Each small write may incurs significant overhead due to compression and file system operations. This makes it unsuitable for real-time transactional applications.

Example: In a user management system, new user records are inserted as they sign up. If Parquet was used in the case, then each insert would require rewriting entire columns, introducing latency and slowing down the signup process eventually resulting in bad User experience.

3. Lack of ACID Compliance

Transactional systems require ACID (Atomicity, Consistency, Isolation, Durability) properties to ensure data integrity.

Issue: Parquet doesn’t natively support ACID transactions. Handling concurrent writes and ensuring consistency becomes complex and error-prone(Delta Lake can be used to add ACID support — but once again that is not native).

4. Compression Overhead for Small Transactions

While Parquet’s compression is great for reducing storage, it adds overhead to write operations.

Issue: Compression algorithms like Snappy, Gzip, or ZSTD are optimized for large blocks of data. In OLTP scenarios with small, frequent writes, the compression overhead negatively impacts performance.

Why Parquet Excels in OLAP ?

Read-Optimised Design

Parquet’s architecture is optimised for read-heavy operations, where you need to scan large datasets but only a few columns.

Example:

For generating analytics dashboards that aggregate user behaviour over time, Parquet allows me to read only the relevant columns in batches, making the process efficient and fast.

Write Limitations

Parquet’s write inefficiencies become apparent in OLTP workloads that require:

  • Frequent Updates: Modifying data requires rewriting.
  • Transaction Support: Lack of ACID compliance makes it unsuitable for operations requiring strict consistency.
  • Real-Time Data Handling: Compression and write overhead introduce latency.

The Better Fit for OLTP: Row-Oriented Databases

For OLTP applications, row-oriented databases like PostgreSQL or MySQL are more suitable.

Advantages:

  • Efficient Row-Level Operations: Quick inserts, updates, and deletes.
  • Transaction Support: ACID compliance ensures data integrity.
  • Indexing: Faster query performance on specific rows.

Conclusion

Parquet is a powerful tool for data — analytics but falls short in transactional applications due to its columnar design and write overhead.

Key Takeaways:

  • Use Parquet for: Read-heavy analytical workloads where you need to process large volumes of data efficiently.
  • Avoid Parquet for: Write-heavy transactional systems that requre frequent row-level updates and low latency.

If you found this article useful, please like and share your feedback.

Follow me on Medium and LinkedIn for more such insightful contents: https://www.linkedin.com/in/ritam378

#ApacheSpark #Spark #DataEngineering #Optimisation #SparkTuning

Apache Spark
Spark
Data Engineering
Data
Technology
Recommended from ReadMedium