Parquet is Good for OLAP but Not for OLTP Use Cases. But Why?
Many engineers and data scientists praise Parquet for its efficient compression and fast query performance. While it’s a highly valued storage format for OLAP (Online Analytical Processing) workloads, it does have its own limitations. Unfortunately, Parquet falls short when it comes to OLTP (Online Transactional Processing) use cases. Let me share why, along with some examples from my experience.
Parquet’s Strengths in OLAP
Columnar Storage and Compression
- Parquet is a columnar storage format, which means it stores data column-wise rather than row-wise.
- This design allows for excellent compression (Why ? because row wise storage has different data formats for different columns, whereas column wise storage has same datatype through-out that column; this results in efficient application of compression algorithms)
- Hence it is well suited for querying of large datasets.
Example:
I was working on a project that involved analysing billions of records to generate monthly reports. Using Parquet, I could read just the necessary columns (like product_id, sale_amount) without scanning the entire dataset. The compression reduced storage costs and improved I/O performance, making queries run faster, saving time and compute cost.
The Challenges with OLTP scenarios in Parquet
1. No In-Place Update
Lets learn this from an example code :
// Sample DataFrame
val df = spark.read.parquet("employees.parquet")
// Update salary by 20% for Top employee in the "Engineering" department
val updatedDF = df.withColumn("salary",
when(col("department") === "Engineering" && col("rank") === 1, col("salary") * 1.20)
.otherwise(col("salary")))
// Write the updated DataFrame back to Parquet (overwriting)
updatedDF.write.mode("overwrite").parquet("employees_updated.parquet")Here we want to update salary by 20% for Top employee in the “Engineering” department.
Issue: Unlike row-based storage formats (e.g., relational databases), Parquet cannot modify data in place.
To apply the update in the above example, Spark reads the entire Parquet data, updates the salary column, and then rewrites the entire DataFrame back to Parquet. This means that even if you only update a few records, the entire file or partition is rewritten.
2. Inefficient Row-Level Operations
Parquet isn’t designed for row-level mutations, it is optimised for batch operations
Issue: Each small write may incurs significant overhead due to compression and file system operations. This makes it unsuitable for real-time transactional applications.
Example: In a user management system, new user records are inserted as they sign up. If Parquet was used in the case, then each insert would require rewriting entire columns, introducing latency and slowing down the signup process eventually resulting in bad User experience.
3. Lack of ACID Compliance
Transactional systems require ACID (Atomicity, Consistency, Isolation, Durability) properties to ensure data integrity.
Issue: Parquet doesn’t natively support ACID transactions. Handling concurrent writes and ensuring consistency becomes complex and error-prone(Delta Lake can be used to add ACID support — but once again that is not native).
4. Compression Overhead for Small Transactions
While Parquet’s compression is great for reducing storage, it adds overhead to write operations.
Issue: Compression algorithms like Snappy, Gzip, or ZSTD are optimized for large blocks of data. In OLTP scenarios with small, frequent writes, the compression overhead negatively impacts performance.
Why Parquet Excels in OLAP ?
Read-Optimised Design
Parquet’s architecture is optimised for read-heavy operations, where you need to scan large datasets but only a few columns.
Example:
For generating analytics dashboards that aggregate user behaviour over time, Parquet allows me to read only the relevant columns in batches, making the process efficient and fast.
Write Limitations
Parquet’s write inefficiencies become apparent in OLTP workloads that require:
- Frequent Updates: Modifying data requires rewriting.
- Transaction Support: Lack of ACID compliance makes it unsuitable for operations requiring strict consistency.
- Real-Time Data Handling: Compression and write overhead introduce latency.
The Better Fit for OLTP: Row-Oriented Databases
For OLTP applications, row-oriented databases like PostgreSQL or MySQL are more suitable.
Advantages:
- Efficient Row-Level Operations: Quick inserts, updates, and deletes.
- Transaction Support: ACID compliance ensures data integrity.
- Indexing: Faster query performance on specific rows.
Conclusion
Parquet is a powerful tool for data — analytics but falls short in transactional applications due to its columnar design and write overhead.
Key Takeaways:
- Use Parquet for: Read-heavy analytical workloads where you need to process large volumes of data efficiently.
- Avoid Parquet for: Write-heavy transactional systems that requre frequent row-level updates and low latency.
If you found this article useful, please like and share your feedback.
Follow me on Medium and LinkedIn for more such insightful contents: https://www.linkedin.com/in/ritam378
#ApacheSpark #Spark #DataEngineering #Optimisation #SparkTuning






