Skewed Data in Spark? Add SALT to Compensate
A step-by-step guide to handle skewed data with SALT technique
If you have been working with Apache Spark for a while, you must have seen the following error:java.lang.OutOfMemoryError: Java heap space
The out-of-memory (OOM) error is one of the most recurring errors preventing Spark jobs from completing successfully. Unevenly distributed keys, known as data skew, commonly cause this issue.
There are many ways to solve the out-of-memory issue in Spark. The brute force way is to add more memory to the executor and hopes it works. Spark also has many tunning parameters to rebalance memory. The skewed data is a dataset problem. Besides optimizing Spark parameters, it is usually the responsibility of the user to use the data itself to solve the issue. This article will use SALT to crack the data skew issue.
Understand Data Skew
In Spark, wide transformations involve a shuffle of the data between the partition. Shuffling is the process transfers the data from the mapper to the reducer.
The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. — Spark Documentation
Shuffling is an expensive operation. As you can see on the left image, shuffling isn’t usually 1:1 copying as hash keys are typically used to determine how data is grouped by and where to copy. This process usually means data is copied through numerous executors and machines. If you have a key with a considerable volume than the others, this “HOT KEY” causes a data skew.
The data skew has nothing to do with Spark, and it’s the nature of a dataset. For example, if we perform some sales analysis that requires a breakdown by city. The cities with more populations like New York, Chicago, San Fransico have a higher chance to get data skew problems.
And the next question becomes,
How could we balance the key distribution so some keys won’t be hot spots?
The answer is to this is to make the existing keys slight different so they can process evenly. One option is to find another field, add it as a composite key or hash the entire keyset. Again, this only works if the new field we chose makes the composite key distribute evenly. If this won’t work, you’d need some randomness to help, and this introduces SALT.
What is SALT?
In cryptography, a salt is random data that is used as an additional input to a one-way function that hashes data, a password or passphrase — Wikipedia
The SALT idea from cryptography introduced randomness to the key without knowing any context about the dataset. The idea is for a given hot key, if it combines with different random numbers, we’ll end up noting having all the data for the given key processed in a single partition. A significant benefit of SALT is it is unrelated to any of the keys, and you don’t have to worry about some keys with similar context with the same value again.
In Spark, SALT is a technique that adds random values to push Spark partition data evenly. It’s usually good to adopt for wide transformation requires shuffling like join operation.
The following image visualizes how SALT is going to change the key distribution. Key 1(light green) is the hot key that causes skewed data in a single partition. After applying SALT, the original key is split into 3 parts and driving the new keys to shuffle to different partitions than before. In this case, Key 1 goes to 3 different partitions, and the original partition can be processed in parallel among those 3 partitions.
How to use SALT in Spark
The process of using SALT in Spark can be breakdown into:
- Add a new field and populate it with random numbers.
- Combine this new field and the existing keys as a composite key, perform any transformation.
- Once the processing is done, combine the final result.
We can write a line of spark code like: