avatarChengzhi Zhao

Summary

The article provides a solution for handling skewed data in Apache Spark using the SALT technique to prevent out-of-memory errors and improve data processing efficiency.

Abstract

The article discusses the common issue of data skew in Apache Spark, which leads to out-of-memory errors and hinders the completion of Spark jobs. It explains that data skew occurs due to unevenly distributed keys during shuffling operations. The author introduces the SALT (Skew-Aware data Layer for Apache Spark) technique, which adds randomness to keys to distribute data more evenly across partitions, thus preventing hot keys and enabling parallel processing. The SALT method is presented as a way to address data skew without requiring additional memory or extensive tuning of Spark parameters, and it is particularly recommended for wide transformations like join operations. The article also includes a step-by-step guide on implementing SALT in Spark and concludes by emphasizing the importance of user intervention in balancing skewed data for efficient data processing.

Opinions

  • The author suggests that simply adding more memory to executors is a brute force and potentially ineffective method for handling out-of-memory issues in Spark.
  • The article conveys that data skew is an inherent dataset problem rather than a Spark-specific issue, and it requires user involvement to resolve.
  • The author opines that using a composite key or hashing the entire keyset may not always solve data skew problems, highlighting the need for a more robust solution like SALT.
  • The article emphasizes the benefits of SALT, such as its ease of implementation and the fact that it is unrelated to the dataset's keys, thus providing a generic approach to handling skewed data.
  • The author believes that SALT is a valuable technique for data engineers and data scientists dealing with skewed data, as it introduces randomness to key distribution, which helps in achieving a more balanced data processing workload.

Skewed Data in Spark? Add SALT to Compensate

A step-by-step guide to handle skewed data with SALT technique

Image: @tangerinenewt Unsplash

If you have been working with Apache Spark for a while, you must have seen the following error:java.lang.OutOfMemoryError: Java heap space

The out-of-memory (OOM) error is one of the most recurring errors preventing Spark jobs from completing successfully. Unevenly distributed keys, known as data skew, commonly cause this issue.

There are many ways to solve the out-of-memory issue in Spark. The brute force way is to add more memory to the executor and hopes it works. Spark also has many tunning parameters to rebalance memory. The skewed data is a dataset problem. Besides optimizing Spark parameters, it is usually the responsibility of the user to use the data itself to solve the issue. This article will use SALT to crack the data skew issue.

Understand Data Skew

In Spark, wide transformations involve a shuffle of the data between the partition. Shuffling is the process transfers the data from the mapper to the reducer.

The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. — Spark Documentation

Image by Author

Shuffling is an expensive operation. As you can see on the left image, shuffling isn’t usually 1:1 copying as hash keys are typically used to determine how data is grouped by and where to copy. This process usually means data is copied through numerous executors and machines. If you have a key with a considerable volume than the others, this “HOT KEY” causes a data skew.

The data skew has nothing to do with Spark, and it’s the nature of a dataset. For example, if we perform some sales analysis that requires a breakdown by city. The cities with more populations like New York, Chicago, San Fransico have a higher chance to get data skew problems.

And the next question becomes,

How could we balance the key distribution so some keys won’t be hot spots?

The answer is to this is to make the existing keys slight different so they can process evenly. One option is to find another field, add it as a composite key or hash the entire keyset. Again, this only works if the new field we chose makes the composite key distribute evenly. If this won’t work, you’d need some randomness to help, and this introduces SALT.

What is SALT?

In cryptography, a salt is random data that is used as an additional input to a one-way function that hashes data, a password or passphrase — Wikipedia

The SALT idea from cryptography introduced randomness to the key without knowing any context about the dataset. The idea is for a given hot key, if it combines with different random numbers, we’ll end up noting having all the data for the given key processed in a single partition. A significant benefit of SALT is it is unrelated to any of the keys, and you don’t have to worry about some keys with similar context with the same value again.

In Spark, SALT is a technique that adds random values to push Spark partition data evenly. It’s usually good to adopt for wide transformation requires shuffling like join operation.

The following image visualizes how SALT is going to change the key distribution. Key 1(light green) is the hot key that causes skewed data in a single partition. After applying SALT, the original key is split into 3 parts and driving the new keys to shuffle to different partitions than before. In this case, Key 1 goes to 3 different partitions, and the original partition can be processed in parallel among those 3 partitions.

Image by Author

How to use SALT in Spark

The process of using SALT in Spark can be breakdown into:

  1. Add a new field and populate it with random numbers.
  2. Combine this new field and the existing keys as a composite key, perform any transformation.
  3. Once the processing is done, combine the final result.

We can write a line of spark code like:

Final Thought

Today, most data processing won’t automatically offset the highly skewed data, and it’s up to the user to apply certain logic to balance the data. SALT is a generic way of solving the skewed data issue without knowing the context of the data. I hope this was helpful information to solve the skewed data issue.

I hope this story is helpful to you. This article is part of a series of my engineering & data science stories that currently consist of the following:

You can also subscribe to my new articles or become a referred Medium member who gets unlimited access to all the stories on Medium.

In case of questions/comments, do not hesitate to write in the comments of this story or reach me directly through Linkedin or Twitter.

Apache Spark
Spark
Data Science
Data
Data Processing
Recommended from ReadMedium