Apache Spark Partitioning and Bucketing
![](https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*x5D28_hODTUUPq0gXcMtsg.png)
One of Apache Spark’s key features is its ability to efficiently distribute data across a cluster of machines and process it in parallel. This parallelism is crucial for achieving high performance in big data workloads.
In this blog post, we will explore two techniques employed by Spark to effectively distribute data: Partitioning and Bucketing.
Create a Dataframe
To illustrate partitioning and bucketing in PySpark, we will use the DataFrame defined in the following code. You can find the complete source code for this article in my Github repository.
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('partition-bucket-app').getOrCreate()
df = spark.createDataFrame(
[("person_1", "20", "M"), ("person_2", "56", "F"), ("person_3", "89", "F"), ("person_4", "40", "M")],
["name","age", "sex"]
)
Spark Partitioning
Partitioning in Spark refers to the division of data into smaller, more manageable chunks known as partitions. Partitions are the basic units of parallelism in Spark, and they allow the framework to process different portions of the data simultaneously on different nodes in a cluster.
Why Partition Data?
Partitioning serves several essential purposes in Spark:
- Parallelism: By dividing data into partitions, Spark can distribute these partitions across multiple nodes in a cluster. This enables parallel processing, significantly improving the performance of data operations.
- Efficient Data Processing: Smaller partitions are easier to manage and manipulate. When a specific operation is performed on a partition, it affects a smaller subset of the data, reducing memory overhead.
- Data Locality: Spark aims to process data where it resides. By creating partitions that align with the distribution of data across nodes, Spark can optimize data locality, minimizing data transfer over the network.
How to Control Partitioning
Spark uses two types of partitioning:
- Partitioning in memory
- Partitioning on disk
Partitioning in memory
While transfrorming data Spark allows users to control partitioning explicitly by using repartition or coalesce.
Repartition
- It allows to specify the desired number of partitions and the column(s) to partition by.
- It shuffles the data to create the specified number of partitions.
Coalesce
- It reduces the number of partitions by merging them.
- It’s useful when you want to decrease the number of partitions for efficiency.
By default, Spark/PySpark creates partitions that are equal to the number of CPU cores in the machine.
Here’s an example of using repartition. We repartition data into 4 partitions based on a column named sex.
df = df.repartition(4, "sex")
print(df.rdd.getNumPartitions())
Partitioning on disk
When writing data in Spark, the partitionBy() method is used to partition the data into a file system, resulting in multiple sub-directories. This partitioning enhances the read performance for downstream systems.
The function can be applied to one or multiple column values while writing a DataFrame to the disk. Based on these values, Spark splits the records according to the partition column and stores the data for each partition in a respective sub-directory.
In the code below we write data to the file system in CSV format by partitioning it by age column.
df.write.mode("overwrite").partitionBy("sex").parquet("data/output")
In the data/output directory, we notice that the data has been organized into partitions, resulting in the creation of two distinct sub-folders. This partitioning is based on the unique sex values present in our DataFrame, which are F, and M.
![](https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*fplIY2IOyPvlODdDLFq76g.png)
Below is how the data gets written to the disk.
![](https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*n853xsKLRFCqxd3FyMyMdg.png)
Spark Bucketing
Bucketing is a technique used in Spark for optimizing data storage and querying performance, especially when dealing with large datasets. It involves dividing data into a fixed number of buckets and storing each bucket as a separate file.
Why Bucket Data?
Bucketing provides several advantages:
- Efficient Data Retrieval: When querying data, Spark can narrow down the search by reading only specific buckets, reducing the amount of data to scan. This results in faster query performance.
- Uniform Bucket Sizes: Bucketing ensures that each bucket contains approximately the same number of records, preventing data skew.
How to Bucket Data
Bucketing is typically applied to DataFrames or tables using the bucketBy method in Spark SQL. You specify the number of buckets and the column to bucket by. Here’s an example, where we create 5 buckets based on a column named age then we write data to a table named bucketed_table.
df.write.bucketBy(5, "age").saveAsTable("bucketed_table")
You can also specify sorting within buckets, which can further optimize certain query types:
df.write.bucketBy(10, "age")\
.sortBy("name")\
.saveAsTable("sorted_bucketed_table")
Choosing Between Partitioning and Bucketing
The decision of whether to use partitioning or bucketing (or both) in your Spark application depends on the specific use case and query patterns. Here are some guidelines:
- Partitioning: Use partitioning when you need to distribute data for parallel processing and optimize data locality. It’s beneficial for filtering and joins where the filter condition aligns with the partition key.
- Bucketing: Use bucketing when you have large datasets and want to optimize query performance, especially for equality-based filters. It’s beneficial for scenarios where data is frequently queried with specific criteria.
- Combining Both: In some cases, combining partitioning and bucketing can yield the best results. You can partition data by a high-level category and then bucket it within each partition.
Conclusion
Spark partitioning and bucketing are essential techniques for optimizing data processing and query performance in large-scale data analytics. By understanding when and how to use these techniques, you can make the most of Apache Spark’s capabilities and efficiently handle your big data workloads.
The choice between partitioning and bucketing depends on your specific use case, query patterns, and data size, so consider your requirements carefully when designing your Spark applications.
Thank you for reading 🙌🏻 😁 I hope you found it helpful. If so, feel free to hit the button 👏 . If you have any other tips or tricks that you would like to share, please leave a comment below.