avatarAditya

Summarize

How/when does repartitioning in spark helps to improve performance?

What is partition → In spark, RDD is a data structure that holds data. Generally data size will be huge to fit into a single node, so it should be split and placed (partitioned) across various nodes.In short — partition is an atomic chunk of data stored on a given node in a cluster and RDD is a collection of those partitions.

So how many partitions should the data be chopped into → Having too few partitions cause less concurrency, data skewing and improper resource utilization and on the other side too many partitions cause task scheduling to take more time than actual execution time.

So what is repartition → It is a transformation in spark that will change the number of partitions and balances the data. It can be used to increase or decrease the number of partitions and always shuffles all the data over the network. So it will be termed as a fairly expensive operation.

Spark also has an optimized version of repartition() called coalesce() that allows avoiding data movement, and only be used to decrease the number of partitions

So in which scenarios, repartition actually helps to improve performance if it is an expensive operation?

Consider a real-world example(instead of a sample code like sc.parallelize(List("this is","an example"))→ Suppose there is a normalizer spark job that gathers order data for a retail store like home depot, from ordering data sources systems(like database or topics or messaging queues) and dumps the data on to a data lake in the form of parquet at a location like below every hour(two files per disk partition in this case disk partition is by the hour, H00)

Size of a.parquet and b.parquet is of 200 MB /retail/homedepot/orders/p_day=2020–4–14/p_time=H00/a.parquet /retail/homedepot/orders/p_day=2020–4–14/p_time=H00/b.parquet

We have a couple of summarizer spark jobs like below that feed from order data Customer behaviors → Read order data and summarize customer data. Vendor behaviors → Read order data and summarize vendor data.

Normalizer job writes order data in 2 parquet files of 200 MB in size and say we have a requirement, where normalizer should not change number of disk partition files it will write data to, in order to tailor summarizer needs i.e Normalizer design and requirements orthogonal to summarizer jobs that feed on normalized order data.

So if we assign 40 cores for customer behaviors spark job, then how many cores it will use to crunch normalized data from H00?

It will use 2 cores.

Why → Even though summarizer job for user behaviors assigned 40 cores, it cannot use all cores, but use only 2 cores as it contains only two partition files of data(a and b) for a given hour.

Solution → So the solution will be to repartition the data to 40 (matching number of cores), so that 40 parallel data chunks will be processed in parallel I.e instead of using only 2 cores, summarizer job now will leverage all cores(40) assigned to it.

  • Using this approach, we don't have to change normalizer way of writing data(needed in so many real-world scenarios)
  • Summarizer spark jobs read data from partition files and use repartition in the code to operate on that dataset in memory without even persisting the repartitioned data physically. We can even use repartition to operate on column names as well like repartition(“day”).
E.g
Dataset<Row> ds = spark.read(/retail/homedepot/orders/p_day=2020414/p_time=H00/)
ds.repartition(40)
Spark Repartition
Repartition
Improve Spark Performance
Rdd
Partition
Recommended from ReadMedium