avatarDave Canton

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

1004

Abstract

tion abou them!</a>) but it is good to understand that RDDs are actually maid up of four main parts:</p><ol><li><b>Partitions</b></li><li><b>Dependencies</b> (that models the relationships a RDD and its partitions and the partition which it was derived from)</li><li><i>Function</i>: for comping the dataset based on its parent RDD</li><li><i>Metadata</i> about its partitioning scheme and data placement</li></ol><p id="8c2b">Therefore, each partition can depend on one or more partitions from its parent RDD.</p><h2 id="816c">Narrow dependencies</h2><p id="1a13">When each partition at the parent RDD is used by at most one partition of the child RDD, then we have a narrow dependency. Computations of transformations with this kind of dependency are rather fast as they do not require any data shuffling over the cluster network. In addition, optimizations such as <i>pipelining</i> are also possible.</p><p id="ceac"><i>Example: map , filter and union transformations</i></p><figure id="7aca"><img s

Options

rc="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*kAw8hogu1oZPy9QU.png"><figcaption>on</figcaption></figure><h2 id="ddd7">Wide dependencies</h2><p id="9e8c">When each partition of the parent RDD may be depended on by multiple child partitions (wide dependency), then the computation speed might be significantly affected as we might need to shuffle data around different nodes when creating new partitions.</p><p id="7ef6"><i>Example: groupByKey operations and join operations whose inputs are not co-partitioned</i></p><p id="f532">When designing algorithms, it is great to bear in mind those definitions in order to always try to minimize the number of transformations which leads to RDDs with wide dependencies and data shuffling.</p><p id="4be6">If you need help if your Spark application, get in touch. Besides <a href="https://twitter.com/dvcanton">Twitter</a>, you can also reach me on <a href="https://stackoverflow.com/users/7082230/dave-canton">StackOverflow</a>.</p></article></body>

Wide and Narrow dependencies in Apache Spark

Indeed, not all transformations are born equal. Some are more expensive than others and if you shuffling data all around you cluster network, then you performance you surely take the hit! In order to understand why some transformations can have this impact into the execution time, we need to understand the basic difference between narrow and long dependencies in Apache Spark.

Computations are represented in Spark as a DAG(Directed Acyclic Graph) — officially described as a lineage graph — over RDDs, which represent data distributed across different nodes.

I won’t dive into details about RDDs here (Jacek Laskowsk has a nicely written description abou them!) but it is good to understand that RDDs are actually maid up of four main parts:

  1. Partitions
  2. Dependencies (that models the relationships a RDD and its partitions and the partition which it was derived from)
  3. Function: for comping the dataset based on its parent RDD
  4. Metadata about its partitioning scheme and data placement

Therefore, each partition can depend on one or more partitions from its parent RDD.

Narrow dependencies

When each partition at the parent RDD is used by at most one partition of the child RDD, then we have a narrow dependency. Computations of transformations with this kind of dependency are rather fast as they do not require any data shuffling over the cluster network. In addition, optimizations such as pipelining are also possible.

Example: map , filter and union transformations

on

Wide dependencies

When each partition of the parent RDD may be depended on by multiple child partitions (wide dependency), then the computation speed might be significantly affected as we might need to shuffle data around different nodes when creating new partitions.

Example: groupByKey operations and join operations whose inputs are not co-partitioned

When designing algorithms, it is great to bear in mind those definitions in order to always try to minimize the number of transformations which leads to RDDs with wide dependencies and data shuffling.

If you need help if your Spark application, get in touch. Besides Twitter, you can also reach me on StackOverflow.

Spark
Rdd
Partitioning
Big Data
Recommended from ReadMedium