High Performance Spark Chapter 2 (How Spark Works) Notes

Main Concepts: RDD, transformation/action functions, dependencies, SparkContent, jobs, stages, tasks

Transformations and Actions

Spark abstracts large datasets as immutable objects called RDDs. There are two types of functions defined on RDDs: actions and transformations. Basically, transformations just set up the computational graph. Transformation functions take in RDDs and return new RDDs for further processing. The real computation will be performed only when there is an action been applied to RDDs. This mechanism is called lazy evaluation.

Dependencies

All the transformations can be divided into two categories: transformation with narrow dependencies and transformation with wide dependencies, so what do they mean? In general, for the partitions of the RDD that we apply the transformation to, we will call them parent partitions, and the result partitions from the transformation will be called child partitions.

In the book, the definition of narrow dependencies is that “dependencies are only narrow if they can be determined at design time, irrespective of the value of the records in the parent partitions, and if each parent has at most one child partition”, so for example, “filter” and “map” are transformations with narrow dependencies since the transformations on each record only depend on the information within record itself. On the contrary, transformations like “sort” are with wide dependencies since we need to know the data within all the partitions in order to sort.

For the transformations with wide dependencies, we need to repartition the data in a particular way. In order to do that, Spark adds a “ShuffleDependency” object to the dependency list associated with the RDD. However, shuffles are expensive, especially when work with large data and a great portion of the data needs to be moved. So we should avoid doing expensive shuffle when writing the Spark program.

These concepts are important when talking about the definition of Spark jobs, stages, and tasks. (will be mentioned later)

What Happens when We Start SparkContext

Spark driver program pings the cluster manager;
The cluster manager launches a set of Spark executors (JVMs) on worker nodes;
- Partitions of RDD will be distributed across executors;
- An executor can contain multiple partitions;
- A partition cannot be spread across multiple executors;

Jobs, Stages, and Tasks

Job: Spark jobs are the highest elements of Spark’s execution hierarchy. Each job corresponds to one action.
Stage: Before we apply action, there will be some transformations beforehand. As we mentioned before, some transformations are with wide dependencies and we need to shuffle their dependencies in order to perform further computations, so each shuffle will be a stage boundary, the transformations between two boundaries form a stage.
Task: Task is the smallest unit in the execution hierarchy. Since a stage contains transformations with narrow dependencies, a task will perform these transformations on a piece of the data (and all of the tasks in one stage execute the same code on different pieces).