Spark Base 01-RDD and width dependency

1, RDD

The official explanation: A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,partitioned collection of elements that can be operated on in parallel
原文 : Elastic distributed Data sets (RDD) are the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can operate in parallel

As can be seen from the above screenshot, RDD not only provides basic methods such as Map and filter, but also implicitly integrates other operations.

Packages for K – V type of operations in the org. Apache. Spark. RDD. PairRDDFunctions, for example: combineByKey, reduceByKey, etc
For Double encapsulation of the numeric types in org. Apache. Spark. RDD. DoubleRDDFunctions, for example: sum, etc
For packages SequenceFile file types of operations in the org. Apache. Spark. RDD. SequenceFileRDDFunctions, for example: saveAsSequenceFile, etc

2. Features of RDD

A list of partitions: a list of one or more partitions
A function for computing each split: a function for computing each split: a function for computing each split
A list of dependencies on other RDDs: A list of dependencies on other RDDs
A Partitioner for key-value RDDs: If the data stored in the RDD is in key-value form, a custom Partitioner can be passed for repartitioning
A list of preferred locations to compute each split on
The characteristics can be summarized from the above five characteristics:
1. An immutable, partitioned collection object
2. It can be created in parallel conversion mode, such as map and filter
3. Failure automatic reconstruction
4. Controllable storage levels (memory, disk, etc.) are used for fault-tolerant recalculation
5. Must be serializable
6. The static type

3, width dependence

NarrowDependency
- The official explanation: Base class for dependencies where each partition of the child RDD depends on a small number of partitions of the parent RDD. Narrow dependencies allow for pipelined execution
- The base class of dependencies in which each partition of the child RDD depends on a small number of partitions of the parent RDD. Narrow dependencies allow pipe execution
- There are two subclasses of narrow dependency implementation under the same package, OneToOneDependency and RangeDependency
Wide dependency (ShuffleDependency)
- The official explanation: Represents a dependency on the output of a shuffle stage. Note that in the case of shuffle, the RDD is transient since we don’t need it on the executor side
- Represents a dependency on the output of the shuffle phase. Note that in the shuffle case, the RDD is transient because we don’t need it on the executor side

Conclusion:
- Spark of dependence is by the org. Apache. Spark. Dependency for control of the outside and there are two subclasses respectively corresponds to the narrow rely on NarrowDependency and wide dependence ShuffleDependency 】【】
- NarrowDependency is an abstract class, and the concrete implementation classes are OneToOneDependency and RangeDependency
  - OneToOneDependency: There is only one parent RDD partition for the dependency
  - RangeDependency: There are multiple parent RDD partitions for dependencies
- ShuffleDependency is what other articles have called a wide dependency, and I recommend calling it ShuffleDependency when describing Spark dependencies

Spark Base 01-RDD and width dependency

1, RDD

2. Features of RDD

3, width dependence

Related Posts

Ask the search algorithm, I use the most easy-to-understand let her learn

Mysql index in a nutshell

What about mocks for Python unit tests