1, RDD

  • The official explanation: A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,partitioned collection of elements that can be operated on in parallel

  • 原 文 : Elastic distributed Data sets (RDD) are the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can operate in parallel

As can be seen from the above screenshot, RDD not only provides basic methods such as Map and filter, but also implicitly integrates other operations.

  • Packages for K – V type of operations in the org. Apache. Spark. RDD. PairRDDFunctions, for example: combineByKey, reduceByKey, etc
  • For Double encapsulation of the numeric types in org. Apache. Spark. RDD. DoubleRDDFunctions, for example: sum, etc
  • For packages SequenceFile file types of operations in the org. Apache. Spark. RDD. SequenceFileRDDFunctions, for example: saveAsSequenceFile, etc

2. Features of RDD

  • A list of partitions: a list of one or more partitions
  • A function for computing each split: a function for computing each split: a function for computing each split
  • A list of dependencies on other RDDs: A list of dependencies on other RDDs
  • A Partitioner for key-value RDDs: If the data stored in the RDD is in key-value form, a custom Partitioner can be passed for repartitioning
  • A list of preferred locations to compute each split on
  • The characteristics can be summarized from the above five characteristics:
    1. An immutable, partitioned collection object
    2. It can be created in parallel conversion mode, such as map and filter
    3. Failure automatic reconstruction
    4. Controllable storage levels (memory, disk, etc.) are used for fault-tolerant recalculation
    5. Must be serializable
    6. The static type

3, width dependence

  • NarrowDependency

    • The official explanation: Base class for dependencies where each partition of the child RDD depends on a small number of partitions of the parent RDD. Narrow dependencies allow for pipelined execution
    • The base class of dependencies in which each partition of the child RDD depends on a small number of partitions of the parent RDD. Narrow dependencies allow pipe execution
    • There are two subclasses of narrow dependency implementation under the same package, OneToOneDependency and RangeDependency
  • Wide dependency (ShuffleDependency)

    • The official explanation: Represents a dependency on the output of a shuffle stage. Note that in the case of shuffle, the RDD is transient since we don’t need it on the executor side
    • Represents a dependency on the output of the shuffle phase. Note that in the shuffle case, the RDD is transient because we don’t need it on the executor side

  • Conclusion:
    • Spark of dependence is by the org. Apache. Spark. Dependency for control of the outside and there are two subclasses respectively corresponds to the narrow rely on NarrowDependency and wide dependence ShuffleDependency 】 【 】
    • NarrowDependency is an abstract class, and the concrete implementation classes are OneToOneDependency and RangeDependency
      • OneToOneDependency: There is only one parent RDD partition for the dependency
      • RangeDependency: There are multiple parent RDD partitions for dependencies
    • ShuffleDependency is what other articles have called a wide dependency, and I recommend calling it ShuffleDependency when describing Spark dependencies