The most basic data abstraction in Spark is the RDD.
RDD: Resilient Distributed DataSet.
1. RDD has three basic features
These three features are: partition, immutable, parallel operation.
A, partition
The data contained in each RDD is stored on different nodes of the system. Logically, we can think of RDD as a large array, where each element represents a Partition.
In physical storage, each partition points to a Block of data stored in memory or hard disk. In fact, this Block is the data Block calculated by each task, which can be distributed on different nodes.
Therefore, RDD is only an abstract data set. The partition does not store specific data, but only its index in the RDD. The ID of the RDD and the index of the partition can uniquely determine the number of the corresponding data block, and then extract data for processing through the interface of the underlying storage layer.
In a cluster, data blocks on each node are stored in the memory as much as possible. When the memory space is insufficient, data blocks are stored in disks to minimize disk I/O overhead.
B, immutable
Immutability means that each RDD is read-only and the partition information it contains is immutable. Since the existing RDD is immutable, we can only obtain the new RDD through Transformation of the existing RDD and calculate the desired result step by step.
In this way, we do not need to store the calculated data immediately during THE calculation of RDD. Instead, we only need to record the transformation operation of each RDD, namely, the dependency relationship. In this way, the calculation efficiency can be improved on the one hand, and error recovery can be easier on the other hand. If the node of the RDD output at step NTH fails and the data is lost during the calculation, the RDD can be recalculated from step n-1 according to dependencies, which is one reason why RDD is called **” elastic “** distributed data set.
C, parallel operation
Because of the partitioning nature of RDD, it naturally supports parallel processing. That is, data on different nodes can be processed separately to generate a new RDD.
2. Structure of RDD
Each RDD contains partition information, dependencies, and so on, as shown in the following figure:
A, Partitions
Partitions represent the logical structure of data in THE RDD, and each Partion is mapped to a data block in a node’s memory or hard disk.
B, SparkContext
SparkContext is the entry point to all Spark functions, represents connections to Spark nodes, and can be used to create RDD objects and broadcast variables in the nodes, among other things. There is only one SparkContext per thread.
C, SparkConf
SparkConf is configuration information.
D, the Partitioner
The Partitioner determines the partitioning method of the RDD, and there are currently two major partitioning methods: Hash Partioner and Range Partitioner. A Hash is a Hash of the Key of the data, and a Rang is a partition sorted by Key. You can also customize the Partitioner.
E, the Dependencies
Dependencies record the calculation process of the RDD. In other words, the RDD is obtained through which transformation operation.
There is a concept here. According to the partition mapping of each RDD, the partition mapping of the new RDD generated after calculation can be divided into narrow and wide dependencies.
Narrow dependency means that the partition of the parent RDD can be one-to-one mapped to the partition of the child RDD. Wide dependency means that each partition of the parent RDD can be used by multiple child RDD partitions. As shown in figure:
Due to the characteristics of narrow dependencies, narrow dependencies allow each partition of a child RDD to be generated in parallel, and allow multiple instructions to be executed chained on the same node without waiting for partitioning operations of other parent RDD.
There are two main reasons for Spark’s difference in width dependency:
- Narrow dependencies allow chaining operations on the same node, such as a filter operation after a map operation. In contrast, a dependency requires that all parent partitions be available.
- From a fail-recovery perspective, narrow dependency fail-recovery is more efficient because only the missing parent partition is recalculated, whereas wide dependency involves multiple parent partitions at all levels of the RDD.
F, Checkpoint
Checkpoint mechanism, during the calculation process, there are some time-consuming RDD, we can cache it to the hard disk or HDFS, mark this RDD has been processed by checkpoint, and clear all its dependencies. At the same time, create a dependency on CheckpointRDD, which can be used to read RDD from the hard disk and generate new partition information.
After doing this, when an RDD needs error recovery, you can go back to the RDD and find that it was recorded by the checkpoint and read the RDD directly from the hard disk without recalculating.
G, Preferred Location
For each fragment, an optimal location will be selected for calculation. The data will not move, while the code moves.
H, Storage Level
RDD is used to record the level of storage in RDD persistence.
- MEMORY_ONLY: Only exists in the cache. If there is not enough memory, the rest is not cached. This is the default storage level for RDD.
- MEMORY_AND_DISK: cached in memory or cached in memory if not enough.
- DISK_ONLY: stores only disks.
- MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. : Same level and function as above, except that each partition makes copies on two nodes of the cluster.
I, the Iterator
Iteration functions and computation functions are used to show how the RDD is computed from the parent RDD.
The iterating function first determines whether the RDD you want to compute is in the cache, reads it directly if it is, and looks for whether the RDD you want to compute was processed by the checkpoint if it is not. If it does, it reads directly; if it does not, it calls the calculation function to recurse up and find the parent RDD for calculation.