1. Five basic Properties

  • A list of partitions

  • A function for computing each split

  • A list of dependencies on other RDDs

  • Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)

  • Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

These are the comments in the source code for RDD. Here are the five characteristic attributes

1.1 partition

A Partition is the basic constituent unit of a data set. For RDD, each shard is processed by a computation task that determines the granularity of parallel computation. The user can specify the number of shards in the RDD when creating the RDD; if not, the default value is used

1.2 Calculated functions

A function that evaluates partitioned data. The RDD in Spark is calculated in fragments. Each RDD implements the compute function to achieve this purpose. The function compute combines iterators without saving the result of each calculation

1.3 Dependency

There are dependencies between RDD. Each RDD transformation generated a new RDD, and lineage was formed among RDD. If data of some partitions is lost, Spark can recalculate the lost partition data through this dependency relationship instead of recalculating all partitions of the RDD

1.4 partition editor

For a KEY-value RDD, there may be a Partitioner. Spark implements two types of sharding functions, a HashPartitioner based on hash and a RangePartitioner based on range. Partitioner only a key-value RDD can have a Partitioner, and the value to the Parititioner of a non-key-value RDD is None. The Partitioner function determines the number of fragments in the RDD itself, as well as the number of fragments in the output of the parent RDD Shuffle

1.5 Preferred Storage Location

A list that stores the preferred location of each Partition. For an HDFS file, this list stores the location of each Partition in the block. According to the concept of “mobile data not mobile computing”, Spark allocates computing tasks to the storage locations of data blocks to be processed

2. Common operators between RDD transformations

Starting from the basic features of RDD mentioned above, the programs often written in my work include RDD creation, RDD transformation, RDD operator execution, and the mandatory steps to create data corresponding to external systems flowing into Spark cluster. As for the data created from the set, it is generally used in testing, so I will not go into details. RDD transformations correspond to a special operator called Transformation that is used by lazy loading, and actions correspond to the actions that trigger a Transformation to perform, typically output to a collection, print out, or return a value, or output from a cluster to another system. There’s a technical word for this: Action.

2.1 Common conversion operators

Conversion operators, that is, conversion operations from one RDD to another RDD, correspond to some built-in Compute functions, but these functions are classified as wide and narrow dependent operators with or without shuffle

2.1.1 Differences between wide and narrow dependencies

Generally, there are two kinds of online articles, one is the handling definition, that is, whether a parent RDD partition will be dependent on multiple sub-partitions, the other is to check whether Shuffle, Shuffle means wide dependency, no means narrow dependency, the first is still reliable, the second is to take itself as itself, so there is no reference value. 2.1.3 How to distinguish between narrow and wide dependencies

2.1.2 Common operators of wide and narrow dependencies

Narrow dependent common operators

Map(func) : apply func to every element in the dataset and return a new RDD filter(func) : apply func to every element in the dataset and return a RDD flatMap(func) containing elements that make func true: Like map, each input element is mapped to zero or more output elements mapPartitions(FUNc) : Much like Map, but Map partitions apply FUNC to each element, whereas mapPartitions apply FUNC to the entire partition. Given that an RDD has N elements and M partitions (N >> M), the map function is called N times, while the function in mapPartitions is called only M times, processing all elements in one partition at a time: mapPartitionsWithIndex(func) Similar to mapPartitions, there is additional information about the index value of partitions

Glom () : forms an Array for each partition to form a new RDD type RDD[Array[T]] sample(withReplacement, Fraction, seed) : a sampling operator. A fraction is randomly sampled from the specified random seed. WithReplacement indicates whether the extracted data is put back. True indicates the sample with put back, and false indicates the sample without put back

Coalesce (numPartitions,false) : no shuffle, which is used to reduce partitions

Union (otherRDD) : Finds the union of two RDD’s

Cartesian (otherRDD) : Cartesian product

Zip (otherRDD) : combines two RDD into a key-value RDD. By default, the two RDD have the same number of partitions and elements. If no, an exception is thrown.

MapPartitions: process data of one partition at a time. Data can only be released after partition data is processed. OOM best practice: If memory resources are sufficient, mapPartitions are recommended to improve processing efficiency

Wide dependent common operators

GroupBy (func) : Groups by the return value of the passed function. Put the same key value into an iterator

Distinct ([numTasks]) : A new RDD is returned after the RDD element is de-duplicated. The numTasks parameter can be passed to change the number of RDD partitions

Coalesce (numPartitions, true) : shuffle is enabled. Repartition is used instead of adding or reducing partitions

Repartition (numPartitions) : increases or decreases the number of partitions, which can be shuffle

SortBy (func, [Ascending], [numTasks]) : Uses func to process the data and sort the results after processing

Intersection (otherRDD) : Calculates the intersection of two RDD

Subtract (otherRDD) : Find the difference set of two RDD

2.1.3 How to distinguish wide dependencies from narrow dependencies

Here, I suggest that the operators that are difficult to understand can be seen directly from Spark’s history dependency graph. If there is a division, it means a wide dependency, while if there is no division, it means a narrow dependency. Of course, this is the practice of practical school. Then bring dependency graph to him, and, of course, as a theory and practice and walker, I am here to take a discriminant, begins with understanding the definition, define a parent RDD partitions have been more child partition, it can think about it from this Angle, the parent partition a single partition data, is it possible to flow to different sub RDD partitions, For example, think of a distinct operator, or sortBy operator, global recalibration, and global sorting. Suppose you start 1, 2, and 3 in a partition by map(x => (x, null)).reduceByKey((x, y) => x).map(_._1). Although the number of partitions remains the same, the data of each partition must look at the data of other partitions to determine whether to retain the partition. The input partition and output partition must be merged and reorganized, so shuffle is inevitable. SortBy similarly.

2.2 Common action operators

Action Triggers the Job. A Spark program (Driver program) contains as many Action operators as there are jobs; Typical Action operator: collect/count collect() => sc.runjob () =>… => DagScheduler.runJob () => The Job is triggered

collect() / collectAsMap() stats / count / mean / stdev / max / min reduce(func) / fold(func) / aggregate(func)

First () : Return the first element in this RDD take(n) : take the first num elements of the RDD top(n) : Returns the first num elements by default (descending order) or by the specified collation rule. TakeSample (withReplacement, num, [seed]) : returns the sampled data foreach(func)/foreachPartition(func) : It is similar to map and mapPartitions, but foreach is Action saveAsTextFile(path)/saveAsSequenceFile(path)/saveAsObjectFile(path)

3. Common PairRDD operations

RDD is divided into Value type and key-value type. In the previous section, Value RDD operations are performed. In practice, key-value RDD, also known as PairRDD, is used. Operations of Value RDD are basically concentrated in RDD.scala. Key-value RDD operations are concentrated in PairRddfunctions. scala.

Most of the operators described above are valid for PairRDD. When the value of an RDD is key-value, it can be implicitly converted to PairRDD. PairRDD also has its own Transformation and Action operators.

3.1 Transformation to PairRDD

3.1.1 Operations Similar to Map Operations

MapValues, flatMapValues, keys, and values are simplified operations that can be implemented using map operations.

3.1.2 Aggregation operation [Important and Difficult Points]

PariRDD(k, v) is widely used, Aggregation groupByKey/reduceByKey/foldByKey/aggregateByKey combineByKey (OLD)/combineByKeyWithClassTag (NEW) => Underlying implementation SubtractByKey: Similar to subtract, delete elements whose keys are the same as those in other RDD

Conclusion: Efficiency equality is the most familiar method; In general, groupByKey is inefficient. Use it sparingly

3.1.3 Sorting Operations

SortByKey: sortByKey functions on PairRDD to sort keys

3.1.4 join operation

cogroup / join / leftOuterJoin / rightOuterJoin / fullOuterJoin

val rdd1 = sc.makeRDD(Array((1."Spark"), (2."Hadoop"), (3."Kylin"), (4."Flink")))
val rdd2 = sc.makeRDD(Array((3."Bill"), (4."Fifty"), (5."Daisy"), (6."Feng qi")))
val rdd3 = rdd1.cogroup(rdd2)
rdd3.collect.foreach(println)
rdd3.filter{case (_, (v1, v2)) => v1.nonEmpty & v2.nonEmpty}.collect
// Copy source code to implement the join operation
rdd3.flatMapValues( pair =>
for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
)
val rdd1 = sc.makeRDD(Array(("1"."Spark"), ("2"."Hadoop"), ("3"."Scala"), ("4"."Java")))
val rdd2 = sc.makeRDD(Array(("3"."20K"), ("4"."18K"), ("5"."25K"), ("6"."10K")))
rdd1.join(rdd2).collect
rdd1.leftOuterJoin(rdd2).collect
rdd1.rightOuterJoin(rdd2).collect
rdd1.fullOuterJoin(rdd2).collect
Copy the code

3.1.5 Action operation

collectAsMap / countByKey / lookup(key)

Lookup (key) : An efficient lookup method that only looks up the data of the corresponding partition (if the RDD has a partition divider)

4. Note

Actual combat out of real knowledge, want some kind of implementation, assuming that you think of a certain operator, then to use it, do not understand the place to see the source code, cause can become! Check your profile for more.