“This is the 11th day of my participation in the First Challenge 2022. For details: First Challenge 2022”


Hello everyone, I am a ~

5 hours to open the door of Spark and the last hour to talk about RDD.

Don’t say a word, get to work!

What is the RDD

Spark encapsulates three data structures to process data with high concurrency and throughput, and is used in different application scenarios. Respectively is:

  • RDD: Elastic distributed data set

  • Accumulator: Distributed shares write only variables

  • Broadcast variable: distributed shared read-only variable

Many current frameworks have poor performance in iterative algorithm scenarios and interactive data mining scenarios, which is the motivation of RDD.

Let’s focus on how RDD is used in data processing.

It represents an immutable, read-only, partitioned data set. Working with AN RDD is like working with a local collection, with many methods that can be called without worrying about the underlying scheduling details.

The five characteristics

RDD has five features, three basic features and two optional features.

  • Partition: A list of data fragments that can be divided into data. The partitioned data can be calculated in parallel. It is an atomic part of a data set.
  • Function (compute) : For each shard there is a function to iterate/compute to perform it.
  • Dependency: Each RDD has a dependency on the parent RDD, while the source RDD has no dependency. The relationship between them is recorded by establishing dependency relationships.
  • (Optional) : Each shard will prefered location. That is, which machines perform the task better (data localizability).
  • Partitioning policy (Optional) : RDD for key-value can tell them how to shard. This can be specified using the repartition function.

All works

From the perspective of computing, data processing requires computing resources (memory & CPU) and computing model (logic), and execution requires coordination and integration of computing resources and computing model.

The Spark framework applies for resources first and then decomposes the data processing logic of applications into computing tasks one by one. Then, the task is sent to the computing node that has allocated resources, and the data is calculated according to the specified computing model. Finally, the calculation results are obtained.

Create RDD

There are four ways to create an RDD on Spark.

Open IDEA and create a Scala Class.

1. Create an RDD from the memory

Spark provides two main methods: Parallelize and makeRDD

import org.apache.spark.{SparkConf.SparkContext}

object Rdd {
  def main(args: Array[String) :Unit = {
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    val sparkContext = new SparkContext(sparkConf)
    val rdd1 = sparkContext.parallelize(
      List(1.2.3.4))val rdd2 = sparkContext.makeRDD(
      List(1.2.3.4)
    )
    rdd1.collect().foreach(println)
    rdd2.collect().foreach(println)
    sparkContext.stop()
  }
}
Copy the code

The output

In terms of the underlying code implementation, the makeRDD method is really just the Parallelize method.

def makeRDD[T: ClassTag](

	seq: Seq[T],

	numSlices: Int = defaultParallelism): RDD[T] = withScope {

	parallelize(seq, numSlices)

}
Copy the code

2.Created from external storage (file) RDD

RDD created from data sets of external storage systems includes local file systems and all data sets supported by Hadoop, such as HDFS and HBase.

Counting the number of lines is the same as reading the file mentioned in section 2. If you have problems executing the following code on Windows, refer to section 2 to create the Spark Shell.

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf.SparkContext}

object Rdd {
  def main(args: Array[String) :Unit = {
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
    val sparkContext = new SparkContext(sparkConf)
    val fileRDD = sparkContext.textFile("src/main/java/test.txt")
    fileRDD.collect().foreach(println)
    sparkContext.stop()
  }
}
Copy the code

The output

3. Create the RDD from another RDD

After an RDD calculation is completed, a new RDD is generated.

4. Create an RDD

The RDD is directly constructed in new mode, which is generally used by the Spark framework itself.

The last

Congratulations to all the students who have persevered here. Through 5 days and 5 hours of learning, the students have gained a simple understanding of Spark and completed the classic case of introduction to big data — WordCount.

However, there is still a long way to go to learn Spark well. Here is one of my personal favorites:

When the path is long and obstructed, it will come.

Water does not compete, for is the flow.

Thank you for your 5 days of support. Thank you! Finally, I wish you a happy New Year!