“This is the 11th day of my participation in the First Challenge 2022. For details: First Challenge 2022”
Hello everyone, I am a ~
5 hours to open the door of Spark and the last hour to talk about RDD.
Don’t say a word, get to work!
What is the RDD
Spark encapsulates three data structures to process data with high concurrency and throughput, and is used in different application scenarios. Respectively is:
-
RDD: Elastic distributed data set
-
Accumulator: Distributed shares write only variables
-
Broadcast variable: distributed shared read-only variable
Many current frameworks have poor performance in iterative algorithm scenarios and interactive data mining scenarios, which is the motivation of RDD.
Let’s focus on how RDD is used in data processing.
It represents an immutable, read-only, partitioned data set. Working with AN RDD is like working with a local collection, with many methods that can be called without worrying about the underlying scheduling details.
The five characteristics
RDD has five features, three basic features and two optional features.
- Partition: A list of data fragments that can be divided into data. The partitioned data can be calculated in parallel. It is an atomic part of a data set.
- Function (compute) : For each shard there is a function to iterate/compute to perform it.
- Dependency: Each RDD has a dependency on the parent RDD, while the source RDD has no dependency. The relationship between them is recorded by establishing dependency relationships.
- (Optional) : Each shard will prefered location. That is, which machines perform the task better (data localizability).
- Partitioning policy (Optional) : RDD for key-value can tell them how to shard. This can be specified using the repartition function.
All works
From the perspective of computing, data processing requires computing resources (memory & CPU) and computing model (logic), and execution requires coordination and integration of computing resources and computing model.
The Spark framework applies for resources first and then decomposes the data processing logic of applications into computing tasks one by one. Then, the task is sent to the computing node that has allocated resources, and the data is calculated according to the specified computing model. Finally, the calculation results are obtained.
Create RDD
There are four ways to create an RDD on Spark.
Open IDEA and create a Scala Class.
1. Create an RDD from the memory
Spark provides two main methods: Parallelize and makeRDD
import org.apache.spark.{SparkConf.SparkContext}
object Rdd {
def main(args: Array[String) :Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
val sparkContext = new SparkContext(sparkConf)
val rdd1 = sparkContext.parallelize(
List(1.2.3.4))val rdd2 = sparkContext.makeRDD(
List(1.2.3.4)
)
rdd1.collect().foreach(println)
rdd2.collect().foreach(println)
sparkContext.stop()
}
}
Copy the code
The output
In terms of the underlying code implementation, the makeRDD method is really just the Parallelize method.
def makeRDD[T: ClassTag](
seq: Seq[T],
numSlices: Int = defaultParallelism): RDD[T] = withScope {
parallelize(seq, numSlices)
}
Copy the code
2.Created from external storage (file) RDD
RDD created from data sets of external storage systems includes local file systems and all data sets supported by Hadoop, such as HDFS and HBase.
Counting the number of lines is the same as reading the file mentioned in section 2. If you have problems executing the following code on Windows, refer to section 2 to create the Spark Shell.
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf.SparkContext}
object Rdd {
def main(args: Array[String) :Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
val sparkContext = new SparkContext(sparkConf)
val fileRDD = sparkContext.textFile("src/main/java/test.txt")
fileRDD.collect().foreach(println)
sparkContext.stop()
}
}
Copy the code
The output
3. Create the RDD from another RDD
After an RDD calculation is completed, a new RDD is generated.
4. Create an RDD
The RDD is directly constructed in new mode, which is generally used by the Spark framework itself.
The last
Congratulations to all the students who have persevered here. Through 5 days and 5 hours of learning, the students have gained a simple understanding of Spark and completed the classic case of introduction to big data — WordCount.
However, there is still a long way to go to learn Spark well. Here is one of my personal favorites:
When the path is long and obstructed, it will come.
Water does not compete, for is the flow.
Thank you for your 5 days of support. Thank you! Finally, I wish you a happy New Year!