This article originated from personal public account: TechFlow, original is not easy, for attention


Today, in the second article of the Spark series, we look at one of spark’s most important concepts — RDD.

In the last lecture, we installed Spark locally. Although we only have a local cluster, it still does not prevent us from conducting experiments. The biggest feature of Spark is that the computing code is the same regardless of cluster resources. Spark automatically performs distributed scheduling for us.

RDD concept

The RDD is an important part of Spark introduction. But a lot of beginners don’t know what RDD is, and I don’t know what RDD is. I wrote a lot of code before LEARNING Spark, but I still don’t know what RDD is.

RDD’s Full English name is Resilient Distributed Dataset. I will write it in English and it will be much clearer. Even if you don’t know the first word, you can at least know that it’s a distributed data set. The first word is elastic, so the literal translation is elastic distributed data set. It’s still not clear, but it’s a lot clearer than just knowing the concept of RDD,

An RDD is an immutable, distributed collection of objects, and each RDD is divided into partitions that run on different nodes in the cluster.

Many sources only have such a superficial explanation, it seems to say a lot, but we can not get it. Finally, I found a detailed explanation on The blog of Daesong. The daesong searched the source code of Spark and found the definition of RDD. An RDD contains the following contents:

  • A list of partitions
  • A function for computing each split
  • A list of dependencies on other RDDs
  • Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
  • Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

Let’s look at them one by one:

  1. It is a set of partitions, and a partition is the smallest unit of data set in Spark. In other words, Spark data is stored by partition, and different partitions are stored on different nodes. This is also the basis of distributed computing.
  2. A computing task applied to each partition. In Spark, data and operations are separated, and Spark is based on lazy computing, which is stored to perform which calculations on which data until the actual actions that trigger the computation occur. The mapping between data and calculations is stored in the RDD.
  3. Dependency relationship between RDD. There is conversion relationship between RDD. One RDD can be converted to another RDD through conversion operations, and these conversion operations are recorded. When some data is lost, Spark recalculates the lost data based on the recorded dependency relationship instead of recalculating all data.
  4. A partition method is a function that calculates partitions. Spark supports hash partitioning based on hash and range partitioning based on range.
  5. A list of preferred storage locations for each partition.

From the above five points, we can see an important concept of Spark. That is, mobile data is inferior to mobile computing. In other words, when Spark runs scheduling, it tends to distribute computing to nodes rather than collect data from nodes for calculation. RDD is based on this idea, and it does just that.

Create RDD

Spark provides two ways to create an RDD. One is to read an external data set, and the other is to parallelize a set already stored in memory.

Let’s look at them one by one, and the easiest way to do that, of course, is parallelization, because you don’t need an external data set, you can do that very easily.

Before we do that, let’s take a look at the concept of SparkContext. SparkContext is the entrance to spark, which is the main function of the program. When we start Spark, Spark has created an instance of SparkContext for us, named SC, which we can access directly.


We need to create an RDD based on sc. For example, I want to create an RDD with strings:

texts = sc.parallelize(['now test'.'spark rdd'])

Copy the code

The returned texts are an RDD:


In addition to Parallelize, we can also generate RDD from external data. For example, if I want to read from a file, I can use the textFile method in SC to get the RDD:

text = sc.textFile('/path/path/data.txt')

Copy the code

In general, we rarely use Parallelize for RDD creation except for local debugging, as this requires reading data into memory first. Memory limitations make it difficult to use Spark’s power.

Transformation operations and action operations

As we mentioned earlier in our introduction to RDD, RDD supports two types of operations, one called transformation and one called Action.

As the name implies, Spark converts one RDD to another RDD during the conversion operation. The RDD will record what we converted this time, but will not perform the calculation. So we still have an RDD rather than an execution result.

For example, after we created the RDD of texts, we wanted to filter the content and only reserved those with a length of more than 8. We could use filter to transform the texts:

textAfterFilter = texts.filter(lambda x: len(x) > 8)

Copy the code

The result is also an RDD. As we said earlier, since filter is a transform operation, Spark only records its contents and does not actually execute it.

The transform operation can operate on any number of RDDS, for example, if I do the following, I get a total of four RDDS:

inputRDD = sc.textFile('path/path/log.txt')

lengthRDD = inputRDD.filter(lambda x: len(x) > 10)

errorRDD = inputRDD.filter(lambda x: 'error' in x)

unionRDD = errorRDD.union(lengthRDD)

Copy the code

The final union will combine the two RDD results together. If we execute the above code, Spark will record the RDD dependency information. We can draw the dependency information to form a dependency graph:


No matter how many transformations we perform, Spark doesn’t actually perform any of the transformations, and only when we perform the actions does the recorded transformations actually go into operation. First (), take(), count() are all actions, and Spark returns the result.


Where first is used to return the first result, take takes an argument specifying the number of results to return, and count is the number of computed results. As we were overdue, spark returned the results for us after we performed these operations.

This article focuses on RDD concepts, and our next article will focus on transformation operations and action operations. If you are interested, please look forward to it

Today’s article is all of these, if you feel that there is a harvest, please feel free to click on or forward, your help is very important to me.