Before I start talking about Spark, I suggest that if you want to know, learn and use Spark well, the Spark website is a great tool for almost all of your needs. At the same time, it is recommended to learn The Scala language, mainly based on two points: 1. Spark is written in Scala. To learn Spark well, you must read and analyze its source code, as well as other technologies. 2. Using Scala to write Spark programs is more convenient, concise, and efficient than using Java. (I will explain The Scala language separately later.) To get back to the book, here is an overview of the Spark ecosphere.

Apache Spark is a fast, versatile, scalable, and fault-tolerant big data analysis engine based on in-memory iterative computation. First of all, Spark is currently a computing engine that processes data and does not store it. Let’s start by looking at the core components of the Spark ecosystem:

This article will first briefly introduce the usage scenarios of each component, and then I will explain the core components separately. The following are based on sparK2.x version.

Spark RDD and Spark SQL

Spark RDD and Spark SQL are mostly used in offline scenarios. Spark RDD can process both structured and unstructured data, but Spark SQL processes structured data and internally processes distributed data sets through the dataset

SparkStreaming and StructuredStreaming

Spark Streaming is used for Streaming, but it should be emphasized that Spark Streaming is based on microbatch processing. Even though the Structured Streaming is optimized in real time, Spark’s Streaming preparation is indeed quasi-real time compared to Flink and Storm

MLlib

For machine learning, pySpark also has applications that do data processing based on Python

GraphX

For graph calculation

Spark R

Data processing and statistical analysis based on R language

The following describes the features of Spark

  • fast

    Spark implements the DAG execution engine and processes data iteratively in memory. Spark saves the intermediate results of data analysis in memory, eliminating the need to repeatedly read and write data from external storage systems. Compared with MapReduce, Spark is more suitable for scenarios requiring iterative operations, such as machine learning and data mining.

  • Easy to use

    Support scala, Java, Python, R languages; Support a variety of advanced operators (currently more than 80), users can quickly build different applications; Supports shell interactive queries such as Scala and Python

  • general

    Spark is a one-stop solution that integrates batch processing, stream processing, interactive query, machine learning, and graph computing, avoiding resource waste caused by deploying different clusters in multiple computing scenarios

  • Good fault-tolerance

    Error tolerance is achieved by checkpoint in distributed data set calculation. When an operation fails, there is no need to start the calculation again.

  • Strong compatibility

    It runs on resource managers such as Yarn, Kubernetes, and Mesos. The Standalone mode acts as a built-in resource management scheduler and supports multiple data sources