In an age when everyone is talking about Spark, I feel compelled to publish a technical article on Spark to help you understand and master Spark from the beginning to the master, from concept to programming, and deeply appreciate the charm of Spark.
1. Why Spark
First, Spark is the most efficient open source data processing engine designed for fast processing, ease of use and advanced analytics, with participants from more than 250 organizations and a growing community of developers and users.
Second, as a general-purpose computing engine designed for large-scale distributed data processing, Spark supports multiple workloads through a unified engine containing Spark components and API access libraries that support popular programming languages, including Scala, Java, Python, and R.
Finally, it can be deployed in different environments, read data from a variety of data sources, and interact with numerous applications.At the same time, this unified computing engine enables different workloads in an ideal environment — ETL, Interactive Query (Spark SQL), advanced analysis (Machine Learning), Image processing (GraphX/ GraphFrames) and Spark Streaming — both run on the same engine.You’ll get an introduction to some of these components in subsequent steps, but first let’s introduce the key concepts and terminology.
2. Concepts, key terms and keywords of Apache Spark
In June this year, KDnuggete published the Apache Spark key terms to explain (www.kdnuggets.com/2016/06/spa…
Spark Cluster A group of machines or nodes preconfigured in the cloud or data center where Spark is installed. Those machines are Spark Workers, Spark Master (cluster manager in a separate mode), and at least one Spark Driver.
Spark Master As the name implies, the Spark Master JVM acts as the cluster manager in a separate deployment mode, and Spark Works registers themselves as part of the cluster. Depending on the deployment pattern, it acts as a resource manager to decide how many actuators to publish on which machines in the cluster.
Spark Worker The Spark Worker JVM releases actuators on behalf of the Spark Driver after receiving a command from the Spark Master. Spark’s application is decomposed into task units, which are executed by the actuators of each Worker. In short, the Worker’s job is to publish an executor on behalf of the Master.
Spark Executor is a JVM container with a allocated amount of processors and memory on which Spark runs its tasks. Each Worker node publishes its Own Spark executor through a configurable core (or thread). In addition to performing Spark tasks, each executor stores and caches data partitions in memory.
Spark Driver Once it gets information about all the workers in the cluster from the Spark Master, the Driver assigns Spark tasks to each Worker’s actuator. Drive also gets calculated results from each actuator’s tasks.
SparkSession and SparkContext
As the chart shows,SparkContext is the channel to access all of Spark’s functions; There is only one SparkContext per JVM. The Spark driver uses it to connect to the cluster manager to communicate and submit Spark work. It allows you to configure Spark parameters. With SparkContext, drivers can instantiate other contexts, such as SQLContext, HiveContext, and StreamingContext.
With Apache Spark 2.0,SparkSession can access all of Spark’s mentioned functions through a single entry point, making it easier to access Spark functions and manipulate data across the underlying context.Spark deployment mode
Spark supports four cluster deployment modes. Each of the Spark components in the Spark cluster has its own characteristics. Of all the modes, local mode, which runs on a separate host, is by far the easiest.
As a beginner or intermediate developer, you don’t need to know this complex form, but it’s here for your reference. In addition, step 5 of this article delves into various aspects of the Spark architecture.
Spark的Apps, Jobs, Stages and Tasks
A Spark application usually contains several Spark operations, which can be decomposed into transformation or Action on a data set to use the Spark RDD, data box, or data set. For example, in Spark, if you call an action, that action generates a job. A job can be decomposed into a single or multiple stages. Stages are further split into separate tasks; A Task is an execution unit. The Spark driver dispatcher sends the task to the Spark actuator on the Spark Worker node for execution. Typically, multiple tasks run in parallel on the same executor, running cell processes separately on a partitioned dataset of memory.
Article source: www.kdnuggets.com/2016/09/7-s…