sequence

As the year of 2019 is coming to an end, it is time to summarize the past and look into the future. Recently, I have been sorting out the functions of Spark with my friends for further study and learning.

The body of the

Don’t you collect such a big picture

Spark Function Points (text version)

Spark Streaming

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant Streaming of real-time data. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and complex algorithms can be expressed with high-level capabilities like Map, Reduce, Join, and Window. Finally, processed data can be pushed to file systems, databases, and real-time dashboards. In fact, Spark’s machine learning and graphics processing algorithms can also be applied to data streams

DStreams

Discretized Streams

Discrete stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous data stream, either an input data stream received from a source or a processed data stream generated by transforming the input stream. Internally, DStream is represented by a series of sequential RDD’s, which is Spark’s abstraction of immutable distributed data sets.

Each RDD in DStream contains data from a specific interval

Input DStreams and Receivers
- Receiever
  
  Each input DStream is associated with a Receiver object, which receives data from the source and stores it in Spark’s memory for processing.
  - reliable
  - unreliable
- Base source (StreamingContext API provides Receiver)
  - The file system
  - Socket connection
- Advanced Source (custom Receiver)
  - Kafka
  - Flume
  - Kinesis
  - Twitter
- Custom data source
- Queue of RDDs as a Stream
Output Operations on DStreams
- Hdfs
- DataBases
  - Hbase
  - MySql
- DashBoards
DataFrame and SQL Options
Transformations on DStreams
- RDD-to-RDD
- Join operation
  - Stream-stream joins
  - Stream-dataset joins
- The Windows operating
  - reduceByWindow
  - reduceByKeyAndWindow
  - countByValueAndWindow
  - .
graph processing algorithms
MLlib
- streaming ML algorithms
  - Streamline regression
  - Flow KMeans
  - .
- Offline data training model is applied to stream data online
- Stream data training model
  - The trained model is applied online to stream data

Fault-tolerant mechanism

Cashing/Persistence
- Persist (cache) to memory
- Setting the Persistence Level
Fault tolerance semantics
- Three kinds of guarantee
  - At most once
  - Exactly once
  - At least once
CheckPoint
- Metadata checkpointing
  
  Save the information that defines the flow computation to fault-tolerant storage, such as HDFS, for recovery from the failure of the node where the driver of the flow application is running
- Data checkpointing
  
  This is necessary for some state transitions that combine data across multiple batches. In such transformations, the generated RDD relies on the RDD of the previous batch, which results in the length of the dependency necklace increasing over time. To avoid this unrestricted increase in recovery time (proportional to the dependency chain), stateful intermediate RDD periodically checks to a reliable store (such as HDFS) to break the dependency chain.
Accumulators, Broadcast Variables, and Checkpoints

Unable to recover accumulator and broadcast variables from checkpoints in Spark Streaming. If checkpoints are enabled and either accumulator or broadcast variables are used together, delay-instantiated singleton instances of accumulator and broadcast variables must be created so that they can be re-instantiated after a driver failure restarts
- Create a lazy instantiation singleton instance

Core

RDDs

RDD operator
- Transformations
- Actions
Set parallelization
External data set
- Local File
- HDFS
- Cassandra
- HBase
- S3
- other Hadoop InputFormat
persistence
- The cache
- Storage level
  - MEMORY_ONLY
  - MEMORY_AND_DISK
  - MEMORY_ONLY_SER
    - Java and Scala
  - MEMORY_AND_DISK_SER
    - Java and Scala
  - DISK_ONLY
  - MEMORY_ONLY_2 etc.
  - OFF_HEAP
    - off-heap memory
- Fault tolerance
  - Lineage
- LRU
  - Removing Cache Data

closure

Local vs. cluster modes

Shuffle operation

Sorted Based Shuffle

Shared variables

Radio variable
- Broadcast Variables
accumulator
- Accumulators

Spark SQL

SparkSession

Spark application configuration
Creating DataSet/DataFrame
SQL API
SparkContext
The catalog metadata

Dataset/DataFrame

Creating DataSet/DataFrame
- RDD
- External data source
  - HDFS
Data Sources
- The LOAD/SAVE operation
- Store format files in columns
  - Parquet Files
  - ORC Files
- Unstructured data files
  - CSV Files
  - Text Files
- Semi-structured data files
  - JSON Files
- Unstructured database
  - Hbase
- Structured database
  - Mysql
    - JDBC/ODBC
    - Spark SQL CLI
  - Hive
    - SparkSession.enableHiveSupport
    - Spark SQL CLI
The DataSet/DataFrame/RDD conversion
DataSet/DataFrame API
- The SQL statement
  - The catalog metadata
    - Hive metastore
    - In-memory
  - SQL Language Categories
    - DDL
    - DML
    - DQL
    - DCL
  - Built-in function
    - A single function
    - Aggregation function
- DSL statement
  - Domain-specific language
    - .filter
    - .select
schema API
- Show the schema
- To define the schema

UDF

UDAF
- Untyped UDAF
- Typed UDAF
UDTF

Spark SQL on Hive

The Hive framework has been deployed
The Hive framework is not deployed

Performance Tuning

Caching Data In Memory
Other Configuration Options
Broadcast Hint for SQL Queries

Spark SQL technical points

Execute the process
- Logical Plan
  - SQL Parse
    - UnresolvedLogicalPlan
      - DDLParser
      - SqlParser
      - SparkSQLParser
  - Catalyst Analyzer
    - ResolvedLogicalPlan
  - Catalyst Optimizer
    - OptimizedLogicalPlan
- SparkPlanner
  - Parent: QueryPlanner
  - Strategies
    - LeftSemiJoin
    - HashJoin
    - PartialAggregation
    - BroadcastNestedLoopJoin
    - TakeOrdered
- Physical Plan
  - prepareForExecution
    - Executable PhysicalPlan
  - execute
    - SchemaRDD
Performance optimization
- Reduce or avoid the Shuffle process
  - Unified Zoning Policy
  - Broadcast mechanism
- Data skew
  - A custom Partitioner
- Memory cached data
- The core attributes
  - spark.driver.cores
  - spark.driver.memory
  - spark.executor.cores
  - spark.executor.memory
  - spark.sql.shuffle.partitions

Delta Lake

The data of lake

MLlib

Spark AI

GraphX

Spark Graph

Structured Streaming

Streaming Dataset/Streaming DataFrame

Creating Streaming Dataset/Streaming DataFrame
- File source
  - File generation workflow
  - Generate automatic partition workflow for files and folders
- Kafka source
  - Streaming mode generates a workflow
  - Batch mode generates the workflow
- Socket source
- Rate source
SQL query
Window Operations on Event Time
- Window generation rule
- Window working mode
Watermark
- Updata Mode
- Append Mode
- Complete Mode(must be aggregated in advance)
Streaming Deduplication
Join Operations
- Stream-static Data join operation
- Stream-stream Indicates the data join operation
Trigger
- Trigger.ProcessingTime(time interval)
- Trigger.Once()
- The Trigger. Continuous (Checkpoint interval)
Manage and monitor workflow
- Managing Streaming Queries
- Monitoring Streaming Queries
Checkpoint

Continuous Processing

Sources
- Kafka source
- Rate source
Sinks
- Kafka sink
- Memory sink
- Console sink

monitoring

StreamingListener interface monitoring

Receiver state
The processing time

Job Web UI

Streaming tab

When using StreamingContext, the Spark Web UI displays an additional Streaming TAB
- A running receiver
  - Receiver state
  - The number of records received
  - Receiver error
- Batch processing completed
  - Batch time
  - Queueing delay
4040 etc.

The port is active. If 4040 is occupied, the port will be extended to 4041,4042,4043…
- Stages and Tasks
- Rdd size and memory usage
- Environmental information
- running executors
eventLog.enabled
- true

Spark’s History Server

port
- 18081
kerberos
Access control
- acls

Rest API

ip:18081/api/v1
- History service
  - app-id
ip:4040/api/v1
- Running tasks

Index system

Dropwizard index library
- The metrics. Dropwizard. IO / 4.4.1
The receiver
- Http
- JMX
- CSV
- Ganglia
  - Licensing restrictions
    - LGPL license
    - Custom Installation

Advanced analysis

System analysis tool
- dstat
- iostat
- iotop
The JVM tools
- jstack
- jmap
- jstat
- jconsole

Deployment way

Local

YARN

Mesos

Standalone

Kubernetes

Programming interface (API)

Scala

Java

Python

R

SQL

SQL does not refer to a specific programming language. Spark SQL is the Spark module for structured data processing. 2.x introduced the ANSI-SQL parser to provide parsing capabilities for standardized SQL.

Built-in Funcs
- Spark.apache.org/docs/latest…

Shell

Ps: The GraphX and MLlib components are not described in the figure

Whisper: The original image exported by Mind is too large to be clear when uploaded, so the text is listed.

The Spark function point | mind maps “collection”