Hadoop
Yarn, a mature framework for big data processing, solves the resource bottleneck of processing a large number of tasks in parallel
What exactly does Yarn do
Here,MapReduce’s comprehensive data processing is submitted to Job Tracker,
Disadvantages: 1. It is OK to deal with small-scale Data processes. If it is a big Data scenario and a large number of Data nodes are processed at the same time, Job Tracker will easily become a system bottleneck due to unbalanced allocation of resources. Can only accept MapReduce mode, technology stack can only be Java
Yarn– Solution
HadoopV2.0 uses Yarn(Yet Another Resource Negotiator), which aims to separate Resource management from Job scheduling monitoring
Yarn Core Components
Global components
-
The Scheduler and Application Manager are used to manage, schedule, and allocate resources globally.
The Scheduler allocates resources to applications based on node capacity and queuing
The Application Manager receives the request submitted by the user, starts the Applcation Master in the node, and monitors its status and necessary restarts
-
The Node Manager agent monitors nodes and reports Resource usage to the Resource Manager
Per – appliaction components
- The Application Master communicates with the Resource Manager to obtain resources for computing. After obtaining the resources, communicate with the Node Manager on the Node, summarize the tasks in the assigned Container, and monitor the task execution.
- C ** Ontiner Resource abstraction ** Memory, CPU, disk, network, etc. The Resource returned by the Resource Manager for the Application Master is the Container.
Pain points that Yarn addresses
- Solve the bottleneck of the Job Tracker by using the Application Master. After a new task is submitted, the Resource Manager starts a new Applcation Master on the appropriate node to avoid the bottleneck of the Job Tracker
- More efficient resource scheduling
- Supports data processing methods other than MapReduce, such as Spark
The problem of Yarn
After a large number of tasks are submitted, computing resources are exhausted. As a result, new jobs with a high priority need to wait for a long time to be processed. You can configure different resource scheduling rules [priorities] to alleviate this problem.
The Spark framework
Sparksql and Hive do not calculate directly, but tell each node to calculate the task, and then summarize the calculation results
Introduction :Spark and MapReduce work at the same level to solve distributed computing
Driver Master Worker Executer
Features: Can be deployed on Yarn native HDFS and Scala
Deployment model: Single-machine model Pseudo-cluster model Independent cluster (also called native cluster mode) YARN cluster
YARN cluster: The ApplicationMaster role in the YARN ecosystem is replaced by the Spark ApplicationMaster developed by Apache. Each NodeManager role in the YARN ecosystem is equivalent to a Worker role in the Spark ecosystem. NodeManger is responsible for starting Executor.
About Spark SQL
Introduction to the
There are two components :SQLContext and DataFrame
For structured data processing (JSON,Parquet, Carbon (Huawei), database), and performing ETL operations, and finally completing specific queries. Generally,Spark does not support a new application development, but introduces a new Context and response RDD
SQL support
Parser, Optimizer, execution
Processing order:
-
SQlParser Generates LogicPlan Tree.
-
Analyzer and Optimizer apply various rules to the LogicalPlan Tree;
-
SparkRDD is generated from LogicalPlan.
-
Finally, the generated RDD is sent to Spark for execution.
Hive On Spark
Introduction: Evolved from Hive on MapReduce, Spark serves as the Hive computing engine and is submitted to the Spark cluster for computing to improve Hive query performance
SparkSQL is a solution that implements SQL on Spark. The SQL engine is a translation layer that translates an SQL into a distributed executable Spark application
SELECT item_type, sum(price)
FROM item
GROUP item_type;
Copy the code
Steps:
This SQL script, handed to Hive or a similar SQL engine, “tells” the computing engine to do two steps: read the item table and extract the item_type and price fields; Shuffle key (item_type) shuffle key (item_type) Shuffle key (item_type) Shuffle key (item_type) Shuffle key (item_type) shuffle key (item_type) shuffle key (item_type) shuffle key (item_type) shuffle The same item_type is aggregated to the same aggregation node, which then adds together the Partial Sum of each group to get the final result. Both Hive and SparkSQL generally do this.