Hadoop

Yarn, a mature framework for big data processing, solves the resource bottleneck of processing a large number of tasks in parallel

What exactly does Yarn do

Here,MapReduce’s comprehensive data processing is submitted to Job Tracker,

Disadvantages: 1. It is OK to deal with small-scale Data processes. If it is a big Data scenario and a large number of Data nodes are processed at the same time, Job Tracker will easily become a system bottleneck due to unbalanced allocation of resources. Can only accept MapReduce mode, technology stack can only be Java

Yarn– Solution

HadoopV2.0 uses Yarn(Yet Another Resource Negotiator), which aims to separate Resource management from Job scheduling monitoring

Yarn Core Components

Global components

  • The Scheduler and Application Manager are used to manage, schedule, and allocate resources globally.

    The Scheduler allocates resources to applications based on node capacity and queuing

    The Application Manager receives the request submitted by the user, starts the Applcation Master in the node, and monitors its status and necessary restarts

  • The Node Manager agent monitors nodes and reports Resource usage to the Resource Manager

Per – appliaction components

  • The Application Master communicates with the Resource Manager to obtain resources for computing. After obtaining the resources, communicate with the Node Manager on the Node, summarize the tasks in the assigned Container, and monitor the task execution.
  • C ** Ontiner Resource abstraction ** Memory, CPU, disk, network, etc. The Resource returned by the Resource Manager for the Application Master is the Container.

Pain points that Yarn addresses

  • Solve the bottleneck of the Job Tracker by using the Application Master. After a new task is submitted, the Resource Manager starts a new Applcation Master on the appropriate node to avoid the bottleneck of the Job Tracker
  • More efficient resource scheduling
  • Supports data processing methods other than MapReduce, such as Spark

The problem of Yarn

After a large number of tasks are submitted, computing resources are exhausted. As a result, new jobs with a high priority need to wait for a long time to be processed. You can configure different resource scheduling rules [priorities] to alleviate this problem.

The Spark framework

Sparksql and Hive do not calculate directly, but tell each node to calculate the task, and then summarize the calculation results

Introduction :Spark and MapReduce work at the same level to solve distributed computing

Driver Master Worker Executer

Features: Can be deployed on Yarn native HDFS and Scala

Deployment model: Single-machine model Pseudo-cluster model Independent cluster (also called native cluster mode) YARN cluster

YARN cluster: The ApplicationMaster role in the YARN ecosystem is replaced by the Spark ApplicationMaster developed by Apache. Each NodeManager role in the YARN ecosystem is equivalent to a Worker role in the Spark ecosystem. NodeManger is responsible for starting Executor.

About Spark SQL

Introduction to the

There are two components :SQLContext and DataFrame

For structured data processing (JSON,Parquet, Carbon (Huawei), database), and performing ETL operations, and finally completing specific queries. Generally,Spark does not support a new application development, but introduces a new Context and response RDD

SQL support

Parser, Optimizer, execution

Processing order:

  1. SQlParser Generates LogicPlan Tree.

  2. Analyzer and Optimizer apply various rules to the LogicalPlan Tree;

  3. SparkRDD is generated from LogicalPlan.

  4. Finally, the generated RDD is sent to Spark for execution.

Hive On Spark

Introduction: Evolved from Hive on MapReduce, Spark serves as the Hive computing engine and is submitted to the Spark cluster for computing to improve Hive query performance

SparkSQL is a solution that implements SQL on Spark. The SQL engine is a translation layer that translates an SQL into a distributed executable Spark application

SELECT item_type, sum(price)
FROM item
GROUP item_type;
Copy the code

Steps:

This SQL script, handed to Hive or a similar SQL engine, “tells” the computing engine to do two steps: read the item table and extract the item_type and price fields; Shuffle key (item_type) shuffle key (item_type) Shuffle key (item_type) Shuffle key (item_type) Shuffle key (item_type) shuffle key (item_type) shuffle key (item_type) shuffle key (item_type) shuffle The same item_type is aggregated to the same aggregation node, which then adds together the Partial Sum of each group to get the final result. Both Hive and SparkSQL generally do this.