Abstract: This blog describes how mapReduce-type job logs are generated in Hadoop. This section describes the key process of log generation without too much detail.

This document is shared by MXG in How to Generate MapReduce Job Logs in Hadoop.

We know that Hadoop is divided into three parts: HDFS, Yarn, and Mapreduce. Mapreduce-related core codes are all in the hadoop-Mapreduce-Project sub-project.

Among them, the more important functional modules are MRAppMaster, JobHistory and mapreduceClient, which correspond to the above APP, HS and JobClient respectively. There are also some common utility classes that are not detailed here.

MRAppMaster: the first Container that runs as a Yarn Application. It applies for task resources for mapReduce jobs and monitors task running status.

Note that the nouns yarn Application and MapReduce job are introduced here, which are two different levels of the same thing. All applications running in Yarn are called applications, and job is a MapReduce application. In the MapReduce framework, it may be called in other computing frameworks. In short, no matter how it is called in the computing framework, In YARN, it’s all YARN Application.

Remembering the terms established by the open source community helps us understand code. For example, when we see a job-related interface, our subconscious will say, “JobHistory” or “MRAppMaster”, if it’s an application-related interface, Then this must be the ResourceManager interface.

HistoryServer: When the YARN Application runs in AM, the default path for uploading the run logs of the job to HDFS is/TMP /hadoop-yarn/staging. You can also use the following parameters: Yarn. App. Graphs. Am. Staging – dir configured to any desired path. It can even be stored across domain file systems, such as not on HDFS. The AM logs will eventually be organized into a specific format, JHist, that HistoryServer will parse and present in a friendly web page.

Jobclient: Provides interfaces for job management, such as job submission.

The following describes the AM log generation stages of the next MapReduce job. Three important parameters are designed:

1- Temporary log directory for Mr Jobs that have been run;

Graphs. Jobhistory. Intermediate – done – dir: – the history/TMP/Mr

2- The final directory of the finished Mr Job

mapreduce.jobhistory.done-dir: /mr-history/done

3 -am Log persistence directory during running

yarn.app.mapreduce.am.staging-dir: /tmp/hadoop-yarn/staging

The following describes how AM logs are moved between the three directories at different stages of a job.

Phase1

AM run logs are stored in the HDFS path: / TMP /hadoop-yarn/staging. They are dynamically updated during job running.

Phase2

After the Job is run, THE AM copies the Job logs in the/TMP /hadoop-yarn/staging/path to the/MR-history/TMP path. Jhist files, summary files, and conf.xml files), all of which have a.tmp suffix.

Files with TMP extensions will be renamed to normal files only after they are copied successfully.

Phase3

The JobHistoryServer process has a Service of type JobHistory (see the JHS initialization process and Service introduction section).

The JobHistory Service is simple:

1- Periodically copy the job completion logs generated from phase2 in/MR-history/TMP to/MR-history /done. After the copy is complete, delete the log files in/MR-history/TMP.

2- periodically scans the job log files in the /mr-history/done directory and deletes all the job log files that exceed the life cycle. That is, the deleted job information cannot be seen on the JobHistoryServer web page.

Therefore, for a finished MapReduce job, the job logs and the logs of each task can be accessed on the Web page of JobHistoryServer. /mr-history/done/TMP /logs/ log

/ MR-history /done/ records job configurations and task overview information (how many Maps, how many reduces, how many successes, and how many failures).

/ TMP /logs records detailed logs about all Containers (tasks) in the application. Data about jumping from the Job page to the task page is obtained from this log.

As you may have noticed in phase2’s AM log screenshot, the link to the job log was set after the job was run and before the log was copied to intermediate-dir. This link is actually the address of the jobhistoryServer Web service.

The following figure shows the information about a typical running application on yarn’s native page.

The following link: https:// {RESOURCEMANAGER_IP} : {PORT} / Yarn/ResourceManager / 45 / proxy/application_1636508815320_0003 /.

Obviously, the resourceManager is still accessed at this time, that is, the process running at that time has nothing to do with jobhistoryServer.

Similarly, if we want to view the log details of some tasks that have finished running for a running job, we will access the corresponding NodeManager to get them, as shown in the following figure.

It is also clear from the link information that nodeManager is being accessed and that the process has nothing to do with jobhistoryServer.

However, after a YARN Application is completed, the History link on the Application Overview page is replaced by the JobHistoryServer link instead of the ResourceManager link.

That is, for a running job, all log access requests on the page are handled by YARN. For a job that has finished running, all subsequent requests are processed by JobHistoryServer except the yarn Application page overview.

A job at the end of the run connection may be this: https:// {RESOURCEMANAGER_IP} : {PORT} / Yarn/ResourceManager / 45 / proxy/application_1636508815320_0003 /

Similarly, the JobHistoryServer processes the logs of a task/ Container for a job that has been run.

Click to follow, the first time to learn about Huawei cloud fresh technology ~