Flink on YARN (bottom) : FaQs and troubleshooting methods

Yang Tao (Bo Yuan)

Flink supports Standalone deployment and cluster deployment modes such as YARN, Kubernetes, and Mesos. The YARN cluster deployment mode is widely used in China. Flink community will launch Flink on YARN application interpretation series, divided into two parts. The previous part shared the resource scheduling model based on flip-6 reconstruction and introduced the whole process of Flink on YARN application startup. This article will answer the common problems of the client and Flink Cluster based on the feedback of the community, and share the troubleshooting ideas of related problems.

Common client Problems and troubleshooting methods

▼ Application Submission Console exception Message: Could not build the program from JAR file.

In many cases, an exception occurs during the submission process, not in the specified JAR file. You need to troubleshoot the problem based on the log information. The most common reason is that the dependent Hadoop JAR file is not added to the CLASSPATH and the dependent class cannot be found (e.g. ClassNotFoundException: Org, apache hadoop. Yarn. Exceptions. YarnException) lead to the failure load client entrance class (FlinkYarnSessionCli).

** how to Associate Flink on YARN With a Specified YARN Cluster when submitting an Application?

The Flink on YARN client usually needs to configure two environment variables HADOOP_CONF_DIR and HADOOP_CLASSPATH to enable the client to load Hadoop configurations and dependent JAR files. Example (existing environment variable HADOOP_HOME specifies the Hadoop deployment directory) :

exportHADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoopexportHADOOP_CLASSPATH=`${HADOOP_HOME}/bin/hadoop classpath`

Where and How to Configure Client Logs?

The client logs are normally in the log folder of the Flink deployment directory: ${FLINK_HOME}/log/ Flink -${USER}-client-.log, configured using log4j: The ${FLINK_HOME} / conf/log4j – cli. Properties.

If it is difficult to locate and configure logs in complex client environments, run the following environment variables to enable log4J DEBUG logs to trace log4J initialization and loading processes: export JVM_ARGS=” -dlog4j. DEBUG =true”

Troubleshooting Client Problems

If client logs cannot be located, modify the log4j configuration file to change the log level from INFO to DEBUG and run the system again to check whether DEBUG logs can help troubleshoot problems. For some problems that do not have logs or log information is incomplete, code-level debugging may be required. Modifying source code and repackaging replacement methods are too cumbersome. Java Byteman (Byteman Document for detailed syntax description) is recommended.

(1) write debugging scripts, such as printing Flink actual use of the Client class, said the following script in CliFrontend# getActiveCustomCommandLine function exits to print the return value;

RULE testCLASS org.apache.flink.client.cli.CliFrontendMETHOD getActiveCustomCommandLineAT EXITIF TRUEDO traceln(“——->CliFrontend#getActiveCustomCommandLine return: “+$!) ; ENDRULE

(2) Set the environment variables using Byteman JavaAgent:

exportBYTEMAN_HOME=/path/to/byte-homeexportTRACE_SCRIPT=/path/to/scriptexportJVM_ARGS=”-javaagent:${BYTEMAN_HOME}/lib/by teman.jar=script:${TRACE_SCRIPT}”

(3) to run the test command bin/flink run – m yarn – cluster – p 1 / examples/streaming/WordCount. The jar, the console output:

——->CliFrontend#getActiveCustomCommandLine return: org.apache.flink.yarn.cli.FlinkYarnSessionCli@25ce9dc4

Flink Cluster FaQs and Troubleshooting Methods

Conflicts between user application and framework JAR package versions

The problem often thrown NoSuchMethodError/ClassNotFoundException/IncompatibleClassChangeError abnormalities, to solve such a problem:

MVN Dependency :tree displays all dependency chains in a tree structure, and then locates the conflicting dependency libraries. You can also add the parameter -dincludes to specify the package to be displayed. The format is [groupId]:[artifactId]:[type]:[version], which supports matching. Multiple MVN dependency: tree-dincludes =power,javaassist;

2. After locating the conflicting packages, it is necessary to consider how to exclude them. The simple solution is to use Exclusion to exclude the dependencies passed from other dependent projects. Refer to the Maven Shade Plugin for details.

How to determine the specific source of a class when multiple versions of a dependency library JAR package coexist?

Many applications run CLASSPATH with multiple versions of the same dependent library JAR packages, resulting in the actual use of the version is related to the loading order, often need to determine the source JAR of a class when checking problems, Flink supports configuring JVM parameters for JM/TM processes. Therefore, you can print the loaded class and its source (in the.out log) through the following three configuration items, depending on your needs:

Env. Java. Opts = verbose: the class / / configuration JobManager&TaskManagerenv. Java. Opts. Jobmanager = verbose: the class / / configure JobManagerenv. Java. Opts Taskmanager = verbose: taskmanager class / / configuration

How to view the complete logs of Flink Application?

You can view the JM/TM logs of the Flink application running on the WebUI. However, complete logs are required for troubleshooting. Therefore, you need to know the log saving mechanism of YARN.

1. If the application is not finished, The Container logs are stored on the node where the Container is running. The Container logs can be found in the configuration directory of the node where the Container is running. ${yarn. The nodemanager. Log – dirs} / /, can also directly from the WebUI visit: http:///node/containerlogs//

2. If the application is complete and log collection is enabled (yarn.log-aggregation-enable=true), NM usually uploads all its logs to a distributed storage (usually HDFS) and deletes local files after the application is complete. You can run the yarn logs-applicationid -appowner command to view all application logs. You can also add the -containerid-nodeaddress parameter to view the logs of a container. You can also directly access the distributed storage directory: ${yarn.nodeManager. Remote-app-log-dir}/${user}/${yarn.nodeManager. Remote-app-log-dir}/

Fault Diagnosis of Flink Application Resource Allocation

If the Flink application cannot start normally and reach the RUNNING state, you can perform the following steps to rectify the fault:

1. Check the current status of the application first. According to the above description of startup process, we know:

If the NEW_SAVING state is in the persistent application information state, you need to check whether the RM state storage service (usually the ZooKeeper cluster) is running properly.

If the state is in the SUBMITTED state, the possible cause is that some time-consuming operations of the hold read/write lock occur in RM, resulting in event accumulation. You need to locate the fault based on the YARN cluster logs.

If the STATUS is ACCEPTED, check whether the AM is normal and go to Step 2.

If the JOB is in the RUNNING state but not all resources are available, go to Step 3.

2. Check if AM normal, from YARN can display interface (http:///cluster/app/) or YARN application REST API (http:///ws/v1/cluster/apps/) to check the diagnostics and information, Identify the cause and solution of the problem according to the keyword information:

– Queue’s AM resource limit exceeded. Reason is up to the queue AM available resources limit, namely queue AM has been used and AM new application resources beyond the sum queue AM resource limit, can adjust queue AM available resources percentage of configuration items: yarn. The scheduler. The capacity.. Maximum – am – resource – percent.

– User’s AM resource limit exceeded. That is, the sum of the used AM resources and newly applied AM resources in the queue of the application user exceeds the upper limit of AM resources in the queue of the application user. You can increase the proportion of available AM resources to solve this problem. Related configuration items: yarn.scheduler.capacity.. User – limit – factor and yarn. The scheduler. Capacity.. Minimum – the user – limit – percent.

– AM container is launched, waiting for AM container to Register with RM. The possible cause is that the AM is started but the internal initialization is incomplete. ZK connections may time out. Check AM logs and rectify the fault based on the specific problem.

– Application is Activated, waiting for resources to be assigned for AM. This message indicates that the AM application check has passed and is waiting for the scheduler to allocate resources. In this case, you need to check the resources at the scheduler level and go to Step 4.

3. The application does have resource requests that YARN cannot meet: Check whether any Pending resources are in the Total Outstanding Resource Requests list. If no Pending resources are in the Total Outstanding Resource Requests list, YARN has been allocated. Exit the check process and switch to AM. If yes, the scheduler fails to complete allocation. Go to Step 4.

Yarn-9050 can automatically diagnose application problems on the web user interface (WebUI) or through the REST API. Yarn-9050 will be released in Hadoop3.3.0. Manual troubleshooting is required for earlier versions.

Check cluster or queue Resources. The following Resources are displayed in the tree of the scheduler page: Effective Max Resource, Used Resources: (1) Check whether cluster Resources, queue Resources or its parent queue Resources are Used up. (2) Check whether a dimension resource of the leaf queue approaches or reaches its upper limit;

Check for resource fragments: (1) Check the proportion of the sum of the Used resources and Reserved resources in the total resources of the cluster. When the cluster resources are nearly full (for example, more than 90%), there may be resource fragmentation, and the application allocation speed will be affected and slow, because most machines have no resources, and insufficient available resources will be Reserved. When the reserved resources reach a certain scale, most of the machine resources may be locked, and the subsequent allocation may be slow. (2) Check the distribution of available NM resources. Even if the cluster resource usage is low, it may be caused by different resource distribution in different dimensions. For example, memory resources on half of the nodes are nearly used up, and CPU resources on half of the nodes are nearly used up, and CPU resources on half of the nodes are nearly used up, and CPU resources on half of the nodes are nearly used up, and CPU resources on half of the nodes are nearly used up. If the resource value of a dimension is too large, the resource cannot be applied for.

Check for high-priority applications that frequently apply for and immediately release resources. In this case, the scheduler is too busy meeting the resource requests of this application to take care of other applications.

Check whether the Container fails to be started or automatically exits after the Container is started. You can view Container logs (including localize logs and launch logs), YARN NM logs, or YARN RM logs.

TaskManager Startup Exception:

org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. This token is expired. current time is … found …

This exception is thrown when Flink AM applies to YARN NM for a Container whose startup token has expired. The usual reason is that Flink AM receives the Container from YARN RM long after the Container expires. By default, it takes 10 minutes for the Container to be released) to start it, further because Flink internally starts serial after receiving the Container resource returned by YARN RM.

When there are a large number of Containers to be started and the performance of distributed file storage such as HDFS is low (the TaskManager configuration needs to be uploaded before starting the Container), Container startup requests tend to accumulate internally. Flink-13184 optimizes this problem. First, the validity check is added before startup to avoid meaningless configuration upload process. Second, asynchronous multithreading optimization is carried out to speed up startup.

Failover Exception 1:

java.util.concurrent.TimeoutException: Slot allocation request timed out for …

The applied TaskManager resources cannot be allocated properly. You can rectify the fault by referring to Step 4 of the Flink Application Resource Allocation fault troubleshooting method.

Failover 2:

java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id timed out.

The direct cause is that the TaskManager heartbeat timed out. Possible causes are as follows:

If the process exits, an error may occur or the process is affected by the preemption mechanism on YARN RM or NM. You need to trace the TaskManager logs or YARN RM/NM logs.

The process is still running and the cluster is disconnected due to a network fault. If the connection times out, JobManager automatically exits. After the fault occurs, JobManager automatically applies for resources again and starts a new TaskManager.

If the GC process takes a long time, memory leaks or improper configuration of memory resources may be caused. Locate the cause based on logs or memory analysis.

Failover 3:

java.lang.Exception: Container released on a lost node

The node where the Container runs is marked LOST in the YARN cluster. The YARN RM releases all containers on this node and notifies AM. After the JobManager receives the fault, it automatically recovers by Failover (applying for resources again and starting a new TaskManager). The remaining TaskManager processes can automatically exit after timeout.

Fault Diagnosis of Flink Cluster

Analyze and locate faults based on JobManager/TaskManager logs. For details about complete logs, see How to View Complete Flink Application Logs. If you want to obtain debugging information, ${FLINK_HOME}/conf/log4j.properties); ${FLINK_HOME}/conf/log4j.properties; You are advised to use Java Byteman to view internal processes. For details, see: How Do I Install The Agent Into A Running Program?

The resources

There is a jump in the green font, please see the link below for detailed reference:

Byteman Documents

Maven Shade Plugin

YARN-9050

FLINK-13184

How Do I Install The Agent Into A Running Program?

Flink on YARN The first and the next two articles describe the whole process of starting Flink on YARN and provide troubleshooting ideas for common problems on the client and Flink Cluster for your reference. They hope to help you in application practice.

Apache Flink Community Recommendation ▼

Apache Flink and Flink Forward Asia 2019, the top event in the field of big data, has been launched, and the agenda of the conference is exciting. For more information, please check out:

Developer.aliyun.com/special/ffa…

The first Apache Flink Geek Challenge is open, focusing on machine learning and performance optimization. 400,000 prize money is waiting for you. To join the challenge, please click:

Tianchi.aliyun.com/markets/tia…

Author: Ba Shu Zhen

The original link: yq.aliyun.com/articles/71…

This article is the original content of the cloud habitat community, shall not be reproduced without permission.

Flink on YARN (bottom) : FaQs and troubleshooting methods

Related Posts

Swift compilation time half optimization

Fixed 1px border being thick on the move

Use LTS to help you solve distributed task scheduling problems!