Author: Yang Tao (Boyuan)
Flink supports Standalone Standalone deployment and YARN, Kubernetes, and Mesos cluster deployment modes. YARN cluster deployment mode is widely used in China. Flink community will launch Flink on YARN application interpretation series, divided into two, the next two. This section describes the process for starting the Flink on YARN application. Based on the community feedback, this section answers the frequently asked questions of the client and Flink Cluster and shares troubleshooting ideas.
Common Client Problems and troubleshooting ideas
Could not build the program from JAR file
This problem is deceptive. Many times it is not the JAR file that is specified to run, but an exception occurred during the commit process, which needs to be further investigated based on the log information. The most common reason is that the dependent Hadoop JAR file was not added to the CLASSPATH and the dependent class was not found (for example: ClassNotFoundException: Org, apache hadoop. Yarn. Exceptions. YarnException) lead to the failure load client entrance class (FlinkYarnSessionCli).
Flink on YARN How can I associate an application with a specified YARN cluster?
Flink on YARN You need to configure the HADOOP_CONF_DIR and HADOOP_CLASSPATH environment variables on the client to enable the client to load the Hadoop configuration and dependent JAR file. Example (existing environment variable HADOOP_HOME specifies the Hadoop deployment directory) :
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_CLASSPATH=`${HADOOP_HOME}/bin/hadoop classpath`
Copy the code
Where and how to configure client logs?
Client logs are usually stored in the Flink deployment directory under the log folder:{USER}-client-.log, using log4j configuration: ${FLINK_HOME}/conf/log4j-cli.properties.
Export JVM_ARGS=” -dlog4j. DEBUG =true” Export JVM_ARGS=” -dlog4j. DEBUG =true”
Troubleshooting faults on the client
If client logs cannot be located, change the log level in the log4J configuration file from INFO to DEBUG and run the log4J configuration file again to check whether any DEBUG logs can help locate the fault. For some problems with no log or incomplete log information, code level debugging may be required. It is too tedious to modify the way of repackaging and replacing the source code. It is recommended to use the Java bytecode injection tool Byteman (for detailed syntax description, see Byteman Document).
(1) write debugging scripts, such as printing Flink actual use of the Client class, said the following script in CliFrontend# getActiveCustomCommandLine function exits to print the return value;
RULE test
CLASS org.apache.flink.client.cli.CliFrontend
METHOD getActiveCustomCommandLine
AT EXIT
IF TRUE
DO traceln("------->CliFrontend#getActiveCustomCommandLine return: "+ $!) ; ENDRULECopy the code
(2) Set environment variables, using byteman JavaAgent:
export BYTEMAN_HOME=/path/to/byte-home
export TRACE_SCRIPT=/path/to/script
export JVM_ARGS="-javaagent:${BYTEMAN_HOME}/lib/byteman.jar=script:${TRACE_SCRIPT}"
Copy the code
(3) to run the test command bin/flink run – m yarn – cluster – p 1 / examples/streaming/WordCount. The jar, the console output:
——->CliFrontend#getActiveCustomCommandLine return: org.apache.flink.yarn.cli.FlinkYarnSessionCli@25ce9dc4
Flink Cluster FAQ and troubleshooting ideas
Version conflicts between user applications and framework JAR packages
The problem often thrown NoSuchMethodError/ClassNotFoundException/IncompatibleClassChangeError abnormalities, to solve such a problem:
MVN dependency:tree: MVN dependency:tree: MVN dependency:tree: MVN dependency:tree: MVN dependency:tree: MVN dependency:tree: MVN dependency:tree For example, MVN dependency: tree-dincludes =power,javaassist; MVN dependency: tree-dincludes =power,javaassist; MVN dependency:tree -Dincludes=power;
2. After locating conflicting packages, it is necessary to consider how to arrange packages. The simple solution is to use Exclusion to eliminate the dependencies passed from the dependent project. Please refer to the Maven Shade Plugin for details.
How to determine the specific source of a dependent library when multiple versions of JAR packages exist?
Many applications run multiple versions of JAR packages that depend on the same library in the CLASSPATH. As a result, the actual version used depends on the order in which the JAR was loaded. It is often necessary to determine the origin JAR of a class when troubleshooting problems. Therefore, you can print the loading class and its source (in the.out log) using one of the following configuration items:
env.java.opts=-verbose:class/ / configurationJobManager&TaskManager
env.java.opts.jobmanager=-verbose:class/ / configurationJobManager
env.java.opts.taskmanager=-verbose:class/ / configurationTaskManager
Copy the code
How to view the complete log of Flink application?
JM/TM logs of the Flink application running can be viewed on the web user interface (WebUI). However, problems need to be analyzed and checked based on complete logs. Therefore, you need to learn about the YARN log saving mechanism.
1. If the application is not finished, Container logs are kept on the node where the application is running. You can still find the Container logs in the configuration directory of the node where the application is running. ${yarn.nodemanager.log-dirs}/
/
can also be accessed directly from WebUI: http://
/node/containerlogs/
/
2. If the application is terminated and log collection is enabled for the cluster(yarn.log-aggregation-enable=true), NM uploads all logs to the distributed storage (usually HDFS) and deletes the local files after the application is complete (incremental upload is also configured). To view all application logs, run the yarn command yarn logs -applicationID <APPLICATION_ID> -appowner. You can also add -containerid <CONTAINER_ID> -nodeaddress <NODE_ADDRESS> to view the logs of a container. You can also directly access the distributed storage directory.{user}/${yarn.nodemanager.remote-app-log-dir-suffix}/<APPLICATION_ID>
Troubleshooting for Flink application resource allocation problems
If the Flink application cannot be started and reaches the RUNNING state, perform the following steps:
1. We need to check the current status of the application first. According to the above description of the startup process, we know:
-
When the status is NEW_SAVING, the application information is being persisted. If the status is in this state, check whether the RM status storage service (usually the ZooKeeper cluster) is normal.
-
If the state of SUBMITTED indicates that some hold read/write operations may occur in RM, resulting in event accumulation. You need to locate events based on YARN cluster logs.
-
If the AM is in the ACCEPTED state, check whether the AM is normal and go to Step 2.
-
If the system is in the RUNNING state but not all resources are available, the JOB cannot run properly. Go to Step 3.
2. Check whether AM is normal. From YARN application display interface (http:///cluster/app/ < APPLICATION_ID >) or YARN application REST API (http:///ws/v1/cluster/apps/ < APPLICATION_ID >) view Diagnostics information to identify the cause and solution of the problem based on the keyword information:
– Queue’s AM resource limit exceeded. The cause is that the upper limit of available AM resources in the queue is reached. That is, the sum of used AM resources and new AM resources in the queue exceeds the upper limit of available AM resources in the queue. You can adjust the configuration item of the percentage of available AM resources in the queue appropriately. Yarn. The scheduler. Capacity. < QUEUE_PATH >. The maximum – am – resource – percent.
– User’s AM resource limit exceeded. The cause is that the upper limit of AM resources available to the application user in the queue is reached. That is, the sum of the AM resources used by the application user in the queue and the new AM resources applied for by the application user exceeds the upper limit of AM resources available to the application user in the queue. You can increase the proportion of AM resources available to the application user to solve this problem. Yarn. The scheduler. Capacity. < QUEUE_PATH >. The user – limit – factor and Yarn. The scheduler. Capacity. < QUEUE_PATH >. The minimum – the user – limit – percent.
– AM container is launched, waiting for AM container to Register with RM. The cause is that the AM is started, but the internal initialization is not complete, and the ZK connection may time out. For details about the cause, check AM logs and solve the problem based on the specific problem.
– Application is Activated, waiting for resources to be assigned for AM. The information indicates that the AM check has passed and the system is waiting to be allocated by the scheduler. In this case, check the resources at the scheduler level and go to Step 4.
3. Ensure that the application has resource requests that YARN cannot meet. If no Pending Resource exists in the Total Outstanding Resource Requests list, click the Outstanding application ID in the application list to enter the application page. If no Pending Resource exists in the Total Outstanding Resource Requests list, click the Outstanding application ID in the application instance list to enter the application page. Note If YARN is assigned, exit the check process and go to the AM check. If yes, the scheduler fails to complete the assignment and go to Step 4.
Yarn-9050 supports automatic application fault diagnosis on the web user interface (WebUI) or through the REST API. The yarn-9050 will be released in Hadoop3.3.0. For earlier versions, manual troubleshooting is required.
-
To check Resources in a cluster or queue, expand the queue in the leaf of the scheduler tree to check Resource information: Effective Max Resource, Used Resources: (1) Check whether Resources in the cluster, the queue to which the cluster belongs, or its parent queue are Used up. (2) Check whether the resource of a dimension in the leaf queue is close to or reaches the upper limit;
-
Check whether resource fragmentation exists: (1) Check the proportion of the sum of Used resources and Reserved resources in the total resources of the cluster. When the cluster resources are nearly Used up (for example, more than 90%), there may be resource fragmentation, which slows down the allocation of applications. If reserved resources reach a certain size, most of the machine resources may be locked, and subsequent allocation may be slow. (2) Check the distribution of available NM resources. Even if the cluster resource usage is not high, it may be caused by the different distribution of resources in different dimensions. For example, the memory resources on 1/2 node are close to being used up, and the CPU resources on 1/2 node are close to being used up, and the CPU resources on 1/2 node are high. If a certain dimension of the applied resource is excessively configured, the applied resource may fail.
-
Check for high-priority issues where an application requests resources frequently and releases resources immediately, which causes the scheduler to be too busy meeting resource requests from one application to care about other applications;
-
Check whether Container fails to be started or automatically exits after being started. You can view Container logs (including Localize logs and launch logs), YARN NM logs, or YARN RM logs.
▼ start TaskManager exception: org. Apache. Hadoop. Yarn. Exceptions. YarnException: Unauthorized request to start container. This token is expired. current time is … found …
This exception is raised when Flink AM applies to YARN NM to start a Container whose token has timed out. The common cause is that it is a long time after Flink AM receives the Container from YARN RM (the Container validity period exceeds). By default, it takes 10 minutes for the Container to be released. The further reason is that Flink starts the Container in serial mode after receiving the Container resources returned by YARN RM.
When there are a large number of Containers to be started and the performance of distributed file storage such as HDFS is slow (TaskManager configuration needs to be uploaded before starting), Container startup requests tend to be stored in the system. Flink-13184 optimises this problem. The first is to add a validity check before the startup, to avoid meaningless configuration upload process, and the second is asynchronous multithreading optimization, to speed up the startup.
▼ Failover exception 1: Java. Util. Concurrent. TimeoutException: Slot allocation request timed out for…
The fault is that the TaskManager resources cannot be allocated properly. You can rectify the fault according to Step 4 of the Flink application resource Allocation problem.
▼ Failover exception 2: Java. Util. Concurrent. TimeoutException: Heartbeat of TaskManager with id < CONTAINER_ID > timed out.
The direct cause is that the TaskManager heartbeat times out. Further causes may be as follows:
-
The process exits. An error may occur or the preemption mechanism on YARN RM or NM may affect the process. You need to further trace the TaskManager or YARN RM/NM logs.
-
The process is still running, but the cluster is disconnected due to network problems. The connection will automatically exit due to timeout. JobManager will perform a Failover to recover after the fault occurs (re-apply for resources and start a new TaskManager).
-
The process takes a long time to GC, which may be caused by memory leak or improper memory resource configuration. Locate the cause based on logs or memory analysis.
Java.lang.Exception: Container released on a lost node
The cause is that the node where the Container is running is marked as LOST in the YARN cluster. All containers on this node will be released by the YARN RM and inform the AM. After receiving this exception, JobManager performs a Failover to restore itself (re-apply for resources and start a new TaskManager). The remaining TaskManager process can exit after the timeout.
Troubleshooting of Flink Cluster
First, analyze and locate faults based on JobManager/TaskManager logs. For complete logs, see “How to View Complete Logs of Flink Application”. If you want to obtain debugging information, JobManager/TaskManager (${FLINK_HOME}/conf/log4j.properties) ¶ You are advised to use The Java bytecode injection tool Byteman to view The internal process status. For details, see How Do I Install The Agent Into A Running Program?
The resources
There are jumps in the green text, please see the link below for more information:
Byteman Documents
Maven Shade Plugin
YARN-9050
FLINK-13184
How Do I Install The Agent Into A Running Program?
The first and next articles summarize the Flink on YARN application startup process and provide troubleshooting ideas for common problems of the client and Flink Cluster for your reference. I hope they can be helpful in application practice.
Apache Flink community recommendation
Apache Flink and Flink Forward Asia 2019 will be held in Beijing from November 28 to 30. Senior technology experts from nearly 30 well-known enterprises such as Ali, Tencent, Meituan, Bytedance, Baidu, Intel, DellEMC, Lyft, Netflix and the founding team of Flink gathered at the International Conference Center to discuss core technologies and open source ecology in the era of big data with global developers. To learn more about the exciting agenda, please click:
Developer.aliyun.com/special/ffa…
Flink community public account background reply “tickets”, a small number of free tickets to get first.
Flink community official wechat public account