Welcome to github.com/hsfxuebao/j… , hope to help you, if you feel ok, please click on the Star

preface

This article summarizes some steps to locate common online problems in Java applications. The main purpose of sharing is to let students who are less exposed to online problems have a knowledge in advance, so as not to be confused when they encounter real problems. After all, the author himself came from a time when he was in a hurry.

Just a hint here. During an online emergency, remember that there is only one overall goal: “Restore service as quickly as possible and eliminate impact.” No matter in which stage of emergency, we must first think of recovery, recovery may not able to locate problems, does not necessarily have a perfect solution, perhaps through experience judgment, may be the default switch, etc., but can let us achieve the goal of fast recovery, then keep some scene, to locate and solve problems, and analyse.

Ok, now let’s get down to business.

1. High/soaring CPU utilization

Note: CPU usage is an important indicator of system busyness. But ** “The safe threshold for CPU usage is relative, depending on whether your system is IO intensive or compute intensive” **. Generally, compute-intensive applications have high CPU usage but low LOAD.

“Common reasons:”

  • Frequent gc

  • Infinite loops, thread blocking, IO wait… etc

simulation

To illustrate, here is the simulation code for a simple infinite loop to simulate a CPU surge.

Add the CpuReaper class to the simplest SpringBoot Web project,

@author Richard_yyf */ @component public class CpuReaper {@postconstruct public void CpuReaper () { int num = 0; long start = System.currentTimeMillis() / 1000; while (true) { num = num + 1; if (num == Integer.MAX_VALUE) { System.out.println("reset"); num = 0; } if ((System.currentTimeMillis() / 1000) - start > 1000) { return; }}}} Copy codeCopy the code

Once packaged as a JAR, run it on the server. java -jar cpu-reaper.jar &

Step 1: Locate the offending thread

Method A: The traditional method

  1. Top Locates the process with the highest CPU

    Run the top command to view the CPU usage of all processes and locate the fault. In this case, our Java process. PID column is the process id. (See [Appendix] for unclear indicators)

  2. Top-hp PID Locates the thread with the highest CPU usage

  3. Printf ‘0x%x’ tid Converts the thread ID to hexadecimal

    printf ‘0x%x’ 12817 0x3211

    Copy the code

  4. Jstack pid | grep dar find thread stack

    jstack 12816 | grep 0x3211 -A 30

    Copy the code

Method b: show — busy — Java threads

This script comes from an open source project on Github that provides many useful scripts, such as show-busy-Java-Threads. Using this script, you can directly simplify the tedious steps in method A. As follows,

> wget --no-check-certificate https://raw.github.com/oldratlee/useful-scripts/release-2.x/bin/show-busy-java-threads > Chmod +x show-busy-java-threads >./show-busy-java-threads copy codeCopy the code

Show-busy-java-threads # Find the most cpu-consuming thread from all running Java processes (default 5) and print the thread stack. Of course, you can manually specify the Id of the Java process to be analyzed, Show-busy-java-threads -p < specified Java process Id> show-busy-java-threads -c < number of threads to display > copy codeCopy the code

Method C: arthas Thread

Ali’s open source Arthas now does almost all of our troubleshooting online, providing a complete toolset. In this scenario, a single thread-n command is also required.

> curl - O https://arthas.gitee.io/arthas-boot.jar # download copy codeCopy the code

It is important to note that arthas uses a different CPU ratio than the previous two. Arthas is the percentage of CPU time consumed by each thread in the current JVM over a sampling interval. Arthas is the percentage of CPU time consumed by each thread in the current JVM over a sampling interval.

See the website: alibaba. Making. IO/arthas/thre…

subsequent

Through the first step, after finding the problematic code, observe the thread stack. We ** should make a case-by-case analysis on a case-by-case basis. Here are a few examples.

Case one: You find that the GC threads that use the most CPU are the GC threads.

GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007fd99001f800 nid=0x779 runnable GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007fd990021800 nid=0x77a runnable GC task thread#2 (ParallelGC)" os_prio=0 tid=0x00007fd990023000 Os_prio =0 tid=0x00007fd990025000 nid=0x77c runnable GC task thread#3 (ParallelGC)" os_prio=0 tid=0x00007fd990025000 nid=0x77c runnabl replication codeCopy the code

Gc screening is quite a bit, so I decided to cover it in a separate section later.

Case 2: The highest CPU usage is found in business threads

  • io wait

    • In this example, I/O is blocked due to insufficient disk space

  • Wait for kernel-state locks, such as synchronized

    • Jstack – pid l | grep BLOCKED to check the thread stack blocking state

    • Dump thread stack, analyze thread holding lock condition.

    • Arthas provides Thread-b to find out which threads are currently blocking other threads. For synchronized

Second, frequent GC

1. Review the GC process

Before we get to what follows, take a moment to review the entire GC process.

Following on from the previous section, it is natural to look at the gc specifics in this case.

  • Method A: View gc logs

  • Method b: jstat -gcutil Indicates the number of milliseconds at which process ID statistics are collected. The default value is consistent statistics

  • Method C: It’s more convenient if your company has a component that monitors the application (such as Prometheus + Grafana)

Here are some additional instructions for enabling GC log. One oft-discussed question (conventional wisdom) is whether GC logging should be turned on in a production environment. Since the overhead is usually very limited, my answer is ** “on” **. However, it is not necessary to specify GC logging parameters when starting the JVM.

HotSpot JVM has a special class of parameters called manageable parameters. For these parameters, you can change their values at run time. All of the arguments we discuss here and those that begin with “PrintGC” are manageable arguments. This allows us to turn GC logging on or off at any time. For example, we can use the JDK’s built-in JInfo tool to set these parameters, or we can use the JMX client to set these parameters by calling the setVMOption method of the HotSpotDiagnostic MXBean.

Again, kudos to arthas❤️, which provides vmOption commands that can be viewed directly to update parameters related to VM diagnostics.

Once you have captured the GC logs, you can upload them to GC Easy to help with analysis and get visualized chart analysis results.

2. GC causes and location

“Prommotion failed”

Objects promoted from sector S will not be released in the old age, causing FullGC (throw OOM if FGC is invalid).

Possible causes:

  • “Survivor zone too small, objects too old”

    View the SurvivorRatio parameter

  • “Large object allocation, not enough memory”

    Dump heap, profiler/MAT analyzes object occupancy

  • “There are lots of objects in the old section”

    Dump heap, profiler/MAT analyzes object occupancy

You can also infer the problem from the effect of full GC. Normally, a full GC should reclaim a lot of memory, so “the normal heap memory curve should be zigzag.” If you notice that the heap barely drops after full GC, you can infer that “there are a large number of uncollectable objects in the heap that are constantly expanding, pushing the heap usage past the threshold for full GC, but not being collected, causing Full GC to keep executing.” In other words, there may be a “** memory leak” **.

In general, GC related exception inference involves ** “memory analysis” **, using tools like JMap to dump memory snapshots (or Arthas’ Heapdump) commands, and then using visual memory analysis tools like MAT, JProfiler, JVisualVM, etc.

As for the steps after the memory analysis, you need friends to analyze it according to specific problems.

The thread pool is abnormal

Java thread pools take the thread pool of a bounded queue as an example. When a new task is submitted, if fewer threads are running than corePoolSize, a new thread is created to process the request. If the number of running threads is equal to corePoolSize, new tasks are added to the queue until the queue is full. When the queue is full, a new thread is started to process the task, but not larger than maximumPoolSize. ThreadPoolExecutor rejects a new task when the queue is full and the maximum number of threads has been opened.

Frequently asked Questions and causes

There are several reasons for this thread pool exception:

  1. “Downstream service Response time (RT) is too long”

    This situation may be caused by downstream service exceptions. As consumers, we need to set the appropriate timeout and circuit breaker downgrade mechanism.

    In addition, corresponding monitoring mechanisms such as log monitoring and metrics monitoring are required. Do not wait until the target user feels an exception and the problem is reported from the outside to check logs.

  2. “Database slow SQL or database deadlock”

Viewing Log Keywords

  1. “Java code deadlock”

    Jstack - pid l | grep -i - E 'BLOCKED | deadlock'

Troubleshooting methods for the first two types of problems are generally by viewing logs or monitoring components.

Four, common problems recovery

This section is referenced in this article

Fifth, Arthas

Again, I want to do a separate section on Arthas.

Arthas is an open source Java diagnostic tool from Alibaba. Arthas uses Instrumentation to modify bytecode to diagnose Java applications based on Java Agent.

  • Dashboard: Real-time system data panel, which allows you to view thread, memory, GC and other information

  • Thread: Displays information about the current thread, such as the top N busiest threads

  • Getstatic: Obtains the static attribute value. For example, getstatic className attrName can be used to view the actual value of the online switch

  • Sc: View the loaded classes of the JVM, which can be used to check jar package conflicts

  • Sm: View method information for JVM loaded classes

  • Jad: Decompile JVM loading class information to determine why code logic is not executed

  • Logger: View logger information and update logger level

  • Watch: Observe method execution data, including outgoing parameters, incoming parameters, anomalies, etc

  • Trace: Internal call duration of a method and output the time spent on each node for performance analysis

  • Tt: Used to record methods and perform playback

This is excerpted from Arthas official documentation.

In addition,

Arthas

It’s also integrated

ognl

This lightweight expression engine allows you to do a lot of arthas manipulation with OGNL.

The rest is not covered here, but arthas’s official documentation, Github Issue, is for those interested.

Vi. Tools involved

A few more tools.

  • Arthas (Super Recommended ❤️❤️)

  • useful-scripts

  • GC easy

  • Smart Java thread dump analyzer – thread dump analysis in seconds

  • PerfMa – Java VIRTUAL machine parameter/thread dump/ memory dump analysis

  • Linux command

  • Java N ax

  • MAT, JProfiler… Visual memory analysis tools

conclusion

I know this article is not a comprehensive summary of online exceptions, and ** “Network (timeout, TCP queue overflow…)” **, out-of-heap memory, and many other exception scenarios are not covered. Mainly because their contact is very little, no deep understanding of the study, forced to write out will almost mean, more afraid of the mistake of others 😅.

What I want to say is that the Online screening of Java applications is actually very important to check whether a person has a solid foundation and problem solving ability. For example, thread pool operation mechanism, GC analysis, Java memory analysis, etc., if you do not have a solid foundation, read more confused. In addition, look at some good online experience articles about abnormal investigation, so that even if they can not meet temporarily, but will slowly sum up a set of structural framework to solve similar problems in my mind, when the time comes really encountered, that is, by analogy.

If this article is helpful to you, please give me a thumbs up. This is my biggest motivation 🤝🤝🤗🤗.

reference

  • Developer.aliyun.com/article/757…

  • Arthas 3.2.0 document

  • Distributed Services Architecture: Principles, Design, and Practice

The appendix

Indicates the meaning of the indicator displayed in the top command

From: juejin. Cn/post / 684490…