Optimization practice of high CPU usage in primary online service

A JVM process with a very busy service running on line frequently runs above 100% CPU. Here are some things you can learn from reading this article.

High CPU usage of Java programs
Log4j pseudoasynchrony that can cause a large number of exceptions to online services
Optimization of Kafka asynchronous sending
On-CPU flame diagram principle and interpretation
Use the Trie prefix tree to optimize Spring’s path matching

Start trying to

If the JVM CPU usage is high, the first response is to find the thread with the highest CPU usage and see what it is executing. Use the top command to view the CPU usage of all threads in the process, as shown below.

top -Hp you_pid
Copy the code

The output is as follows:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 48 root 20 0 30.367g 2.636g 12940s 12.7 2.9 36:15.18 Java 2365 Root 20 0 30.367g 2.636g 12940 R 1.3 2.9 2:33.64 Java 2380 root 20 0 30.367g 2.636g 12940 S 1.3 2.9 2:33.10 Java 2381 Root 20 0 30.367g 2.636g 12940s 1.3 2.9 2:33.41 Java 10079 root 20 0 30.367g 2.636g 12940s 1.3 2.9 0:30.73 Java 10 Root 20 0 30.367g 2.636g 12940s 1.0 2.9 4:08.54 Java 11 root 20 0 30.367g 2.636g 12940s 1.0 2.9 4:08.55 Java 92 root 20 0 30.367g 2.636g 12940s 1.0 2.9 2:53.71 Java 681 root 20 0 30.367g 2.636g 12940s 1.0 2.9 2:52.56 Java 683 root 20 0 30.367g 2.636g 12940s 1.0 2.9 2:56.81 Java 690 root 20 0 30.367g 2.636g 12940s 1.0 2.9 3:34.24 JavaCopy the code

You can see that the thread with the highest CPU usage has a PID of 48(0x30). Use jStack to output the current thread stack, then grep 0x30, as shown below.

jstack 1 | grep -A 10 "0x30 "
Copy the code

The output is as follows.

"kafka-producer-network-thread | producer-1" #35 daemon prio=5 os_prio=0 tid=0x00007f9ac4fc7000 nid=0x30 runnable [0x00007f9ac9b88000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <0x0000000094ef70c8> (a sun.nio.ch.Util$3)
        - locked <0x0000000094ef70e0> (a java.util.Collections$UnmodifiableSet) - locked <0x000000009642bbb8> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)  at org.apache.kafka.common.network.Selector.select(Selector.java:686)Copy the code

You can see that this is a kafka sending thread. Our log printing is using the log4j2 kafka plugin to write log files to Kafka. The log writing is very heavy. Let’s first optimize the CPU usage of the Kafka thread.

KafkaAppender optimization under Log4j2

KafkaAppender encapsulated KafkaProducer. After tests, parameters batch.size and Linger.ms are strongly related to KafkaProducer sending frequency. Now let’s see what these parameters actually do.

linger.ms

KafkaProducer sends messages when the Batch buffer is full or when linger. Ms time reaches. Linger. ms is used to specify how long the sender can wait before the batch buffer pool is filled, which is equivalent to the Nagle algorithm of TCP.

The default value is 0, and the Sender thread will send as long as there is data, without waiting, even if there is only one data in the Batch buffer. The latency is low, but the throughput is poor.

If a value greater than 0 is set, the sender can wait for a period of time when the buffer is not full and send the accumulated lingering.ms data together. This reduces the number of requests and avoids sending too many packets too frequently to send data immediately. This increases latency, but improves throughput.

batch.size

When KafkaProducer sends multiple messages, it sends the messages destined for the same partition as a batch.

Batch. size Specifies the size of the batch send cache memory area. Note that this is not a number, the default value is 16384 (16KB).

When the batch buffer is full, all messages in the buffer are sent out. This doesn’t mean KafkaProducer will wait until the batch is full, or if there is only one message, the message will never be sent. Both Linger. Ms and Batch. size affect the sending behavior of KafkaProducer.

Setting the batch.size value too small will reduce throughput and too large will waste memory.

Neither of these two values is configured in our online configuration, which will run according to the configuration of Lingering.ms =0 and Batch. size of 16KB. Because the log generation is very frequent, the Sender thread is hardly idle and has been processing and sending data packets.

Potential pit in log4j2’s asynchronous Appender

There is a risk that a potential pit in log4j2’s asynchronous Appender should be avoided before making Kafka sender parameters adjustments, otherwise it will cause a lot of timeouts to the online business interface.

Log4j2’s asynchronous Appender makes use of a local ArrayBlockingQueue to store messages from the application layer. The default size of this queue is 128 in log4j2 (version 2.7). This value has been set to 1024. If KafkaAppender processes slowly, soon the queue fills up, as shown below.

After filling is blocking waiting is involved, or discard behind membership of the log, compare the pit is the default configuration is DefaultAsyncQueueFullPolicy log4j2, this strategy is a synchronized block waiting for the current thread. We can choose to set this value to discard to ensure that upper-layer business interfaces are not affected regardless of whether low-level logs are written slowly or not. In the worst case, some logs are discarded. Log4j provides configuration items, the system attribute log4j2. AsyncQueueFullPolicy set to Discard.

That’s not all. After setting the full queue policy to Discard, log4j only discards INFO and below logs by default. If the system generates a large number of LOGS of WARN and ERROR levels, this policy will block upstream threads even if it is discarded. Set log4j2.DiscardThreshold to ERROR or FATAL.

After modifying the KafkaProducer and log4j parameters, the CPU usage of kafka sending threads is reduced to less than 5%.

The all-purpose fire chart

At first, I wanted to use tools such as Perf, dTrace and SystemTap to generate the flame map, but THERE was no privileged privilege in the Docker container. I tried one by one but could not run all the above commands. Arthas provides the fire graph generation command Profiler, which uses async-Profiler to sample applications and generate fire graphs.

After attaching the JVM process with arthas, start sampling with profiler Start, run for a while and execute Profiler Stop to generate a flame SVG, as shown in the following image.

The flame map has several characteristics:

Each box represents a function in the stack;
The Y-axis represents the depth of the function call stack, and the underlying function is the parent of the upper function. The deeper the call stack, the higher the flame;
The X-axis does not represent the passage of time, but the number of samples. The wider a function is on the X-axis, the more times it is drawn from the sample and the longer it takes to execute.

As you can see from the graph above, Kafka and Spring functions are the most cpu-intensive. Kafka’s problems can be optimized. Next, let’s look at the Spring function stack.

Log4j line number calculation cost

Enlarged SVG, you can see there is a cap has been high, the function is Log4jLogEvent calcLocation, namely log4j generate log print lines of calculation, as shown in the figure below.

The principle of calculating the line number is actually achieved by getting the current call stack. This calculation performance is very poor, and there are many benchmark examples on the web to test how slow it is.

We turned off log4j’s line number output, the CPU usage was a bit lower, and the flat-topped call was gone.

Use the Trie prefix tree to optimize Spring’s own performance issues

Continue analyzing the high occupancy of function calls. Due to historical reasons, we do not place urls that do not require authentication under the same prefix path in advance in URL design. As a result, there are more than one hundred exclude-mapping configurations of the interceptor, as shown below.

<mvc:interceptors>
    <mvc:interceptor>
        <mvc:mapping path="/ * *"/>
        <mvc:exclude-mapping path="/login"/>
        <mvc:exclude-mapping path="/*/login"/>
        <mvc:exclude-mapping path="/*/activity/teacher"/>
        <mvc:exclude-mapping path="/masaike/dynamic/**"/>. There are more than a hundred of them...<mvc:exclude-mapping path="/masaike/aaaa/**"/>
        <mvc:exclude-mapping path="/masaike/**/hello"/>
       <bean class="com.masaike.AuthenticationHandlerInterceptor"/>
    </mvc:interceptor>
</mvc:interceptors>
Copy the code

Spring MVC processing logic in the org. This passage springframework. Web. Servlet. Handler. MappedInterceptor

For an interface that requires authentication, it traverses the entire excludePatterns list
For an interface that really does not require authentication, the for loop breaks in the middle. In the worst case, you need to iterate through all of the excludePatterns lists

This is inefficient. We can optimize the logic to use the trie tree to match the path. The difference is that the trie tree supports the wildcard * and ** in the middle.

Suppose we have the following paths:

  "/api/*/login"
, "/wildcard/**"
, "/wildcard/**/hello"
, "/v2/hello/"
, "/v2/user/info/"
, "/v2/user/feed/"
, "/v2/user/feed2/"
Copy the code

The resulting Trie tree is shown below.

* └ API └. └ ─ ─ ─ ─ ─ ─ the login └ ─ ─ v2 └ ─ ─ the hello └ ─ ─ the user └ ─ ─ feed └ ─ ─ feed2 └ ─ ─ the info └ ─ ─ wildcard └ ─ ─ * *Copy the code

The implementation is very simple. Each node has a list of children represented by a map, which makes lookup very fast.

/** * @author Arthur.zhang ([email protected]) * Public class PathTrie {private Node root = new Node(emptyList(), null); publicPathTrie() {
    }

    public void insert(String path, String obj) {
        List<String> parts = getPathSegments(path);
        insert(parts, obj);
    }

    public void insert(List<String> parts, String o) {
        if (parts.isEmpty()) {
            root.obj = o;
            return;
        }
        root.insert(parts, o);
    }

    private static List<String> getPathSegments(String path) {
        return Splitter.on('/').splitToList(path).stream().filter(it -> ! it.isEmpty()).collect(Collectors.toList()); } public boolean existsPath(String path) {return root.exists(getPathSegments(path), 0);
    }

    public void dump() {
        if(root ! = null) root.dump(); } private static class Node { String name; Map<String, Node> children; String obj; Node(List<String> path, String obj) {if (path.isEmpty()) {
                this.obj = obj;
                return; } name = path.get(0); } private synchronized void insert(List<String> parts, String o) { String part = parts.get(0); Node matchedChild; // If it is **, the following path nodes need not be insertedif ("* *".equals(name)) {
                return;
            }
            if (children == null) children = new ConcurrentHashMap<>();

            matchedChild = children.get(part);
            if(matchedChild == null) { matchedChild = new Node(parts, o); children.put(part, matchedChild); } // remove parts. Remove (0);if(parts.isempty ()) {// Assign the last child's obj to matchedChild.obj = o if it reaches the end; }else{ matchedChild.insert(parts, o); /** * @param pathSegments = /a/b/c ->'a' , 'b' , 'c'* @param level Level depth of the current path traversal, such as /a/b/c -> 0='a'1 ='b'2 ='c'*/ public Boolean exists(List<String> pathSegments, int level)if(pathSegments. Size () < level + 1) {// If the current trie tree is not a leafif(obj == null) {// Check whether the leaf Node contains **, if so, match to Node n = children.get("* *");
                    if(n ! = null) {return true; }}returnobj ! = null; }if (children == null) {
                return false; } String pathSegment = pathSegments.get(level); Node n = children. Get (pathSegment); // select * from * where * does not existif (n == null) {
                n = children.get("*"); } // 3, if not already exists, check whether there is **if (n == null) {
                n = children.get("* *");
                if(n ! = null) {return true; }} // 4, if none of these are found, returnfalse
            if (n == null) {
                return false; } // 5, if a node is found, continue the recursive searchreturn n.exists(pathSegments, level + 1);
        }

        @Override
        public String toString() {
            return "Node{" +
                           "name='" + name + '\'' + ", children=" + children + '}'; } / * * * use a similar tree output of the command format to print the tree prefix number *. * └ ─ ─ API * └ ─ ─ * * └ ─ ─ the login * └ ─ ─ v2 * └ ─ ─ hello * └ ─ ─ the user * └ ─ ─ feed * └ ─ ─ feed2 * └ ─ ─ the info * └ ─ ─ wildcard * └ ─ ─ * * * / public void dump () {dump (0); } public void dump(int level) { if (level == 0) { System.out.println("."); } else { StringBuilder sb = new StringBuilder(); for (int i = 0; i < level - 1; i++) { sb.append("\t"); } sb. Append (" └ ─ ─ "), append (name); System.out.println(sb.toString()); } if (children ! = null && ! children.isEmpty()) { for (Map.Entry
      
        item : children.entrySet()) { item.getValue().dump(level + 1); } } } } }
      ,>Copy the code

With the IMPLEMENTATION of Trie, the average CPU consumption was reduced by about 5%.

summary

After the overall optimization, the CPU usage during the peak period decreased from about 100% to 35%, the effect is relatively obvious.

Performance optimization is a full stack project, and most of the problems that can be identified are solved. In addition, it is not enough to simply command. Only by understanding the underlying principles can we truly aim at the target and discover the core of the problem.

If you have any questions, you can scan the following TWO-DIMENSIONAL code to follow my official number to contact me.