1. The background

1.1. Topic sources

Recently, many students engaged in the development of mobile Internet and Internet of Things sent me emails or micro blog messages to inquire about the problems related to push service. There are a variety of questions. In the process of helping you answer your questions, I also summarized the questions, which can be summarized as follows:

  1. Can Netty serve as a push server?
  2. If Netty is used to develop push services, how many clients can a server support?
  3. Technical problems encountered in using Netty to develop push service.

Due to the large number of consultants and their focus is relatively concentrated, I hope that the case analysis and summary of the design points of push service in this paper can help you avoid detours in practical work.

Recently organized a cover of a large factory Java interview summary + knowledge learning thinking guide + a 300 page PDF document Java core knowledge summary!

If you want to get the PDF, you can get it for free by scanning the picture below and replying to “Digging Gold”

1.2. Push service

In the era of mobile Internet, Push service has become an indispensable part of App applications. Push service can improve user activity and retention rate. Our mobile phones receive a variety of ads and prompt messages every day, most of which are achieved through push services.

With the development of the Internet of Things, most smart homes support mobile push service. In the future, all smart devices connected to the Internet of Things will be the clients of push service, which means that push service will face massive device and terminal access in the future.

1.3. Features of push service

The main features of mobile push service are as follows:

  1. The network used is mainly the wireless mobile network of operators, and the network quality is unstable. For example, the signal on the subway is very poor, and the network is prone to flash interruption.
  2. A large number of clients are accessed and long connections are usually used. Both clients and servers consume large resources.
  3. Since Google’s push framework is not available in China, Android’s long connections are maintained by each app, which means there are multiple long connections on each Android device. Even if there is no message to push, the heartbeat message volume of the long connection itself is very large, which will lead to the increase of traffic and power consumption.
  4. Unstable: message loss, repeated push, delayed delivery, expired push occur from time to time;
  5. Spam messages fly everywhere and there is no unified service governance capability.

In order to solve the above disadvantages, some enterprises have also proposed their own solutions. For example, the push service launched by Jingdong Cloud can realize the mode of multiple applications, single service and single connection, and use AlarmManager to save electricity and traffic by timing heartbeat.

2. A real case in the smart home field

2.1. Problem description

Intelligent home MQTT message service middleware, maintain 100,000 users online long connection, 20,000 users concurrent message requests. After the program runs for a period of time, found the memory leak, suspected Netty Bug. Other relevant information is as follows:

  1. MQTT message service middleware server memory 16G, 8 core CPUS;
  2. In Netty, the size of boss thread pool is 1, worker thread pool is 6, and other threads are allocated to services. This allocation method was later adjusted to the worker thread pool size of 11, and the problem remained.
  3. The Netty version is 4.0.8.final.

2.2. Fault location

First, the memory stack needs to be dumped to analyze the object and reference relationship suspected of memory leak, as shown below:

We found that The ScheduledFutureTask of Netty increased by 9076% to about 110W instances. Through the analysis of the business code, we found that users use IdleStateHandler for business logic processing when the link is idle, but the idle time is set to a large 15 minutes.

Netty’s IdleStateHandler starts three scheduled tasks based on user scenarios: ReaderIdleTimeoutTask, WriterIdleTimeoutTask, and AllIdleTimeoutTask are all added to the NioEventLoop Task queue to be scheduled and executed.

Because the timeout time is too long, 10W long links will create 10W ScheduledFutureTask objects, and each object also stores service member variables, which consumes a lot of memory. Some scheduled tasks are aged into the persistent generation and are not garbage collected by THE JVM. The memory keeps growing and the user mistakenly thinks that there is a memory leak.

In fact, after further analysis, we found that the timeout time set by the user was very unreasonable. The timeout time of 15 minutes could not reach the design goal. After the redesign, the timeout time was set to 45 seconds, and the memory could be recovered normally.

2.3. Summary of problems

If you have 100 long connections, even for long timed tasks, there is no memory leak, and in the new generation you can recycle through the Minor GC. It is because of the 100,000-degree long connection that small problems are magnified, leading to subsequent problems.

In fact, what if the user does have a scheduled task that runs for a long period? For the massive long connection push service, if the code is handled carelessly, all will be lost. Here we introduce how to use Netty to achieve the push service of millions of clients according to the architecture characteristics of Netty.

3. Key points of Netty mass push service design

As a high-performance NIO framework, it is technically feasible to use Netty to develop efficient push service. However, due to the complexity of push service itself, it is not easy to develop stable and high-performance push service. Therefore, reasonable design should be carried out in the design stage according to the characteristics of push service.

3.1. Modification of the maximum number of handles

The maximum number of file handles in Linux is one of the most important tuning parameters. By default, the maximum number of file handles in a single process is 1024. You can run the ulimit -a command to view related parameters, as shown in the following example:

[root@lilinfeng ~]# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256324
max locked memory       (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 ...... Subsequent output omittedCopy the code

When the number of links received by a single push service exceeds the upper limit, “Too many Open Files” will be reported, and all new client access will fail.

By vi/etc/security/limits. Conf to add the following configuration parameters: save after modification, the cancellation of the current user login again, through the ulimit – a check to see if modify the state of the effect.

* softnofile1000000 * hardnofile1000000Copy the code

It should be pointed out that although we can modify the maximum number of open handles for a single process to be very large, when the number of handles reaches a certain order of magnitude, the processing efficiency will decrease significantly. Therefore, reasonable Settings should be made according to the hardware configuration and processing capacity of the server. If the performance of a single server is not good, it can be implemented by clustering.

3.2. Be careful CLOSE_WAIT

Students engaged in the development of mobile push service may experience that the reliability of mobile wireless network is very poor, and there are often client reconnection and network flash disconnection.

In the push system with millions of long connections, the server needs to be able to correctly handle these network exceptions. The design points are as follows:

  1. The reconnection interval of the client must be set properly to prevent connection failures caused by frequent connections (for example, the port has not been released).
  2. Client repeated login rejection mechanism;
  3. The server correctly handles I/O and decoding exceptions to prevent handle leakage.

Last but not least, the problem of too many close_WAIT links is the problem of client disconnection due to network instability. If the server fails to close the socket in time, too many links in the close_WAIT state will result. The link in close_WAIT state does not release resources such as handles and memory. If Too many open files occur, the system may run out of handles. New clients cannot access the link, and operations related to creating or opening handles fail.

Close_wait is formed when the connection is closed passively. According to the TCP state machine, when the server receives the FIN sent by the client, the TCP stack automatically sends an ACK and the connection enters the close_WAIT state. However, if the server does not execute the close() operation on the socket, the state cannot be migrated from close_WAIT to last_ACK, and many connections in the close_wait state exist in the system. Typically, a CLOSE_wait lasts for at least 2 hours (the default timeout is 7200 seconds, or 2 hours). If a server program causes a bunch of CLOSE_Waits to consume resources for some reason, the system will usually crash before it can be released.

The possible causes for excessive CLOSE_wait are as follows:

  1. The program processes a Bug that causes the socket not to be closed after receiving the FIN. This may be a Netty Bug or a service layer Bug.
  2. Failed to close the socket in a timely manner: For example, the I/O thread is blocked unexpectedly or the number of user-defined tasks executed by the I/O thread is too high. As a result, I/O operations are not processed in a timely manner and links cannot be released in a timely manner.

The following uses Netty to analyze potential fault points.

Design Point 1: Do not process business (except heartbeat sending and detection) on Netty’s I/O thread. Why? For A Java process, threads cannot grow indefinitely, which means that the Netty Reactor thread count must converge. The default value of Netty is the number of CPU cores x 2. Generally, it is recommended to set the number of threads as large as possible for I/ O-intensive applications. However, this is mainly for traditional synchronous I/O applications. Number of CPU cores x 2].

If a single server supports 1 million long connections and the number of server cores is 32, the number of connections processed by a single I/O thread is L = 100/(32 x 2) = 15625. If there is one message interaction (new message push, heartbeat, and other administrative messages) every 5S, the average CAPS = 15625/5 = 3125 messages/SEC. This value compared to the processing of Netty performance in terms of pressure is not big, but in the actual business process, often there will be some additional complex logic processing, such as performance statistics, logging interface, etc., these operations performance overhead is big also, if in the I/O threads directly do business logic processing, may be blocking I/O thread, The read and write operations on other links are affected. As a result, the passively closed link cannot be closed in time, causing close_wait accumulation.

Recently organized a cover of a large factory Java interview summary + knowledge learning thinking guide + a 300 page PDF document Java core knowledge summary!

If you want to get the PDF, you can get it for free by scanning the picture below



Design Point 2: Be careful when executing custom tasks on I/O threads. NioEventLoop, Netty’s I/O processing thread, supports two types of custom Task execution:

  1. Normal Runnable: execute by calling NioEventLoop’s execute(Runnable Task) method;
  2. ScheduledFutureTask: The scheduled task is executed by calling the Schedule (Runnable Command, Long Delay, TimeUnit Unit) series interface of NioEventLoop.

Why NioEventLoop should support the execution of user-defined Runnable and ScheduledFutureTask is beyond the focus of this article, which will be covered in a future feature article. This paper focuses on the analysis of their influence.

Execution of Runnable and ScheduledFutureTask in NioEventLoop means that users are allowed to execute non-I /O class business logic in NioEventLoop, which is usually related to processing of message packets and protocol management. Their execution preempts the CPU time for NioEventLoop I/O reads and writes. If there are too many user-defined tasks or the execution cycle of a single Task is too long, I/O reads and writes will be blocked, which indirectly leads to close_wait accumulation.

Therefore, if the user uses Runnable and ScheduledFutureTask in the code, please set the appropriate ioRatio ratio. This value can be set through the setIoRatio(int ioRatio) method of the NioEventLoop. The default value is 50. That is, the execution time ratio of I/O operations to user-defined tasks is 1:1.

My recommendation is not to perform custom tasks in NioEventLoop, or non-heartbeat scheduled tasks, when the server is handling a large number of client long connections.

Design point 3: Be careful with IdleStateHandler. Many users use IdleStateHandler for heartbeat sending and detection, and this usage is recommended. This is much more efficient than sending heartbeats by starting a scheduled task yourself. However, in the actual development, it should be noted that in the heartbeat business logic processing, whether normal or abnormal scenarios, the processing delay should be controllable to prevent the unexpected blocking of NioEventLoop caused by the uncontrollable delay. For example, when the heartbeat timeout or I/O exception occurs, the service invokes an alarm on the Email sending interface. As a result, the Email sending client is blocked due to the timeout on the Email server, and the AllIdleTimeoutTask of IdleStateHandler is blocked due to cascading. Eventually the other link reads and writes on the NioEventLoop multiplexer are blocked.

The constraint also exists for ReadTimeoutHandler and WriteTimeoutHandler.

3.3. Reasonable heartbeat cycle

The million-level push service means that there will be millions of long connections, and each long connection needs to be maintained by the heartbeat between App and App. It is very important to set the heartbeat cycle reasonably, and the heartbeat cycle setting of push service should consider the characteristics of mobile wireless network.

When a smart phone is connected to the mobile network, it is not really connected to the Internet. The IP assigned to the mobile phone by the operator is actually the internal IP of the operator. The IP address must be translated through the operator’s gateway to connect the mobile phone to the Internet. This gateway is referred to as NetWork Address Translation (NAT). Simply speaking, the connection between mobile terminals and the Internet is the mapping between mobile Intranet IP addresses, ports, and external IP addresses.

The GateWay GPRS Support Note (GGSN) module implements the NAT function. In order to reduce the load of the GateWay NAT mapping table, most mobile wireless network operators will delete the corresponding table of a link when there is no communication for a period of time, resulting in link interruption. It is this deliberate shortening of the release timeout of idle connections, originally intended to save the role of channel resources, but did not expect that the Application of the Internet should not send heartbeat far higher than the normal frequency to maintain the push of long connections. In the case of China Mobile’s 2.5g network, the baseband is idle for about 5 minutes and the connection will be released.

Due to the characteristics of mobile wireless networks, push service does not set the heartbeat cycle is too long, otherwise the long connection is released, causing frequent client, but also can’t set too short, otherwise the current lack of unified framework under the mechanism of the heartbeat is very easy to cause the signaling storm (such as micro confidence jump signaling storm problem). There is no unified standard for the specific heartbeat period. 180S May be a good choice, while wechat is 300S.

In Netty, heartbeat detection can be implemented by adding IdleStateHandler to ChannelPipeline, specifying idle link time in the constructor, and then implementing idle callback interface to realize heartbeat sending and detection, the code is as follows:

public void initChannel({@link Channel} channel) {
 channel.pipeline().addLast("idleStateHandler", new {@link   IdleStateHandler}(0, 0, 180));
 channel.pipeline().addLast("myHandler", new MyHandler()); } Intercept link idle events and handle heartbeat:  public class MyHandler extends {@link ChannelHandlerAdapter} { {@code @Override} public void userEventTriggered({@link ChannelHandlerContext} ctx, {@link Object} evt) throws {@link Exception} {ifEvt instanceof {@link IdleStateEvent}} {// Heartbeat processing}}}Copy the code

3.4. Properly set the receive and send buffer capacity

For long links, each link needs to maintain its own buffer for receiving and sending messages. The JDK’s native NIO library uses java.nio.byteBuffer, which is a fixed Byte array. Arrays cannot be dynamically expanded.

public abstract class ByteBuffer
    extends Buffer
    implements Comparable
{
    final byte[] hb; // Non-null only for heap buffers
    final int offset;
    boolean isReadOnly;Copy the code

The inability to dynamically scale the capacity can cause some problems for users. For example, because the length of each message message cannot be predicted, a relatively large ByteBuffer may need to be pre-allocated, which is usually fine. However, in the mass push service system, this will bring heavy memory burden to the server. Assuming that the maximum limit of a single push message is 10K and the average size of the message is 5K, in order to meet the requirement of processing 10K messages, the capacity of ByteBuffer is set to 10K, so that each link actually consumes 5K more memory. If the number of long links is 1 million, each link independently holds The ByteBuffer receiving buffer. Total(M) = 1000000 * 5K = 4882M. Large memory consumption not only increases hardware cost, but also easily leads to long time Full GC, which will cause a relatively big impact on system stability.

In fact, the most flexible processing method is to dynamically adjust the memory, that is, the receiving buffer can be calculated according to the received messages, dynamically adjust the memory, using CPU resources to exchange memory resources, the specific strategy is as follows:

  1. ByteBuffer supports capacity expansion and contraction, which can be flexibly adjusted as needed to save memory.
  2. When receiving a message, the system analyzes the size of the received message according to the specified algorithm and predicts the size of the future message. The buffer capacity can be flexibly adjusted according to the predicted value to minimize resource loss and meet the normal function of the program.

Fortunately, Netty provides ByteBuf that supports dynamic capacity adjustment. For memory allocators that receive buffers, Netty provides two types:

  1. FixedRecvByteBufAllocator: fixed length of the receive buffer distributor, assigned by the it ByteBuf length is fixed size, would not be dynamically according to the size of the actual data submitted to the contract. However, if the capacity is insufficient, dynamic expansion is supported. Dynamic extension is a basic function of Netty ByteBuf and has nothing to do with the implementation of ByteBuf allocator.
  2. AdaptiveRecvByteBufAllocator: dynamic adjustment of the receive buffer capacity allocator, it will be according to the Channel before received datagram size calculation, if fill continuously receive buffer space to write, the dynamic expand capacity. If the number of received datagrams is smaller than the specified value for two consecutive times, the current capacity is shrunk to save memory.

Relative to the FixedRecvByteBufAllocator, AdaptiveRecvByteBufAllocator is more reasonable to use, can be specified when creating the client or server RecvByteBufAllocator, code is as follows:

 Bootstrap b = new Bootstrap();
            b.group(group)
             .channel(NioSocketChannel.class)
             .option(ChannelOption.TCP_NODELAY, true)
             .option(ChannelOption.RCVBUF_ALLOCATOR, AdaptiveRecvByteBufAllocator.DEFAULT)Copy the code

If the default is not set, then use AdaptiveRecvByteBufAllocator.

It is also worth noting that it is recommended that the buffer size be set to the average message size, not the upper limit of the maximum message size, for both the receive and send buffers, as this will result in additional memory waste. You can set the initial size of the receive buffer as follows:

/**
	 * Creates a new predictor with the specified parameters.
	 * 
	 * @param minimum
	 *            the inclusive lower bound of the expected buffer size
	 * @param initial
	 *            the initial buffer size when no feed back was received
	 * @param maximum
	 *            the inclusive upper bound of the expected buffer size
	 */
	public AdaptiveRecvByteBufAllocator(int minimum, int initial, int maximum) Copy the code

For message sending, you usually need to construct and encode ByteBuf yourself

3.5. Memory pool

Push servers host a huge number of long links, and each long link is actually a session. If every session holds data structures such as heartbeat data, receive buffers, instruction sets, and so on, and these instances die and die as messages are processed, this puts heavy GC pressure on the server and consumes a lot of memory.

The most efficient solution is to use a memory pool, with each NioEventLoop thread handling N links, and within the thread, the link processing is serial. If link A is processed first, it will create objects such as receive buffer. After the decoding is completed, the constructed POJO object will be encapsulated into tasks and delivered to the background thread pool for execution. Then the receive buffer will be released, and the creation and release of receive buffer will be repeated for each message receiving and processing. If the memory pool is used, when link A receives A new datagram, it applies for idle ByteBuf from the memory pool of NioEventLoop. After decoding, release is called to release ByteBuf into the memory pool for subsequent link B to use.

With memory pool optimization, the ByteBuf request and GC count for a single NioEventLoop is reduced from N = 1000000/64 = 15625 to a minimum of zero (assuming memory is available for each request).

Let’s take GC optimization with PooledByteBufAllocator of Netty4 as an example to evaluate the effect of memory pool. The results are as follows:

Garbage is generated 5 times faster and garbage is removed 5 times faster. With the new memory pool mechanism, network bandwidth can be almost full.

Prior to Netty4, Netty 3 created a new heap buffer every time a new message was received or a user sent a message to a remote end. This means that for each new buffer, there will be a new byte[capacity]. These buffers cause GC stress and consume memory bandwidth. To be safe, new byte arrays are allocated with zero padding, which consumes memory bandwidth. However, data filled with zero is likely to be filled with actual data again, which again consumes the same memory bandwidth. If the Java Virtual Machine (JVM) provided a way to create new byte arrays without having to fill them with zero, we could have reduced memory bandwidth consumption by 50%, but that is not currently the case.

A new ByteBuf memory pool was implemented in Netty 4, which is a pure Java version of Jemalloc (also used by Facebook). Now Netty will no longer waste memory bandwidth by padding buffers with zero. However, because it does not rely on GC, developers need to be careful about memory leaks. If you forget to free the buffer in your handler, the memory usage will grow indefinitely.

Netty does not use memory pools by default. You need to specify this parameter when creating clients or servers. The code is as follows:

Bootstrap b = new Bootstrap();
            b.group(group)
             .channel(NioSocketChannel.class)
             .option(ChannelOption.TCP_NODELAY, true)
             .option(ChannelOption.ALLOCATOR, PooledByteBufAllocator.DEFAULT)Copy the code

After using the memory pool, the allocation and release of memory must occur in pairs, i.e. retain() and release(), otherwise memory leaks will result.

It is important to note that if you use a memory pool, after complete the decoding work ByteBuf must explicitly call ReferenceCountUtil. Release (MSG) to memory the receive buffer ByteBuf release, otherwise it would be considered still in use, it will lead to memory leaks.

3.6. Beware of the Invisible Log Killer

In general, it is well known that you can’t do things on Netty I/O threads that have uncontrollable execution time, such as accessing databases or sending emails. But one common but dangerous operation that is often overlooked is logging.

Generally, in the production environment, interface logs need to be generated in real time. Other logs are at the ERROR level. When I/O exceptions occur in the push service, exception logs are recorded. If the WIO of the current disk is high, the log file writing operation may be blocked synchronously, and the blocking duration cannot be predicted. As a result, the NioEventLoop thread of Netty is blocked, the Socket link cannot be closed in time, and other links cannot be read or written.

For example, the most commonly used log4j, although it supports AsyncAppender, when the log queue is full, it will block the business thread synchronously until the log queue is free.

 synchronized (this.buffer) {
      while (true) {
        int previousSize = this.buffer.size();
        if (previousSize < this.bufferSize) {
          this.buffer.add(event);
          if(previousSize ! = 0)break;
          this.buffer.notifyAll(); break;
        }
        boolean discard = true;
        if((this.blocking) && (! Thread.interrupted()) && (Thread.currentThread() ! {this.buffer. Wait (); Discard =false; } catch (InterruptedException e) { Thread.currentThread().interrupt(); }}Copy the code

This kind of BUG has a very strong concealment, often WIO high time lasts very short, or is occasional, it is difficult to simulate such a fault in the test environment, problem location is very difficult. This requires the reader to be very careful when writing code and be aware of hidden mines.

3.7. TCP parameter optimization

Common TCP parameters, such as the size of the RECEIVE and send buffer on the TCP layer, correspond to SO_SNDBUF and SO_RCVBUF of ChannelOption in Netty respectively. You need to set them properly according to the size of the push message. For massive long connections, 32 KB is a good choice.

Another common optimization approach is soft interrupts, as shown in the figure: If all soft interrupts run on the hardware interrupts of CPU0’s corresponding network card, then CPU0 is always handling the soft interrupts, and other CPU resources are wasted because multiple soft interrupts cannot be executed in parallel.

If the Linux kernel version is greater than or equal to 2.6.35, enabling RPS improves network communication performance by more than 20%. The basic principle of RPS is to calculate a hash value based on the source address, destination address, destination and source port of the packet, and then select the CPU on which the soft interrupt will run according to the hash value. At the upper level, this means that each connection is bound to the CPU and the hash value is used to balance soft interrupts running on multiple cpus, thus improving communication performance.

3.8. The JVM parameter

The most important parameter adjustments are two:

  • -xmx: The maximum memory of the JVM needs to be calculated according to the memory model and obtain a relatively reasonable value.
  • Parameters related to GC, such as the ratio of new generation, old generation, permanent generation, GC strategy, and ratio of new generation areas, need to be set and tested according to specific scenarios, and continuously optimized to minimize the frequency of Full GC.


    Recently organized a cover of a large factory Java interview summary + knowledge learning thinking guide + a 300 page PDF document Java core knowledge summary!

    If you want to get the PDF, you can get it for free by scanning the picture below