1. The background
1.1. Amazing performance numbers
A friend in the circle recently told me via a private message that they implemented 10W TPS (1K of complex POJO objects) of cross-node remote service calls using Netty4 + Thrift compressed binary codecs. This is more than 8 times better than a traditional communication framework based on Java serialization +BIO (Synchronous blocking IO).
In fact, I am not surprised by this data. Based on my 5 + years of NIO programming experience, it is entirely possible to achieve the above performance indicators by selecting the appropriate NIO framework and designing the Reactor thread model with high performance compressed binary codecs.
Let’s take a look at how Netty supports 10W TPS cross-node remote service calls. Before we start, let’s take a quick look at Netty.
1.2. Netty Basics
Netty is a high-performance, asynchronous event-driven NIO framework, which provides support for TCP, UDP, and file transfer. As an asynchronous NIO framework, all Netty IO operations are asynchronous and non-blocking. Users can obtain THE I/O operation results proactively or through notification mechanisms.
As the most popular NIO framework, Netty has been widely used in the field of Internet, big data distributed computing, game industry, communication industry, etc. Some well-known open source components in the industry are also built based on Netty NIO framework.
2. Netty high performance
2.1. Performance model analysis of RPC call
2.1.1.The three deadly SINS of poor performance in traditional RPC calls
Problem of network transmission mode: Traditional RPC framework or remote service (process) invocation based on RMI adopts synchronous blocking IO. When the concurrent pressure of the client or network delay increases, synchronous blocking IO will cause frequent blocking of IO threads due to frequent wait. As threads cannot work efficiently, The I/O processing capability deteriorates.
Below, we take a look at the drawbacks of BIO communication through the BIO communication model diagram:
Figure 2-1 BIO communication model diagram
Using BIO communication model of the server, usually by an independent Acceptor thread is responsible for monitoring the client connection, after receive the client connection for the client connection to create a new thread to handle the request, after processing is complete, return to reply messages to the client, thread, this is typical of a request a response model. The biggest problem of this architecture is that it does not have elastic scalability. When the number of concurrent visits increases, the number of threads on the server is linearly proportional to the number of concurrent accesses. Threads are very valuable system resources of JAVA VIRTUAL machines, and when the number of threads expands, the system performance declines sharply. Handle overflows, thread stack overflows, and other issues can occur, and the server eventually goes down.
Serialization problems: Java serialization has the following typical problems:
-
Java serialization mechanism is an object encoding and decoding technology in Java, which cannot be used across languages. For example, for interconnection between heterogeneous systems, the code stream after Java serialization needs to be deserialized into the original object (copy) through other languages, which is difficult to support at present.
-
Compared with other open source serialization frameworks, the code stream after Java serialization is too large, which will lead to extra resource consumption whether it is network transmission or persistent to disk.
-
Poor serialization performance (high CPU usage).
Threading model problem: because use synchronous blocking IO, which leads to each TCP connection takes up a thread, because the thread resources is a very precious resource, the JVM virtual machine when blocking IO, speaking, reading and writing lead to thread fails to timely release, will result in system performance fell sharply, serious even lead to the virtual machine could not create a new thread.
2.1.2. Three topics of high performance
-
Transport: What channels are used to send data to each other, BIO, NIO, or AIO, and the IO model largely determines the performance of the framework.
-
Protocol: Which communication protocol to use, HTTP or internal private protocol. The performance model varies with the protocol you choose. Internal private protocols can usually be designed to perform better than public protocols.
-
Thread: How are datagrams read? Which thread conducts codec after read, how the message is distributed after codec, and Reactor thread model have great influence on performance.
Figure 2-2 Three elements of RPC call performance
2.2. Netty High-performance
2.2.1. Asynchronous non-blocking communication
In the IO programming process, when multiple client access requests need to be processed at the same time, multithreading or IO multiplexing technology can be used to process. IO multiplexing allows the system to process multiple client requests simultaneously in a single thread by reusing multiple IO blocks to the same SELECT block. Compared with the traditional multithreading/multi-process model, the biggest advantage of I/O multiplexing is that the system cost is small, the system does not need to create new additional processes or threads, and does not need to maintain these processes and threads running, reducing the maintenance workload of the system, saving system resources.
JDK1.4 provides support for non-blocking IO (NIO) and JDK1.5_UPDATe10 uses epoll instead of select/poll, greatly improving NIO communication performance.
The JDK NIO communication model looks like this:
Figure 2-3 Multiplexing model diagram of NIO
NIO also provides two different Socket channel implementations, SocketChannel and ServerSocketChannel, corresponding to the Socket and ServerSocket classes. Both of these new channels support blocking and non-blocking modes. Blocking mode is very simple to use, but has poor performance and reliability, while non-blocking mode is the opposite. Developers can generally choose the appropriate pattern based on their needs. In general, low-load, low-concurrency applications can choose to block IO synchronously to reduce programming complexity. However, for high load and high concurrency network applications, NIO’s non-blocking mode should be used for development.
Netty architecture is designed and implemented according to Reactor model, and its server communication sequence diagram is as follows:
Figure 2-3 NIO server communication sequence diagram
The client communication sequence diagram is as follows:
Figure 2-4 NIO client communication sequence diagram
Netty’s IO thread NioEventLoop can simultaneously process hundreds or thousands of client channels due to the aggregate of multiplexer selectors. Since read and write operations are non-blocking, this can fully improve the running efficiency of IO threads. Avoid thread suspension due to frequent I/O blocking. In addition, because Netty adopts the asynchronous communication mode, one IO thread can concurrently process N client connections and read and write operations, which fundamentally solves the traditional synchronous blocking IO one connection one thread model, the performance, elastic scalability and reliability of the architecture have been greatly improved.
2.2.2. Zero copy
Many users have heard of Netty’s “zero copy” function, but it is not clear where it exists. This section explains Netty’s “zero copy” function in detail.
Netty’s “Zero copy” is mainly reflected in the following three aspects:
-
Netty receives and sends bytebuffers using DIRECT BUFFERS, which use out-of-heap DIRECT memory for Socket reading and writing without the need for secondary copy of byte BUFFERS. If traditional HEAP BUFFERS are used for Socket reads and writes, the JVM copies the HEAP Buffer into direct memory before writing it to the Socket. The message is sent with an extra memory copy of the buffer compared to direct out-of-heap memory.
-
Netty provides the combined Buffer object, which can aggregate multiple ByteBuffer objects. Users can operate the combined Buffer as easily as operating a single Buffer. The traditional memory copy method of combining several small buffers into one large Buffer is avoided.
-
Netty adopts the transferTo method to transfer files, which can directly send the data in the file buffer to the target Channel, avoiding the memory copy problem caused by the traditional write method.
Netty’s receive Buffer was created using a zero-copy Buffer.
Figure 2-5 Reading zero copy asynchronous messages
The ioBuffer method of ByteBufAllocator is used to fetch the ByteBuf object each time a message is read through the loop.
Figure 2-6 ByteBufAllocator allocates out-of-heap memory using the ioBuffer
To avoid copying a copy from the heap to direct memory, Netty’s ByteBuf allocator creates non-heap memory to avoid a second copy of the buffer during Socket IO reads and writes, using “zero copy” to improve read and write performance.
The second “zero copy” implementation, CompositeByteBuf, encapsulates multiple ByteBuFs into a CompositeByteBuf and provides a unified encapsulated ByteBuf interface. Its class definition is as follows:
Figure 2-7 CompositeByteBuf class inheritance relationship
The CompositeByteBuf is a CompositeByteBuf wrapper that combines multiple ByteBuFs into a collection and provides a unified ByteBuf interface as defined below:
Figure 2-8 CompositeByteBuf class definition
Add ByteBuf, do not need to do memory copy, related code:
Figure 2-9 Adding zero copy for ByteBuf
Finally, let’s look at the “zero copy” of file transfer:
Figure 2-10 Zero Copy file transfer
Netty file transfer DefaultFileRegion sends a file to the target Channel using the transferTo method. The FileChannel transferTo method is described as follows:
Figure 2-11 Zero Copy file transfer
For many operating systems, it directly sends the contents of the file buffer to the target Channel, without copying. This is a more efficient transmission mode, and it realizes the “zero copy” of file transmission.
2.2.3. Memory pool
With the development of THE JVM and just-in-time compilation technology, object allocation and collection is a very lightweight task. For buffers, however, the situation is slightly different, especially for direct out-of-heap memory allocation and reclamation, which can be a time-consuming operation. To maximize the reuse of buffers, Netty provides a buffer reuse mechanism based on memory pools. Netty ByteBuf: Netty ByteBuf
Figure 2-12 Memory pool ByteBuf
Netty provides multiple memory management policies. You can configure related parameters in the boot assistance class to achieve differentiated customization.
Let’s take a look at the performance difference between ByteBuf based on memory pool recycling and plain ByteBuf.
Use case 1, create a direct memory buffer using a memory pool allocator:
Figure 2-13 Memory pool-based non-heap memory buffer test case
Use case 2, direct memory buffer created using non-heap memory allocator:
Figure 2-14 Test case of creating non-heap memory buffers based on non-memory pools
Each execution is 3 million times. The performance comparison results are as follows:
Figure 2-15 Comparison of memory pool and non-memory pool buffer write performance
Performance tests show that ByteBuf with memory pools performs about 23 times better than ByteBuf with a long history (performance data is strongly correlated with usage scenarios).
Netty pool memory allocation
Figure 2-16 AbstractByteBufAllocator buffer allocation
Moving on to the newDirectBuffer method, we see that it is an abstract method, implemented by a subclass of AbstractByteBufAllocator, with the following code:
Figure 2-17 Different implementations of newDirectBuffer
This code goes to the newDirectBuffer method of PooledByteBufAllocator, obtains the memory region PoolArena from the Cache, and calls its allocate method to allocate memory:
Figure 2-18 Memory allocation for PooledByteBufAllocator
The allocate method for PoolArena is as follows:
Figure 2-18 PoolArena buffer allocation
We will focus on the implementation of newByteBuf, which is also an abstract method, using subclasses DirectArena and HeapArena to implement different types of buffer allocation. Since the test case uses out-of-heap memory,
Figure 2-19 PoolArena’s newByteBuf abstract method
DirectArena, if not turned on to use Sun’s unsafe, is an alternative to broadening
Figure 2-20 DirectArena newByteBuf method implementation
Execute PooledDirectByteBuf newInstance as follows:
Figure 2-21 newInstance of PooledDirectByteBuf
Recycle ByteBuf objects through the RECYCLER get method or create a new ByteBuf object that is not available in the memory pool. ByteBuf was obtained from the buffer pool, after calling AbstractReferenceCountedByteBuf setRefCnt method to set the reference counter, used for object reference counting and memory recycling garbage collection (similar to the JVM).
2.2.4. Efficient Reactor thread model
There are three commonly used Reactor thread models, which are as follows:
-
Reactor single thread model;
-
Reactor multithreading model;
-
Reactor multithreaded model
Reactor single-thread model, which refers to the fact that all I/O operations are done on the same NIO thread. The NIO thread has the following responsibilities:
-
As the NIO server, it receives TCP connections from the client.
-
As the NIO client, initiate TCP connection to the server.
-
Read the request or reply message of the communication peer end;
-
Sends a request or reply message to the peer.
The Reactor single-thread model is shown as follows:
Figure 2-22 Reactor single-thread model
Because the Reactor pattern uses asynchronous non-blocking I/OS, none of the I/O operations cause blocking, and theoretically one thread can handle all OF the I/O-related operations independently. From an architectural point of view, a NIO thread can do what it’s supposed to do. For example, a TCP connection request message is received from an Acceptor client. After a link is established, the corresponding ByteBuffer is sent to a specified Handler through Dispatch to decode the message. The user Handler can send messages to the client through the NIO thread.
For some low-volume applications, a single-threaded model can be used. However, it is not suitable for high-load and high-concurrency applications. The main reasons are as follows:
-
A NIO thread processing hundreds of links at the same time cannot support performance. Even if the CPU load of NIO thread reaches 100%, it cannot meet the requirements of encoding, decoding, reading and sending massive messages.
-
When the NIO thread is overloaded, the processing speed will slow down, which will lead to a large number of client connections timeout. After timeout, the connection will be retransmitted, which will increase the load of NIO thread, and eventually lead to a large number of messages backlog and processing timeout, and NIO thread will become a performance bottleneck of the system.
-
Reliability problem: Once the NIO thread runs off unexpectedly or enters an infinite loop, the entire system communication module becomes unavailable, unable to receive and process external messages, resulting in node failure.
In order to solve these problems, the Reactor multithreading model is evolved. Let’s learn the Reactor multithreading model together.
The main difference between the Rector multithreaded model and the single-threaded model is that there is a group of NIO threads to handle IO operations. The schematic diagram is as follows:
Figure 2-23 Reactor multithreading model
Characteristics of the Reactor multithreaded model:
-
A special NIO thread, the Acceptor thread, listens to the server and receives TCP connection requests from the client.
-
Network IO operations – read, write, and so on are handled by a NIO thread pool, which can be implemented using the standard JDK thread pool. The thread pool consists of a task queue and N available threads, which are responsible for reading, decoding, encoding, and sending messages.
-
A NIO thread can process N links at the same time, but one link corresponds to only one NIO thread, preventing concurrent operations.
In most scenarios, Reactor multithreading model can meet performance requirements. However, in very specific application scenarios, a single NIO thread listening and handling all client connections can be a performance issue. For example, millions of clients connect concurrently, or the server needs to securely authenticate client handshake messages. Authentication itself is very performance destructive. In these scenarios, a single Acceptor thread may have a performance problem. To solve this problem, a third Reactor thread model is developed, that is, a master-slave Reactor thread model.
The characteristics of the master-slave Reactor thread model are that instead of a single NIO thread, the server receives client connections from a single NIO thread pool. After an Acceptor receives a TCP connection request from a client, it registers the SocketChannel with an I/O thread in the SUB REACTOR thread pool. It is responsible for the read, write, codec and decoding of socketchannels. The Acceptor thread pool is only used for client login, handshake, and security authentication. Once a link is established, it is registered with the I/O thread of the back-end subReactor thread pool, which is responsible for the subsequent I/O operations.
Its threading model is shown below:
Figure 2-24 Reactor primary/secondary multithreading model
Using the master-slave NIO thread model, you can solve the performance problem that a single server listening thread cannot efficiently process all client connections. Therefore, this threading model is recommended in Netty’s official demo.
As a matter of fact, Netty’s threading model is not fixed, and the above three Reactor threading models can be supported by creating different EventLoopGroup instances in the boot helper class and configuring the appropriate parameters. Netty’s support for Reactor thread model provides flexible customization capabilities, which can meet the performance requirements of different business scenarios.
2.2.5. Serial design concept without locking
In most scenarios, parallel multithreading improves the concurrency performance of the system. However, if concurrent access to shared resources is mishandled, it can lead to serious lock contention, which can ultimately lead to performance degradation. In order to avoid performance loss caused by lock contention as much as possible, serialization design can be adopted, that is, message processing can be completed in the same thread as far as possible, without thread switching, so as to avoid multi-thread contention and synchronous locking.
To maximize performance, Netty adopts serial lock-free design. Serial operations are performed within I/O threads to avoid performance degradation caused by multi-thread competition. On the surface, the serialization design appears to be CPU inefficient and not concurrent enough. However, by adjusting the thread parameters of the NIO thread pool, multiple serialized threads can be started simultaneously to run in parallel. This partially lock-free serialized thread design is superior to the one-queue-multiple worker thread model.
The working principle diagram of Netty’s serialization design is as follows:
Figure 2-25 Working principle of Netty serialization
Netty NioEventLoop reads the message, directly calls ChannelPipeline fireChannelRead(Object MSG), as long as the user does not actively switch the thread, The NioEventLoop calls the user’s Handler all the way through without thread switching. This serialization avoids lock contention caused by multithreading and is optimal from a performance perspective.
2.2.6. Efficient concurrent programming
Netty’s efficient concurrent programming is mainly reflected in the following points:
-
The extensive and correct use of volatile;
-
Extensive use of CAS and atomic classes;
-
Use of thread-safe containers;
-
Improve concurrency performance through read/write locks.
If you want to know more about Netty’s efficient concurrent programming, you can read the article “Analysis of Multi-threaded concurrent programming in Netty”. In this article, you can read the detailed introduction and analysis of Netty’s multi-threading skills and applications.
2.2.7. High-performance serialization framework
The key factors that affect serialization performance are summarized below:
-
Serialized codestream size (network bandwidth usage);
-
Serialization & deserialization performance (CPU usage);
-
Whether support cross-language (heterogeneous system docking and development language switch).
Netty supports Google Protobuf by default. By extending Netty’s codec interface, users can implement other high-performance serialization frameworks, such as Thrift’s compressed binary codec framework.
Let’s take a look at the comparison of byte arrays serialized by different serialization and deserialization frameworks:
Figure 2-26 Comparison of serialization stream sizes among serialization frameworks
As you can see from the figure above, the Protobuf serialized stream is only about a quarter of the Java serialized stream. It is the poor performance of Java native serialization that has led to various high-performance open source serialization technologies and frameworks (poor performance is just one of the reasons, but there are other factors such as cross-language, IDL definition, and so on).
2.2.8. Flexible TCP parameter configuration
Setting TCP parameters properly can improve performance in some scenarios, such as SO_RCVBUF and SO_SNDBUF. If not set properly, the impact on performance can be significant. The following configuration items have a significant impact on performance:
-
SO_RCVBUF and SO_SNDBUF: 128K or 256K is recommended.
-
SO_TCPNODELAY: The NAGLE algorithm automatically connects small packets in the buffer to form larger packets, preventing the sending of a large number of small packets from blocking the network, thus improving the network application efficiency. However, this optimization algorithm needs to be disabled for time-delay sensitive application scenarios.
-
Soft interrupt: If the Linux kernel version supports RPS (2.6.35 or later), RPS can be enabled to implement soft interrupt to improve network throughput. RPS calculates a hash value based on the source address, destination address, destination and source port of the packet, and then selects the CPU on which the soft interrupt will run based on the hash value. From the upper level, this means that each connection is bound to the CPU, and the soft interrupt is balanced on multiple cpus through this hash value. Improve network parallel processing performance.
Netty can flexibly configure TCP parameters in the startup assistance class to meet different user scenarios. The configuration interfaces are defined as follows:
Figure 2-27 TCP parameter configuration of Netty
2.3. Summarize
By analyzing Netty’s architecture and performance model, we found that Netty’s high performance architecture is well designed and implemented. Netty’s support for 10W TPS cross-node service invocation is not too difficult due to the high quality architecture and code.
Refer to the address
- www.infoq.cn/article/net…
Reading notes
Link: pan.baidu.com/s/1vf2f4KRd… Password: yvoy
If you like my article, you can follow the individual subscription number. Welcome to leave messages and communicate at any time. If you want to join the wechat group to discuss with us, please add the administrator to simplify the stack culture – little Assistant (lastpass4U), he will pull you into the group.