preface

We will start from the underlying network I/O model optimization, and then to memory copy optimization and threading model optimization, in-depth analysis of how Tomcat, Netty and other communication frameworks improve system performance by optimizing I/O.

Java backend architecture advanced Study notes:Java from entry to architecture Growth Notes :JVM+ Concurrency + source code + distributed + Microservices + Dachang field projects + Dachang performance tuning solutions (click here to get free)

Network I/O model optimization

In network communication, the bottom layer is the network I/O model in the kernel. With the development of technology, the network model of operating system kernel has derived five I/O models. The book “UNIX Network Programming” divides these five I/O models into blocking I/O, non-blocking I/O, I/O multiplexing, signal-driven I/O and asynchronous I/O. Each I/O model is based on an optimized upgrade of the previous ONE.

The initial blocking I/O, which requires a user thread to handle each connection creation and is suspended before the I/O operation is ready or finished, is the root cause of performance bottlenecks.

So where does blocking occur in socket communication?

In Unix Network Programming, socket communication can be divided into streaming sockets (TCP) and datagram sockets (UDP). Among them, TCP connection is the most commonly used. Let’s understand the working process of TCP server (because TCP data transmission is complicated, there is the possibility of unpacking and packing, so I only assume the simplest TCP data transmission) :

  • First, the application creates a socket through the system call socket, which is a file descriptor assigned to the application by the system.
  • Second, the application gives the socket a name through the system call BIND, which binds the address and port number.
  • Listen is then called to create a queue for incoming connections from clients.
  • Finally, the application service listens for the client’s connection requests through the system call ACCEPT.

When a client is connected to the server, the server calls fork to create a child process that listens for messages from the client through the system call read and returns information to the client through write.

1. Blocking I/O

Throughout the socket communication workflow, the default state of the socket is blocked. That is, when a socket call is issued that cannot be completed immediately, the process blocks, is suspended by the system, and goes to sleep, waiting for the corresponding operation response. From the figure above, we can see that there are three types of possible blocking.

Connect the blocked: When the client initiates a TCP connection request, the system calls the connect function to complete the three-way handshake. The client needs to wait for ACK and SYN signals sent by the server, and the server also needs to block the ACK signals waiting for the client to confirm the connection. This means that each TCP connect blocks and waits until the connection is confirmed.

Accept block: a server communicating with a blocked socket receives incoming connections and calls accept. If no new connections arrive, the calling process will be suspended and enter the blocked state.

Read/write block: When a socket connection is created successfully, the server forks a child process and calls the read function to wait for the client to write data. If no data is written, the child process will be suspended and enter the blocking state.

2. Non-blocking I/O

Using FCNTL, you can make all three of these operations non-blocking. If no data is returned, an EWOULDBLOCK or EAGAIN error is returned, and the process is not kept blocked.

When we set the above operation to non-blocking, we need to set up a thread to poll the operation, which is the most traditional non-blocking I/O model.

3. I/O reuse

Using user thread polling to check the status of an I/O operation is a disaster for CPU utilization in the event of a large number of requests. So is there any other way to implement non-blocking I/O sockets?

Linux provides I/O multiplexing functions select/poll/epoll. The process blocks one or more read operations by calling the function through the system. In this way, the kernel can help us detect if multiple reads are in a ready state.

The select() function is used to listen for readable, writable and exception events on file descriptors of interest to the user during the timeout period. The kernel of the Linux operating system treats all external devices as a file. Reading and writing a file calls the system command provided by the kernel and returns a file descriptor (FD).

 int select(int maxfdp1,fd_set *readset,fd_set *writeset,fd_set *exceptset,const struct timeval *timeout)
Copy the code

Looking at the above code, the file descriptors monitored by the SELECT () function are divided into three classes, namely writefds (write file descriptor), READFds (read file descriptor) and ExceptfDS (exception event file descriptor).

The select() function blocks until a descriptor is ready or times out, and returns. When the select function returns, the ready descriptor can be found by iterating through the FDset with the function FD_ISSET. Fd_set can be understood as a collection of file descriptors, which can be set using the following four macros:

Poll () : Before each call to select(), the system needs to copy a FD from the user state to the kernel state, which incurs some performance overhead. The default number of FDS monitored by a single process is 1024, which can be broken by changing the macro definition or even recompiling the kernel. However, because fd_set is implemented based on arrays, too many FDS are added and deleted, resulting in low efficiency.

Poll () has a similar mechanism to select(), with little difference in nature. Poll () manages multiple descriptors through polling, which is processed according to the state of the descriptor, but poll() has no limit on the maximum number of file descriptors.

Poll () and select() share the same disadvantage that arrays containing a large number of file descriptors are copied in whole between the user state and the kernel’s address space, and regardless of whether these file descriptors are ready or not, their overhead increases linearly with the number of file descriptors.

The epoll() function: select/poll is a sequential scan of fd readiness, and the number of FDS supported is not too large, so its use is limited.

Linux provides an epoll call in the 2.6 kernel version, which uses an event-driven approach instead of polling to scan fd. Epoll registers a file descriptor with epoll_ctl() and stores the file descriptor in the kernel event table, which is implemented based on a red-black tree, so in the case of a large number of I/O requests, Insert and delete perform better than the select/poll array fd_set, so epoll performs better and is not limited by the number of FDS.

int epoll_ctl(int epfd, int op, int fd, struct epoll_event event)
Copy the code

From the above code, we can see that the EPFD in epoll_ctl() is an epoll-specific file descriptor generated by epoll_create(). Op represents the operation event type, fd represents the associated file descriptor, and event represents the event type specified to listen for.

Once a file descriptor is ready, the kernel uses a callback-like mechanism to quickly activate the file descriptor, notify the process when it calls epoll_wait(), and then complete the associated I/O operations.

int epoll_wait(int epfd, struct epoll_event events,int maxevents,int timeout)
Copy the code

4. Signal driven I/O

Signal-driven I/O is similar to the observer mode in that the kernel is an observer and the signal callback is a notification. When a user process initiates an I/O request operation, the system calls the sigAction function to register a signal callback for the corresponding socket. In this case, the user process will not be blocked and the process will continue to work. When the kernel data is ready, the kernel generates a SIGIO signal for the process and uses the signal callback to inform the process to perform relevant I/O operations.

Signal-driven I/O performs better than the first three I/O modes because the process is not blocked while waiting for data to be ready and the main loop can continue to work.

Signal-driven I/O is rarely used in TCP because SIGIO is a Unix signal with no additional information. If a source has multiple causes for the signal, the receiver cannot determine what is going on. And TCP socket production of signal events have seven, so the application received SIGIO, there is no way to distinguish processing.

But signal-driven I/O is now used for UDP communication, which has only one data request event, which means that under normal circumstances the UDP process will call recvFROM to read the incoming datagram as soon as it captures the SIGIO signal. If an exception occurs, an exception error is returned. NTP servers, for example, apply this model.

5. Asynchronous I/O

Signal-driven I/O although it does not block while waiting for data to be ready, I/O operations after notification still block, and the process waits for data to be copied from kernel space to user space. Asynchronous I/O, on the other hand, implements true non-blocking I/O.

When a user process initiates an I/O request operation, the system tells the kernel to start an operation and to notify the process when the entire operation is complete. This operation involves waiting for the data to be ready and copying the data from the kernel to user space. The asynchronous I/O model is rarely used in actual production environments due to the high code complexity and difficulty in debugging programs, and the fact that asynchronous I/O is supported by few operating systems (Linux does not support asynchronous I/O, but Windows has implemented asynchronous I/O).

NIO implements non-blocking I/O using an I/O multiplexer, Selector, which uses one of these five I/O multiplexing models. In Java, Selector is an outsourced class for select/poll/epoll.

As we mentioned in the above TCP communication flow, conect, Accept, read and write in Socket communication are blocking operations. The four listening events OP_ACCEPT, OP_CONNECT, OP_READ, and OP_WRITE correspond to the SelectionKey in the Selector.

In NIO server communication programming, a Channel is created to listen for client connections. Next, a Selector is created and a Channel is registered with the Selector. The program polls the registered channels through the Selector. When one or more channels are found to be ready, it returns a ready listening event. Finally, the program matches the listening event and performs related I/O operations.

When creating a Selector, the program selects which I/O multiplexing function to use based on the operating system version. In JDK1.5, if the program is running on Linux and the kernel is 2.6 or higher, epoll is selected in NIO instead of traditional select/poll, which greatly improves NIO communication performance.

Because signal-driven I/O does not support TCP communication, and asynchronous I/O is not mature in the Linux operating system kernel, most frameworks are based on I/O multiplexing model to achieve network communication.

Zero copy

In the I/O multiplexing model, read/write I/O operations are still blocked, and multiple memory copies and context switches occur during read/write I/O operations, which increases the performance overhead of the system.

Zero copy is a technique that avoids multiple memory copies to optimize read and write I/O operations.

In network programming, read and write are usually used to complete an I/O operation. Each I/O read or write operation requires four memory copies. The path is I/O device > kernel space > user space > kernel space > Other I/O devices.

The Mmap function in the Linux kernel can replace the I/O operations of read and write, so that the user space and the kernel space share the same cache data. Mmap maps an address in user space and an address in kernel space to the same physical memory address. Both user space and kernel space are virtual addresses, which are eventually mapped to physical memory addresses through address mapping. This approach avoids the exchange of data between kernel space and user space. The epoll function in I/O multiplexing uses Mmap to reduce memory copying.

In Java NIO programming, the Direct Buffer is used to achieve zero copy of memory. Java creates a physical memory space directly outside of the JVM memory space so that both the kernel and user processes can share a cache of data.

Thread model optimization

In addition to kernel optimization of the network I/O model, NIO has also been optimized and upgraded at the user level. NIO is an I/O operation based on an event-driven model. Reactor model is a common model for synchronous I/O event processing. Its core idea is to register I/O events with the multiplexer. Once an I/O event is triggered, the multiplexer will distribute the event to the event processor to perform the READY I/O event operation. The model has the following three main components:

  • Event Acceptor: is responsible for receiving requests for connections;
  • Event Separator Reactor: Upon receiving the request, it registers the established connection to the separator, relies on the loop to listen to the multiplexer Selector, and dispatches the event to the event handler once it listens to the event.
  • Handlers do related event processing, such as reading and writing I/O operations.

1. Single-thread Reactor Thread model

NIO was originally implemented on a single-threaded basis, with all I/O operations performed on a single NIO thread. Because NIO is non-blocking I/O, one thread can theoretically complete all I/O operations.

However, NIO does not really implement non-blocking I/O operations, because the user process is still blocked while reading and writing I/O operations. This method has performance bottlenecks in high load and high concurrency scenarios. If a NIO thread processes tens of thousands of CONNECTED I/ OS simultaneously, The system cannot support requests of this magnitude.

2. Multithreaded Reactor Thread model

To address the performance bottleneck of this single-threaded NIO in high-load, high-concurrency scenarios, thread pools were later used.

In both Tomcat and Netty, an Acceptor thread is used to listen for connection request events. When the connection is successful, the established connection is registered with the multiplexer. Once the event is heard, it is handed over to the Worker thread pool for processing. In most cases, this threading model can meet performance requirements, but an Acceptor thread may have a performance bottleneck if the connected client is an order of magnitude higher.

3. Principal/slave Reactor thread model

At present, NIO communication framework in the mainstream communication framework is based on the principal and slave Reactor thread model. In this model, acceptors are no longer a single NIO thread, but a thread pool. The Acceptor receives the TCP connection request from the client. After the connection is established, subsequent I/O operations are handed over to the Worker I/O thread.

Tuning Tomcat parameters based on the threading model

In Tomcat, BIO and NIO are implemented based on the master-slave Reactor thread model.

In the BIO, the Acceptor in Tomcat only listens for new connections. Once the connection is established and listens for I/O operations, it is handed over to the Worker thread, which is responsible for I/O reads and writes.

In NIO, Tomcat added a Poller thread pool. Acceptors listen for connections and send requests to the Poller buffer queue instead of using Worker threads to process them directly. In Poller, a Selector object is maintained, and when the Poller fetches a connection from the queue, it registers with that Selector. It then iterates through the selectors to find ready I/O operations and uses threads in the Worker to process the corresponding requests.

You can set the Acceptor and Worker thread pools with the following parameters.

AcceptorThreadCount: This parameter represents the number of Acceptor threads that can be scaled up to increase the ability to process connections if the amount of data on the request client is very large. The default value is 1.

MaxThreads: The number of Worker threads that handle I/O operations. The default value is 200. This parameter can be adjusted based on the actual environment, but not necessarily larger.

AcceptCount: Tomcat Acceptor threads are responsible for accepting connections from the Accept queue and sending them to workers. AcceptCount refers to the size of the Accept queue.

When Http turns Keep Alive off, you can increase this value appropriately when concurrency is high. However, when KEEP Alive is enabled for Http, the Worker thread may be occupied for a long time due to the limited number of Worker threads, and the connection may timeout in the Accept queue. If the Accept queue is too large, it is easy to waste connections.

MaxConnections: Indicates the number of sockets connected to Tomcat. In BIO mode, a thread can process only one connection, and generally maxConnections and maxThreads have the same value. In NIO mode, where a thread processes multiple connections at the same time, maxConnections should be set to be much larger than maxThreads, which defaults to 10000.

The last

If you think it’s a good partner, remember to support the three, and continue to update selected technical articles!