The author | yan goofy source | alibaba cloud native public number

Dubbo is a lightweight, open source Java services framework that is preferred by many enterprises when building distributed services architectures. Since 2014, ICBC has been exploring the transformation of distributed architecture and independently developed a distributed service platform based on open source Dubbo.

Dubbo framework runs stably and performs well under the service scale with a small number of providers and consumers. With the increasing demand for online, diversified and intelligent banking services, there will be a scene where providers provide services for thousands or even tens of thousands of consumers in the foreseeable future.

Under such a high load, if the server program is not well designed, the network service may be inefficient or even completely broken when handling tens of thousands of client connections, which is known as THE C10K problem. So, can a distributed services platform based on Dubbo handle complex C10K scenarios? To this end, we set up a large-scale connection environment, simulated service invocation and conducted a series of explorations and validations.

In the C10K scenario, a large number of Dubbo service invocation transactions fail

1. Prepare the environment

Use Dubo2.5.9 (default Netty version is 3.2.5.final) to write the service provider and corresponding service consumer. There is no actual business logic in the provider service method, only sleep 100ms; On the consumer side, set the service timeout period to 5s. Each consumer invokes the service once every minute.

Prepare one 8C16G server to deploy one service provider in containerized mode, prepare hundreds of 8C16G servers to deploy 7000 service consumers in containerized mode.

Start the Dubbo monitoring center to monitor service invocations.

2. Customize the authentication scenario and observe the verification result

The verification was not satisfactory. In a C10K scenario, Dubbo service invocation fails due to timeout.

If the distributed service invocation takes a long time, all the link nodes from the service consumer to the service provider will occupy thread pool resources for a long time, increasing the additional performance loss. However, when the concurrency of service invocation increases suddenly, the whole link node is easily blocked, which affects the invocation of other services, and further causes the performance of the whole service cluster to degrade or even become unavailable, resulting in an avalanche. Service invocation timeout cannot be ignored. Therefore, we analyze the timeout failure of Dubbo service invocation in this C10K scenario in detail.

C10K scenario analysis

According to the service invocation transaction link, we first suspect that the transaction timeout is caused by the provider or consumer’s own process lag or network delay.

Therefore, we start the process GC log on the provider server and consumer server with transaction failure, print the process JSTACK several times, and carry out network packet capture on the host computer.

1. View GC logs and jStack

There are no obvious anomalies in the gc duration, GC interval, memory usage and thread stack of the process of provider and consumer. The conjecture of timeout caused by GC triggering stop the world or timeout caused by improper thread design is temporarily excluded.

2. Observe the failed transactions in the two scenarios

For failed transactions in the above two scenarios, observe network packet capture respectively, and observe the following two different phenomena:

For scenario 1: Transaction timeout during provider stable operation

Track network packet capture and transaction logs of providers and consumers. After the consumer initiates the service invocation request, the provider quickly catches the request message from the consumer, but it takes 2s+ for the provider to process the transaction from the receipt of the request message.

At the same time, observe the data flow of the transaction request response. It also takes 2s+ between the business method processing of the provider and the sending of the return packet to the consumer, after which the consumer quickly receives the return packet of the transaction. However, the total transaction time exceeds 5s and the timeout period of service invocation, causing a timeout exception to be thrown.

Therefore, it is judged that the cause of transaction timeout is not on the consuming side, but on the providing side.

For scenario 2: a large number of transactions time out after the provider restarts

After the service invocation request is initiated, the provider immediately receives the request packet from the consumer. However, the provider replies to the RST packet instead of submitting the transaction packet to the application layer. The transaction times out and fails.

Observe that a large number of RST packets are received within 1 to 2 minutes after the provider restarts. The deployment script prints the number of connections in the established state every 10ms after the provider restarts. It is found that the number of connections does not recover to 7000 immediately after the provider restarts, but returns to normal after 1-2 minutes. In this process, the connection status between each consumer and provider is established, and the provider is suspected to have unilateral connection.

We continue to analyze these two anomaly scenarios separately.

Scenario 1: The provider takes a long time before and after the actual transaction, resulting in transaction timeout

Detailed collection of provider’s operating status and performance indicators:

  1. Service provider JStack was collected on the provider server every 3s, and it was observed that the Netty worker thread frequently processed heartbeat every 60 seconds or so.
  2. At the same time, top-H was printed. It was observed that 9 Netty worker threads were among the top 10 threads occupying more CPU time slices. Because the provider server is 8C, Dubbo has 9 Netty worker threads by default, that is, all 9 Netty worker threads are busy.

  1. Deploy the server system performance collection tool Nmon and observe CPU burrs every 60 seconds. The number of network packets at the same time also has burrs.

  1. Deploy SS-NTP to continuously print the data backlog in the network receive queue and send queue. Large queues were observed around the time points of long transactions.

  1. In the Dubbo service framework, the interval for the provider and consumer to send heartbeat packets (packet length is 17) is 60 seconds, which is similar to the interval above. Combined with network packet capture, there are many heartbeat packets near the transaction time point that takes a long time.

According to the heartbeat mechanism of the Dubbo framework, when the number of consumers is large, the provider sends heartbeat packets and the number of heartbeat packets that need to be answered is very dense. Therefore, it is suspected that the netty threads are busy due to the heavy heartbeat, which affects the processing of transaction requests and thus increases the transaction time.

Further analyze the operation mechanism of Netty worker threads, and record the processing time of each Netty worker thread in three key links: connection request processing, write queue processing and selectKeys processing. It was observed that every 60 seconds or so (consistent with the heartbeat interval) the processing and reading of data packets were more time-consuming, and the transaction time increased during the interval. The provider receives more heartbeat packets at the same time.

Therefore, confirm the above suspicion. The netty worker thread is busy due to the heavy heartbeat, which increases the transaction time.

Scenario 2: The transaction times out due to unilateral connection

  • Analyze the cause of unilateral connection

During the TCP three-way handshake, if the full connection queue is full, unilateral connections are generated.

The full connection queue size is determined by the minimum value of the backlog for the system parameters net.core.somaxconn and LISTEN (somaxconn,backlog). Somaxconn is a Linux kernel parameter with a default value of 128; Backlog is set when the Socket is created. The default backlog value in Dubbo2.5.9 is 50. Therefore, the full connection queue in the production environment is 50. Using the ss command (Socket Statistics), the size of the full connection queue is 50.

Observe the TCP connection queue and verify that the full connection queue overflows.

That is, the capacity of the full connection queue is insufficient, resulting in a large number of unilateral connections. In this verification scenario, there are too many consumers of the subscription provider. When the provider restarts, the registry pushes the online notification of the provider to the consumer, and all consumers re-connect with the provider almost at the same time, resulting in the overflow of the full connection queue.

  • Analyze the impact range of the unilateral connection

The influence range of unilateral connection is mostly the first transaction of the consumer, and occasionally the first transaction starts to fail 2-3 times in succession.

A transaction does not necessarily fail if it is set up as a unilateral connection. After the three-way handshake full connection queue is full, if the half-connection queue is idle, the provider creates a timer to retransmit the SYN + ACK to the consumer. The retransmission interval increases by a multiple of 1s.. 2s.. 4s.. A total of 31 s. Within the retransmission times, if the full connection queue is idle, the consumer replies ack and the connection is established successfully. The transaction is successful.

Within the retransmission times, if the full connection queue is still busy, the new transaction fails after the timeout period.

The connection was discarded after the retransmission count was reached. The consumer then sends the request and the provider replies with the RST. The transaction failed to reach the timeout period.

According to Dubbo’s service invocation model, after the provider sends an RST, the consumer throws an exception Connection reset by peer, and then disconnects from the provider. However, the consumer cannot receive the response packet of the current transaction, resulting in a timeout exception. At the same time, the consumer timer detects the connection with the provider every 2s. If the connection is abnormal, it initiates reconnection and recovers the connection. Trading has been normal since then.

3. Analysis and summary of C10K scenario problems

To sum up, there are two reasons for the above transaction timeout:

  • The Netty worker thread is busy due to the heartbeat mechanism. Procedure In each heartbeat task, the provider sends heartbeat messages to all consumers that have not received or received packets within one heartbeat period. Consumer Sends heartbeat packets to all providers that have not received or received packets within one heartbeat period. The provider is connected to a large number of consumers, resulting in the accumulation of heartbeat packets. In addition, processing heartbeat processes consumes a large amount of CPU, affecting the processing duration of service packets.
  • The capacity of the full connection queue is insufficient. The queue overflows after the provider restarts, resulting in a large number of one-sided connections. There is a high probability that the first transaction under a unilateral connection will fail due to timeout.

4. Think next

  1. Scenario 1: How can I reduce the heartbeat processing time of a single Netty worker thread and speed up the running efficiency of the I/O thread? The following schemes are preliminarily envisaged:
  • Reduces the processing time of a single heartbeat
  • Increase the number of Netty worker threads to reduce the load of a single I/O thread
  • Break up the heartbeat to avoid intensive processing
  1. For scenario 2 above: How to avoid the transaction failure caused by the first large number of half-connections? The following schemes are envisaged:
  • Example Add the length of the TCP full connection queue involving the operating system, container, and Netty
  • Improves the speed of server accept connections

The processing efficiency of transaction packets is improved

1. Layer by layer optimization

Based on the above assumptions, we carried out a lot of optimization from the system level and Dubbo framework level to improve the efficiency of transaction processing in the C10K scenario and the performance capacity of service invocation.

The optimization includes the following aspects:

The framework layer specifically involved in optimization is as follows:

After verification of each optimization content item by item, all measures have improved to varying degrees, and their effects are as follows:

2. Comprehensive optimization verification effect

Comprehensive use of the above optimization effect is the best. In the authentication scenario in which one provider connects to 7000 consumers, no transaction timeout occurs for a long time after the provider is restarted. Before and after optimization, the CPU peak value of provider decreases by 30%, the processing time difference between consumer and provider is controlled within 1ms, and the TRANSACTION time of P99 decreases from 191ms to 125ms. In addition to improving the success rate of transactions, it effectively reduces the waiting time of consumers, reduces the occupation of service operation resources and improves the stability of the system.

3. Online actual operation effect

Based on the above verification results, ICBC integrated the above optimization content into the distributed service platform. As of the date of publication, there are already online scenarios that apply one provider to connect tens of thousands of consumers. After the implementation of the optimized version, there is no abnormal transaction timeout in the provider’s version upgrade and long-term operation, and the actual operation effect is in line with expectations.

future

Industrial and Commercial Bank of China is deeply involved in the construction of Dubbo community, and has encountered many technical challenges in the process of large-scale application of Dubbo at financial level. In order to meet the harsh operation requirements of high-sensitive transactions at financial level, icbc has carried out large-scale independent research and development, and continued to improve the stability of the service system through the expansion and customization of Dubbo framework. With the concept of “from open source, give back to open source”, we will continuously contribute general enhancement capabilities to the open source community.

In the future, we will continue to focus on the financial scale application of Dubbo, cooperate with the community to continue to improve the performance capacity and high availability level of Dubbo, accelerate the digital innovation and transformation of the financial industry, and fully control the basic core key.

Author’s brief introduction

Gao Fei Yan is an architect in the field of micro services, mainly engaged in service discovery, high-performance network communication and other research and development work. He is good at ZooKeeper, Dubbo, RPC protocol and other technical directions.

Log in to start.aliyun.com for immersive online interactive tutorials.