causes
Some time ago, I researched NACOS, which was used to replace ZooKeeper as dubbo registry, using the 1.1.4 version of NACOS. I also used nacosSync, a nacOS-provided migration tool to synchronize common registry services to NACOS. It’s not easy to use, at least not at the production level. But that’s not relevant for this article, and I’ll write an article later on about the pros and cons of this synchronization tool, and what production-level changes are still needed. At the beginning of the test, there are always services inexplicably offline, has been unable to find the reason. In the course of research, nacOS released 1.2.0-beta.0, so I went to Github to see the 1.2.0-Beat.0 release note. I reviewed the fixed bugs one by one, and the important ones were merged into the research version. One bugfix caught my attention.
Nacos’s Java client uses REST’s HTTP interface for requests. This bugfix says
When Dubbo is using nacOS registry, there are a lot of connections in TIME_WAIT state on dubbo’s consumer side, taking up a lot of ports, each request/heartbeat is new, no shared connections. From javadoc’s point of view, the problem might be the use of HttpURLConnection, which calls disconnect on each request, closing the connection.
Go to the nacosSync server (essentially a NACOS client) and check the connection status (no live reservation, this is simulated later)
Then I looked at the error log
java.net.ConnectException: Can't assign requested address (connect failed)Copy the code
It’s almost certain that this bug is causing a serious problem and is most likely to blame for the frequent drop of service.
Problem handling & analysis
The issue code was merged into the survey version, repackaged and restarted nacosSync. Sure enough, the TIME_WAIT count was down, and after a few days of testing, the NACOS service was no longer offline without reason. The solution is simple, but why is this happening? Take a look at the fix code
The Java Doc says so
Each HttpURLConnection instance is used to make a single request but the underlying network connection to the HTTP server may be transparently shared by other instances. Calling the close() methods on the InputStream or OutputStream of an HttpURLConnection after a request may free network resources associated with this instance but has no effect on any shared persistent connection. Calling the disconnect() method may close the underlying socket if a persistent connection is otherwise idle at that time.
Calling disconnect() closes the connection, which increases the number of connections in TIME_WAIT state. It seems that we need to review the basic knowledge of TCP. Next, we will introduce the three-way handshake to establish a connection and the four-way wave to disconnect a connection. The following content is mainly from the understanding of Xie Xiren’s “Computer Networks” (7th edition) and the articles on the network.
TCP Three-way handshake
As shown in the figure, the TCP three-way handshake to establish a connection is as follows:
-
(1) At the beginning, A is closed and B is in the LISTEN state. A initiates A connection request and sends A packet segment with SYN=1 and the initial sequence number is SEq =x. A enters the SYN-sent state
-
(2) After receiving the request packet, B agrees to establish a connection and sends the packet segment with SYN=1 and ACK=1. The sequence number of THE packet is ACK= x+1 and the initial sequence number of B is SEq =y. B enters the SYN-RCVD state
-
(3) After receiving the confirmation from B, A needs to confirm to B, with ACK=1, SEq =x+1, ACK= y+1, then A and B enter the ESTABLISHED state
However, the TCP protocol needs to consider the exception case
-
(Exception A) If A loses packets when sending packets in (1) and B does not receive the packets, A will retry. After the retry times out, A enters the CLOSED state.
-
(Exception B) If (2) A receives A request but does not reply or the reply packet is lost, this is (Exception A). If THE reply packet from B is lost, that is, A does not receive the acknowledgement packet. (3) In this case, B will retry and close the connection after timeout.
-
In this case, A is in the ESTABLISHED state and can send data. In this case, B is still in the SYN-RCVD state. If B receives A packet from A first, it enters the ESTABLISHED state. If user A does not send the last acknowledgement packet or data, user B is in half-connected state and closes the connection after (2) attempts. This is called A SYN FLOOD attack.
TCP waves four times
-
(1) In the ESTABLISHED state, A and B are in the ESTABLISHED state. A closes the port and sends the FIN=1 packet, seq= U, and A enters the FIN-wait-1 state.
-
(2) After receiving the CLOSE signal, B replies ACK=1, SEq = V, ACK= U +1, and B enters the close-wait state. At this time, B can also send data to A, and A enters the FIN-wait-2 state after receiving the reply from B.
-
(3) After B finishes sending data, it sends A signal that can be turned off to A. FIN=1, ACK=1, SEq =w, ACK= U +1, B enters the last-ACK state;
-
(4) AFTER receiving the shutdown signal from B, A replies to confirm, ACK=1, SEq = U +1, ACK= W +1, A enters the time-wait state and enters the CLOSED state after 2MSL (that is, 120 seconds), B enters the CLOSED state after receiving the confirmation from A.
Abnormal conditions:
-
(Exception A) A initiates A shutdown signal to enter FIN-WAIT-1. If B does not reply, the system will retry until timeout. After timeout, A directly closes the connection.
-
(Exception B) After REPLYING to A, B enters the Close-wait state, but does not send the next FIN packet. Therefore, B remains in the Close-wait state.
-
(Exception C) After receiving ACK from B, A enters the FIN-wait-2 state and waits for B’s shutdown. At this time, DATA from B can still be received; In theory, FIN-waIT-2 keeps this state until IT receives B’s close request, but the actual implementation is to have a timeout period, which is 180 seconds by default in Linux, after which the connection is directly closed.
-
(Exception D) After sending a FIN packet to close the connection, B enters the last-ACK state, but does not receive a reply. B sends the last-ACK request repeatedly until it times out. After the timeout, B closes the connection.
From TCP’s three handshakes and four waves, the following conclusions can be drawn:
-
TCP requires four waves because the TCP connection is full-duplex. The first two waves ensure the close of A to B, and the second two ensure the close of B to A.
-
TIME_WAIT is set on the client to ensure that the LAST ACK can reach B with a high probability. If the client closes the connection without waiting for 2MSL and the ACK is also lost, then B’s repeated close request cannot be processed and B will remain in the last-ACK state with a high probability.
-
In the absence of attacks, close-wait and time-wait states tend to cause problems. Close-wait is when the server does not CLOSE the connection, usually because the server forgot to CLOSE the connection in the code. When a time-wait occurs, the client initiates too many connections in a short period of TIME. The client can reuse the connections to solve the problem.
-
If there are many other intermediate states, the above diagram can be analyzed to consider whether there is an attack.
conclusion
-
Note that short links may cause too much TIME_WAIT when processing requests from clients.
-
When writing code, you should pay attention to possible exceptions. Unused resources (including but not limited to connections) need to be released in time.
-
For the use of open source products, it is necessary to read the Github issue as much as possible, and try to anticipate possible problems in advance. When the next version is released, it is necessary to pay attention to the bugs fixed and new features introduced.
More nacOS related articles recommended
-
Nacos consistency protocol Distro introduction
Scan to identify the QR code to catch the master