The article source blog.csdn.net/shootyou/ar…

\

Yesterday resolved a server exception caused by an HttpClient call error.

Blog.csdn.net/shootyou/ar…

As mentioned in the analysis, a large number of CLOSE_WAIT states were detected by viewing the server network state.

\

During routine maintenance of the server, the following commands are often used:

[plain]  view plain  copy

  1. netstat -n | awk ‘/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}’    

It displays information such as the following:

TIME_WAIT 814

CLOSE_WAIT 1

FIN_WAIT1 1

ESTABLISHED 634

SYN_RECV 2

LAST_ACK 1\

The three commonly used states are ESTABLISHED, TIME_WAIT, and CLOSE_WAIT.

\

There is no need to specify what each state means, but take a look at the following diagram. Note that the server mentioned here should be the one receiving the business request:

\

\

You don’t have to memorize all of these states, just understand the meaning of the three most common ones I mentioned above. Generally, it is not necessary to check the network status. If the server is abnormal, 80% or 90% of them are the following two situations:

1. The server maintains a large number of TIME_WAIT states

2. The server keeps a large number of CLOSE_WAIT states

Because Linux is assigned to a user’s file handle is limited (you can refer to: blog.csdn.net/shootyou/ar…). TIME_WAIT and CLOSE_WAIT are always held, which means that the corresponding number of channels are always occupied. Once the maximum number of handles is reached, no new requests can be processed, followed by a large number of Too Many Open Files exceptions. Tomcat crashes…

Here to discuss the two situations of processing method, there are many online information to confuse the two situations of processing method, thought that the kernel parameters optimization system can solve the problem, in fact is not appropriate, system kernel parameters optimized solution TIME_WAIT could easily, but dealing with the CLOSE_WAIT or need to start from the program itself. Now let’s talk about how to deal with these two cases respectively:

\

1. The server maintains a large number of TIME_WAIT states

This situation is quite common, some crawler server or WEB server (if the NMS does not do kernel parameter optimization during installation) often encounter this problem, how does this problem arise?

It can be seen from the above schematic diagram that TIME_WAIT is the state maintained by the party who takes the initiative to close the connection. For the crawler server, it is itself the “client”. After completing a crawling task, it will initiate the initiative to close the connection and enter the state of TIME_WAIT. Then, after holding this state for 2MSL (Max Segment Lifetime), close the recycle resource completely. Why would you do that? Why hold on to resources for a while when you’ve already closed the connection? This was specified by the TCP/IP designers for two main reasons:

1. Prevent the packets in the last connection from reappearing after getting lost and affecting the new connection (after 2MSL, all the duplicate packets in the last connection will disappear) 2. Reliably close the TCP connection. When the last ACK (FIN) sent by the active closing party is lost, the passive party reissues the FIN. If the active closing party is CLOSED at that time, it responds with an RST instead of an ACK. Therefore, the active party must be in TIME_WAIT state, not CLOSED. TIME_WAIT is designed to recycle resources periodically and does not consume large amounts of resources unless it receives a large number of requests in a short period of time or is attacked.

Here’s a quote about MSL:

[plain]  view plain  copy

  1. MSL is the duration of a TCP Segment from source to destination (i.e. the time a network packet can survive transmission over the network). The Internet wasn’t as fast as it is today. Can you imagine typing an ADDRESS from your browser and waiting four minutes for the first byte to appear? This is almost impossible in today’s network environment, so we can reduce the duration of TIME_WAIT considerably to allow Ports to be freed up faster for other connections.

Here’s another quote from a web source:

[plain]  view plain  copy

  1. It is worth mentioning that, for HTTP protocol based on TCP, it is the Server that closes the TCP connection, so the Server will enter the TIME_WAIT state. It can be expected that for Web Server with large traffic, there will be a lot of TIME_WAIT state. If the server receives 1000 requests per second, there is a backlog of 240*1000= 240,000 TIME_WAIT records, and maintaining these states is a burden to the server. Of course, modern operating systems use fast lookup algorithms to manage these time_waits, so it doesn’t take much time to determine if a TIME_WAIT is hit for a new TCP connection request, but it’s never good to have so many states to maintain.
  2. HTTP version 1.1 stipulates that the default behavior is keep-alive, that is, TCP connections are reused to transmit multiple requests/responses. One of the main reasons is the discovery of this problem.

This means that the HTTP interaction is not the same as the diagram above. It is not the client that closes the connection, but the server, so the Web server also has a lot of TIME_WAIT situations.

\

Now how to solve this problem.

\

The solution is simply to allow the server to quickly recycle and reuse TIME_WAIT resources.

\

/etc/sysctl.conf

[plain]  view plain  copy

  1. # How many SYN requests must be sent by the kernel for a new connection before it decides to abort. The value should not be greater than 255. The default value is 5, which corresponds to around 180 seconds
  2. net.ipv4.tcp_syn_retries=2  
  3. #net.ipv4.tcp_synack_retries=2  
  4. # Indicates the frequency at which TCP sends keepalive messages when Keepalive is enabled. The default value is 2 hours
  5. net.ipv4.tcp_keepalive_time=1200  
  6. net.ipv4.tcp_orphan_retries=3  
  7. This parameter determines how long the socket remains in fin-WaIT-2 state if it is closed at the request of the local end
  8. net.ipv4.tcp_fin_timeout=30    
  9. # indicates the length of the SYN queue, which defaults to 1024 and is increased to 8192 to accommodate more network connections waiting for connections.
  10. net.ipv4.tcp_max_syn_backlog = 4096  
  11. # indicates SYN Cookies are enabled. When SYN overflows the queue, cookies are enabled to prevent a small number of SYN attacks. The default value is 0, indicating that the SYN is disabled
  12. net.ipv4.tcp_syncookies = 1  
  13.   
  14. # enables reuse. Allows time-wait Sockets to be reused for new TCP connections. Default is 0, indicating closure
  15. net.ipv4.tcp_tw_reuse = 1  
  16. # enables fast collection of time-wait Sockets from TCP connections. The default value is 0, indicating that fast collection of time-Wait Sockets is disabled
  17. net.ipv4.tcp_tw_recycle = 1  
  18.   
  19. ## Reduce the number of probes before timeout
  20. net.ipv4.tcp_keepalive_probes=5   
  21. Optimize the network device receive queue
  22. net.core.netdev_max_backlog=3000   

[plain]  view plain  copy

  1.   

\

Run the /sbin/sysctl -p command for the parameters to take effect.

\

The main header notice here is net.ipv4.tcp_tw_reuse

net.ipv4.tcp_tw_recycle

net.ipv4.tcp_fin_timeout

net.ipv4.tcp_keepalive_*\

These parameters.

\

Net.ipv4. tcp_tw_reuse and net.ipv4.tcp_tw_recycle are both enabled to recycle resources in TIME_WAIT state. \

Net.ipv4. tcp_fin_timeout This parameter reduces the time required for the server to switch from FIN-waIT-2 to TIME_WAIT in an abnormal case. \

Net.ipv4. tcp_keepalive_* a series of parameters that are used to set the configuration for the server to detect connection survival. \

The use of Keepalive can be found at hi.baidu.com/tantea/blog…

\

[update] 2015.01.13

Note Risks associated with tcp_TW_RECYCLE: blog.csdn.net/wireless_te…

\

2. The server keeps a lot of CLOSE_WAIT state \

Take a break and take a breath. I started out talking about the difference between TIME_WAIT and CLOSE_WAIT, but the more I dig, the better it is.

\

The TIME_WAIT state can be resolved by optimizing server parameters, because the TIME_WAIT state is controlled by the server itself, either because the connection is abnormal, or because the server does not reclaim resources quickly, but is not caused by its own program error.

This is not the case for CLOSE_WAIT. As you can see from the above figure, if you keep CLOSE_WAIT, there is only one case where the server does not send an ACK signal after the connection is closed. In other words, the application does not detect that the connection has been closed, or the application simply forgets to close the connection, so the resource remains occupied by the application. Personally think this situation, through the server kernel parameters can not be solved, the server for the program preempted resources do not have the initiative to reclaim the right, unless the termination of the program run.

\

If you are using a heavy CLOSE_WAIT HttpClient and you met, so this journal may be useful to you: blog.csdn.net/shootyou/ar…

I used a scenario in the log to illustrate the difference between CLOSE_WAIT and TIME_WAIT.

Server A is A crawler server, it uses A simple HttpClient to request Apache on resource server B to obtain file resources. Under normal circumstances, if the request is successful, then after capturing the resources, server A will take the initiative to close the connection request, this time is to actively close the connection. The connection status of server A is TIME_WAIT. What if an exception occurs? If server B fails to close the connection, server A fails to close the connection. If server A fails to close the connection, the HttpClient fails to release the connection, which causes CLOSE_WAIT. \

\

So if the solution to a lot of CLOSE_WAIT can be summed up in one sentence: look up the code. Because the problem is in the server program.

\

References:

1. Under Windows TIME_WAIT processing can attend the warrior log: blog.miniasp.com/post/2010/1… \

2. The WebSphere server optimization has certain reference value: publib.boulder.ibm.com/infocenter/…

3. All kinds of the meaning of the kernel parameters: haka.sharera.com/blog/BlogTo…

4. Linux server sysctl optimized Linux network of adventure: blog.csdn.net/chinalinuxz…

\