Solution to Bug: Troubleshooting the probability failure of invoking extranet services

And external coordination has always been a troubling problem, especially the problems caused by some basic environment configuration. The author solved a probabilistic failure problem of invoking extranet services in an accidental case. The investigation process will be issued here, and I hope readers will know how to start when they encounter this problem.

The cause of

The author’s new system online, the need to PE operation. But responsible for the operation of PE and another development is entangled with each other, so that the author waited for half an hour. In line with the idea of accelerating the launch of the system, I wondered if I could help them deal with the problem quickly, so that the author could send back the coding as soon as possible.

On inquiry, this question has been dragged on for 3 months. The phenomenon is as follows:

Each client will fail with a probability of nearly 1/2 and will report an error:

To troubleshoot

Appserver development and the corresponding PE exchange found that the appServer and Nginx connection is short, because of the socketTimeOutException, so we can rule out the connection between the AppServer and nginx. Check the log on nginx and find a strange phenomenon, as shown in the following figure:

All appServer calls to one nginx always succeed, while calls to another nginx most likely fail. The configuration of the two nginx machines is exactly the same, and one of the oddities is that the call to the offending peer server only fails. The rest of the business is unaffected, as shown below:

Because of these two strange phenomena, the development and PE are not in dispute. According to the first phenomenon, if one Nginx is better than one nginx, then the second nginx has a problem, it is reasonable to infer that the development requires the replacement of Nginx.

According to the second phenomenon, the error occurs only when the service is called. If there is no problem with other services, it must be the fault of the peer service server. PE thinks it is not the fault of NGINx.

After a long argument, the tentative plan was to expand Nginx and see what would happen. The author feels that this plan is not reliable, blind expansion may cause the opposite effect. Let’s just grab the bag and see what happens.

caught

In fact, I don’t think nginx should be a problem as it is such a generic component. The problem should be on the peer server. According to the response of the development of the opposite end, there was no problem with his own curl, and he did curl on his own server for N times on site, but there was no problem. (Because this problem was not solved, he was sent to our company to assist in the investigation.)

Therefore, a network worker captures packets outside the firewall, and the packet capture results are as follows:

Time point Source IP address Destination IP address info 2019-07-25 16:45:41 20.1.1.1 30.1.1.1 TCP 58850->443[SYN] 2019-07-25 16:45:42 20.1.1.1 30.1.1.1 TCP [TCP Retransmission]58850->443[SYN] 2019-07-25 16:45:44 20.1.1.1 TCP [TCP Retransmission]58850->443[SYN]Copy the code

The ReadTimeOut timeout set on the AppServer is 3s. Therefore, an error message is reported on the peer end after the second SYN retransmission.

As shown below:

(Note: the tcp_syn_retries is 2 on the Linux server where nginx resides.)

Analysis of packet capture results

The second Nginx sends a SYN packet to the peer service, but the peer service does not respond. As a result, nginX2 creates a connection timeout, resulting in ReadTimeOut on the AppServer (appServer is a short connection to Nginx).

The normal inference is that the SYN from outside the firewall to the peer service is lost. And Ali cloud as a very stable service provider, should be impossible to appear such a large probability of loss phenomenon. And since the peer server is using a very mature SpringBoot, this bug should not occur. Most likely, the setup of the peer server itself is faulty.

Log in to the peer server and perform troubleshooting

Because the other party’s development came to the scene, so the author directly used his computer to log in the service of Ali cloud server. First look at dMESg, as shown below, there are a bunch of errors:

It feels relevant, but this information alone cannot locate the problem. Next, I run netstat -s:

This command gives the crucial information that translates to 16,990 passive connections being rejected due to time stamps! A little research revealed that it was due to the setup

This passive denial of connections can result in NAT situations. Tcp_tw_recycle =1 by default tcp_timestamps are 1. In addition, client calls are sent from NAT, which matches all the characteristics of this problem. So I try to set their tcp_timestamps to 0.

Dozens more calls were made, and no more errors were reported!

Linux source code analysis

Although the problem is solved, but I want to take a look at the source level of this problem in the end what is going on, so began to study the corresponding source code (based on linux-2.6.32 source code). Nginx sends the first SYN handshake with the peer server, so we can trace the source code for this:

The code for tcp_timestamps is in tcp_v4_conn_REQUEST, so let’s follow up (the following code ignores other unnecessary logic):

int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb) { ...... /* VJ's idea. We save last timestamp seen * from the destination in peer table, when entering * state TIME-WAIT, And check against it before * bonuses new connection request. * When entering the TIME_WAIT state, the last timestamp is recorded in the peer tables. * Then check this timestamp when new connection requests come in */ / if (tmp_opt.saw_tstamp && with tcp_timestamps and tcp_TW_RECYCLE enabled tcp_death_row.sysctl_tw_recycle && (dst = inet_csk_route_req(sk, req)) ! = NULL && (peer = rt_get_peer((struct rtable *)dst)) ! = NULL && peer->v4daddr == saddr) {/** TCP_PAWS_MSL== 60 */ ** TCP_PAWS_WINDOW ==1 */ / The following IP addresses are the same peer IP address // tcp_ts_stamp Local timestamp recorded after the connection of the peer IP address enters the time_wait state // The current time is within one minute after the actual timestamp of the last time_wait record if (get_seconds() < peer->tcp_ts_stamp + TCP_PAWS_MSL && // the timestamp of the latest received packet (brought by the peer) // The timestamp of the current request is smaller than the s32 timestamp recorded after time_WAIT (peer-> tcp_ts-req ->ts_recent) > TCP_PAWS_WINDOW) {// Add statistics for passive connection rejection NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED); Goto drop_and_release; }}... }Copy the code

If tcp_TIMEstamps and TCP_TW_RECYCLE are enabled, new connections with the same IP address enter the time_wait state one minute after the last connection enters the time_wait state. If the timestamp of the new connection is less than the timestamp of the last packet in time_wait state, the SYN is discarded and drop_AND_RELEASE is entered.

Let’s continue tracking drop_and_release:

Let’s look at what happens if tcp_v4_conn_request returns 0:

It can be seen from the source code trace that in this case, the corresponding SYN packet is discarded directly, and the peer end cannot get any response for SYN retransmission, which is consistent with the result of packet capture.

And the symptoms of the problem

Okay, a nginx failed.

The TCP timestamp is not the timestamp given by the date command on the current host. The rules for calculating this timestamp are left open, except that each machine’s timestamp is different (and can be vastly different).

Since we call the peer using NAT, the two nginx will appear to be the same IP address to the peer server, and their timestamps will be confused when sent to the peer server. Nginx1 has a larger timestamp than NginX2, so any nginX1 connection request (short connection) that occurs within a minute will always be discarded.

As shown below:

Why is the peer self-test always normal?

Because the timestamp when this native calls the native is on a machine (the native), there is no confusion.

Why is it normal for NginX2 to call other services?

The tcp_TW_RECYCLE service is not enabled on the server where other external services reside. This problem can actually be solved by setting tcp_TW_RECYCLE to 0. In addition, the tcp_TW_recycle parameter has been removed from older Linux kernels.

conclusion

Due to the current SHORTAGE of IP addresses and the limited size of DNS packets (512 bytes), NAT is used to interact with external networks in most network architectures. Therefore, problems occur when tcp_TW_RECYCLE is set to 1.

This problem usually requires a certain understanding of THE TCP protocol to find the root cause.

Original link:

Blog.51cto.com/14528283/24…

Wenyuan network, only for the use of learning, such as infringement, contact deletion.

I will be high quality technical articles and experience summary are collected in my public account [Java circle], for the convenience of everyone to learn, but also organized a set of learning materials, free to love Java students! More learning communication group, more communication problems can be faster progress ~

Solution to Bug: Troubleshooting the probability failure of invoking extranet services

The cause of

To troubleshoot

caught

Analysis of packet capture results

Linux source code analysis

conclusion

Related Posts

Why should you learn PHP in 2018?

Build a Git server with Gitolite

Hongmeng OS front-end development guide: Network picture _Image Render network picture