First, background

A version two months ago needed to connect to the interfaces related to Tencent conference and receive the callback of Tencent conference event change. Tencent Conference sent HTTP requests to our test environment service through webhook, which was called A service for now, assuming the domain name was A.test.abc.com.

When saving the interface configuration, Tencent Conference verifies the availability of the URL of the interface and responds within 3 seconds. If the url is returned correctly, the configuration can be performed.

Second, problem phenomenon

During the configuration, an exception occurs, and the configuration cannot be saved properly. The configuration is saved successfully only after repeated operations for more than ten times (for details, see Section 5 at the end of this article). As shown in the figure:

Three, problem investigation

1. Check whether complex logic exists on the interface

  • The interface is a simple GET request that returns a base64 decoded string without any special processing.
  • Check the Controller layer and find no special AOP time-consuming logic.
  • When Postman is used to simulate the request on the local computer, the response time of the call interface on the Intranet or the Internet is very fast, and the response time is all within 200ms.

2. Collect interface request data from related parties

Communicated with the person in charge of the backend of Tencent conference, the person in charge of The backend of Tencent conference searched logs and found that the URL of the testing environment of A service that the backend of Tencent conference initiated HTTP calls timed out.

3. Temporary solutions

In order to avoid affecting the test delivery, we temporarily made a layer of request transfer for a service on Ali Cloud service, and let Tencent cloud transfer the request to the online service. The online service initiated a second HTTP request through OKHTTP and forwarded it to the test service.

Through this layer, the problem is temporarily resolved and the normal business callback is realized.

But for the test environment, aliyun online server is also an external environment, why there is no problem in our service call test environment, while Tencent service call timeout? At the end of the article, the fifth section talks about problem analysis

4. Rectify the network link fault

Before troubleshooting links, you can deploy simple Web services and provide callback interfaces on your own server and online server. Basically, it can be determined that there is A time consuming problem when Tencent conference service accesses the test environment of A service

4.1 Simulating the test environment interface of extranet request to check the time consumption

Using curl to simulate the test environment interface of Internet request, print the time of each link:

  • Select Aliyun multi-container test:

  • Test using Tencent Cloud server:

You can see that the domain namelookup (time_namelookup) takes up most of the time.

However, the time of TCP connection, HTTP request response and data transmission is very low, which is about 200ms.

Curl code example

The curl - w "time_namelookup: % {time_namelookup} \ ntime_connect: % {time_connect} \ ntime_pretransfer: % % {time_pretransfer} \ ntime_starttransfer: {time_starttransfer} \ ntime_total: %{time_total}\n" -o /dev/null -s -L "Https://a.test.abc.com/server-main/api/v1/meeting/disable/callback?check_str=xxx (fill in the url)." "Copy the code

Curl Indicates the mapping between some parameters

parameter describe
time_total Total time, in seconds. To three decimal places
time_namelookup DNS resolution time: the time from the start of a request to the end of DNS resolution
time_connect Connection time, including DNS resolution time (connection time =time_connect-time_namelookup)
time_appconnect Connection establishment completion time, such as SSL/SSH connection establishment or three-way handshake completion time
time_pretransfer Time from start to ready for transmission
time_redirect Redirect time, including DNS resolution of several redirects up to the last transfer, connection, pre-transfer, and transfer time
time_starttransfer Start transmission time. The time it takes for the Web server to return the first byte of data after the client makes the request

4.2 Observing service Inlet Traffic

You can use tcpdump to capture packets in the traffic inlet of service A, dump logs to A local PC, and use Wireshark to analyze the logs. The time between establishing the TCP connection and initiating the HTTP request is very low indeed, and corresponds to the time printed above

The tcpdump command

tcpdump   -i  eth0  -s  0  -w  tcp.pcap
Copy the code

4.3 Observing the Traffic Details of HTTP Requests initiated by Extranet Callers

After packet capture and analysis using tcpdump, it is found that the DNS server returns the IP address corresponding to the domain name within 0.3ms. However, after this step, the client does not establish the TCP connection. Instead, the client continues to search for AAAA a.test.abc.com. This step takes more than 5 seconds, which is consistent with the time in 4.1.

A DNS query is issued for each curl request. A DNS query is issued for each curl request. A DNS query is issued for each curl request. IPV4 returns results quickly, while IPV6 often takes up to 5 seconds (IPV6 requests are forward to upstream’s DNS) and returns the NXDomain with no IP.

4.4 Verification Problems

Add -4 to the curl request and use the Ipv4 address to verify whether the request takes time to query the ipv6 address. It was a very quick response. At this point, the problem has been verified.

Iv. Solutions

With problems and packets, ask o&M for help. Finally, the o&M colleagues perform ipv6 false response configuration for the generic Cname of *.test.abc.com. That is, they directly return to the NXDomain without any processing. The problem was solved.

V. Other problems encountered in the process

1. Ali Cloud online environment request test environment also has a time consuming situation, then why the temporary solution using the transfer service can be used normally?

Okhttp will resolve the domain name of the HTTP request URL. When the domain name first requests to obtain an IP address, it will cache the domain name and IP in a one-to-one corresponding, and this cache operation is different from the OPERATING system level DNS cache, there is no expiration time.

So there may be cases where the first request takes a long time. However, after the first request, subsequent HTTP requests are made using IP addresses and do not need to be parsed again.

# getallByName0 (java.lang.String, java.net.InetAddress, Boolean)

2. Tencent conference service saves several times and only a few times can be successful

Guess is still the same as the above mechanism of IP cache, because now most Web services are distributed deployment of multiple containers, multiple HTTP requests, as long as there are two go to the same machine, can use the COMPONENT cache IP address directly request.