A, fault

The basic architecture is shown in the figure. The client initiates an HTTP request to Nginx, which forwards the request to the gateway, which in turn forwards the request to the back-end microservice.

The symptom is that every ten minutes or several hours, the client will receive one or more consecutive request timeout errors. View the nginx log and return 499 for the request. No corresponding request is received in the gateway log.

From log analysis, the problem should be either nginx or Spring-cloud-gateway.

Nginx version: 1.14.2, Spring-Cloud version: Greenwich.RC2.

Nginx configuration is as follows:

[root@wh-hlwzxtest1 conf]# cat nginx.conf

worker_processes  8;

events {
    use epoll;
    worker_connections  10240;
}

http {
    include       mime.types;
    default_type  application/octet-stream;

    sendfile       on;
    tcp_nopush     on;
    tcp_nodelay    on;

    keepalive_timeout  65;
    #gzip on;Upstream {server 10.201.0.28:8888; keepalive 100; } server { listen 80; server_name localhost; charset utf-8; location /dbg2/ { proxy_pass http://dbg2/; Proxy_http_version 1.1; proxy_set_header Connection""; }}}Copy the code

To improve performance, nginx sends requests to the gateway as HTTP 1.1, which can reuse TCP connections.

Second, the screen

1. View TCP connections

[[email protected] logs]# ss - n | grep 10.201.0.27:8888TCP ESTAB 0 0 10.197.0.38:36674 10.201.0.27:8888 TCP ESTAB 0 0 10.197.0.38:40106 10.201.0.27:8888 [[email protected] opt]# ss - n | grep 10.197.0.38TCP ESTAB 0 0 :: FFFF: 10.201.0.27.8888 :: FFFF :10.197.0.38:40106 TCP ESTAB 0 0 :: FFFF: 10.201.0.27.8888 : : FFFF: 10.197.0.38:39266Copy the code

The socket connection between nginx and the gateway is (10.201.0.27:8888, 10.197.0.38:40106). The possible cause is that one end abnormally closes the TCP connection but does not notify the peer end, or the peer end does not receive the notification.

2. Packet capture analysis

Nginx packet capture data

No. 8403: Forwards HTTP requests to the gateway.

No. 8404: Resends the ACK packet if no ACK packet is received within the RTT time.

Serial number 8505: RTT is about 0.2s, TCP retransmission;

No. 8506:0.4s No ACK packet received, TCP retransmission;

No. 8507:0.8s Does not receive ACK packet, TCP retransmits;

No. 8509:1.6s No ACK packet received, TCP retransmission;

.

No. 8439:28.1s (128RTT) No ACK packet is received. TCP is retransmitted.

No. 8408: The request timeout is set to 3s, so the FIN packet is sent.

Take a look at the packet capture data of the gateway:

# 1372:17:24:31 Received the ACK packet from Nginx, corresponding to # 1348 in the packet capture diagram (nginx server time is about 1 minute 30 seconds fast);

No. 4221: After 2 hours, the TCP keep-alive heartbeat packet is sent (as can be seen from the nginx packet capture diagram, the TCP connection is idle within 2 hours).

No. 4253:75s later, the TCP keep-alive heartbeat is sent again.

Serial no. 4275:75s after sending heartbeat again;

9 times in a row;

Serial no. 4489: Sends an RST packet and resets the connection through the peer.

2 hours, 75 seconds, 9 times, system default setting.

[root@eureka2 opt]# cat /proc/sys/net/ipv4/tcp_keepalive_time
7200
[root@eureka2 opt]# cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75
[root@eureka2 opt]# cat /proc/sys/net/ipv4/tcp_keepalive_probes
9
Copy the code

See article: Why do TCP-based applications need heartbeat packets

3, analysis,

Through the above packet capture analysis, it is basically confirmed that the problem is nginx. At 19:25, the gateway sends a TCP keep-alive heartbeat packet to the nginx server. At this time, the TCP connection remains on the nginx server, but there is no response. At 22:20, nginx sends an HTTP request to the gateway, but the gateway has closed the TCP connection, so there is no reply.

Third, solve

1, proxy_send_timeout

Upstream: “Nginx timeout configuration” : “nginx timeout configuration” : “nginx timeout configuration

“Proxy_connect_timeout” : indicates the connection timeout between nginx and upstream server.

“Proxy_read_timeout” : indicates that nginx has timed out while receiving upstream data (default: 60s).

Proxy_send_timeout: nginx times out while sending data to upstream server (default: 60s). If no byte is sent within 60 seconds, the connection is closed.

These parameters are specific to the HTTP protocol. For example, proxy_send_timeout = 60s, it does not mean to close the connection if 60s does not send an HTTP request. It means that after sending an HTTP request, the connection is closed if it lasts longer than 60 seconds between two write operations. So these parameters, of course, are not what we need.

2. Keepalive_timeout (upstream

Module ngx_http_upstream_module

Syntax:	keepalive_timeout timeout;
Default:	
keepalive_timeout 60s;
Context:	upstream
This directive appeared inVersion 1.15.3.Copy the code

Sets a timeout during which an idle keepalive connection to an upstream server will stay open.

Setting the TCP connection to be idle for more than 60 seconds before closing is exactly what we need.

To use this parameter, upgrade nginx to 1.15.8 with the following configuration file:

HTTP {upstream dbg2 {server 10.201.0.27.8888; keepalive 100; keepalive_requests 30000; keepalive_timeout 300s; }... }Copy the code

Set the TCP connection to run 30,000 HTTP requests or idle 300 seconds, then close the connection.

After continuing the test, no packet loss was found.

No. 938: After five minutes, nginx initiates a FIN packet to close the connection.