A, fault
The basic architecture is shown in the figure. The client initiates an HTTP request to Nginx, which forwards the request to the gateway, which in turn forwards the request to the back-end microservice.
The symptom is that every ten minutes or several hours, the client will receive one or more consecutive request timeout errors. View the nginx log and return 499 for the request. No corresponding request is received in the gateway log.
From log analysis, the problem should be either nginx or Spring-cloud-gateway.
Nginx version: 1.14.2, Spring-Cloud version: Greenwich.RC2.
Nginx configuration is as follows:
[root@wh-hlwzxtest1 conf]# cat nginx.conf
worker_processes 8;
events {
use epoll;
worker_connections 10240;
}
http {
include mime.types;
default_type application/octet-stream;
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 65;
#gzip on;Upstream {server 10.201.0.28:8888; keepalive 100; } server { listen 80; server_name localhost; charset utf-8; location /dbg2/ { proxy_pass http://dbg2/; Proxy_http_version 1.1; proxy_set_header Connection""; }}}Copy the code
To improve performance, nginx sends requests to the gateway as HTTP 1.1, which can reuse TCP connections.
Second, the screen
1. View TCP connections
[[email protected] logs]# ss - n | grep 10.201.0.27:8888TCP ESTAB 0 0 10.197.0.38:36674 10.201.0.27:8888 TCP ESTAB 0 0 10.197.0.38:40106 10.201.0.27:8888 [[email protected] opt]# ss - n | grep 10.197.0.38TCP ESTAB 0 0 :: FFFF: 10.201.0.27.8888 :: FFFF :10.197.0.38:40106 TCP ESTAB 0 0 :: FFFF: 10.201.0.27.8888 : : FFFF: 10.197.0.38:39266Copy the code
The socket connection between nginx and the gateway is (10.201.0.27:8888, 10.197.0.38:40106). The possible cause is that one end abnormally closes the TCP connection but does not notify the peer end, or the peer end does not receive the notification.
2. Packet capture analysis
Nginx packet capture data
No. 8403: Forwards HTTP requests to the gateway.
No. 8404: Resends the ACK packet if no ACK packet is received within the RTT time.
Serial number 8505: RTT is about 0.2s, TCP retransmission;
No. 8506:0.4s No ACK packet received, TCP retransmission;
No. 8507:0.8s Does not receive ACK packet, TCP retransmits;
No. 8509:1.6s No ACK packet received, TCP retransmission;
.
No. 8439:28.1s (128RTT) No ACK packet is received. TCP is retransmitted.
No. 8408: The request timeout is set to 3s, so the FIN packet is sent.
Take a look at the packet capture data of the gateway:
# 1372:17:24:31 Received the ACK packet from Nginx, corresponding to # 1348 in the packet capture diagram (nginx server time is about 1 minute 30 seconds fast);
No. 4221: After 2 hours, the TCP keep-alive heartbeat packet is sent (as can be seen from the nginx packet capture diagram, the TCP connection is idle within 2 hours).
No. 4253:75s later, the TCP keep-alive heartbeat is sent again.
Serial no. 4275:75s after sending heartbeat again;
9 times in a row;
Serial no. 4489: Sends an RST packet and resets the connection through the peer.
2 hours, 75 seconds, 9 times, system default setting.
[root@eureka2 opt]# cat /proc/sys/net/ipv4/tcp_keepalive_time
7200
[root@eureka2 opt]# cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75
[root@eureka2 opt]# cat /proc/sys/net/ipv4/tcp_keepalive_probes
9
Copy the code
See article: Why do TCP-based applications need heartbeat packets
3, analysis,
Through the above packet capture analysis, it is basically confirmed that the problem is nginx. At 19:25, the gateway sends a TCP keep-alive heartbeat packet to the nginx server. At this time, the TCP connection remains on the nginx server, but there is no response. At 22:20, nginx sends an HTTP request to the gateway, but the gateway has closed the TCP connection, so there is no reply.
Third, solve
1, proxy_send_timeout
Upstream: “Nginx timeout configuration” : “nginx timeout configuration” : “nginx timeout configuration
“Proxy_connect_timeout” : indicates the connection timeout between nginx and upstream server.
“Proxy_read_timeout” : indicates that nginx has timed out while receiving upstream data (default: 60s).
Proxy_send_timeout: nginx times out while sending data to upstream server (default: 60s). If no byte is sent within 60 seconds, the connection is closed.
These parameters are specific to the HTTP protocol. For example, proxy_send_timeout = 60s, it does not mean to close the connection if 60s does not send an HTTP request. It means that after sending an HTTP request, the connection is closed if it lasts longer than 60 seconds between two write operations. So these parameters, of course, are not what we need.
2. Keepalive_timeout (upstream
Module ngx_http_upstream_module
Syntax: keepalive_timeout timeout;
Default:
keepalive_timeout 60s;
Context: upstream
This directive appeared inVersion 1.15.3.Copy the code
Sets a timeout during which an idle keepalive connection to an upstream server will stay open.
Setting the TCP connection to be idle for more than 60 seconds before closing is exactly what we need.
To use this parameter, upgrade nginx to 1.15.8 with the following configuration file:
HTTP {upstream dbg2 {server 10.201.0.27.8888; keepalive 100; keepalive_requests 30000; keepalive_timeout 300s; }... }Copy the code
Set the TCP connection to run 30,000 HTTP requests or idle 300 seconds, then close the connection.
After continuing the test, no packet loss was found.
No. 938: After five minutes, nginx initiates a FIN packet to close the connection.