We all know that in Kubernetes cluster, the Ingress solution is commonly used to broker the traffic within the cluster. The common Ingress solution is Traefik and Nginx, and the traditional Nginx as the reverse proxy and load device within the enterprise. Specific configurations may be required in different scenarios to meet the requirements, otherwise bizarre exception status codes will appear. In this article, we take a look at the 500 and 502 exceptions we encountered in Traefik.
One, foreword
Before we begin, let’s take a look at a few exception status codes that are commonly encountered in Nginx as a reverse proxy tool:
- 499: The client disconnects. Usually a request is not returned within the specified time, the client actively closes the connection, and the client returns no data (499 will be logged in Nginx). The general is
Client Timeout
- 500: The server has an internal error. The server encountered an unknown error that prevented it from completing the request processing, usually due to a backend business logic exception (itself a Bug)
- 502: Gateway error. Typically, the gateway does not receive the expected response from the upstream service (
The upstream does not return data or does not return data as specified in the protocol
), the gateway feels useless and returns a gateway error. The general isThe back-end server is down
orBusiness logic timeout
- 504: The gateway times out. Indicates that the gateway does not obtain response data from the upstream in time. This is usually caused by a timeout on the Nginx gateway as a client to the upstream server, but for the upstream server, it will continue to execute until the end. (
Timeout for Nginx gateway as client
)
In the case of # 499, the client specified a timeout of N seconds, but the request actually took M seconds to execute on the server (M>N seconds), and the client got impatient and shut down
# For 499 status, the solution is to optimize the back-end code logic or modify the nginx parameters
$ cat nginx.conf
proxy_ignore_client_abort on;
M $curl - I - 3 http://127.0.0.1/hello.php
# 502 is usually a case where the server on the back end of the Nginx gateway is down
PHP - FPM. Conf (request_terminate_timeout5s); PHP - FPM. Conf (request_terminate_timeout5s); Abnormal data returned to the Nginx gateway
Several error logs on the back end at # 502
recv() failed (104: Connection reset by peer) while reading response header from upstream
upstream prematurely closed connection while reading response header from upstream
connect(a)failed (111: Connection refused) while connecting to upstream
In general, the problems that occur are usually because the back end has hung up, or because the back end is overloaded and temporarily unresponsive
Proxy_read_timeout can be added to nginx for temporary relief
$ cat nginx.conf
proxy_read_timeout 20s;
In # 504, the client ->Nginx->Backend. In Backend, Nginx needs to be used as a client to access Backend services. However, before Backend execution is complete, Nginx exceeds the client timeout. A 504 exception occurs (but what is returned for the client?).
# For scenario 504, the common practice is to optimize the logic of Backend and reduce the execution time appropriately; You can also increase the timeout for Nginx as a client
Nginx is used as a Proxy when it is used as a Proxy
$ cat nginx.conf
uwsgi_connect_timeout 5;
uwsgi_send_timeout 5;
uwsgi_read_timeout 5;
fastcgi_read_timeout 5;
fastcgi_send_timeout 5;
proxy_connect_timeout 90;
proxy_send_timeout 90;
proxy_read_timeout 90;
Copy the code
Traefik’s quirky 500 and 502
Deployment configuration of Traefik in a Kubernetes cluster
The traefik configuration for our current cluster is as follows:
# Traefik configmap configuration file
$ cat traefik-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: traefik-config
namespace: kube-system
data:
traefik.toml: |
defaultEntryPoints = ["http"."https"]
debug = false
logLevel = "INFO"
InsecureSkipVerify = true
[entryPoints]
[entryPoints.http]
address = ": 80"
compress = true
[entryPoints.https]
address = ": 443"
[entryPoints.https.tls]
[web]
address = ": 8080"
[kubernetes]
[metrics]
[metrics.prometheus]
buckets=[0.1.0.3.1.2.5.0]
entryPoint = "traefik"
[ping]
entryPoint = "http"
# Traefik DaemonSet configuration
$ cat traefik-ds-v17.16..yaml
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: traefik-ingress-controller
namespace: kube-system
---
kind: DaemonSet
apiVersion: extensions/v1beta1
metadata:
name: traefik-ingress-controller
namespace: kube-system
labels:
k8s-app: traefik-ingress-lb
spec:
template:
metadata:
labels:
k8s-app: traefik-ingress-lb
name: traefik-ingress-lb
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/master
operator: DoesNotExist
serviceAccountName: traefik-ingress-controller
terminationGracePeriodSeconds: 30
hostNetwork: true
containers:
- image: traefik:v17.16.
name: traefik-ingress-lb
ports:
- name: http
containerPort: 80
hostPort: 80
- name: admin
containerPort: 8080
securityContext:
capabilities:
drop:
- ALL
add:
- NET_BIND_SERVICE
args:
- --api
- --kubernetes
- --logLevel=INFO
- --traefikLog.filePath=/logdata/traefik.log
- --configfile=/config/traefik.toml
- --accesslog.filepath=/logdata/access.log
- --accesslog.bufferingsize=100
volumeMounts:
- mountPath: /config
name: config
- mountPath: /logdata
name: access-log
volumes:
- configMap:
name: traefik-config
name: config
- name: access-log
hostPath:
path: /opt/logs/ingress/
---
kind: Service
apiVersion: v1
metadata:
name: traefik-ingress-service
namespace: kube-system
labels:
k8s-app: traefik-ingress-lb
spec:
selector:
k8s-app: traefik-ingress-lb
ports:
- protocol: TCP
port: 80
name: web
- protocol: TCP
port: 8080
name: admin
Copy the code
Python’s external API interface
# interface external ingress
$ kubectl get ingress -n s-data
NAME HOSTS ADDRESS PORTS AGE
data-api.bgbiao.cn data-api.bgbiao.cn 80 236d
ops.bgbiao.cn ops.bgbiao.cn 80 236d
Test the external interface
$ curl data-api.bgbiao.cn -i
HTTP / 1.1 401 Unauthorized
Access-Control-Allow-Headers: Content-Type, X-TOKEN
Access-Control-Allow-Origin: *
Content-Length: 58
Content-Type: application/json
Vary: Accept-Encoding
Date: Sun, 28 Jun 2020 14:55:00 GMT
The interface requires login, so we pressure the login interface to simulate the problem
$ curl -X POST --data '@/root/login.json' -H 'Content-type:application/json' http://data-api.bgbiao.cn/account/users/login/ -i
HTTP / 1.1 200 OK
Access-Control-Allow-Headers: Content-Type, X-TOKEN
Access-Control-Allow-Origin: *
Content-Length: 250
Content-Type: application/json
Vary: Accept-Encoding
Date: Sun, 28 Jun 2020 14:56:33 GMT
Copy the code
The weird 500 and 502
After the service was deployed, everything was fine, but a simple pressure test revealed that some requests to the service failed.
#Pressure test using tool AB
#According to the pressure test results, 20 concurrent pressure test 200 requests, during which there were 7 failed requests
$ ab -c 20 -n 200 -T 'application/json' -p /root/login.json http://data-api.bgbiao.cn/account/users/login/
.
Benchmarking data-api.bgbiao.cn (be patient)
Completed 100 requests
Completed 200 requests
Finished 200 requests
Server Software:
Server Hostname: data-api.bgbiao.cn
Server Port: 80
Document Path: /account/users/login/
Document Length: 250 bytes
Concurrency Level: 20
Time taken for tests: 1.340 seconds
Complete requests: 200
Failed requests: 7
(Connect: 0, Receive: 0, Length: 7, Exceptions: 0)
Write errors: 0
Non-2xx responses: 7
Total transferred: 91371 bytes
Total body sent: 46400
HTML transferred: 48387 bytes
Requests per second: 149.21 [#/ SEC] (mean)
Time per request: 134.035 [ms] (mean)
Time per request: 6.702 [ms] (mean, across all concurrent requests)
Transfer rate: 66.57 [Kbytes/ SEC] Received
33.81 kb/s sent
100.38 kb/s total
Connection Times (ms)
min mean[+/-sd] median max
Connect: 1 1 0.1 1 1
Processing: 2 116 27.8 114 179
Waiting: 2 116 27.8 114 179
Total: 3 117 27.8 116 180
Percentage of the requests served within a certain time (ms)
50%116
66%121
75%125
80%129
90%154
95%167
98%173
99%175
100%180 (longest request)
#Save pressure measurement results to text for simple analysis
#Simple analysis of 200 pressure test requests, there are 4 requests failed, respectively 500 error twice and 502 error twice
$ ab -v 10 -c 20 -n 200 -T 'application/json' -p /root/login.json http://data-api.bgbiao.cn/account/users/login/ > ab-log.txt
$cat ab-log.txt | grep HTTP | sort| uniq -c
196 HTTP / 1.0 200 OK
2 HTTP/1.0 500 Internal Server Error
2 HTTP/1.0 502 Bad Gateway
1 POST /account/users/login/ HTTP/1.0
Copy the code
Traefik 500 and 502 troubleshooting
In the case of Kubernetes clustering, traefik plays a similar role to Nginx in that we mentioned the reasons for the 500 and 502 status codes in the Nginx scenario and related solutions.
At the beginning, we mentioned the configuration information of traefik within the cluster. For SRE, any service produced must have relevant observable data. Therefore, We also persisted traefik’s access logs and process logs by default (corresponding to access.log and Traefik.log, respectively) and exposed Traefik’s Prometheus metrics interface.
For the above pressure test request, we caught the following exception log in the access log:
$ tail -f access.log | grep data-api.bgbiao.cn | grep -v HTTP / 1.0 "' 200 '
192.168.0.23- -28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench / 2.3" 122267376 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 0ms
192.168.0.23- -28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench / 2.3" 122267385 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 1ms
192.168.0.23- -28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench / 2.3" 122267410 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 1ms
192.168.0.23- -28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench / 2.3" 122267418 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 1ms
192.168.0.23- -28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench / 2.3" 122267484 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 1ms
192.168.0.23- -28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 502 11 "-" "ApacheBench / 2.3" 122267518 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 1ms
192.168.0.23- -28/Jun/2020:14:57:39 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench / 2.3" 122267550 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 4ms
192.168.0.23- -28/Jun/2020:15:02:06 +0000] "POST /account/users/login/ HTTP/1.0" 502 11 "-" "ApacheBench / 2.3" 122272696 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 2ms
192.168.0.23- -28/Jun/2020:15:02:06 +0000] "POST /account/users/login/ HTTP/1.0" 502 11 "-" "ApacheBench / 2.3" 122272711 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 1ms
192.168.0.23- -28/Jun/2020:15:02:06 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench / 2.3" 122272836 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 0ms
192.168.0.23- -28/Jun/2020:15:02:06 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench / 2.3" 122272837 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 0ms
Copy the code
As you can see, the status codes of failed requests are the same as those in our pressure test results, which are 500 and 502, although there are more than four, but this is not important for the time being.
Usually for most people, when they see 500 or 502 in the agent layer, they will subconsciously think that it must be the upstream service. However, such speculation can be quickly eliminated, and the methods of elimination are basically as follows:
- Continue to access the upstream server during pressure tests 500 and 502
- Pressure the upstream server directly using the same pressure parameters
- Deploy the upstream service on a separate ECS and pressure it with the same parameters
After the above three methods of testing, we basically ruled out the upstream service problems, so we officially doubt whether traefik itself has performance or other parameters problems.
Change the log level in Traefki to DEBUG:
$cat traefik - ds - v1.7.6 yaml
.
args:
- --api
- --kubernetes
- --logLevel=DEBUG
.
Copy the code
The following information is displayed in the log:
# 500 Related logs
time="2020-06-28T15:35:05Z" level=debug msg="'500 Internal Server Error' caused by: http: server closed idle connection"
time="2020-06-28T15:35:05Z" level=debug msg="Vulcand/foxy HTTP: / forward/Round trip: http://20.0.26.9:8080, code: 500, Length: 21, duration: 1.486276 ms"
# 502 Related logs
time="2020-06-28T15:35:05Z" level=debug msg="'502 Bad Gateway' caused by: EOF"
time="2020-06-28T15:35:05Z" level=debug msg="Vulcand/foxy HTTP: / forward/Round trip: http://20.0.26.9:8080, code: 502, Length: 11, duration: 1.530677 ms"
Copy the code
The 500 message above is probably caused by traefik’s service process actively closing the idle link, while the 502 message below is caused by EOF, which seems to have been disconnected without finishing reading the response data, causing Traefik to return 502. This is often the case in Nginx, and it is easy to adjust the configuration parameters (mentioned at the beginning of this article), but traefik deployment patterns and parameter adjustments require careful documentation.
Traefik-related Issues on Github showed that the issue had occurred several times.
- Traefik return 500 internal error – no 500 on backend
- Sporadic 502 response only when running through traefik
One is 500 issues and one is 502 issues, but usually they come in pairs.
500 and 502 problem solutions
As mentioned in the first issue above, Traefik has HTTP keep-alive enabled by default in HTTP reverse proxy, but not in Python applications. Since the above test program was actually developed using Python, adjust this parameter first.
Add keep-alive to uwsgi.ini
$ cat uwsgi.ini
[uwsgi]
HTTP = 0.0.0.0:8080
http-keepalive = 1
chdir = /opt/app/
wsgi-file = /opt/app/main.py
callable = app
Stats = 0.0.0.0:8081
processes = 2
threads = 10
After restarting the application, test again
The error 502 and 500 that appeared before have almost disappeared
# 200 concurrent requests, total 10000 requests, failed number 0, total 1min(QPS to 160😸)
$ ab -c 200 -n 10000 -T 'application/json' -p /root/login.json http://data-api.bgbiao.cn/account/users/login/
.
.
Concurrency Level: 200
Time taken for tests: 59.323 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Total transferred: 4670000 bytes
Total body sent: 2320000
HTML transferred: 2500000 bytes
Requests per second: 14655 [#/sec] (mean)
Time per request: 1186.454 [ms] (mean)
Time per request: 5.932 [ms] (mean, across all concurrent requests)
Transfer rate: 76.88 [Kbytes/ SEC] Received
38.19 kb/s sent
115.07 kb/s total
Connection Times (ms)
min mean[+/-sd] median max
Connect: 1 3 38.6 1 1035
Processing: 101 942 1457.7 857 32684
Waiting: 101 942 1457.7 857 32684
Total: 102 945 1458.0 861 32685
# P99 reaches 7.3s
Percentage of the requests served within a certain time (ms)
50% by 861
66% by 1033
75% by 1136
80% by 1191
90% by 1886
95% by 2281
98% by 4209
99% by 7399
100% 32685 (longest request)
Copy the code
Problems 500 and 502 have been temporarily solved by turning on HTTP keep-alive parameter of the business layer. Is it possible to optimize the problem through traefik parameter? It has been mentioned in the second issue above.
That is, modify the following traefik parameters and redeploy the traefik cluster:
# close traefik keep - the alive parameters, the default value is 200, if the parameter is 0, then use the go DefaultMaxIdleConnsPerHost parameters in the standard library
# keep-alive is primarily used to reuse links to reduce open files, but this can happen with a large number of short connections
--maxidleconnsperhost=-1
# increase the timeout time for idle links and forwarding responses by setting the retry times. The default value is 0
--retry.attempts=10
# this parameter has been replaced by -- respondingtimeouts. Idletimeout parameter, the default is 3 m0s
--idletimeout=60s
The default is 0s
--forwardingtimeouts.responseheadertimeout=60s
# traefik Idle link timed out
$ ./traefik --help | grep idletimeout
--idletimeout (Deprecated) maximum amount of time an idle (keep-alive) connection will remain (default "0s")
--respondingtimeouts.idletimeout IdleTimeout is the maximum amount duration an idle (keep-alive) connection will (default "3m0s")
Parameter related to response timeout
$ ./traefik --help | grep respondingtimeouts
--respondingtimeouts Timeouts for incoming requests to the Traefik instance (default "true")
--respondingtimeouts.idletimeout IdleTimeout is the maximum amount duration an idle (keep-alive) connection will (default "3m0s")
--respondingtimeouts.readtimeout ReadTimeout is the maximum duration for reading the entire request, including (default "0s")
--respondingtimeouts.writetimeout WriteTimeout is the maximum duration before timing out writes of the response. (default "0s")
# forwarding timeout
$ ./traefik --help | grep forwardingtimeouts
--forwardingtimeouts Timeouts for requests forwarded to the backend servers (default "true")
--forwardingtimeouts.dialtimeout The amount of time to wait until a connection to a backend server can be (default "30s")
--forwardingtimeouts.responseheadertimeout The amount of time to wait for a server's response headers after fully writing (default "0s")
Copy the code
The final modified traefik parameters are as follows:
--maxidleconnsperhost=-1 --maxidleconnsperhost=-1
- --api
- --kubernetes
- --logLevel=INFO
- --traefikLog.filePath=/logdata/traefik.log
- --configfile=/config/traefik.toml
- --accesslog.filepath=/logdata/access.log
- --accesslog.bufferingsize=100
- --forwardingtimeouts.responseheadertimeout=60s
- --respondingtimeouts.idletimeout=180s
- --retry.attempts=10
- --idletimeout=180s
Copy the code
The creepy 503 in SLB
After the domain name was directly resolved to traefik node and 502 and 500 no longer appeared, we mounted the node of Traefik to the Intranet SLB of Ali Cloud, but the strange 503 problem began to appear again.
Simple pressure test after the SLB is connected (the Intranet uses the free SLB 😹)
# client->ali-slb->traefik->pods
$ ab -v 10 -c 200 -n 2000 -T 'application/json' -p postfile.json http://data-api.soulapp.cn/get_doc > slb-log.txt
$ cat slb-log.txt | grep HTTP / 1.1 '200' | wc -l
1322
$ cat slb-log.txt | grep HTTP / 1.1 '503' | wc -l
678
# client->traefik->pods
$ ab -v 10 -c 200 -n 2000 -T 'application/json' -p postfile.json http://data-api.c.bgbiao.cn/get_doc > traefik-log.txt
Completed 200 requests
Completed 400 requests
Completed 600 requests
Completed 800 requests
Completed 1000 requests
Completed 1200 requests
Completed 1400 requests
Completed 1600 requests
Completed 1800 requests
Completed 2000 requests
Finished 2000 requests
$ cat traefik-log.txt | grep HTTP / 1.0 '200' | wc -l
2000
Copy the code
Later, I saw the following document in aliyun document, which basically means that the SLB side of Aliyun has made certain traffic limiting policies for SLBS of different specifications. At this time, 503 will be returned to the client.
Ali cloud TPS – 503