After the previous series of introductions, we have set up a set of Kong gateway on the K8s cluster of Ali Cloud, and gradually transferred the services in the old architecture through smooth migration. The next step is to consider how to efficient operation and maintenance.

Operation and maintenance includes service monitoring and alarm, daily management operation of Kong gateway, daily operation of K8s (which may also affect Kong), and even how to combine ali Cloud platform, which has achieved the purpose of simplicity and efficiency.

Overall service and individual request monitoring

Return according to: request

By monitoring the market through the Kong Gateway, we can see the overall request processing details. On this basis, you can try to do a percentage of normal request handling (2xx return) or exception handling (5XX return).

After the previous series of introductions, we have set up a set of Kong gateway on the K8s cluster of Ali Cloud, and gradually transferred the services in the old architecture through smooth migration. The next step is to consider how to efficient operation and maintenance.

Operation and maintenance includes service monitoring and alarm, daily management operation of Kong gateway, daily operation of K8s (which may also affect Kong), and even how to combine ali Cloud platform, which has achieved the purpose of simplicity and efficiency.

Overall service and individual request monitoring

Return according to: request

By monitoring the market through the Kong Gateway, we can see the overall request processing details. On this basis, you can try to do a percentage of normal request handling (2xx return) or exception handling (5XX return).

For some core services, an alarm is applied according to the request return, for example, an alarm is triggered when the request normal processing (2XX return) percentage is less than 99%, and an alarm is triggered when the request abnormal processing (5XX return) percentage is higher than 1%.

According to: Request delay

For some core services, an alarm is given according to the request delay. For example, the service response time in the figure above is usually less than 50ms. We can set an alarm with a p95 line delay of 500ms.

Kong gateway resources are occupied

In a K8s cluster, alarms are generated about node resource usage (CPU, memory, and disk) and Pod resource usage (CPU, memory). The Kong gateway also operates in Pod form. As long as CPU and memory specifications are properly configured, the alarm can refer to the cluster default alarm. In addition, as the gateway, Kong is the inbound and outbound traffic of the entire cluster. Therefore, you need to configure inbound and outbound traffic monitoring. For example, prevent abnormal heavy traffic of a single service from flooding the bandwidth, affecting other normal services.

Individual services are exclusive to SLB

Business services through domain name -> SLB -> K8s Kong gateway, gateway is shared, but SLB can be shared or independent.

Recommended. Important services are exclusive to the SLB, which facilitates SLB specifications control and management (log collection and log analysis). Secondary services share SLBS, share specifications, and reduce cost and volume.

Note: The establishment of SLB is detailed in the previous article.

K8s routine operations Affect the Kong gateway

Note: Here we assume that the Kong gateway is deployed in a Deployment mode.

According to the previous requirements of the “Reserve Source IP Address” function, all K8s nodes serve as the back-end bearer of SLB, and the Kong gateway node must be deployed on all nodes. To sum up, when we operate nodes, we will involve the Kong gateway instance. The following uses restarting a node as an example:

Problem #1: All instance pods must be migrated before a node can be restarted. When migrating The Kong Pod, the gateway traffic needs to be removed first. How to remove the traffic forwarded to a node by the SLB?

The SLB corresponding to K8s in Ali Cloud finds the backend bearer instance through virtual server group, as shown in the figure below

An easy way to do this is to manually adjust the weights and restore them after the whole operation, but this method doesn’t fit well in the process. In addition, THE method I recommend is to automatically remove nodes in the unschedulable state of the SLB backend.

The kubectl cordon and kubectl drain commands place nodes in an unschedulable state, The default service. Beta. Kubernetes. IO/alibaba – cloud – loadbalancer – remove – unscheduled – backend values to off, Nodes in the unschedulable state are not removed from the back-end server group of the SLB.

To remove an unschedulable node from an SLB back-end server group, Please send service. Beta. Kubernetes. IO/alibaba – cloud – loadbalancer – remove – unscheduled – backend values set to on.

apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/alibaba-cloud-loadbalancer-remove-unscheduled-backend: "on"
  name: nginx
spec:
  externalTrafficPolicy: Local
  ports:
  - name: http
    port: 30080
    protocol: TCP
    targetPort: 80
  selector:
    app: nginx
  type: LoadBalancer
Copy the code

With this feature of SLB, let’s go back to the example of restarting the node. The process is as follows:

  1. kubectl cordonSet the node tounschedulableState.
  2. kubectl drainEmpty the instance.
  3. Restart the node.
  4. kubectl uncordonSet the node toschedulableState.

Problem #2: When nodes are scaled, the number of instances of the Kong gateway needs to be adjusted, i.e. the number of copies in Deployment.

Problem #3: Upgrading the Gateway Kong instance.

The current request link is that SLB forwards the request to each node, and kube-Proxy on each node forwards the request to the Kong instance on that node. Assume the following scenario: When upgrading The Kong Gateway and Kong Pod enters the rolling upgrade phase, SLB does not know this information simultaneously (even if health check is configured on SLB, there may be a delay), and SLB still forwards the request to the old Version of Kong Pod instance. If the old Kong Pod instance is in Terminating state or has completely exited, the request will go unanswered.

In this regard, we strictly divide the upgrading of Kong gateway into three steps:

  1. First cut flow, tissue flow further in.
  2. In order toRecreate(The upgrade policies of Deployment are divided into “concrete” and “RollingUpdate”, which defaults to “RollingUpdate”).
  3. Restore traffic.

The method of cutting and restoring the flow can be used in question #1.

Add: We want to minimize the number of gateway upgrades, and when necessary, we must do thorough research and experimentation to minimize risk.

Add: If you want to simplify the operation, I recommend doing it during low traffic.

Service SLB related

Request Log Analysis

As mentioned above, we can configure exclusive SLB for important services, so I recommend layer 7 (application layer), so that we can enable the log analysis function of Aliyun.

A simple Query syntax allows you to understand and analyze business requests, which is very helpful to business developers.

Note: A disadvantage of layer 7 SLBS is that multiple SLBS are required when HTTPS requests from different domains are served uniformly. If it is a tier 4 SLB, there is no such problem.

Business SLB monitoring and alarm

Ali Cloud SLB monitoring and alarm can refer to the document, here will not expand