The cluster of our company is always on the verge of collapse. After mastering it for nearly three months, we found the following reasons for the instability of our cluster:

2. Lack of monitoring platform [the most important reason] 3. Lack of log system 4Copy the code

Overall, the main reason for the problem is the lack of a predictable monitoring platform, which is always unknown until problems arise. The secondary reason is that the role of the server is not clear and the release process is not stable.

Solution Version release process is unstable

Refactor the release process. Business comprehensive K8S, build kubernetes as the core ci/ CD process.

Release process

The release process is as follows:

Sc: The developer submits the code to the Developer branch (always make sure that the Developer branch is in the latest code), the Developer branch merges into the branch corresponding to the release environment, triggers the enterprise wechat alarm, triggers the Gitlab-Runner pod deployed in the K8S cluster, The new runner pod performs ci/ CD operations.

There are three steps in this process: test case, package image, and update pod.

When deploying the service for the first time in the K8S cluster environment, you may need to create namespace, ImagepullSecret, PV (Storageclass), Deployment (Pod Controller), SVC, ingress, and so on. Image package push ali cloud warehouse and image download from Ali cloud warehouse use VPC to access, do not connect to the public network, no network speed limit. The process is completed, runner pod is destroyed, and GitLab returns the results.

It is important to note that the resource list does not contain ConfigMap or Secret and should not be used for security reasons

Now in the code warehouse, our company uses Rancher as the K8S multi-cluster management platform, and the above security issues are handled by operation and maintenance in Rancher’s dashboard.

Service deployment logic diagram

The service deployment logic diagram is as follows:

According to the analysis of the release process, and then according to the logic diagram can be clear release process. Here we can see that our company uses Kong instead of Nginx to do authentication, authentication and agent. The IP address of the SLB is bound to Kong. 0,1,2 belong to test job; 3 Belongs to build job. 4,5,6,7 belong to the change pod stage. Not all services need to be stored. It depends on the actual situation, so you need to write judgment in kubernetes.sh. Here I am trying to use a set of CI applications with all the environments, so I need to use more judgment in kubernetes.sh, and.gitlab-ci.yml is too much. The recommendation is to use a CI template that can be applied to all environments. You should also consider your own branching mode, as described in Git Branch Development Specification Manual, Git Commit Specification

Lack of monitoring and warning platform To build a reliable federal monitoring platform in line with the cluster environment of our company, to achieve simultaneous monitoring of several cluster environments and pre-failure alarm, intervention in advance.

Monitoring and warning logic diagram

The monitoring and early warning logic diagram is as follows:

Brief analysis: In general, the monitoring schemes I use here are Prometheus ➕shell script or go ➕ Sentry script. The alarm mode is enterprise wechat or enterprise email.

The three colored lines in the figure represent three monitoring methods that need to be paid attention to. The script is used for backup alarms, certificate alarms, and theft detection. Prometheus Here uses the Prometheus resource list modified according to Prometheus – Opertor. Data is stored on nas. Strictly speaking, Sentry is a log collection platform. I classify it as a monitoring platform here because I value its ability to collect crash information of the underlying application code. It is a service logic monitoring platform, which aims to collect and summarize error logs generated during the operation of the service system and monitor alarms.

Note that the federated monitoring platform is used instead of deploying a normal monitoring platform.

Logical diagram of federal monitoring and warning platform

The logical diagram of multi-cluster federated monitoring and early warning platform is as follows:

Since our company has several K8S clusters, it would be too inconvenient to manage if a set of monitoring and warning platforms were deployed on each cluster. Therefore, the strategy I adopted here is to implement a federal strategy for each monitoring and warning platform and use a unified visual interface for management. Here I will implement three levels of monitoring: operating system level, application level, and business level. For traffic monitoring, you can directly monitor kong, template 7424.

Lack of log system With the advancement of comprehensive SERVICE K8S, the demand for log system becomes more urgent. The feature of K8S is that service fault logs are difficult to obtain. The establishment of an observable and filtered log system can reduce the difficulty of fault analysis.

The logic diagram of the log system is as follows:

Brief analysis: in the overall business of K8S, convenient management and maintenance, but the difficulty of log management on the appropriate rise.

We know that pod restarts are multi-factorial and uncontrollable, and that every POD restart is re-logged, meaning that the logs before the new POD are not visible. Of course, there are several ways to implement log persistence: remotely stored logs, locally mounted logs, and so on. Elasticsearch was chosen to build the log collection system for the sake of visualization, analyzability, etc.

Extremely lack of relevant operation documents to establish a language -> operation and maintenance related information as the center of the document center, the relevant operations, problems, scripts and other detailed records, ready to view at any time.



Brief analysis: For security reasons, it is not convenient for too many colleagues to check.

Operation and maintenance work is special, security and documentation must be guaranteed. I think no matter operation or development, writing documents is a must to master, for himself or for him. The document can be abbreviated, but must contain the core steps. I still think every step of operation and maintenance should be recorded.

Unclear request routes Based on the new idea of cluster reconstruction, a cluster-level traffic request route is reorganized to construct a traffic management system that integrates authentication, authentication, proxy, connection, protection, control, and observation to effectively control the fault explosion scope.

The request path logic diagram is as follows:



Brief analysis: Customer visithttps://www.cnblogs.com/zisefeizhuThe services have been split into micro services. The communication between services has been authenticated and authorized by ISTIO. Those who need to interact with the database go to the database, those who need to write or read storage go to PV, and those who need to convert services go to the conversion service…… The response is then returned.

To sum up, the construction is as follows: Kubernetes – based CI/CD release process, Prometheus – based federal monitoring and early warning platform, ElasticSearch – based log collection system, Yuqi – based document management center, Kong and IStio – based integrated traffic services, Can be in high flatness, high reliability to do a good guarantee.

Attached: overall architecture logic diagram



Note: Please analyze by arrow and color.

Brief analysis: The image above seems too confusing, but if you calm down, you can see it clearly based on the layer by layer analysis of the split module above. Here I have used different colored lines to represent different modules of the system, following the arrows is pretty clear.

According to the current business flow of our company, the above functional modules can theoretically achieve the stability of the cluster. In my opinion, this scheme can ensure the stable operation of business in K8S cluster for a period of time. If there is any problem, it belongs to the code level. I’m not using middleware here, I’m using cache redis but I’m not drawing it. I plan to add middleware kafka or RQ to the logging system and conversion service after the above figure is done.