The author | yuan b Ali cloud log data acquisition, head of client service, currently collecting client logtail millions of scale deployment in the group, every day tens of thousands of applications for PB data, experience many pairs of 11, 12 test.
Introduction: With the K8s continuous update iteration, the use of K8s log system construction developers, gradually encountered a variety of complex problems and challenges. In this article, the author analyzes the difficulties in the construction of K8s log system based on his years of experience, expecting to provide useful reference for readers.
I have been working in the Logging field for several years. In the past year, more and more students came to consult how to build a Logging system for Kubernetes, or they came to ask for help. During this process, they encountered a series of problems and how to solve them. Let the students who see this article can take detours. This series of articles is positioned as a long series, and the content tends to be practical operation and experience sharing, and the content will be updated irregularly with the iteration of technology.
preface
I first heard the name of Kubernetes in 2016, when Kubernetes was still in the “three Kingdoms era” with Docker Swarm and Mesos solutions, Kubernetes emerged from this competition with a number of advantages (scalability, declarative interface, cloud friendliness) and eventually gained dominance. As the most core project of CNCF (none), Kubernetes is the base of Cloud Native. At present, Alibaba has carried out Cloud Native transformation of the whole site based on Kubernetes. Within 1-2 years, 100% of Alibaba’s business will run on the public cloud. The core of CloudNative definition in CNCF is: In public cloud, private cloud, hybrid cloud, etc., Through Containers, Service Meshes, MicroServices, Immutable Infrastructure, Declarative APIs Build and run resilient, fault-tolerant, manageable, observable, loosely-coupled applications. Observability is an essential part of an application system, and there is Diagnosability in cloud native design, including cluster-level logging, metrics, and Trace.
Why do we need a logging system
The process for locating an online fault is as follows: Discover the fault using Metric, locate the faulty module based on Trace, and locate the fault cause based on the module logs. Logs contain errors, key variables, and code running paths, which are the core of troubleshooting. Therefore, logs are always the only path for troubleshooting online problems.
During ali’s more than ten years, the log system has been evolving along with the development of computing forms, which can be roughly divided into three main stages:
- In the single-machine era, almost all applications are deployed on a single machine, and when the service pressure increases, only IBM minicomputers of higher specifications can be switched. As a part of the application system, logs are mainly used for program debugging. Logs are usually analyzed with common Linux text commands such as grep.
- As the stand-alone system became the bottleneck restricting the business development of Ali, in order to truly Scale out, The Flying project was launched: the Flying 5K project was officially launched in 2013. At this stage, each business started the distributed transformation, and the invocation between services also changed from local to distributed. In order to better manage, debug and analyze distributed applications, we developed the Trace (distributed link tracking) system and various monitoring systems. The common feature of these systems is the centralized storage of all logs (including metrics, etc.);
- In order to support faster development and iterative efficiency, in recent years, we have started the transformation of containers, and started to embrace Kubernetes ecology, business full cloud, Serverless and other work. At this stage, the log shows explosive growth in both scale and type, and the demand for digital and intelligent analysis of log is also increasing. Therefore, a unified log platform emerges at the historic moment.
The ultimate interpretation of observability
In CNCF, Observability plays a major role in problem diagnosis, which extends to the overall level of the company. Observability not only covers DevOps, but also business, operation, BI, audit, security and other fields. The ultimate goal of Observability is to realize the digitization and intelligence of all aspects of the company.
In Ali, almost all business roles involve a variety of log data. To support various application scenarios, we have developed many tools and functions: real-time log analysis, link tracing, monitoring, data processing, stream computing, offline computing, BI system, audit system and so on. The log system mainly focuses on real-time data collection, cleaning, intelligent analysis and monitoring, as well as connecting with various streaming computing and offline systems.
Kubernetes log system construction difficulties
Simple log system solutions are very many, relatively mature, here will not go to repeat, we only for Kubernetes log system construction in terms of. The logging solution on Kubernetes is quite different from our previous logging solution based on physical machines and virtual airports. For example:
- The form of logs becomes more complex. There are not only physical machine/VIRTUAL machine logs, but also container standard output, container files, container events, Kubernetes events and other information to be collected.
- The dynamic environment becomes stronger, in Kubernetes, machine downtime, offline, online, Pod destruction, capacity expansion/reduction and so on are normal, in this case the existence of the log is instantaneous (for example, if the Pod destruction after the Pod log is not visible), so the log data must be real-time collected to the server. At the same time, it is necessary to ensure that the log collection can adapt to this dynamic scene;
- A request from the client needs to pass through CDN, Ingress, Service Mesh, Pod and other components, involving a variety of infrastructure, among which the log type has increased a lot. For example, K8s system component logs, audit logs, ServiceMesh logs, and Ingress logs.
- Business architecture changes, now more and more companies began to landing on Kubernetes micro service architecture, in the service system, service for the development of more complex, service dependencies between and the underlying products rely on more and more, then the problem of screening will be even more complicated, if dimensions associated log will be a difficult problem;
- Log solution integration is difficult, we usually build a SET of CICD system on Kubernetes, this CICD system needs to complete business integration and deployment as automation as possible, including log collection, storage, cleaning, etc., also need to be integrated into this system, and as consistent as possible with K8s declarative deployment mode. However, the existing log systems are usually independent systems, which cost a lot to integrate into CICD.
- Log scale problem, usually at the beginning of the system we will choose the logging system, and build open source this way at the beginning of the test validation phase or company development is no problem, but as the business grow, log volume growth to a certain scale, self-built open source systems often will encounter all sorts of problems, Examples include tenant isolation, query latency, data reliability, system availability, and so on. Although the log system is not the most important path in IT, these problems will have terrible impacts once they occur at critical moments. For example, when an emergency problem occurs during troubleshooting, multiple engineers query the log system concurrently and blow up the log system. As a result, the fault recovery time becomes long and the impact is greatly affected.
I believe that the students who are engaged in the construction of K8s log system will be deeply touched by the analysis of the difficulties above. We will introduce in detail how to build K8s log system in Ali from the perspective of landing. Please pay attention to it.
The original link
This article is the original content of the cloud habitat community, shall not be reproduced without permission.