Ali PB level Kubernetes log platform construction practice

QCon is a comprehensive technology event hosted by InfoQ and held annually in London, Beijing, New York, Sao Paulo, Shanghai and San Francisco. I was honored to participate in the 10th anniversary conference of QCON. as the sharing guest, I published the Construction Practice of Ali PB-level Kubernetes Log Platform in the special session of Mr. Liu Yu’s operation and maintenance. Now I have sorted out PPT and manuscript, hoping to share it with more fans.

The development of computing form and the evolution of log system

During ali’s more than ten years, the log system has been evolving along with the development of computing forms, which can be roughly divided into three main stages:

In the single-machine era, almost all applications are deployed on a single machine, and when the service pressure increases, only IBM minicomputers of higher specifications can be switched. As a part of the application system, logs are used for program debugging and are analyzed using common Linux text commands such as grep.
As the single-machine system became the bottleneck restricting ali’s business development, in order to truly Scale out, feitian project was launched: the first line of feitian code was started in 2009, and feitian 5K project was officially launched in 2013. At this stage, each business started the distributed transformation, and the invocation between services also changed from local to distributed. In order to better manage, debug and analyze distributed applications, we developed the Trace (distributed link tracking) system and various monitoring systems. A common feature of these systems is the centralized storage of all logs, including metrics, etc.
In order to support faster development and iterative efficiency, in recent years, we have started the transformation of containers, and started to embrace Kubernetes ecology, business full cloud, Serverless and other work. A very important part of implementing these transformations is the work of observability, and logging is the best way to analyze how the system is running. At this stage, the log shows explosive growth in both scale and type, and the demand for digital and intelligent analysis of log is also increasing. Therefore, a unified log platform emerges at the historic moment.

The importance and construction goal of the log platform

Logs include not only Debug logs of servers, containers, and applications, but also various types of access logs, middleware logs, user clicks, IoT/ mobile terminal logs, and database binlogs. These logs are used in different scenarios depending on their timeliness:

Quasi-real-time: This type of logs is mainly used for quasi-real-time (second-level delay) online monitoring, log viewing, o&M data support, and problem diagnosis. In the last two years, quasi-real-time service insight has also been developed and is based on this type of logs.
Hourly/day level: When the data is accumulated to hourly/day level, some T+1 analysis work can be started, such as user retention analysis, advertising effectiveness analysis, anti-fraud, operation monitoring, user behavior analysis, etc.
Quarterly/annual level: In Ali, data is our most important asset, so many logs are saved for more than one year or permanently. These logs are mainly used for archiving, auditing, attack tracing, business trend analysis, data mining, etc.

In Ali, almost all business roles involve a variety of log data. To support various application scenarios, we have developed many tools and functions: real-time log analysis, link tracing, monitoring, data cleaning, streaming computing, offline computing, BI system, audit system and so on. Many of these systems are very mature, the logging platform mainly focuses on intelligent analysis, monitoring and other real-time scenarios, other functions are usually open form support.

Ali log platform status quo

At present, Ali’s log platform covers almost all product lines and products. At the same time, our products also provide services on the cloud, which has served tens of thousands of enterprises. Daily write flow of more than 16PB, corresponding log lines 40 trillion +, collection client 2 million, service thousands of Kubernetes cluster, is one of the largest log platform in China.

Why self-build

There are many open source solutions, such as ELK(Elastic Search, Logstash, and Kibana). Usually, a log system has the following functions: Log collection/parsing, query and retrieval, log analysis, visualization/alarm, etc., all of these functions can be realized through the combination of open source software, but we chose to build our own, mainly for the following considerations:

Data scale: These open source logging systems can support small scale scenarios well, but it is difficult to support ali’s very large scale (petabyte) scenarios.
Resource consumption: We have millions of servers/containers, and the logging platform has a large cluster size, so we need to reduce resource consumption for the collection and the platform itself.
Multi-tenant isolation: Most systems built by open source software are not designed for multi-tenant. When many businesses/systems use the logging platform, it is easy to crash the entire cluster due to heavy traffic/improper use by some users.
Operation and maintenance complexity: Alibaba has a very complete service deployment and management system, which has a very good operation and maintenance complexity based on internal components.
Advanced analysis requirements: Almost all the functions of the log system come from the corresponding scenarios. There are many advanced analysis requirements for special scenarios that open source software cannot support, such as context, intelligence analysis, and special analysis functions of the log class.

Kubernetes log platform construction difficulties

Around the requirements of The Kubernetes scenario, the difficulties in the construction of the log platform are as follows:

Log collection: Collection in Kubernetes is extremely critical and complex, mainly because Kubernetes is a highly complex scene, K8s has a variety of subsystems, the upper business support a variety of languages and frameworks, at the same time, log collection needs as far as possible and Kubernetes system through, in the form of K8 to complete data collection.
Resource consumption: In K8s, services are usually very small, so data collection needs to consume as little resources as possible for the service itself. Here we make a simple calculation, assuming that there are 100W service instances, each collection Agent reduces 1M memory and 1% CPU overhead, which will reduce 1TB memory and 10000 CPU cores overall.
Operation and maintenance cost: The cost of operation and maintenance of a log platform is quite large, so we do not want each user to build a Kubernetes cluster also need to operate and maintain an independent log platform system. Therefore, the log platform must be SaaS, and the application/user only need to operate the Web page to complete the whole process of data collection and analysis.
Convenient use: The core function of the log system is troubleshooting. The speed of troubleshooting directly determines the work efficiency and loss. In the K8s scenario, a set of high-performance and intelligent analysis functions are needed to help users locate problems quickly, and a series of simple and effective visualization methods are provided for assistance.

Ali PB level Kubernetes log platform construction practice

Kubernetes Log data collection

Log capture is an essential part of both ITOM and future AIOps scenarios, as data sources directly determine the shape and functionality of subsequent applications. In more than ten years, we have accumulated a set of log collection experience for physical machines and virtual machines, but it is not completely applicable to Kubernetes. Here we expand in the form of questions:

Question 1: DaemonSet or Sidecar

The main log collection tool is Agent. In the Kubernetes scenario, there are two collection methods:

DaemonSet mode: Log Onset Agent is deployed on each NODE of K8S, and the Agent collects logs of all containers to the server.
Sidecar mode: A POD runs a Sidecar log Agent container to collect the logs generated by the POD master container.

Each collection method has its corresponding advantages and disadvantages, which can be summarized as follows:

DaemonSet mode Sidecar Log collection Type Standard Output + Partial files File deployment OONonset medium, requiring maintenance DaemonSet high, each POD that requires log collection needs to deploy Sidecar container log storage medium, can be mapped by container/path each POD can be configured independently. High flexibility multi-tenant isolation is general, can only through configuration isolation between strong, through the container in isolation, allocating resources support cluster scale small and medium size alone, support the maximum number of business occupy the lower level of the unlimited resources, each node running a container is higher, each POD run a container query convenience is higher, can be custom query, statistics, Customization based on service characteristics Low customization high, each POD is configured independently Application Scenario Single-function cluster Large, mixed, and PAAS clusters

Inside Ali, for large PAAS clusters, Sidecar is mainly used to collect data, which has the best isolation and flexibility. However, for the clusters with relatively single functions (internal departments/self-built products), DaemonSet was basically adopted, with the lowest resource occupation.

Question 2: How to reduce resource consumption

Our data collection Agent uses self-developed Logtail, which is written in C++/Go. Compared with the open source Agent, Logtail has a great advantage in resource consumption. However, we have been squeezing the resource consumption of data collection, especially in the container scenario. In general, to improve logging and collection performance, we use local SSDS as logging disks. Here we can do a simple calculation: assume that each container carries 1GB SSDS, and one physical machine runs 40 containers. Then, each physical machine needs 40 gb SSDS for log storage, and the 5W physical machine takes 2PB SSDS. In order to reduce resource consumption, we developed FUSE log collection method together with students from Ant Financial. FUSE (Filesystem in Userspace) virtualizes log disks, and the application directly writes logs to virtual log disks. The final data will be collected directly from memory by Logtail to the server. The benefits of this collection include:

Physical servers do not need to provide log disks for containers to implement diskless logs.
The application view still looks at the normal file system without any additional modifications.
Data collection Bypasses disks and directly collects data from the memory to the server.
All data is stored on the server, which supports scale-out and wireless storage space for applications that see the log disk.

Question 3: How do I integrate seamlessly with Kubernetes

One of Kubernetes’ big breakthroughs is the use of declarative apis for service deployment, cluster management, and so on. But in K8s cluster environment, business application/service/component of continuous integration and automatic publishing has become the norm, use the console or the SDK operating collection configuration way is difficult and all kinds of CI, choreography framework integration, after the release of the user can only lead to business application through the console manually configure deployment and the corresponding log collection of configuration. Therefore, we realized the Operator of collection configuration based on Kubernetes’ CRD extension. Users can directly configure the collection mode by using K8s API, Yaml, Kubectl, Helm, etc. Truly integrate log collection into Kubernetes system to achieve seamless integration.

Question 4: How do I manage millions of Logtail

There is a classic principle for talent management: 10 people should be well-intented, 100 people should be decisive, and 1000 people should be hands-off. The same applies to the management of Logtail, a log collection Agent. There are three main processes:

The size: A few years ago, When Logtail was first deployed, it ran on hundreds of physical machines. Like other mainstream agents, Logtail mainly performed data collection functions, including data input, processing, aggregation, and sending. In this period, management was basically by hand. When there is a problem with the acquisition, the manual login machine to see the problem.
Ten thousand scale: As more and more applications access, there may be multiple applications on each machine to collect different types of data, and the manually configured access process becomes more and more difficult to maintain. Therefore, we focus on multi-tenant isolation and centralized configuration management, while adding many control related means, such as traffic limiting, degradation, and so on.
To millions of millions of scale: when deploying quantity level, has become the norm of exception, we need more is to rely on a series of monitoring, reliability assurance mechanism, automated operations management tools, let these mechanisms, automatic tools to complete the Agent installation, monitoring, since the recovery and so on a series of work, truly the shopkeeper of cutting.

Kubernetes logging platform architecture

The figure above shows the overall architecture of Alibaba Kubernetes log platform, which is divided into log access layer, platform core layer and scheme integration layer from bottom to top:

The platform provides many ways to access various types of log data. Not only Kubernetes logs, but also all logs related to Kubernetes business, such as mobile logs, Web application click logs, IoT logs, etc. All data can be collected by active Push and passive Agent. Agent not only supports the Logtail developed by us, but also the open source Agent (Logstash, Fluentd, Filebeats, etc.).
The logs first arrive in a real-time queue provided by the platform. Similar to Kafka’s Consumer group, we provide a real-time data subscription function that users can use to fulfill their ETL requirements. The core functions of the platform include:

Real-time search: A search engine-like approach that supports searching by keyword from all logs, and supports very large scale (petabytes).
Real-time analysis: Provides interactive log analysis methods based on SQL92 syntax.
Machine learning: provides intelligent analysis methods such as time series prediction, time series clustering, root cause analysis and log aggregation.
Flow computing: Interconnects with various flow computing engines, such as Flink, Spark Stream, and Storm.
Offline analysis: Connects to an offline analysis engine, such as Hadoop and Max Compute.

Based on the full range of data sources and the core functions provided by the platform, and combined with Kubernetes log characteristics and application scenarios, build up the general solution of Kubernetes log, such as: audit log, Ingress log analysis, ServiceMesh log and so on. At the same time, for applications/users with specific requirements, they can directly build the upper layer solutions based on OpenAPI provided by the platform, such as Trace system and performance analysis system.

Let’s expand the core functions provided by the platform from the perspective of troubleshooting.

PB level log query

The best way to troubleshoot problems is to check logs. The first thing that comes to most people’s mind is to use the grep command to search for some key error information in logs. Grep is one of the most popular commands for Linux programmers, and it is also very useful for simple troubleshooting scenarios. If the application is deployed on multiple machines, commands such as PGM and PSSH are used in combination. However, these commands are not suitable for dynamic, large-scale scenarios like Kubernetes. The main problems are:

The query is not flexible, and it is difficult to combine various logical conditions with the grep command.
Grep is an analysis tool for plain text, which makes it difficult to format logs into corresponding types, such as Long, Double, or even JSON.
The prerequisite of the grep command is that logs are stored on disks. In Kubernetes, the local log space of the application is very small, and the service will migrate and scale dynamically, and the local data source will probably not exist.
Grep is a typical full scan method. If the amount of data is less than 1GB, the query time is acceptable. However, when the amount of data increases to TB or even PB, the search engine technology must be relied on to work.

In 2009, in order to solve the problems of r&d efficiency and problem diagnosis under large scale (e.g. 5000 PCS), we started to support the super-large log query platform. The main goal is “fast”, and the billions of data can be easily completed in seconds.

Log context

After key logs are located through query, the system behavior needs to be analyzed and the on-site situation restored. The scene is the log context at the time, for example:

An error, before and after data in the same log file
A line of output in LogAppender that outputs the same process to the log module in sequence
One request, the same Session group
One cross-service request, the same TraceId combination

In the Kubernetes scenario, the standard output (STdout) and files of each container are combined to form a context partition, for example, Namesapce+Pod+ContainerID+FileName/ stdout. In order to support the context, we carry a globally unique and monotonically increasing cursor for each minimum differentiation unit in the collection protocol. This cursor has different forms for the output of stand-alone logs, Docker, K8S, mobile SDK, Log4J/LogBack and so on.

An analysis engine for logging

In some complex scenarios, we need to do statistics on the log data to find patterns. For example, aggregations based on ClientIP are performed to find attack source IP addresses, data aggregation is performed to calculate P99/P9999 delay, and multi-dimension combination analysis is performed. In the traditional mode, you need to work with a stream computing engine or an offline computing engine to perform aggregated computing, and then connect to a visual system for graphical display or an alarm system. In this way, users need to maintain multiple systems, and the real-time data deteriorates, and the connection between systems is prone to problems.

Therefore, our platform integrates log analysis, visualization, and alarm functions natively to minimize user configuration links. After years of practice, we found that SQL analysis is the most acceptable method for users, so we analyzed the implementation based on SQL92 standard, and on this basis extended many advanced functions for log analysis scenarios, such as:

Comparison of data before and after is one of the most commonly used methods in log analysis. We provide a year-on-year/sequential function, which can calculate the growth rate of PV today compared with yesterday and last week.
IP geographic function: Based on The high-precision IP geographic database of Taobao, it provides IP conversion to country, province, city, operator, latitude and longitude, etc. For example, remote-IP in the common Nginx access log and K8s Ingress access log can be directly used to analyze geographical location distribution.
Join External data sources: Perform Join analysis on logs with MySQL and CSV, for example, search for user information in the database based on ID, and associate logs with network architecture data in the CMDB.
Security function: Supports common log security analysis methods, such as high-risk IP address database search, SQL injection analysis, and high-risk SQL detection.

Intelligent Log Analysis

On the log platform, applications and users can access, query, analyze, visualize, and alarm logs to monitor exceptions, investigate, and locate problems. However, with the continuous evolution of computing forms, application forms and responsibilities of developers, especially in the recent two years, the rise of Kubernetes, ServiceMesh, Serverless and other technologies, the complexity of the problem is rising, and the conventional means have been difficult to apply. So we started to try to develop into AIOps, such as time series analysis, root cause analysis, log clustering, etc.

Time series analysis

Through the method of timing prediction, we can model the CPU and storage time sequence, make the scheduling more intelligent, and make the overall utilization as smooth as silk. Storage teams budget ahead of time and purchase machines based on projections of disk space growth; When making department/product budget, forecast annual consumption based on annual bills to better control cost.
A slightly larger service may have hundreds, thousands or even tens of thousands of machines. It is difficult to find the difference of each machine’s behavior (time sequence) through human flesh, but time sequence clustering can quickly get the distribution of cluster’s behavior and locate abnormal machines. At the same time, for a single time sequence, the abnormal point can be automatically located through the timing sequence anomaly related detection method.

Returning for analysis

Temporal dependent functions are mainly used to find problems, while pattern Analysis (Root Cause Analysis) is also required to find the Root Cause of problems. For example, when the overall Ingress error rate (5XX ratio) of K8s cluster suddenly increases, how to check whether it is caused by a service problem, a user, a URL, a browser, a network problem in some regions, a node exception or the overall problem? Usually this kind of problem needs to be checked manually from various dimensions, for example:

Group by Service and check whether the error rate is different among services
There’s no difference. Then check the URL
Not yet. Follow the browser
The browser has something to do with it. Continue to look at mobile and PC
Mobile terminal error rate is high, check whether Android or IOS
.

The more dimensions of the problem, the more complex and longer the troubleshooting time. By the time the problem is discovered, the impact has been expanded. Therefore, we developed root analysis related functions that directly locate the set of combinations of dimensions (sets of dimensions) that have the greatest impact on the goal (such as delay, failure rate, etc.) from the multidimensional data. In order to locate problems more accurately, we also support the comparison of the differences between two modes. For example, when an exception occurs today, we can compare it with the normal mode yesterday to quickly find the cause of the problem. Perform blue-green comparisons and A/B tests at release time.

Intelligent log clustering

Above, we found the problem through intelligent timing function and located the key dimension combination through root cause analysis, but when it comes to the final code troubleshooting, we still cannot do without the log. When there is a large amount of log data, it is time-consuming to manually filter logs again and again. Therefore, we hope to use intelligent clustering to cluster similar logs together so that we can quickly learn the system running status through the clustered logs.

Upstream and downstream ecological docking

Kubernetes logging platform is mainly aimed at solving DevOps, Net/Site Ops, Sec Ops and other problems. However, these can not meet all users’ requirements for logging, such as ultra-large log analysis, BI analysis, extremely large security rule filtering, etc. The strength of the platform is more about the strength of the ecology. We connect with a wide range of upstream and downstream ecology to meet users’ more and more log needs and scenarios.

Excellent application case analysis

Case 1: Hybrid cloud PAAS platform log management

A large game company is upgrading its technical architecture, and most of its business will migrate to the PAAS platform built on Kubernetes. In order to improve the overall usability of the platform, the user collection hybrid cloud architecture has great difficulties in the unified construction and management of logs:

Multiple internal applications: Do not want applications to have too much access to details such as log collection and storage, and can provide full-link logs for applications.
1000+ micro service: supports large-scale log collection.
Multi-cloud + offline IDC: hope that a number of cloud manufacturers and offline IDC is the same set of collection scheme;
Short application cycle: The life cycle of some applications is very short, so data must be collected to the server in a timely manner.
Return of overseas data: Logs of overseas nodes are returned for analysis to ensure transmission stability and reliability.

The user finally chose the solution of Aliyun Kubernetes log platform. Logtail solution was used to solve the collection reliability problem, and network problems were solved through the coordination of public network, private line and global acceleration. The system administrator used DaemonSet to uniformly collect logs of all system components. Applications only need to use CRD to collect their own service logs. On the platform side, the system administrator can access all system-level logs and monitor and alarm logs in a unified manner. On the application side, the application side can not only query its own service logs, but also access logs of service-related middleware, Ingress, and system components for full-link analysis.

Case 2: Secondary development log management platform

There are many large business departments in Ali who want to conduct secondary development based on our standard logging platform to meet some special needs of their departments, such as:

Regulate data access through various rules and interface restrictions.
Through TraceID, the whole call chain is connected in series to build the Trace platform.
Detailed permission management for multiple users in a department.
Cost settlement for each sub-department within the department.
Connect with an internal control, operation and maintenance system.

These requirements can be quickly realized based on the OpenAPI and SDK of various languages provided by us. Meanwhile, in order to reduce front-end workload, the platform also provides the function of embedding Iframe, which supports directly embedding some interfaces (such as query box and Dashboard) into the business department’s own system.

Future Work outlook

At present, Ali Kubernetes log platform inside and outside has a lot of applications, the future we will continue to polish the platform, for the application party/users to provide a more perfect solution, the follow-up work is mainly focused on the following points:

Data collection was further refined, reliability and resource consumption were continuously optimized, multi-tenant isolation was achieved, and logs of all applications were collected using DaemonSet on PAAS platform.
Provide more convenient and intelligent data cleaning services, and clean and organize heterogeneous data within the platform.
The knowledge graph that can be automatically updated and support heterogeneous data oriented to THE FIELD of Ops is constructed, so that the experience of troubleshooting can be accumulated in the knowledge base to realize abnormal search and reasoning.
Provide interactive training platform, build more intelligent Automation ability, truly realize Ops closed loop.

The original link

This article is the original content of the cloud habitat community, shall not be reproduced without permission.

Ali PB level Kubernetes log platform construction practice

Ali PB level Kubernetes log platform construction practice

The development of computing form and the evolution of log system

The importance and construction goal of the log platform

Ali log platform status quo

Why self-build

Kubernetes log platform construction difficulties

Ali PB level Kubernetes log platform construction practice

Kubernetes Log data collection

Question 1: DaemonSet or Sidecar

Question 2: How to reduce resource consumption

Question 3: How do I integrate seamlessly with Kubernetes

Question 4: How do I manage millions of Logtail

Kubernetes logging platform architecture

PB level log query

Log context

An analysis engine for logging

Time series analysis

Returning for analysis

Intelligent log clustering

Upstream and downstream ecological docking

Excellent application case analysis

Case 1: Hybrid cloud PAAS platform log management

Case 2: Secondary development log management platform

Future Work outlook

Related Posts

[Wireshark]Lua plug-in development API Reference document

Getting started with Gulp and Webpack

Dry goods share! Design specifications and classical practices for hover buttons