The author | yuan b Ali cloud storage service technical experts

Introduction: The previous article “6 typical problems in the construction of K8s logging system, how many have you encountered? In, we introduced why a log system is needed, why the log system under cloud native environment is so important, and the difficulties in the construction of the log system under cloud native background. I believe that students such as DevOps, SRE, operation and maintenance will have a deep understanding after reading it. This article will directly share with you how to build a flexible, powerful, reliable and scalable logging system in the cloud native scenario.

Requirements drive architectural design

Technology architecture is the process of transforming product requirements into technology implementations. It is fundamental and important for all architects to be able to thoroughly analyze product requirements. Many systems will be overturned soon after they are built, and the most fundamental reason is that they do not address the real needs of the product.

The log service team I work for has nearly 10 years of experience in the log field, serving almost all the teams within Ali, involving e-commerce, payment, logistics, cloud computing, games, instant messaging, IoT and other fields. The optimization and iteration of product functions over the years are based on the changes in the log needs of each team.

Fortunately, we have realized productization on aliyun in recent years, serving tens of thousands of enterprise users, including Top1 Internet customers in domestic major live broadcast, short video, news media, games and other industries. There will be qualitative differences in product functions from serving a company to serving tens of thousands of companies. On cloud, we have to think more deeply about what functions log platform needs to solve for users, what is the core appeal of log, how to meet the needs of all walks of life and various business roles…

Requirements decomposition and functional design

In the previous section, we analyzed the logging requirements of various roles in the company and summarized the following points:

  1. Supports the collection of various log formats and data sources, including non-K8S
  2. Quickly find and locate problem logs
  3. The ability to format semi-structured/unstructured logs in various formats, and support fast statistical analysis and visualization
  4. Supports real-time calculation through logs to obtain some service indicators, and supports real-time alarms based on service indicators (APM in essence)
  5. Supports association analysis of various dimensions for oversized logs, and can accept a certain time delay
  6. It can easily connect to various external systems or obtain customized data, such as third-party audit systems
  7. It can realize intelligent alarm, prediction, and root cause analysis based on logs and related time sequence information, and can support customized offline training mode for better results

To meet the above functional requirements, the log platform must have the following functional modules:

  1. Comprehensive log collection, support DaemonSet and Sidecar collection methods to meet different collection requirements, and support Web, mobile terminal, IoT, physical machine/virtual machine data source collection;
  2. The real-time log channel is a necessary function for connecting upstream and downstream to ensure that logs can be conveniently used by multiple systems.
  3. Data cleaning (ETL: Extract, Transform, Load), cleaning logs in various formats, supporting filtering, enrichment, conversion, leakage, splitting, aggregation, etc.
  4. Log display and search, which is a necessary function of all log platforms, can quickly locate logs according to keywords and view the log context, seemingly simple functions are the most difficult to do well;
  5. Real-time analysis: search can only locate some problems, while the analysis and statistics function can help quickly analyze the root cause of the problem and quickly calculate some service indicators.
  6. Stream computing, we usually use the Stream computing framework (Flink, Storm, Spark Stream, etc.) to calculate some real-time metrics or perform some custom cleaning of the data.
  7. Offline analysis, operation and security-related requirements require a large number of historical logs to be associated with a variety of dimensions. Currently, only T+1 offline analysis engine can complete this task.
  8. Machine learning framework, it can be convenient and fast to connect the historical logs to the machine learning framework for offline training, and load the training results into the online real-time algorithm library.

Open Source solution Design

With the help of the strong open source community, it is easy to implement such a logging platform based on the combination of open source software. Here is a very typical elK-centric logging platform solution:

  • The collection Agent such as FileBeats and Fluentd is used to realize unified data collection on containers.
  • To provide richer upstream and downstream and buffering capabilities, kafka can be used as the receiving end of data acquisition.
  • The collected raw data needs further cleaning. You can subscribe to Kafka using Logstash or Flink, and then write to Kafka after cleaning.
  • Data after cleaning can be connected to ElasticSearch for real-time query and retrieval, Flink for real-time indicator and alarm calculation, Hadoop for offline data analysis, and TensorFlow for offline model training.
  • Data visualization can use grafana, Kibana and other commonly used visualization components.

Why do we choose self-research

Using the combination of open source software is a very efficient solution, thanks to the strong open source community and the accumulation of experience of a large user group, we can quickly build such a system, and can meet most of our needs.

When the system is deployed, logs can be collected from containers, elasticSearch can be found, SQL can be executed successfully on Hadoop, graphs can be seen on Grafana, and alarm SMS can be received at…… The extra work may only take a few days after the above process is completed, but when the system is finally running, you can finally take a breath and relax in your office chair.

However, the ideal is full and the reality is very skinny. When we got through the pre-delivery, finished the test and went to production, we started to access the first application. Gradually more and more applications were connected, and more and more people began to use…… At this point many problems may be exposed:

  • Kakfa and ES have to be expanded constantly, as well as the Connector for synchronizing Kafka to ES. The most troublesome is the acquisition Agent. The DaemonSet Fluentd deployed on each machine cannot be expanded at all. When the single-agent bottleneck is reached, there is no choice but to change the Sidecar, which will bring a series of other problems, such as how to integrate with CICD system, resource consumption, configuration planning, stDout collection is not supported and so on.
  • From the edge business at the beginning, gradually more core business access, for the reliability of the log requirements are becoming higher and higher, there are often r & D responses from ES can not find the data, the operation of the inaccurate statistical statements, the security of the data is not real-time…… The troubleshooting of each problem has to go through many paths such as collection, queue, cleaning, transmission and so on, and the troubleshooting cost is very high. At the same time, it is necessary to build a monitoring scheme for the log system, which can detect problems in time. Moreover, this scheme cannot be based on the log system and cannot be self-dependent.
  • As more and more developers start using logging platforms to investigate problems, it is common for one or two people to submit a large query, causing the overall system load to rise, and other people’s queries to be blocked, or even Full GC, etc. At this time, some high-capacity companies will modify ES to support multi-tenant isolation; Or to build different ES clusters for different business departments, and finally to operate and maintain multiple ES clusters, the workload is still very large.
  • When we invested a lot of manpower and finally managed to maintain the log platform for daily use, the financial department of the company came to us and said that we used too many machines and the cost was too high. At this time, we need to optimize the cost, but we need so many machines. Every day, most of the machine water level is 20%-30%, but the peak water level may reach 70%, so we can’t remove it. If we remove the peak water level, we can only do peak cutting and valley filling, which is another pile of work.

These are the problems often encountered by a medium-sized Internet enterprise in the construction of the log platform, and these problems will be magnified many times in Ali:

  • For example, all the open source software on the market cannot meet our demand for such a large amount of traffic in the face of Double 11.
  • With tens of thousands of business applications inside Ali, thousands of engineers working at the same time, concurrency and multi-tenant isolation, we had to be extreme.
  • In the face of orders and transactions with many cores, the stability of the whole link must require the availability of three or even four nines.
  • Such a large amount of data every day is extremely important for cost optimization, and 10% of cost optimization may bring benefits of hundreds of millions.

Ali K8s log scheme

In view of some of the above problems, we have developed and polished a set of K8s log solutions for many years:

  1. We use our self-developed log acquisition Agent Logtail to realize the comprehensive data collection of K8s. At present, Logtail has been fully deployed in millions of units in the Group, and its performance and stability have been tested by financial level for many times in the Double Eleven.
  2. To simplify complexity, the original integration of data queue, cleaning and processing, real-time retrieval, real-time analysis, AI algorithm, rather than the form of building blocks based on various open source software, greatly reduces the length of data link, and the reduction of link length also means the reduction of the possibility of error.
  3. Queue, cleaning, processing, retrieval, analysis, and AI engine are all deeply customized and optimized for log scenarios, meeting the requirements of large throughput, dynamic expansion, 100-million-level log secondary-level check, low cost, and high availability.
  4. For the general requirements of streaming computing and offline analysis scenarios, both open source and Ali have very mature products, which we support through seamless docking. At present, log service supports dozens of downstream open source and cloud products.

The system now supports the whole of the ali group, ants, clouds and firms log analysis, data volume 16 pb + write every day, one such system development, operational problems and challenges, there is no longer, interested students can refer to our technical team to share: ali 10 pb/day log system design and implementation.

conclusion

This paper mainly introduces how to build a K8s log analysis platform from the framework level, including open source solutions and a set of solutions developed by Ali. However, there is still a lot of work to be done to actually get the system into production and run effectively:

  1. What position do you use to log on the K8s?
  2. K8s log Acquisition options, DaemonSet or Sidecar?
  3. How does the logging solution integrate with CICD?
  4. How to divide the log storage of each application in microservices?
  5. How to do K8s monitoring based on K8s system logs?
  6. How to monitor the reliability of the log platform?
  7. How do I automatically inspect multiple microservices/components?
  8. How to automatically monitor multiple sites and quickly locate traffic anomalies?

“Alibaba cloudnative wechat public account (ID: Alicloudnative) focuses on micro Service, Serverless, container, Service Mesh and other technical fields, focuses on cloudnative popular technology trends, large-scale implementation of cloudnative practice, and becomes the technical public account that most understands cloudnative developers.”