On November 23, Alibaba officially opened iLogtail, an observable data collector. As the infrastructure of alibaba’s internal observables data collection, iLogtail carries the collection of a variety of observables data such as alibaba Group’s and ant’s logs, monitoring, Trace and events. ILogtail runs in a variety of environments, including servers, containers, K8s, and embedded systems, and can collect hundreds of types of observable data. Tens of millions of ilogtails have been installed, and dozens of PB of observable data are collected every day. ILogtail is widely used in online monitoring, problem analysis/location, operation analysis, and security analysis.
The author source b | | yuan ali technology to the public
On November 23, Alibaba officially opened open source the observable data collector iLogtail. As the infrastructure of alibaba’s internal observables data collection, iLogtail carries the collection of a variety of observables data such as alibaba Group’s and ant’s logs, monitoring, Trace and events. ILogtail runs in a variety of environments, including servers, containers, K8s, and embedded systems, and can collect hundreds of types of observable data. Tens of millions of ilogtails have been installed, and dozens of PB of observable data are collected every day. ILogtail is widely used in online monitoring, problem analysis/location, operation analysis, and security analysis.
An iLogtail with observability
Observability is not a new concept, but gradually evolved from IT system monitoring, problem detection, stability construction, operation analysis, BI, security analysis, etc. Compared with traditional monitoring, the core evolution of observability is to collect all kinds of observable data as much as possible to achieve the goal of white box. The core positioning of iLogtail is the collector of observable data, which can collect as much as possible of all kinds of observable data and help the observation platform to build various upper-layer application scenarios.
Challenge of Ali observable data acquisition
For the collection of observable data, there are many open source agents, such as Logstash, Filebeats, Fluentd, Collectd, Telegraf, etc. The functions of these agents are very rich, and the combination of these agents and some extension can basically meet the needs of various internal data collection. However, due to some key challenges such as performance, stability and control ability cannot be met, we still choose self-research in the end:
1. Resource consumption: At present, there are millions of hosts (physical machines/virtual machines/containers) in Alibaba, which generate dozens of PB of observable data every day. Every 1M memory reduction and performance improvement of 1M/s are huge resource savings for us, and the cost savings may be millions or even tens of millions. At present, many open source Agent designs are more focused on functions rather than performance, and it is basically impossible to transform existing open source Agent. Such as:
-
Open source Agents generally have single-core processing performance of 2-10m /s, and we want to have a performance of 100M/s
-
Open source Agent memory will grow explosively in the case of increasing collection targets, increasing data volume, collection delay, server anomalies, etc., and we hope that the memory can remain at a low level even in various environments
-
There is no way to control the resource consumption of open source Agent, so it can only be restricted by cgroup. The final effect is OOM and restart, and data cannot be collected all the time. We hope that after specifying a resource limit such as CPU, memory and traffic, Agent can always work normally within this limit
2. Stability: Stability is an eternal topic. In addition to ensuring the accuracy of data collection, the stability of data collection also needs to ensure that the Agent collected cannot affect business applications, otherwise the impact will be disastrous. Stability construction, in addition to the basic stability of Agent itself, there are many features that are not provided by open source Agent at present:
-
Agent self-recovery: The Agent can automatically recover after Critical events and provide self-recovery from multiple dimensions, such as processes, parent and child processes, and daemons
-
Global multi-dimensional monitoring: It can monitor the stability of agents with different versions, different collection configurations, different pressures, and different regions/networks from a global perspective
-
Problem isolation: As an Agent, no matter how problems occur, problems need to be isolated as far as possible. For example, there are multiple collection configurations on an Agent. If one configuration fails, other configurations cannot be affected. If the Agent is faulty, the stability of application processes on the machine cannot be affected
-
Rollback ability: Version updates and releases are unavoidable, how to roll back quickly when problems occur, and ensure that data is collected at least once or even exactly once during problems and rollbacks.
3. Controllable: Observed data of application range is very wide, almost all of the business, operations, BI, security and other departments will want to use, and one machine can produce all kinds of data, on the same machine to produce data will also have more than one department people to use, for example, in 2018, we, on average, a virtual machine has more than 100 different types of data need to collect, Design more than 10 different departments to use the data. In addition to these, there are many other enterprise-level features that need to be supported, such as:
-
Remote configuration management: In large-scale scenarios, it is almost impossible to manually log in to a machine to modify configurations. Therefore, a configuration management mechanism is required, including graphical configuration management, remote storage, and automatic configuration delivery. In addition, the configuration can be differentiated between different applications, regions, and owning parties. At the same time, because it involves dynamic loading and unloading of remote configuration, Agent also needs to ensure that data is not lost or heavy during Reload configuration
-
Collection configuration priority: When multiple collection configurations are running on a machine, if resources are insufficient, it is necessary to distinguish different configuration priorities. Resources are provided with higher-priority configurations first, and at the same time, it is necessary to ensure that lower-priority configurations do not starve to death.
-
Degradation and recovery ability: In Ali, rushing and killing are common. During this wave peak, many unimportant applications may be downgraded, and the data of corresponding applications also need to be downgraded. After the downgrade, enough Burst capacity is needed to catch up with data quickly after the morning wave peak
-
Completeness of data collection: Monitoring, data analysis and other scenarios require data accuracy. The premise of accurate data is that the server can be collected in time. However, how to determine the corresponding time point of data of each machine and file before calculation requires a very complex mechanism to calculate
Based on the above background and challenges, since 2013, we have gradually optimized and improved iLogtail to solve problems such as performance, stability and controllability, and experienced the tests of Alibaba double 11, Double 12, Spring Festival Gala red envelopes and other projects. ILogtail supports data collection of Logs, Traces, and Metrics. The main features of iLogtail are as follows:
- Supports a variety of Logs, Traces, Metrics data collection, especially for containers, Kubernetes environment support is very friendly
- The data acquisition resource consumption is very low, and the single-core acquisition capacity is 100M/s, which is 5-20 times better than the Agent of similar observable data acquisition
- With high stability, it has been used and verified in the production of Alibaba and tens of thousands of Aliyun customers. The number of deployments is nearly ten million, and dozens of PB observable data are collected every day
- Support plug-in expansion, can be expanded arbitrarily data acquisition, processing, aggregation, sending module
- Supports remote configuration management, graphical configuration management, SDK, and K8s Operator configuration management, and can easily manage data collection of millions of machines
- Supports advanced management and control features such as self-monitoring, traffic control, resource control, active alarm, and collection statistics
Three iLogtail development history
In keeping with the simple characteristics of Ali people, iLogtail is also named very simply. At the beginning, we expected to have a unified tool to Tail logs, so we called it Logtail. The reason for adding “I” is that inotify technology was used at that time, which can control the delay of log collection in milliseconds. So the end is called iLogtail. Since its development in 2013, iLogtail can be roughly divided into three stages: Feitian 5K stage, Ali Group stage and cloud native stage.
1 Flying 5K stage
As a milestone in the field of cloud computing in China, On August 15, 2013, Alibaba Group officially operated the “Flying” cluster with a server scale of 5000 (5K), becoming the first company in China to independently develop a large-scale universal computing platform and the first company in the world to provide 5K cloud computing services.
Since 2009, the Flying 5K project has gradually grown from 30 units at the beginning to 5000 units, constantly solving the core problems of the system, such as scale, stability, operation and maintenance, disaster recovery and so on. ILogtail was born to monitor, analyze and locate 5,000 machines (now called “observability”). The jump from 30 to 5000 presents a number of challenges for observable problems, including stand-alone bottlenecks, problem complexity, ease of troubleshooting, and administrative complexity.
In the 5K phase, iLogtail essentially solves the challenges from single machine, small-scale cluster to large-scale operation and maintenance monitoring. The main features of iLogtail in this phase are as follows:
- Functions: Real-time log, monitoring collection, log capture delay millisecond level
- Performance: single-core processing capacity of 10M/s, and 5000 cluster resources occupy 0.5%CPU cores on average
- Reliability: automatically listens to new files and folders, supports file rotation, and handles network interruption
- Management: Remote Web management, automatic configuration file delivery
- Operation and maintenance: join the yum source of the group, monitor the running status, and automatically report exceptions
- Scale: 3W+ deployment scale, thousands of configuration items collected, 10TB data per day
2 ali Group stage
The application of iLogtail in Aliyun Fetitian 5K project solved the problem of unified collection of logs and monitoring. At that time, Alibaba Group and Ant still lacked a set of unified and reliable log collection system, so we began to promote iLogtail as the log collection infrastructure of the group and ant. Moving from a relatively independent project like 5K to a group-wide application is not a matter of simple replication, but more deployments, higher requirements, and more departments:
- Million-scale operation and maintenance problem: At this time, there are more than one million physical machines and virtual machines in ali, Ant, and we hope to operate, maintain and manage million-scale Logtail with only 1/3 of human resources
- Higher stability: The data iLogtail collects at the beginning is mainly used for troubleshooting. In a wide range of application scenarios, the reliability of logs, such as billing data and transaction data, is increasingly required to meet the pressure of large data traffic such as Singles’ Day and Singles’ Day.
- Multi-department, team: From service 5K teams to nearly a thousand teams, iLogtail can be used by different teams, and an iLogtail can be used by multiple teams, presenting a new challenge to iLogtail in terms of tenant isolation.
After several years of cooperation with Alibaba Group and Ant, iLogtail has made great progress in multi-tenant and stability. The main characteristics of iLogtail at this stage are as follows:
- Function: Supports more log formats, such as regular, delimiter, and JSON. Supports multiple log encoding modes, and supports advanced processing, such as data filtering and desensitization
- Performance: Increased to single-core 100M/s in minimalist mode, and 20M/s+ in regular, delimiter, and JSON modes
- Reliability: Data collection reliability Polling, rotation queue sequence assurance, log clearing protection, and CheckPoint enhancement. Improved process reliability: Automatic recovery of Critical, automatic reporting of Crash, and multi-level daemon
Principle of Log Sequence Preserving collection scheme (For details, please refer to iLogtail Technology Sharing (I) : Log Sequence Preserving Collection Scheme with Polling + Inotify combination)
- Multi-tenant: Supports whole-process multi-tenant isolation, multi-level high and low watermark queues, collection priority, configuration/process-level traffic control, and temporary degradation mechanisms
Multi-tenant isolation process (for details, please refer to iLogtail Technology Share (II) : Multi-tenant Isolation Technology + Double Eleven Practical Effects)
- Operation and maintenance: Based on the group StarAgent automatic installation and guard, abnormal active notification, provide a variety of problems self-check tools
- Scale: millions + deployment scale, thousands of internal tenants, 100,000 + Collection configuration, daily PB data collection
3 Cloud native phase
With all of Ali’s IT infrastructure fully cloud-enabled and iLogtail’s product SLS (logging service) officially commercialized on Ali Cloud, iLogtail began to fully embrace cloud native. The challenge for iLogtail is not performance and reliability, but how to adapt to cloud native (containerization, K8s, adapt to the on-cloud environment), how to adapt to open source agreements, and how to deal with fragmentation requirements. This was iLogtail’s most rapid growth period, with many important changes:
- ILogtail’s original version is based on GCC4.1.2, and the code also relies on the Flying Base. In order to be suitable for more environments, iLogtail has been fully reconfigured to implement the compilation and development of Windows/Linux, X86/Arm, server/embedded and other environments based on one set of code
- Supports containerization and K8s: In addition to collecting logs and monitoring information in containers and K8s environments, the configuration management is upgraded and extended by using the Operator mode. You only need to configure a customized K8s resource of AliyunLogConfig to collect logs and monitoring information
ILogtail Kubernetes Log Collection principle (for details, see Kubernetes Log Collection Principle Analysis)
- Plug-in extension: iLogtail adds a plug-in system that can freely extend the Input, Processor, Aggregator, and Flusher plug-ins to achieve various customized functions
The overall process of the iLogtail plug-in system (refer to the iLogtail Plug-in System Introduction for details)
- Scale: tens of millions of deployment scale, tens of thousands of internal and external customers, millions + collection configuration items, daily collection of dozens of PB data
Open source background and expectation
We believe that open source is the best strategy for iLogtail and the way to unlock its maximum value. ILogtail is the most basic software in the field of observation. We open source iLogtail and hope to build and optimize it with the open source community to become a world-class observable data collector. For the future development of iLogail, we look forward to:
- ILogtail has certain advantages over other open source collection software in terms of performance and resource consumption. Compared with open source software, iLogtail reduces 100TB of memory and 100 million CPU core hours per year in the scale of tens of millions of deployments and tens of PB data per day. We also hope that this collection software can improve the resource efficiency of more enterprises and realize the “common prosperity” of observable data collection.
- At present, iLogtail is only in Alibaba and a small number of cloud enterprises (tens of thousands, but facing the global tens of millions of enterprises, the number is still small), facing relatively few scenarios. We hope that more companies with different industries and characteristics can use iLogtail and put forward more requirements for data sources, processing and output targets to enrich the upstream and downstream ecosystem supported by iLogtail.
- Performance and stability are fundamental to iLogtail, and we hope to attract more talented developers through the open source community to build iLogtail and continue to improve the performance and stability of the observable data collector.
The original link
This article is the original content of Aliyun and shall not be reproduced without permission.