Seven NiuYun at 8 PM on July 24, senior big data architect wang ke in the pegasus web entitled “how to quickly build unified log management system of intelligent audio broadcast, and we discussed the log in the platform construction key points to consider, and share the seven NiuYun in practical experience to improve logging platform on the throughput. This article is the reorganization of the live broadcast content.
The log sample
The figure above shows some examples of logging to help you understand the concept of logging, but the first two items are briefly explained here.
The first item is the user behavior log of a website. Taking this log as an example, it contains two types of information: one is the user information, for example, “City” is the source of user access, and “useAgent” is the browser version of the user. There is also program information, such as “pageSize” and “pageView” for pageSize.
The second entry is a log from a typical Nginx service, where “remote_ADDR” is the source address of the request, “request” is the request content, and “status” is the request status code.
Three basic questions
First question, what is a log?
A log is a record of the status and behavior of a device or program. Now, devices and programs output their own event information in a predefined pattern, which is written to a file or database to form a log. So, a log is actually a record of event information, and the nginx log in the example above is a record of a request event.
Second question, why do I need a log?
Why do I need a log? Because devices and programs output key health information, such as whether and what exceptions occurred on the server, through events, they are pre-defined at design time to assist users in maintenance. Logs record events. You can view historical information about device and program running. You can learn about device and program running changes based on logs to better maintain devices and programs.
Third question, when do I need a log?
When a problem occurs in the system, you can find the cause of the problem through historical events that occur in the running process. In addition, logs are also required to collect statistics of system operating indicators.
Discussion on log platform construction
This section describes ELK solutions
The most common log collection solution in the industry is ELK. ELK is the abbreviation of three components in the solution.
• E is Elasticsearch, which is an open source real-time distributed search engine that is primarily used for storing and searching log data.
• L is a Logstash component. This is a log parsing tool that uses regular expression parsing to break down the log fields and pass them into Elasticsearch.
• K is for Kibana, this component is the presentation tool. The visualized chart displays Elasticsearch log data. You can see how many new records are added to the log during a period of time. You can also see how the number of events of different types fluctuates over time.
Before ELK came out, log management was basically done by logging in to the machine where the log is located and then using Linux commands or manual view and statistics, which was very inefficient. ELK has three advantages: fast, easy to use, and extensible. ELK stores logs in Elasticsearch. Easy to use, ELK itself is easy to deploy, through the official website of the tutorial can build a set of simple ELK; Elasticsearch and Kibana both have add-ons that provide a lot of new functionality. There are also many open source components with the same functionality as Logstash and Kibana, which can be replaced depending on the scenario you use.
This section describes other log components
In addition to Logstash, there are several common log collection tools.
• Rsyslog generally collects syslogs of the system. • Fluentd is mainly used for container log collection; • Written in the language of Filebeat Go, focusing on file log collection; • Flume adds a Channel component to the transmission link as a data cache to save data to disks to avoid data loss.
In addition to collecting tools, cache components are also used to collect logs. There are two main cache components: Kafka message queue and Redis in-memory database.
• Redis is mainly used for scenarios with small data volumes and low latency requirements, mostly in conjunction with Storm to process real-time data.
• Kafka is a universal message queue component, suitable for many scenarios, there are several similar message queue components Kafka.
The three elements and four principles of a log collection solution are described
There are three factors to consider when building log collection solutions using access tools and caching components: timeliness, order of magnitude, and complexity.
• Timeliness is whether logs need to be guaranteed for low-latency transmission, i.e. events occurring on my devices and applications need to be retrieved in the shortest time, or whether delays can be allowed for how long, minutes, hours, or even half a day. In the scheme formulation process, timeliness determines whether to choose Storm real-time streaming scheme. As a hard index of the testing process, it plays a role in the simulation test to judge whether the scheme is available.
• The order of magnitude is a measure of the number of entries and the size of the log space. There is a big difference in disk space occupied by data entries and files between program logs that record events only when running errors and sensor logs collected every second, as well as in the requirements for collection components. The order of magnitude affects the selection of collection tools and cache components. In the process of scheme formulation, it is necessary to add cache components when the number of items collected every day exceeds one million, and it is necessary to consider how to balance the resources of data writing and data query when the number of items exceeds one million.
• The last factor is complexity. The complexity of collection schemes is mainly due to the order of magnitude, network environment, and collection tools. In order of magnitude, we had to add a cache component, writing logs to Kafka first and then to ES, adding a Kafka to consider Kafka’s high availability and data loss, and the solution complexity increased. The complexity of the scheme will affect the difficulty of implementation and maintenance of the scheme. Even if the scheme with high complexity is designed, it is difficult to implement it.
Considering these three factors, we should consider four principles when choosing a log collection solution.
• Files are preferred for cases with low timeliness requirements and small orders of magnitude, because files are the simplest and the best to collect.
• Resolve parsing issues during log generation, not during transport, i.e. constrain the format of the log during generation, rather than consider unifying its format during transport. Most collection tools support rewriting log information in the middle, adding or splitting log information. These operations increase the complexity of the collection solution and increase the cost of replacing the collection tool. Therefore, you should avoid these operations.
• Different logs have their own characteristics, so do we have to choose a different tool for each specific scenario? I think it is necessary to compare the benefits brought by the features of the tool with the complexity changes of the tool. If the new tool does not affect the original components and does not add new development content, the operation and maintenance are acceptable. You can increase the number of tools, otherwise the fewer tools the better.
• Consider multiple levels of caching if the data volume is large. The main purpose of caching was to separate upstream and downstream operations and write more data without affecting ES performance.
Key points in planning the container log collection scheme
Since the concept of container is very hot right now, I would like to talk to you separately about the container log collection needs to be considered.
01 Collection Mode
• Now the mainstream container allocation management is using K8s architecture, and the COLLECTION of K8s log data is relatively simple, which can be obtained directly from var/log/containers in the form of files; Internal application logs can be collected by deploying the log collection tool on Daemonset.
• If K8s is not used and only Docker is used, the collection of log data will be slightly troublesome. The container’s own STdout can be collected by the logdriver of Docker. Application logs inside containers can be viewed using Docker’s own commands or collected using third-party components such as Fluend-Pilot. For the practice of Seven Niuyun, you can also post the logs inside the container.
02 Common Service requirements
• Avoid log loss during container migration: in K8s, use PVC for persistent storage, and just write the logs in. But the added problem is that you need to manage this persistent store.
• Support multiple log collection tools: Configure a different Pod for each log collection tool and call it with Daemonset.
• Express container information in the container application log: K8s 1.7 provides a function to write container information as a public variable in the container when the container is started. Another way to get container system variables is on Daemonset.
Challenges facing the log platform of seven Niuyun
The qiniu Cloud log platform logs mainly come from three sources: the first is the logs of the Qiniu cloud storage system, the second is the logs of the Qiniu Cloud container platform, and the third is the private logs uploaded by customers using the Qiniu cloud intelligent log management platform. The average daily data inflow is now over 250 TB, 365 billion pieces, with the largest customer averaging over 45 TB a day.
The main challenges facing qiuniuyun log platform are as follows:
• A large amount of data needs to be written and queried stably; • Multiple data sources need to support data collection of multiple data sources, and the collection process should be stable and monitored; • Platform users have multiple functional requirements and need to support rich extended functions;
• Most platform users are enterprise users, which need to support multi-tenant scenarios and provide user rights management.
One of the biggest challenges is how to store a large amount of data efficiently, ensuring that data is not lost and at the same time reducing resource occupation as much as possible.
The architecture of qiuniuyun log platform is introduced
• The ELK scheme cannot meet the requirements of system stability and data processing performance in the case of qiniuyun with large data volume;
• ELK scheme fails to meet the requirements of Qiniuyun in the stability of multiple data sources;
• ELK scheme fails to meet the multi-tenant requirements of Qiniu Cloud users;
• ELK solution cannot meet the requirements of Qiniuyun in machine learning and other extended functions.
The final evaluation result was that ELK solution could not solve the challenges qiniuyun faced, so Qiniuyun embarked on the road of self-research. The current log platform architecture of Qiniuyun is described as follows.
• Logkit: Collects log data uniformly and supports hundreds of data sources (including files, MySQL, MSSQL, ElasticSearch, MongoDB, AWS CloudTrail, etc.); Support data format parsing and field conversion at the collection end; The built-in Channel component ensures that data is not lost during transmission. You can view the status of collection tasks and manage task configurations on the management page.
• Pipeline: a unified portal of the log platform, which can receive data transmitted through API from log files, SDK, Web terminal, IOT, and cloud storage. After the data entered the Pipeline, we made a real-time calculation component based on Kafka, which supported the custom calculation process. The calculation results were transmitted to three storage components in the background through the Export component according to different situations.
• Insight cluster: The log storage component developed by Qiniuyun based on the deeply customized ES is mainly used to support the retrieval and query of super-large scale log data. In addition, the Insight cluster supports multiple tenants and integrates alarm monitoring, reports, and machine learning functions.
• Object storage of Qiuniuyun: Qiuniuyun’s universal storage component is used to store cold data that has passed the search period, and supports data analysis in the customized offline computing process;
Seven Niuyun experience sharing – throughput problems
How does Qiuniuyun solve the challenge of logging platform throughput?
First, Pipeline’s Kafka cluster is capable of undertaking hundreds of terabytes of data through expansion; The bottleneck then becomes how to write hundreds of terabytes of data from Kafka to ES. The Export component in charge of writing ES cannot pull data directly from Kafka because there may be a calculation process when the raw data enters the Pipeline. It can only wait for the invocation of the Pipeline. That is, Export cannot independently arrange the writing progress but must keep consistent with the user’s data operation. In this case, Export cannot directly write ES, because data of multiple users may need to be written at the same time. If Export directly writes ES, resource competition and blocking will occur.
Data transmission optimization idea gleason
Seven Niuyun data transmission optimization is mainly divided into two parts. First of all, we put Kafka data acquisition, data format conversion and filtering operations in Export to keep the data after Export operation can be directly written into ES. The scheduling operations based on customer tasks are also placed in Export, and the data access status can be fed back to customers in real time by tracking the status of Export. Then, under Export, we created a transmission queue layer to provide cache through the transmission queue. Data is first entered into the transmission queue and then written asymmetrically according to the current SITUATION of ES, so as to ensure that the data writing process does not affect the customer’s query and retrieval, and some fault recovery measures are designed.
Transmission queue architecture diagram ABOVE
The implementation of the transfer queue is based on the memory queue and local file queue, which is similar to Flume Channel concept. But in the process of implementation, some additional assembly was done to it. Firstly, agent component, which can be understood as scheduler component, is responsible for task creation and assignment. The second is the transaction transaction pool, which processes both incoming and outgoing transactions from memory queues. Why use the concept of transactions here? Because when we write ES, if we write 200 pieces of data, it tells us that there are 150 pieces of data in there and 50 pieces of data not in there, it doesn’t tell which 50 pieces of data are not in there. So the point of using transaction in this context is that, in a real situation, we can send the 50 pieces of data back to the memary queue and let the memory queue continue writing the 50 pieces of data.
Transaction pool design
Through the design of the transmission queue layer seven NiuYun to realize the stability of the large amounts of data to write, overall transmission queue by multistage cache mechanism has realized the dynamic resource allocation and task decomposition, task decomposition and then according to the situation of the ES, gradually after the decomposition of small blocks of data written to the ES, avoids the ES because a lot of writing and collapse; The data is also kept in memory after being fetched from Kafka to ensure the speed of data transfer.
Qiuniuyun intelligent log management platform
Basic function
Previously, we shared our architectural design and practical experience. In terms of functional details, we also conducted independent research and optimization to address the defects of traditional solutions, such as requirements that ELK could not solve.
• Data collection level, LogKit can realize visual data collection; And in the process of collecting data, you can see the real-time flow and subdivision to each machine.
• Log search level, we support multiple users and user rights management; , Reblance operation can be automatically completed without too much attention from users, which will not affect the query efficiency of users; ES writing has also been deeply optimized, which can ensure the query efficiency of ES while writing a large amount of data.
• Data visualization, we provide excellent user experience search interface, and provide a complete report system, easily meet all the needs of users report. On the other hand, in addition to native support for monitoring alarms, we also support docking with third-party visualization tools such as Grafana.
Intelligent components
Considering the future trend of data intelligence, we also provide machine learning components for users. The operating threshold of the component is very low. As long as the corresponding index and the concerned index are selected, the index curve can be automatically modeled. Through monitoring index data, intelligent detection of outliers such as timely detection of under-reporting; At the same time, based on the comprehensive learning of historical data, we also support the prediction of future trends, which can give the future prediction curve according to the time latitude.
Qiuniuyun intelligent log management platform covers the full life cycle of data and is applicable to operation and maintenance monitoring, security audit, business data analysis and other scenarios. It is committed to reducing the mental burden of users and helping customers to digital upgrade.
If you are interested in the qiniu cloud intelligent log management platform, you can click to read the original article to understand the details of the product. Now we have a free quota policy: 1 GB of new log data/month, 1 GB of log data/month, 1 month of log warehouse, 1 million times of API calls/month. Everyone is welcome to register for the experience. Documentation site address: https://pandora-docs.qiniu.com
People say
The Great Talk column is dedicated to the discovery of the minds of technical people, including technical practices, technical dry goods, technical insights, growth tips, and anything worth discovering. We hope to gather the best technical people to dig out the original, sharp and contemporary sound.
Please click on “Read the original text” to open for free