At eight o ‘clock on August 27th, Xuesong Cheng, senior solution architect of Seven Niuyun, conducted a live broadcast entitled “Mining the Infinite Value of big data of traditional industry log” in IT Big Club. He made an in-depth analysis of the common difficulties in operation and maintenance of traditional industry and the necessity of unified log management. And through some real user cases of Pandora and everyone elaborated how to mine the infinite value of traditional industry log big data.

This paper is a compilation of live broadcast content, which is divided into two parts. The first part mainly introduces the common difficulties in operation and maintenance of traditional industries and the necessity of unified log management, as well as several typical scenarios of log analysis.

What is operations

First, let’s talk about what operations is.

Many people have their own understanding of operation and maintenance, and they think operation and maintenance is a very simple thing. When our enterprise buys some information-based products, such as hardware and software, we need a team to make it run normally. But in the process of operation, there will inevitably be various problems, which requires a special team to do the guarantee. If you think of operations simply as a platform, I think that might be superficial. What exactly is operations? There is a lot of understanding on the Internet about the division of operation and maintenance, including website operation and maintenance, system operation and maintenance, network operation and maintenance, database operation and maintenance, IT operation and maintenance development, operation and maintenance security. From these divisions of labor, operation and maintenance is actually a complex and systematic project.

Value of operation and maintenance

· Operation and maintenance should know the exact bottleneck point of the system, and then know the exact capacity of the system; Know how to provide capacity quickly before system bottlenecks occur.

· Know the risk points of the system, coordinate the related modules above and below the risk points, and make redundant strategies; Compared with the centralized solution of single point module stability, more reasonable.

· Engaged in related work for a long time, accumulated rich experience in architecture design, can guide the design and review of new architecture.

· From the perspective of different businesses of the company, the same modules can be abstracted from operation and maintenance for unified management to form an internal capability platform and infrastructure platform, including some micro-services that we can share, so as to form such an effective platform and automatic management method.

The general status quo of existing operations and maintenance personnel and challenges

From the perspective of the value of operation and maintenance, we know that operation and maintenance is a complex and systematic project. For O&M engineers, they need to deal with a lot of daily work. How to help o&M engineers do a good job in daily operation and maintenance work is crucial. However, nowadays operation and maintenance engineers encounter many problems in daily operation and maintenance. The main reason is that the IT environment is becoming more and more complicated. Because information construction is not accomplished overnight, the company will build different business systems, different application support and purchase different hardware equipment at different stages. However, due to the progressive and stacked procurement cycle, there are actually many different types of network devices, a large number of different types of servers, a variety of virtualization solutions, different operating systems, diversified application software and databases.

Some developers are more familiar with MySQL, so they may use MySQL as the database to support their applications. Some developers have been using Oracle for a long time, so they may use Oracle to support their applications. Different business software and different business systems will have different business architectures and different underlying platforms, and each platform will bring different monitoring systems and internal related tools, which will lead to the complexity of the overall IT department environment of an enterprise, resulting in many problems:

· Numerous and complex monitoring software, which cannot be managed uniformly;

· The monitoring alarms are haphazard and there are various deficiencies in the monitoring methods, which cannot be timely sensed when problems occur;

· Long troubleshooting time and complex system, long troubleshooting process, unable to quickly and accurately locate the cause of the problem after the occurrence of problems;

· Weak overall and unable to have a comprehensive control of the overall situation, thus unable to effectively predict the occurrence of problems;

· Security challenges: unable to efficiently detect security problems, such as hacking and illegal operations;

· Difficult administrator management In the face of many heterogeneous monitoring software, administrators need to bear a great mental burden;

Logs are used for o&M management

A large number of o&M teams use logs for O&M management. The reason?

The log system records every condition information of our system by means of text or log. This information can be interpreted as a record and projection of the behavior of devices or ordinary people in the virtual world. This information helps us observe the normal status of the system and quickly locate faults when the system is running.

There are many types of logs, including system logs, application logs and security logs as well as many database logs, and so on. Each log records time stamps, device names, users, and operation behaviors. Logs help system operators and developers learn about server software and hardware information, and check configuration errors and causes of errors. By analyzing logs frequently, you can learn about the load, performance and security of the server, analyze related problems in time, trace the root cause of errors, and correct errors.

Here are a few examples of some of the monitoring or security logs you may encounter in your daily work.

Routine log analysis mainly deals with the following scenarios:

Centralized monitoring of the equipment room

The example in the figure above is a large screen we made for the customer in a case. It can reflect the running status of the entire machine room, and operation and maintenance personnel can intuitively know the overall daily running status of the machine room through the large screen. The following is the architecture diagram we designed. We collected some monitoring indicators of related hardware and software from switches and servers, and then read them into our log management system to store, analyze, monitor and alarm logs in a unified manner, and finally form such a large screen display. This is now the most classic scenario in the use of logs by many operation and maintenance students.

Application quality management

For example, if an enterprise has an OA system, a large number of logs will be generated when people go to the OA system to check the organizational structure of the enterprise and the daily flow of some electronic flows, including the approval of some business applications. We can analyze these logs to see what the average response time is, or how often people use the platform, and we can manage and track the quality of the application across the board. Once we found that everyone was teasing me about the slow opening of OA, the feedback result of my whole data query was slow. What exactly is the problem? We use the application quality management module to query the corresponding fault points and optimize the application quality to provide better experience for the end users. Application quality management is not only used in Internet enterprises, but also in many traditional enterprises.

Unified log management platform

The third is called the Unified Log Management Platform, which is a further extension of scenario 1 and scenario 2. We may only do a monitoring for the equipment room at the beginning, and then hope to monitor the business system and application system at the upper level. Now we want to be able to collect logs wherever they can be generated in the enterprise. The logs include the logs generated during development, service running, and equipment room operation and maintenance. These logs are collected together to form a unified log warehouse, which is similar to our traditional understanding of data warehouse.

Data warehouse is to store all business data and structured data together for subsequent data analysis. Unified log management platform is to collect all the logs generated by the enterprise together, and then you can do real-time or offline data analysis, and then the analysis results through interface output or message queue to support specific business applications. These logs can be retrieved and analyzed for faster problem location and continuous data mining. Now many enterprises are gradually developing, not only the construction of enterprise internal unified data management platform, but also in the construction of internal unified log management platform.

Iot data analysis and monitoring

The fourth one combines industry 4.0 or Made in China 2025, which the country is now vigorously promoting. In fact, it hopes to better support the development of manufacturing industry by means of the Internet of things. Now many manufacturing enterprises will add many iot probes or sensors to their production lines to collect various receipts generated during the operation of the whole production line. Such as workshop temperature, humidity, including machine speed, pressure, flow and so on. Then put these data collected in the form of data flow back my data platform, real-time data gathering and analysis, such as data on roll or real-time data monitoring, once appear abnormal temperature and humidity, speed abnormalities, abnormal pressure, flow rate of abnormal, system need timely alarm, workshop management personnel can timely solve the problems.

In addition, I also need to timely monitor the operation of my entire production line for a period of time, and even combine it with my quality control, quality management, etc., to find some causal relationship between the temperature and humidity indicators on the production line and the actual production quality. These are some of the things that companies are trying to do with the Internet of things. In my opinion, these four scenarios and problems in log operation and maintenance are encountered in traditional and emerging industries.

The necessity of unified log management

Therefore, we clearly feel that unified log management is a very important thing for the traditional industry. It can not only solve the operation and maintenance problems of the traditional industry, but also improve the ability of some enterprises at the business level, including supporting many business decisions and development in the future. In the past, logs were stored in different servers, without centralized management, it was difficult to do correlation analysis, and even deleted.

For example, traditional firewall IPS and many other security data are stored in their own log systems. Now there are few enterprises to do security log association analysis, and such data is often greatly wasted. If you manage dozens or hundreds of servers, you’re still using the traditional method of logging on to each machine in turn. It feels cumbersome and inefficient. For the most part, we need centralized log management that aggregates logs from all servers. In the era of big data, the number of logs is huge and the types are diverse. Enterprise data is like a gold mine to be exploited. However, with the centralized log, the difficulty of log statistics and retrieval will increase. However, for the higher requirements of query, sorting and statistics and the large number of machines still use such a method is inevitably a little inadequate.

Technology selection for log management

There are many technical options for log management. The most traditional and simple one is to use script tools such as grep, sed, and awk. Many O&M engineers have the ability to write scripts independently, but it is inefficient and error-prone. Later, data can also be collected into MySQL for unified data aggregation and some simple calculations. Although it is convenient to use, due to the performance problems of MySQL itself, it does not support a large amount of data, so its capacity is limited. Some enterprises will use NoSQL database to support the storage of large amount of data, but it does not support cross-query and full-text search, to check a specific log information, the use of the burden will become very large.

Later, a lot of big data technologies emerged, such as Hadoop/Spark/Storm, which can easily collect data in offline, real-time or data stream mode. However, IT is complicated to use, which has high requirements for our operation and maintenance team and IT department, and does not support full-text retrieval. So there are not many companies using Hadoop/Spark for log management. Now the vast majority of log management will use ELK, you can be very convenient to download and install on the Internet to use, but ELK product and experience level optimization is far from enough, in some small batch data to try out the function, there is no problem. However, if you want to use ELK to create a unified log repository or enterprise log center for all the logs in the entire machine room or enterprise, its stability and ease of use will be greatly challenged. Especially if your data volume is up to 100 terabytes, using ELK will encounter a lot of problems.

Key points of log management system construction

So how do we choose a log management system to support our internal operations or to support our log analysis? I think it is necessary to consider the key points of log management platform construction from eight perspectives, that is, data collection, cleaning, storage, search, alarm monitoring, analysis, reports, and opening. 

The data collection

Data acquisition seems like a simple concept. However, it can be further divided into four functional points: data collection, analysis, transformation, and transmission.

Data acquisition requires log management platform to support a variety of data sources, which is an excellent data acquisition platform must have the function. This includes relational databases such as MySQL, Oracle, and perhaps even SQL Server. As well as search platforms such as non-relational databases, message queues and ES, as well as Hadoop services, it is a necessary function for a data management platform to accurately collect data from these data sources. In addition, it is better to collect hardware metrics such as server CPU, server memory usage, storage usage, and network traffic of network devices. These metrics may not be presented as logs, but you need to have a collection tool that can be deployed on a server or network device to collect the underlying hardware monitoring metrics. These are also some capabilities that the data acquisition platform needs to embody in the collection function.

A lot of logs are actually in the form of text to do data records, if you want to do deep log analysis, statistics, calculation, you need to extract and slice the content of the log. For example, a log of a security device needs to be divided into fields such as the specific time, log source device, security event name, and description. The first feature that a log management platform needs to consider is the support for a very rich set of predefined parsing rules that make it easy to parse the data into relevant fields regardless of the log format.

The second custom log format supports custom log resolution rules. The reason is that logs must be defined by each application developer during system development, including the format, content, and rules related to logs. So this creates a situation where a thousand flowers bloom and the logs vary from company to company. So different systems of different logs we only use the same set of parsing rules to parse, there will be acclimatized situation. Therefore, if users can customize the analysis rules of these logs very conveniently, for example, the log of a sample can be divided into several fields in the way of dividing words, and the system will automatically generate relevant analysis rules, so that it will be very convenient and easy to use for daily operation and maintenance.

After collection, parsing and data conversion, why conversion work? There are certain fields in the log that we want to make more readable. Such as some users to access the Intranet a business system, logging system will record access source IP address, but when I later want to analyze this log, I don’t really care about how much is the IP address, I care about is the IP address of the corresponding account or a specific one, So we need a conversion process at this time, the IP address into the corresponding entity. Through such conversion rules, operation and maintenance personnel can make the subsequent data analysis and statistics more accurate, and the use process is easier to use.

So collection, analysis, transformation is a very important work, these links are indispensable. Finally, the processed data needs to be sent to a storage for persistent storage or subsequent analysis. So collection, parsing, transformation, and sending are the four small aspects of the data collection function.

The data processing

After the data is collected, some further processing of the data may be required. For some simple data can be directly used for analysis or search without processing. That for some complex business scenarios, such as a large number of data acquisition came in, need every five minutes or every ten minutes to go to a simple calculation of statistical data, or for some business application of the real-time demand is higher, need to come in after the real-time data sampling, with the existing business model or security model matching, To achieve business services or security situation monitoring, in these scenarios, simply through the data acquisition platform is not able to meet the requirements. Need at this moment is a powerful data processing platform, best can be similar to the Hadoop, Spark such large data calculation engine, can according to different types of data sources in real time or off-line calculation, and support the task execution, execution of the loop, the timing of the periodic scheduling, eventually be able to keep the results of calculation and analysis, It can be exported to object storage, log analysis, or service database to directly support subsequent production services.

The data analysis

After data collection and processing, we can enter the stage of data analysis. In this process, we need to carry out all-round rapid analysis of the collected logs and display the results. Then, we need to carry out unified storage of logs, which should support at least TB level or even PB level of data. And be able to support to quick search of the data, the formation of relevant chart and support related to the monitoring, analysis and prediction of the alarm or, log management platform also need to provide the related API interfaces at the same time, be able to docking third-party monitoring platform, monitoring tools, or go directly to support similar precision marketing, user portraits such business systems, All the above are the functions that data analysis needs to support in this process. In my daily communication with many users, I also find that they more or less encounter some pain points of log analysis business. I summarize four points as follows:

Automatic field analysis In the journal acquisition phase has completed the resolution, the type a standard text log parsing into several fields, then can make some of these fields automatic statistics and analysis, operations staff don’t need to go through the way of writing script yourself, to do the calculation of data editing tasks. For example, the system can automatically tell you what the average traffic is on the network, what is the peak and the low value of your traffic, and if there are some error logs, we can figure out what your TOP10 errors are, which users they come from or which devices they come from, Aiming at such some field analysis can greatly reduce the user in the process of using this platform to do some calculation or task configuration of some work or difficulty.

The joint search Just as its name implies is through a condition to search multiple log warehouse at the same time, the scene such as firewall, IPS, antivirus software, access log may be exist in different places, the unified after the collection to the log management platform is also commonly in different log in the warehouse, when there is a security incident, A security event may contain an attack from an IP address or an attack from a user name. I need to retrieve the logs of all security devices by this IP address or user name, and then display relevant content in a unified manner. In this case, there will be a joint search scenario. At this point, you need to have a function that can search all the content that can be seen in the log store.

In daily use of log analysis, not all tasks are fixed and sometimes need to be flexibly changed according to business requirements. For example, if I need to analyze the daily access behavior of a device or a user, I will search the user name, and the log management platform will list all the contents that meet the requirements. But when you look closely, you can find hundreds or even thousands of related logs, or even more for sensors. A single search criterion is often not enough to meet your log analysis needs. At this point, you can choose to add an and search criterion to the search box to further filter the results of the log.

But could there be an easier way? For example, now that you’ve found all log associated with this user name, so can we in the search results for a log in and then draw a piece of word, automatic filling in my search box, to the data of the search results are the secondary filter, or can I rule out in the search results underline the words of the log content, If this feature can be implemented, it will greatly improve the ease of use of the platform and solve many of the daily crashes. This is a pain point of cross word analysis.

All logs generated in the real-time search system are continuously collected to the log platform in the way of data flow. When searching logs, it is hoped that the new logs can be displayed in real time. This way, when I make changes to a business or recover from a failure, I can see the latest log status and easily see whether the business is back to normal. This is a bit like our everyday tail-f data scrolling in real time scenario. This is also a pain point that many users will encounter in the process of data analysis. If a product can solve these pain points of users and reduce the burden of using the platform, it can greatly reduce the pressure of daily operation and maintenance and improve the overall work efficiency.

 cattle man said

The Great Talk column is dedicated to the discovery of the minds of technical people, including technical practices, technical dry goods, technical insights, growth tips, and anything worth discovering. We hope to gather the best technical people to dig out the original, sharp and contemporary sound.

Click “Read above” to learn more about Qiniuyun’s intelligent log management platform