Distributed system has been born for a long time. With the decrease of computing power and storage price, we have witnessed the big explosion of distributed system. Modern Internet companies have become extremely large and the system has become more and more complex, which brings great difficulty to monitoring work: How to process massive log data, how to track services, and how to locate faults efficiently.
Monitoring, as the most important part of operation and maintenance work, has always been the topic of concern for operation and maintenance workers, and the topic of concern for technical workers is also the direction of our InfoQ. InfoQ reporter interviewed wu Qimin, an expert in the field of monitoring, to talk about distributed monitoring systems. Wu Qimin is the chief architect of Ping An Bank’s retail Network Finance Division. He worked at eBay for many years and served as the head of architecture for Dianping and Ctrip. He is also the author of CAT, an open source real-time monitoring system.
In addition, Mr. Wu Qimin is the co-chairman of this InfoQ global operation and Maintenance technology conference CNUTCon. I heard that some of the producers are also students of Mr. Wu qimin
Wu Qimin: I understand the common monitoring methods:
-
According to the monitoring level: service monitoring, application monitoring and basic monitoring;
-
According to the monitoring log source: log file based monitoring, database based monitoring and network based monitoring, etc.
-
According to the monitoring field: front-end monitoring, back-end monitoring, full-link monitoring, inter-service monitoring, etc.
-
Monitoring targets include system fault monitoring, service indicator monitoring, application performance monitoring, user behavior monitoring, and security compliance monitoring.
The monitoring field is very complex. For example, CAT, an open source distributed monitoring system, is good at application monitoring and has certain business monitoring and basic monitoring capabilities. It is based on the network to complete the transmission of log data, will support log files and databases as data sources; It has covered the front and back end monitoring capabilities. Through good implementation, full-link monitoring and inter-service monitoring can be achieved, and the monitoring objectives are very diversified. However, CAT poses a high challenge to the technical development and operation and maintenance capabilities of the monitoring team. Generally, large Internet companies have specialized monitoring r&d teams to conduct secondary development based on CAT to support the diversified needs of the business.
Wu Qimin: Too many! Take CAT as an example. In meituan-dianping, the amount of newly added monitoring logs of CAT is more than 400TB per day (some system logs are sampled), and in ctrip, it has long been over 300TB. Under such pressure of data volume, a small problem under normal circumstances will be magnified thousands of times and become a big problem. Either the system throughput is faulty, hot spots occur in the system, resulting in unbalanced resource usage and waste, or the storage system cannot support the system. The most important problem is the service problems caused by monitoring problems.
The CAT in the field of design is well aware that at the beginning of monitoring is a deep water area, for each field problem is very seriously, for each new report after long multi-angle examination authentication, each line of the core code when coded through strict Review, the most extreme cases after a class every line of code Review all need to be modified.
CAT is capable of stand-alone development, that is to say, as long as a single machine, CAT system can be started up for development, testing and troubleshooting, greatly improving the development efficiency. At the same time, the core system is covered by a large number of unit tests, supplemented by perfect modeling and code generation tools, so that the research and development efficiency remains high.
CAT can monitor CAT (except for the lowest level monitoring), that is, build a minimal application scenario, and improve CAT’s own capabilities, which is also very important; A lightweight and independent platform is an important feature of CAT. Many people often ask: Spring is the most popular framework and powerful, why doesn’t CAT rely on it? I want to say that there is a dependency level problem, usually the higher level depends on the lower level, the lower level does not have to reverse the dependency on the higher level, otherwise what would happen? In fact, this is a classic coupling problem, the bottom layer depends on the top layer, and the scope of application of the bottom layer is severely constrained, just like C# is a good language (except for PHP), C# has to do. Running on the.NET CLR,.net needs to run on Windows, easy to imagine. NET is now facing the dilemma. Of course.NET Core has been modified to run on Linux, but its functionality is still limited.
Like the CAT client SDK, our initial goal was to run above JDK 1.6 on a variety of platforms, including Spring and other technology stacks. This assumption means that CAT needs to rely as little as possible on other runtime components and, if at all, avoid any conflicts with upper-layer applications such as the basic log and netty communications components. But in terms of system integration, CAT can be better integrated with a variety of popular specifications, systems, and components, such as standard J2EE Web projects, which simply configure a filter for WAR, add an auto-registered filter for Spring, and so on. Doing these peripheral integrations shouldn’t be too challenging, and you can do it yourself.
As for the data volume challenge, if there is 300TB of data per day, it means that the average amount of data per second is 6GB under normal conditions (double at peak times, depending on the business type of the site), and at least 60 gigabit network cards (1Gb) are needed to complete it. Let alone how to collect monitoring data, Transferring that much data to CAT systems for analysis is a big challenge. Some people say, can we do some pre-processing on the client side, such as aggregation analysis, or grayscale sampling, or compression? Yes, but this may sacrifice the variety and timeliness of data processing (some reports may not be detailed) and increase the computing and storage burden on business nodes. The core of the transmission problem is how to solve the load imbalance between multiple nodes. It seems to be a small problem, but actually it is a mechanism problem involving the whole system architecture. CAT has a lot of consideration in this aspect, and details will be shared in the future.
The transmission problem is followed by the consumption of CAT in the serialization/deserialization efficiency of log messages, the asynchronous distribution and consumption of log messages in independent threads, and the mechanism of statistics by domain and time segment (hourly report, daily report, weekly report, monthly report, etc.). CAT has a very good mechanism that can be achieved in a single thread, does not emphasize thread safety, can greatly simplify the code writing requirements, reduce the probability of synchronization problems, keep good code readability, and run time efficiency. However, in the long run, storage is always a challenge for applications with large data volumes. Storage not only to solve the problem of throughput, there can be no accumulation, but also to solve the problem of financial cost, pay attention to the cost performance, small enterprises are ok, large enterprises will pay special attention to this piece. CAT’s storage engine has evolved to version 6 (V6), with each version being redesigned from scratch. From conception, to coding, to integration and verification of production, V6 version of the history of six months, all aspects of the challenge. The result is good, the read and write efficiency of the V5 engine is greatly improved, and the number of files is greatly reduced (you yong mentioned the specific number in meituan’s CAT share).
If all the above are technical implementation challenges, then CAT monitoring implementation will be another challenge. Whether it is to do intrusive burying point in code or non-intrusive burying point through bytecode technology, the first thing monitoring should solve is target setting, what problems to solve and what indicators to focus on, no matter existing or potential, if there is no problem, there is no need to monitor. If the goal is clear, the second step is to specify how to achieve it by using existing mechanisms or by using new and creative approaches. The next step is the implementation of the plan. Where is the data and what means are needed to get it, and the data will be transmitted to the background for analysis through CAT API. The last step is to verify whether the system has really achieved the expected result or whether the target setting needs to be adjusted after the system goes online. These four steps can be iterated until various monitoring problems can be solved.
Wu Qimin: there should be a lot of distributed system monitoring projects at home and abroad, can Open source and willing to Open source not much, recently more commonly used, abroad have ELK, ZipKin and Pinpoint, domestic CAT, SkyWalking, open-Falcon and so on. Really through the multi-scene and large flow test of large companies should be few, CAT is one. CAT is currently being deployed spontaneously by hundreds of companies in China, with thousands of interested developers. Recently, Meituan-Dianping has attached great importance to the open source of CAT and invested heavily in it. With a large number of applications from several large and medium-sized Internet companies (such as Ctrip, Ping An Bank, Liepin, Lufax, etc.), IT is believed that CAT will be more widely used in the future. ELK is based on log files to do monitoring, ZipKin and Pinpoint some functions are good, small companies use a little more, relatively simple, the function is not very rich, the company will encounter a lot of scalability problems later.
Wu Qimin: A monitoring system usually contains key modules such as log collection, log transmission, log analysis, log report, log storage and report presentation. The system works together through the division of labor of each module. Different products should have different emphases and advantages. CAT architecture design strives to be simple and efficient, simple development, simple operation and maintenance, and efficient operation. I made a comparison with a large factory before. The ratio of the number of machines in its monitoring system to the number of monitored application machines was about 30, while CAT was 300.
Wu Qimin: It’s a long story. When I first joined Dianping seven years ago (at the end of 2011, before the merger of Meituan and Dianping), I was full of confidence and thought that I would be able to show my talents after learning skills in eBay for more than ten years. Didn’t expect the public comments on the entire site often have all kinds of faults, make everyone, leaders are also very try so hard, the problem is always guide at the scene command personally, or even personally on the computer screen, there are have coincided with a will lead the three layers in the siege, very spectacular scenes (it’s a pity that didn’t take pictures). One suspected network problem, one suspected database problem, one suspected cache problem,… (ten thousand words omitted). But the problem is a problem, because you are not a leader to sell you face, let you go, so the final troubleshooting time is always long, even if the fault recovery, the root cause is not clear. In the process of failure, there are often heavy and calm engineers who suddenly shout out what kind of error I find in what machine and what log, which may be related to the fault. We listen, might as well try, do the corresponding processing, a while the fault phenomenon disappears, the fault is over. However, it is often unclear how to shorten the duration of the failure, how the failure was caused, how the root cause was located and what lessons should be learned from the failure.
In such a situation, I suddenly realized that what the company lacked most was application monitoring, and the above problems were just the phenomenon caused by the lack of good monitoring. As you probably know by now, eBay has a very good application monitoring system called CAL, which is said to be one of eBay’s greatest assets. I did not participate in the development of CAL, but I am an experienced user, and from time to time I would share my CAL experience with my brother team. So I volunteered to be a monitor and wrote down the first line of CAT code. In the first few months, I handled everything by myself from architect to programmer, from back end to front end, from testing to operation and maintenance.
CAT should have many advantages, such as distributed cluster, rich real-time reports, full log collection, high reliability/high availability system, etc. With such a large surveillance field, there must be a lot of imperfections in CAT, For example, the learning threshold is high, the customization is difficult, the usability is not enough, the integration with other ecology is not enough, the language support is still not enough (although there is support for C/C++, Python, Golang, Node.js and other languages), the support for asynchronous multithreading is not enough, the background configuration management needs to be strengthened, and docker And so on. These are all things that need to be improved with the community in the future, and you are welcome to join the CAT community.
Wu Qimin: Let’s talk about the development of CAT at the beginning. At that time, Dianping already had a monitoring system, Hawk, which was based on database to store logs (ELK was not available at that time). It had been developed for nearly a year and basically worked, and was in the stage of comprehensive promotion. After I joined Dianping, after listening to their introduction, I felt scalability was very problematic. So put forward to develop a new monitoring system, and draws a good big of a brain figure (it’s a pity that the computer is broken, take not to come out), like the related colleagues to meet demand, everybody listened one leng one leng, the scene is not even a question, I feel that time is: they must think Lao wu bragging, architects anyhow everyone know, say is easy. When I asked Hawk to stop promoting the system, a lot of people came to me with concerns and confusion. At that point, I decided that I was going to do CAT, and if I screwed up, I’d leave, because I was always welcome back on eBay. Then I did the design and development silently by myself. Now I think that maybe this is the best time for me as a programmer. Every day is full. I still remember that when my leader asked me to have dinner, he asked me what kind of support I wanted. I told him, “Give me a few months, no matter what I do and how I do it, please do not interrupt me, which is the biggest support for me. The leader agreed to my request and did so. After 2 months of fighting, I made the first prototype of the CAT, the lead time for the leadership to do the demo, the scene is designed in advance, which is in the normal run time of application to do the background just make a few mistakes, take a look at the CAT can accurately catch, accurately tell me where is wrong, as the architect of that there is some little skills. At that time, the demonstration was very successful, and the leader happily stood up on the spot, patted me on the shoulder and expressed support for my further research and development. After getting the leader’s recognition and trust, the progress behind is relatively smooth. Of course, it is not plain sailing, there have been a few small twists and turns, but also help people carry the blame, but on the whole it is smooth.
Qimin WU: The cloud native era presents comprehensive challenges to application architecture, and monitoring is no exception. But the core of the essence is the same, only the form is different. In the case of Docker, the granularity of cloud application monitoring is smaller than that of physical machines and VIRTUAL machines. In the past, many systems distinguish cluster nodes by IP address. If the IP address of a Docker instance keeps changing, it will be challenging for monitoring. Docker instance in Kubernetes Pod, may be internal IP address, external visible IP address is Pod address, this may cause some scenes can not string together; On the other hand, the docker container application life cycle may be relatively short. The application on VM is redeployed, while docker is destroyed and rebuilt, which may have some new influences on the monitoring system.
Wu Qimin: IN the field of monitoring, I pay more attention to Visibility. I often use the analogy with people. The system is developed by you, just like a child raised by you. There are too many variables that can affect the various architectural features of the system. Environment configuration issues, network connectivity issues (another bad guy), dirty data issues, service dependency issues, application load issues, machine performance issues, and even JVM and OS issues are all unknown to you and difficult to predict in advance. This requires that the problem indicator be presented in a way that gives you a clear idea of its status. Without measurement, there is no improvement. Without monitoring, you are blind. However, it is difficult to improve the Visibility of the system, because it is for people to see, people’s requirements are changeable, at this stage, after this stage, there will be higher demand, so it is difficult to serve good people. CAT needs to be improved in terms of ease of use, but at the present stage, CAT is mainly creating more problem scenes. The basic problem of food and clothing should be solved first, and whether the experience is good or not can be left to the community.
CNUTCon global Operation and Maintenance Technology Conference has entered the final countdown stage, there are not many seats left, please click “Read the text” to register, take the last bus! There are “automatic operation and maintenance”, “Monitoring and analysis”, “log Processing”, “micro-service architecture”, “Kubernetes” and other technologies to share, comprehensive and multi-angle analysis of 50+ best practice cases in the field of operation and maintenance! InfoQ hosted the 4th CNUTCon global Operation technology Conference, which invited experts from Twitter, RIOT Games (developer of League of Legends), BAT, Huawei and other leading domestic and foreign manufacturers.