September 14 and 15, GOPS global ops conference, Shanghai successfully held two days of operations, for each operational people brought an excellent platform, communicate with each other and learn from tencent technology group (TEG) architecture platform of engineering career PeiZeLiang brings you a “mass storage and CDN tencent automation operations” the theme of the share.

We have synchronized the video shared by the guests at the salon (including HIGH-DEFINITION PPT), please click the card “Tencent Technology Class mini program” below to view it:

I watched the live video shared by the guestsSmall program

Attached with the prepared speech:

Pei Zeliang, from the Architecture platform Department of Tencent Technology Engineering Business Group, has been engaged in the construction of operation system for more than 8 years. He has participated in the construction of Tencent cloud CDB, Tencent Massive file storage system TFS and Tencent CDN service operation system from the initial stage to a relatively complete stage. At present, it focuses on improving the operation quality of Tencent cloud live broadcast, on-demand, static file CDN, COS and other businesses, as well as building a more efficient and secure automatic operation and maintenance system.

What is Tencent architecture platform department to do

Tencent platform architecture department provides WeChat QQ chat pictures, images, circle of friends in the QQ music songs, tencent game, the application of the treasure inside the app download, tencent cloud of cosine object storage, on demand, live, and tencent video on demand, live, mass storage and CDN service behind these products are provided by our department.

Now total more than 2 eb storage capacity, reserve more than 100 TB bandwidth, the use of more than 20 w server, the construction of more than 1000 OC machine room, we provide service traffic flow occupies more than 90% of exports of tencent, and our operations staff 50 people, only here to explain, we have other brother team behind in support, For example, the procurement and maintenance of machines and the construction of the machine room, and the operation and maintenance personnel of the service we managed were only 50 people.

Can image through the plant to understand our mass service operations, the basis of the power station daily operations need to have strong ability of monitoring, real time monitoring to the various indicators have exceptions, such as the current total output voltage value, current value, power generation, and also need to the production environment to do all kinds of in the daily operation, adjustment, Such as loading, adjusting capacity, repair parts, all kinds of raw materials to our daily operational version configuration changes too, have a faulty maintenance machine, of course, the security operations is the foundation, otherwise once the accident, defies the imagination, for power stations, the downstream industry, the inhabitants of all the power went out, can bring huge economic losses, for us, user data is lost can not find back, There will be a huge crisis of confidence. Visually, our operation and maintenance challenges are large monitoring volume and many alarms, frequent changes to the live network, and high security requirements.

This is our automated operation and maintenance system, which can be divided into three parts: basic system, such as configuration system, equipment resource management system, resource budget accounting and accounting system, and general operation and maintenance capability system, such as monitoring, change, PAAS operation and maintenance platform, quality testing, process, and business-specific operation and maintenance system. Like photo album operation and maintenance system, COS operation and maintenance system, VOIP operation and maintenance system. What I share today is the middle piece of general operations capabilities. For us, all of these systems are built to ensure business quality and control business costs.

Monitoring massive services

Have often met business or user complaints over “my circle of friends can’t see the figure, what is”, and then our staff a face of confusion “as if everything is ok, I received the alarm”, also is the problem of monitoring is not complete, and then we look for data in monitoring system on various, want to see exactly what’s the problem, the results point the “query” button, The system has been prompted “please wait, is very hard to query” the result is half a day out of the data, that is, the system performance is low, finally see the view, then came a question, there are thousands of machines in the report of failure data, in the end which machine reported a large number of failure data? The system does not have the capability of multi-dimensional drill-down analysis to find the source. So let’s see how we solved this.

This is an overview of monitoring. I will mainly introduce some differences from common monitoring systems, mainly in monitoring reporting, real-time computing, automatic anomaly discovery, and automatic analysis.

This is our reporting module. The upper end sends data to the server through Intranet or extranet. The data we monitor can be divided into three categories: Structured, that is, time sequence data, detailed log, that is, program flow data, customized data. Businesses report specific data they need by means of monitoring and reporting channels. Of course, the monitoring system cares most about structured time sequence data, such as traffic, number of requests and delay.

We special in the report is in the press, business logic every user request processing data related report will call monitoring API, such as business logic every request the backend system, will get my call delay, advocate interface, the interface, success or failure, error code data such as call monitoring report submitted to API to the monitoring and control system, In the reporting API, we will aggregate multiple data of the same type into one, and then report it to the monitoring Agent daemon of the local machine. In the Agent, the second-level data will be directly sent to the monitoring platform, forming the second-level monitoring. Meanwhile, the Agent will also aggregate multiple second-level data of the same type into one minute-level data. At present, 600 million minute-level structured data is reported to the monitoring platform every minute.

To sum up, we have realized the second level, minute level, detailed log, business custom data and other reporting capabilities in the same reporting channel.

Our monitoring system using a structured multidimensional multi-index model to describe the monitoring of the business, index dimension this concept in OLAP is common, concrete is a software module of monitoring data is decomposed into multiple indicators and multiple dimensions, such as the circle of friends pictures download download module, index have traffic, time delay, the number of requests, the failure number and so on, Dimension has regional, operators, pictures, specifications, etc., each time the user download requests monitoring, corresponding to the combination of some dimension index data, such as the red line in figure refers to the flow rate, a larger Shanghai telecom monitoring system will automatically send this software to dimension index combination is mapped to the characteristics of a unique ID, ID is the digital model, Then, the time series value of the index combination of this dimension and the feature ID as key and value are stored in a separate KV system, and the relationship between the feature ID and the index combination of this dimension is stored in DB as configuration data. The value in one (key, value) can only store the time series value of 120 points in 2 hours. The same key will have 12 values every day. After special compression algorithm is adopted, the daily storage capacity exceeds 350GB.

To sum up, multidimensional model can effectively describe various monitoring data of the business.

Multidimensional model is good, but also can bring some obvious problems, is the most basic software just quote the various dimensions of index combination of data, but users need to query the various dimensions of data, such as the front said the flow of large software submitted to the Shanghai telecom, but users need to look at the telecommunication traffic as a whole.

What to do?

If the data is stored in MySQL, we can directly select sum group by, but such massive data is obviously not suitable for relational database, we are stored in KV system, and then we adopt real-time calculation + real-time calculation mode to solve the problem of convergence. Real-time computing is used for aggregating the most basic data of various dimensional combinations directly reported by the software between multiple machines. Real-time computing is used for immediate computing when the data queried by the user is not the basic combination.

Of course, you have already seen that there are still problems, such as the user query dimension combination of hundreds of thousands, immediate convergence is not guaranteed to return in 1 second, how to do? In this case, we need to use our real-time computing index technology, which is similar to MySQL to build an index to speed up the select, the idea is similar, the method is to create an “index” for the dimension that obviously increases the query time, and the index will be summarized on this dimension and combined with other dimensions. This index is automatically added to the real-time calculation to ensure that the data is calculated every minute, so you will have to ask the user to maintain the index. We have already automated the system to find the optimal dimension to add indexes to when it finds that too many dimensions of a business might affect the query speed.

The whole idea is somewhat similar to Apache Kylin, which is predictive computation, but different. Kylin only predicts all combinations predefined by the user, regardless of whether they will be used in practice or not, whereas we do one predictive computation plus two automatic indexed predications on demand plus immediate aggregation on demand. Even if the user sees hundreds of thousands or even millions of dimension combinations, the system will also return sub-second results to the user’s query, and at the same time ensure that the data volume of the predicted results is not too large, reducing the system cost.

However, when the business volume becomes large, there will still be various challenges. For example, a dimension contains too many sub-items, including hundreds of thousands or even millions of sub-items, such as the domain name of CDN and the room of Tencent Cloud Live. At this time, even real-time computing is not so effective.

Carefully analyze the business features will find most of them are long tail of users, this part of the user very little contribution to the overall flow, some domain name request dozens, hundreds of times a day, it is not necessary to do general minutes level monitoring alarm, which is a waste of monitoring resources, so we put forward the key business and the concept of long tail business differentiate monitoring, real-time analysis module, after the data reported Do long tail business data processing in HBase, and have a long tail, according to a spark task analysis module, ensure that every 5 minutes long tail according to analysis of round, to satisfy certain conditions, such as flow rate reaches a certain threshold of business into key business, do so after it reached the limited resources to priority focus on the business of guarantee, On the other hand, the long tail service detects faults at a 5-minute level and automatically transfers the services that reach a certain traffic volume to key services within 5 minutes, enhancing the monitoring capability.

I mentioned in the second part reported in front of the class monitor, for our second level monitoring is the most important is not used to second level alarm, it will bring a lot of burr harassment, but when there are abnormal can help analysis software minutes load imbalance, such as minutes level data to conceal the real problem of high load, As well as live broadcast, red envelopes such as the second level of the analysis of the emergency scene.

AI OPS is very popular, and we have also made explorations in this area. At present, it has been applied in the automatic discovery of anomalies and the removal of alarm burrs, and the effect is good. We are still exploring the automatic tracing of anomalies. I want to stress is that for our business, even on the automatic machine learning version of abnormal found, the alarm automatic classification of burr, but the traditional manual intervention threshold volatility alarm still cannot little, such as some business is very popular recently, is sensitive to quality of jitter, this time will be the threshold of configuration of traditional manual intervention, volatility, So in our view, AI OPS is not about automation taking over, but the traditional threshold strategy still needs to be used to prevent automatic failure.

In the aspect of automatic anomaly discovery, we adopt a two-stage method, and use statistical analysis algorithm to do a screening first. Currently, four kinds of statistical analysis algorithms are used: Grubbs, EWMA, least square method and First Hour Average. The input curve data are the data of the same day today, yesterday and last week in total. These four algorithms are used to vote whether the current point is an anomaly or not. Then take the IF algorithm for the abnormal points for a discrete point judgment, thus it is concluded that the abnormal points, there are unsupervised algorithm, the flow rate and delay, failure rate, the relative smooth data are valid, the basic is invalid for some of the scenes will, of course, like the figure on the right is live traffic, large anchors a online traffic will rise considerably, A referral traffic will fall sharply, and the anchor line up and down time for the entire business is uncertain, artificial view is difficult to determine what is abnormal, which is normal, algorithms must be also directly a face of fan mang, I want to emphasize that the algorithm is to combine business characteristics, can not only see the data itself, so as to ultimately improve accuracy, In the case of live streaming traffic, we can directly use the traditional threshold value, and add intelligent judgment to the lag rate, which can also achieve the due effect of fault discovery.

Occasionally there will be a spike in the quality data of the business. For example, the delay or failure rate increases obviously within a few seconds, but then immediately drops. Such problems have no impact on the general business, but if all the alarms come out, the operation and maintenance cannot stand it.

We adopt the supervised learning algorithm to intelligent deburring, since there is a supervision, that is to require the participation of people, we build a set of the alarm service WeChat, head, head can directly claim on WeChat alarm, select the cause of the abnormal, there is a “burr jitter”, so that we will get to have the sample of label, Then according to the alarm of business characteristics to build sample characteristics, such as time, region, operators as a feature, then use the RF, GBDT, SVM model burr respectively, for after the alarm will adopt the three models to vote, there must be at least 2 as burr, we would finally judged burr, then downgraded to head, The training of this model is online, the user continues to claim, the model continues to be accurate, and the alarm deburring behind will continue to be effective. Finally, we comprehensively improve the comprehensiveness and accuracy of alarms through intelligent anomaly discovery and deburring.

Alarm automatic analysis processing is also we have one aspect of the characteristics of the comparison, we build a complete system of automatic analysis, after the abnormal produce, if have the corresponding automatic analysis tool, the alarm will not directly to the user, but will be sent to analysis tools, analysis tools to analyze the results after will be pushed to users, at the same time if there is processing tools, Users can also click processing when they receive the analysis results. It should be noted that analysis and processing tools are placed in our PAAS operation and maintenance platform. Users only need to manually drag and write some small scripts to complete the development of tools, which is the convenience of our automatic analysis system. It should be emphasized that the automatic analysis here does not mean that the system has the general ability to automatically analyze the cause unconditionally after the occurrence of anomalies. Even if people are not familiar with the business, they cannot do the general analysis effect. For common problems, special analysis and processing tools can greatly improve operation and maintenance efficiency.

In terms of automatic analysis and processing of alarms, our final desired effect is similar to that of automatic sprinkler system. When a fire is detected, the sprinkler will be automatically turned on to put out the fire.

In the absence of smarter, more accurate capacity estimates, it’s not uncommon to see operations and budget people arguing, “I need 200 units”, “why do you need that many, what’s the reason, can you reduce it?”

In terms of capacity management, we often see Docker-based, K8S, etc., so why do we need separate capacity management? For us, modules are divided into two categories: one is stateless model, which can be docker-managed; the other is stateful module, which is not suitable for Docker-managed. In the capacity of the docker management, we have the second of CPU, flow monitoring ability, to calculate the capacity requirements, and adopt the way of the container in advance deployed on machine tool to achieve the effect of the expansion and second level, at present we build more than 100 w nuclear docker resource pool, used for image compression, video transcoding, AI training this kind of scenario. In capacity management of stateful module, we introduce machine learning algorithm to automatically evaluate capacity, and achieve linkage with budget and resource system, directly submit equipment growth demand with one key.

Our capacity evaluation principle, from the monitoring system can get the number of requests to the software module and the corresponding data flow, CPU, established a model by regression algorithm, in the actual often add the key feature of the business in model, such as image specifications (like a larger version, insets), request type (such as pictures, video, files), because obviously, The number of requests under different image specifications and request types will bring significantly different traffic. When the business needs to report for natural growth or the activity needs to report for resources ahead of time, it is easy to calculate how many requests will increase, how much traffic will result, how much CPU will increase, and therefore how much equipment will be required. With the adoption of machine learning algorithms, capacity assessment is no longer an artificial “approximate” estimate, but an accurate assessment.

From then on, I would no longer argue with budget managers when reporting equipment for operation and maintenance. “According to the performance of existing modules, if the future growth is to be supported, the system needs so many devices.” “It seems that developers need to PK their program performance and whether it can be optimized.”

Secure and efficient live network change and operation

Monitoring is used to find problems, monitoring is very good, found a lot of problems, but there is no handy tools when problems occur, operation and maintenance can only be in a muddle, half a day can not solve the problem, very low efficiency, which is not what we want. Let’s look at our instrumental building of change and operations.

If there are only dozens of machines in the entire production system, you can also expect SSH directly. When there are thousands of machines, this method is not feasible. You can use Ansible and SaltStack. In a complex network environment with hundreds of thousands of production machines and certain security policies, Ansible and SaltStack are not feasible, so we built our own control platform, which has only two core functions: After executing commands and transferring files, security policies such as operation process management, template mechanism and operation scope isolation mechanism are built on this basis. Of course, our control platform also needs to be capable of shielding differences in various networks and regional environments to provide a unified use experience for upper-layer users.

In the file tradition, we use a similar way to P2P, when the source and target machine can be directly connected to the direct transmission, when not directly connected, the system will automatically calculate the optimal path, from the access SVR to do the transfer, these are invisible to the upper callers, the system will automatically complete. For example, the version file when we change is located in the server in IDC, the target machine is located in the edge node of CDN, and there is no external network, at this time, automatic transfer file transfer will be used.

At present, the number of terminals on our control platform exceeds 30W, and the number of jobs scheduled every day exceeds 5KW. For massive operations, the management and control platform is the only way for the operation system to operate the production machine, and no one is allowed to operate the production machine through Expect directly and SSH any more. Therefore, the management and control platform is a very basic and important part of automatic operation.

Change is often a cause of the now network malfunction, so the change is a hair on link, is dependent on the basis of change control platform, this is our change processes, from the development version, and then enter the automated builds and tests, the next into the gray level change, effect to batch change again after confirmation, we did it in the whole process automation, Now a day 50 single change in the general configuration tasks, change the machine more than 1 w, more distinctive automation monitoring is in the stage of change, namely after each machine change, system automatically discover change before and after using the algorithm of machine learning machine load, the amount of business requests and failed quantity whether these indexes such as abnormal, if everything is normal, Will continue to change, once found abnormal changes will be suspended, notify the responsible person to deal with.

Also has A more distinctive is change after the inspection, the expansion and changes or there will be some difference between different business, such as A business is outside the network, and using the TGW, TGW is tencent’s gateway products, similar to the LVS, then it is when expansion application firewall, installation of tunnel, once A deployment didn’t do right, leads to the final out of the question, After the change, we will a docking tooling the inspection, to confirm the change effect, and the tools of inspection is not done in the changes in the system, because of the different business may be different, diverse, often change, so don’t fit into the changes in the system, we are put it in our own PASS operations platform, This PAAS operation and maintenance platform can easily create some small tools and connect with the change system, so as to achieve a unified change system of thousands of effects.

Before we each operations staff have some tools to write their own script, when you need to use, looking around, just like this pile of messy tools, and because to write their own, can also lead to these individuals dedicated tools is difficult to give to others to use, it is difficult to live on and uncontrolled and safety, how to do?

We build a lightweight PAAS operations platform, operations can be convenient to write a variety of tools, in this platform to quickly build complex process, the target is through the tool + process need to quickly piece together a function, the platform itself have tool scheduling engine, process execution engine, the control platform as the foundation, the upper provided to other operating systems, Or operation and maintenance personnel directly to use. On the right is a tool example with crontab added. At present, there are more than 1000 tools and processes on the platform, and the weekly manual usage exceeds 5000 times.

With this PAAS operation and maintenance platform, we no longer need to maintain small scripts on our own machines, making tools easier to maintain, secure and controllable, shared and inherited.

Live network security system

Monitoring has been done well, can we do drink coffee and watch the view, at your desk phone received alarm telephone, everything is all right, we also do the colorful tools, here are the various functions, but if you don’t be special protection to the production machine, let operations staff can often come into contact with the production machine, will like this picture, although we had no intention of sabotage, But often in the dangerous away, it will inevitably have an accident, so we must build a safer operation of the production machine road.

Ctrip deleted the program by mistake in 2015, Didi deleted the program by mistake in 2015, AWS S3 was offline by mistake in 2017, and even Tencent Cloud was recovered by mistake in 2018. All these demonstrate the importance of safe operation and maintenance. Safety is no small matter, and you never know who has to do it.

We put the safety of the production machine in general is divided into two large, one is the artificial login directly to the production machine, the other is operating system operation machine production, in principle, direct stuck the two aspects of security operations, can guarantee the safety of the whole operations, summed up down just don’t let the operating system can production machine operation, also don’t let a person can production machine operation, These two aspects, we have done some specific controls for operating system operation production machine can only through the control platform, so you need to stuck control platform’s security policy, our specific practice is high-risk operations must templated, templated refers to online to human review before the operating tool, after the execution of the use of all can not change the tool code, In addition, the execution frequency of high-risk operations will be limited. For example, only 100 equipment can be operated per hour, and only production machines of related businesses can be operated by the operation system. Not all production machines can be operated by any operation system.

In terms of directly logging in to the production machine, we have divided the permissions of the production machine into two types: common and root. Service processes are started by root, and the human only has the common permissions. That is to say, after logging in to the production machine, the human can view but cannot modify it. And of course, even if we do that, will we be able to guarantee against those misoperations?

As long as there are people, as long as there are constantly changing needs, there is no guarantee of 100% error-free, but we still need to do our best to ensure safety from the root, and try to reduce the possible loss caused by accidents.

Do so much, we to the operation of the network is now production machine, 97% are through a variety of operating system to do, like change the system, the PAAS operations, business operating system platform, 2% will be through the control platform to operate, only about 1% is login directly in the production of machine to modify, this part is usually abort will only be used when processing. Just like this inverted triangle, the lower you go the higher the risk, the lower the efficiency. This three-tier architecture is adopted to comprehensively balance operation and maintenance efficiency and security.

In short, try to avoid direct operation of the production machine, try to avoid hasty writing tool operation and maintenance on the spot, try to use the tools made in advance, so as to stay away from danger.

Summary & Future

To today’s share to make a summary, monitoring ways we use multidimensional multi-index model to describe the data, the real-time computing + real-time calculation method was adopted to do together, in order to improve the query speed, we joined the automatic indexing mechanism, capacity, we have adopted the algorithm of machine learning to more accurate evaluation, change control platform, we build a lot of equipment Did anomalies in the change process of automation, found in the tools we build the PAAS operations platform, CSCL d small tools in one place, easy to use and share maintenance, login security we stuck on artificial production machine, operating system production operation risk points, and eventually formed the most production machine is through the front end system operation, Only very few users can log in to the production machine, for example, to locate faults.

In the future, we will continue to explore in operation and maintenance security, continue to make strides in AI OPS, and continue to dig into the value of massive data in operation.