Tencent QQ massive business automated operation and maintenance of the secret

The lecturer introduction

Yang Lidong

A ten-year veteran of operation and maintenance, currently responsible for the background operation and maintenance of WEizhiyun QQ information and relationship chain.

I have experienced the explosion of background support of farming, pasture and AD advertising business, as well as the development and maturity of operation and maintenance service system, architecture optimization and automation operation and maintenance. I have rich experience and unique thinking on the construction of massive operation and maintenance service system and automation system.

Share a platitude and old topic – automated operation and maintenance.

The automation system of Tencent Cloud contains a lot of content, as can be seen in the figure above. My colleagues have shared the contents of Zhiyun, such as components, basic operation and maintenance, architecture, scheduling, monitoring, AI, algorithm and other aspects.

What I’m going to share is part of the red box. Like bag management, I will explain to you what our bags are, not LV, not Hermes.

It will also share business CMDB configuration management and automation processes. In this process, I will share with you the pits we met, and our ideas and experience, hoping to bring you some inspiration and touch, my goal has been achieved.

I have been engaged in the operation and maintenance work for many years. I used to be engaged in the operation and maintenance of farms and pastures, as well as the operation and maintenance of advertising business. Now I am mainly responsible for the operation and maintenance of QQ password, data and relationship chain. These are the three pieces of content I want to share.

1. Tell us our story

In the case of rapid development, people may encounter various problems, and operation and maintenance bear a lot of pressure in the development process.

For example, the r&d team grows very fast and hires a lot of R&D quickly, so the demand becomes very large and urgent accordingly. Moreover, they may not use the same framework, and their coding habits and components may be different.

As an operation and maintenance staff, you may persuade the boss to hire you by some means. You may hire two new operation and maintenance staff and find that these two operation and maintenance staff may have different abilities.

Daily communication and script management development will definitely be affected, because people will have different understandings of technical points, scripts and architectures. Many people now involve travel, transitions and departures.

Finally, they may spend some energy to persuade the boss to do operation and maintenance optimization, and choose many well-known operation and maintenance schemes from the industry, but they may not be promoted, and the RESEARCH and development personnel may not cooperate, and it is difficult to implement the idea though it is good.

The first one is that all people post photos. All people are 18 years old. The number of pictures uploaded in a short time is four times the peak of the evening on weekdays. The second one is QQ red envelope. Within two weeks, we have completed about 800 modules and nearly 30,000 devices to support the Spring Festival red envelope activities.

2. The next life of package management system

Before we talk about package management and operations solutions, let’s talk about the R&D organization structure. Since the birth of a company, it must be the r & D students who have deployed the environment well. Basically, all the things that should be done have been finished. The boss said that the simultaneous operation and maintenance were no longer moving, and special operation and maintenance were needed, so you were recruited to fill the hole.

What solutions are easy to use under what organizational structure?

The first type is central. A person in charge of research and development unified centralized management, development under a unified framework.

Under this framework, many things are easy to do. In this case, there is a requirement that the person in charge of R&D should be aware of the problem before the company starts or just at the beginning. Or if you have a strong will and ability to transform, you should correct the problem if it is found in the process.
The second type is polycentric. This is what we focus on today. In the face of multi-center, different departments are responsible for different product development. They may or may not adopt the same component, but their environmental requirements are generally the same and do not vary. In this case, standard and efficient types may be born.
The third is the discrete type. I just said that the environment is generally consistent, if the environment is not consistent, it is ridiculous, and then you have an instrument-based operation and maintenance system.

When we face the problem of personnel, we must first have a concept and guiding ideology, and the optimization process of the whole program must be gradual, not in one step.

There are no easy, low-cost, time-saving, do-it-yourself solutions.

Take a look at our process, the first step is to do package management, then with the increase of equipment we do SPP framework components, then do name service, scheduling ability, then do CMDB resources, mirroring, striping, automation process, and finally do data banking, intelligent operation and maintenance. This is a modular delivery process that fits well with the development of the enterprise.

Is it useful to hand over AI operations to a company with 10 servers? It’s no use. I was asked, what should our company do? What I said is that it should be done according to the specific situation, and it will not be used just as the trend becomes popular.

It can be very frustrating to have a lot of developers asking you reboot commands, operations, maintenance, etc. These questions are very simple and can lead operations into repetitive work with little value.

Our package management system is to solve the above problems, which is the most troublesome work that operation and maintenance have to undertake, and it is not easy to produce output. The idea was a unified development framework, and companies like Google already have open source solutions like this.

For our multi-center team, it is not easy to deal with one team, and it is even more difficult to deal with multiple teams. Developers are busy with their own business.

Then we came up with a unified management framework. The starting point of doing this is to optimize operation and maintenance management, its purpose is to manage, do not excessively let the development to change the underlying code, this is not realistic. So we sort out the file directories and script functions.

This is the effect after combing. So what exactly is a bag? I’ve made a simple concept for a package, which is a collection of files designed to perform a specific set of functions.

The directory planning is listed based on the functions of the package. For example, Admin is used for functional restart and file clearing, Conf is a configuration file, Log is a daily value file, and Client is a plug-in. Such a formal partition of files makes all packages look the same. It sounds simple enough, but there’s more to it than that. Read on:

File management

At that time, I was in charge of advertising operation and maintenance, which was a cash flow business, and how much money was spent per minute. I wanted to uninstall a certain packet of an IP, and found that I chose all the IP by mistake, only to find that all the packets of the IP had been deleted. I later added a soft chain to the package’s post-script and no longer had to worry about the package being installed without a soft chain. That really helped solve a big problem.

Another big hole is public files. Multiple programs must access the same file, we now encounter, Qzone has a public file with thousands of lines of configuration, in order to manage the public file we also made a development system, do a variety of locks, in short, very complicated, in this case, there are often accidents.

So if we use the idea of “package” management procedures, encountered this kind of public documents, must be divided into their own packages, do not mix together, to involve the case of multiple editors will be a problem. Therefore, public documents must be opened.

Process management

If you need to start and stop all day long, the package must have a process start and stop function. As long as you have parameters, you can start it, but we have a limitation on our side, “stop using exact match method” is default do not let you edit, because our stop using automatic method.

These things often bring about the manslaughter, the manslaughter of others, and so there’s a pattern here.

We have a process self-monitoring function, which is required when a process freezes or is suspended due to abnormal requests received by the business process.

So how did we do that? Live network every 3 minutes to PS, to find the process in the package, if it is in the nature of what to do, if not, you need to ask in the marker, if the marker allows it in, you need to do pull, failure alarm.

If it’s not there, and the marker bit is not there, then nothing is done.

Why is there such a thing as a “marker bit”? Those of you who understand architecture know that this is a very complex, in order to maintain this bit, we made dozens of versions, various locks and various protections in it.

This thing is decided by the product. When the package is used by the user, there is a requirement that the package is installed and does not start, which may depend on some other files or other environment. I am not ready, and when I start it, I cannot start it.

We have a “boot after install” TAB, and if you don’t want it to start, you pull it up, and things happen that the user didn’t expect, and that’s not allowed. So we made this “flag bit” to show it when it’s started, not if it’s not started.

Version management

Product iterations are very, very frequent. There are over 100 versions in this example. How do we do version management? The differences between versions are made by description. Sounds technical, but what if I get it wrong? We have a solution.

There are three questions:

Store 1000 packages, 100 versions, and save 9.5 TERabytes.
If 100 IP versions are upgraded, will the transmission cost be 12000M or 2000M?
If there is a problem with the upgrade, let’s give back, what is the cost of the network transmission?

This is our plan. Call the Diff function to do the download, cache only the changed files, that is, different files, very fast. We have over 40,000 packages, over 600,000 versions, and 3 terabytes of packages.

What exactly is file-level diff? The image on the left is the cloud weaving package, and the image on the right is the mirror package. By weaving the cloud package, this is a transfer method, with the image package is the diff, from the second version to the third version, only need to transfer 4 and 5.

If 1 and 2 are too large and involve too much transfer, the overall user experience upgrade process and time consumption will be very slow. Our package management is very lightweight, very agile, and its technology actually looks like this.

Our version also rolls back very quickly. During the upgrade, the diff content was put on local disk. If you want to roll back, transfer directly from local disk to local disk, the network cost is zero, it doesn’t need to change.

All of our package versioning and storage is incremental, including saving, downloading, and rolling back.

We have more than 20,000 package operation tasks a day, which are very frequent and take 10-20 seconds each time. The time is very small, and the success rate can reach 4 nines. A lot of issues have been resolved through iteration over the years.

Instance management

Example is to let you know very clearly live network each IP exactly is which program, which version. We’ve done some dumb-proof design and sorting here for the example. Here are two:

The first is to mask the command line. When I do operation and maintenance, I often run command, but running command is very advanced work, running wrong is very troublesome, and rolling back is also very troublesome. Shielding the command line makes it very easy for developers to use package management.
The second is to reduce the number of o&M objects. Reducing the number of operational objects is the eternal rule of optimization. In the past, we typed each command, looking at the conditions and matches, the more you typed the command, the more your brain consumed, by this to reduce the operation and maintenance objects, improve efficiency.

Instance management is not that simple, and every IP looks great, but it’s not. Here we have versions 1.0, 1.1, 1.2 and 1.3. This does exist on the web. Here are a few questions:

The first question is, does this make sense?
The second problem is that high releases can entrap features from lower releases and cause accidents.
The third question is, which one shall prevail during operation and maintenance expansion? Can you pick them when you’re in charge of 10,000 or 20,000? Take a notepad and write it down? No way.

We allow multiple versions. The Internet is very agile and needs to get user data and feedback quickly, so we use Canary release, gray release and blue-green deployment of course, because there are gray requirements, so we must have at least two versions, a new version and a stable version, otherwise it will not work.

The first figure is all 1.0, A development in gray, B development with 1.1 version of the feature gray released in full.

For those of you who often do publishing, this is a problem we often encounter. Its cause is relatively simple, but the consequences are more serious, often someone ran to ask me how to do, and our classmates took the version out.

Don’t be nervous, I say to you needs, all gray distributing IP send a mail to me, also tell B classmate, have a version in the hair, haven’t you finished gray level, and then there were a few IP, when the fourth IP, I said it’s no use, sent to all over the mailbox mail is useless, I come out of an idea, is the constraint.

A cluster can have a maximum of two versions. Canary release is required, I make a constraint, there are two versions at most, there is no need to appear the third version, in fact, the two original couples do not allow a third party, when the two original couples are single.

When you have 1.0 and 1.1, 1.2 can’t release, either upgrade 1.0 to 1.1 or roll back 1.1 to 1.0.

In addition to the problem of version encoder, there is also a very convenient selection of expanded version to help operation and maintenance. In fact, the expansion version is the official version, gray version is also the official version, to help customers decide which is the official version. We choose the version with more than 50%. Why choose a version with more than 50%?

There’s a bunch of written descriptions. It’s your choice. When you choose a new version, the problem may be magnified, if you choose the old version, the new version of gray, you still put the old version, this time the user experience is not very good.

There is also a manual definition, every time when the release of authority or leadership approval, let him define which is the official version, this is also ok, but the efficiency is too low, multi-center team so many products to be released, how to do this. If your product is a traditional enterprise and the approval flow is very strict, one half month, another one month, another one the month after next, then there is no such problem.

Release management

In the release management, we are faced with the problems of frequent iteration, urgent rollback, huge scale and numerous files. We talked about how we can do these challenges very quickly and very lightly.

The issue of who releases, our releases are public releases, product, development and testing, where the people who find the requirements and who change, this is the most efficient.

We have a set of methods and constraints to make development, production, and testing work well with our release system, and there are benefits and risks.

Since it is a public release, so many people are sending, probably the same group of target machines, and complex business upstream and downstream correlation is very strong, you feel normal after sending, but the upstream is abnormal, so change is very important, because you need to quickly catch who is making changes at that point in time.

In my experience, if your business is going wrong, someone is going to change it, just to a different person. For example, there’s a power failure, maybe there’s a power adjustment, and if you can get that data up, that’s great. If the fiber is cut, the change is to the ground, your business is affected, or your support department, operating system is changed, there has to be a change.

We have made a change log. Once there is a problem, we can check it immediately. After the alarm appears, we can choose the corresponding time.

Another issue facing release management is how to enforce the pace and timing of Canary releases.

For the above problems, we are now in a state of semi-solution. We often say that in order to safety must grayscale, time, do not send off work, do not send in the middle of the night, but sometimes he sent you have no way. Now we have to force him, one way is to strengthen the system, and the other way is to settle accounts later. We have not taken very strong measures.

We adopted a semi-tough method called “release plan”, which is to release the content to be released in the future, such as what package or what version to which environment, through such a release plan, so that a lot of people know what you are going to do in advance, so that when you do it, you will not find it is too late.

We encourage the use of this method, and many people are used to it. It is not only to help you do this compulsively. it is also to improve the efficiency, otherwise there are so many IP, you will not know what is published on it.

I detailed the characteristics of our package management from the aspects of file management, process management, version management and release management. To sum up, keep it simple:

First, the package is a management framework, the whole package is for the purpose of operation and maintenance management, there is no need to invade the development code, and through the directory function planning, so that all packages look the same.
Second, packages are the bridge between characters in DevOps, simple and agile.

DevOps is all about multi-role, multi-team collaboration, and how to work effectively together, and we’ve been doing this since 2006 with Package Management, and the most solid system is simple and agile.
Third, the package is a refined operation and maintenance management object, low threshold of transformation. As I mentioned earlier, it is very, very important to reduce the number of o&M managed objects. However, after reducing the number of o&M managed objects, you can also know what is managed in the package through the specific content of the package, so that it does not really become a black box.

Most companies feel superior when they choose an operation solution, go back to the boss and negotiate with the R&D and decide not to do it.

In 2014, there was a business of 28,000 units, which said to do package management. We assisted him to do it, and also made some small modifications. About 80 or 90 percent of the devices were replaced, and they became the way of package release, which is very efficient now.
Fourth, 60% of the problems can be solved. How dare you say that? So far, our resource integration so many functions and modules, package PV and UV are very high, very far behind the second place. It is because for a person with very sophisticated management or operation and maintenance demands, one package is enough to solve many problems, and the program control is very effective.

I’m making a simple comparison here between packages and Docker. As mentioned earlier, packages can be released very quickly. Packages can do it all, but Docker is different.

I didn’t cover much of the basics earlier, because packages are mostly about business logic, including processes and publishing methods that are relevant to business logic. Packages are business logic, whereas dockers are physical. Docker is ideal for solving environmental problems.

Thinking about package management — standardization. The development of the whole package is from efficiency and flexibility to rule restraint, and the general trend of Internet development is bound to be flexible.

We are now finding a balance, and the balance is standardization. For example, our goals, pace, division of labor, strong constraints, majority rule, non-inclusive, public distribution without approval, these things are the standard contract we are trying to achieve.

The most important point here is to “reach consensus”. As the head of operation and maintenance, you need to communicate with your R&D boss and a lot of people, tell them your optimization plan, tell them your solution, and find the points that impress them.

And what is that point that we found? Is very low transformation threshold, but can bring huge efficiency improvement. For such a thing, I believe that many r & D people do not have.

3. CMDB resource mirroring and process automation

Another manifestation of a complex business is architectural complexity. The diagram above is similar to the microservices architecture, which is a tiny service without a standard interface.

As you know, the operation and management of microservices are very complicated, and the same is true here. There are many things outside the program, such as permissions, configuration files, devices, resource management, names, file distribution, authentication services, other interfaces, all of which are necessary for a service to run properly.

The traditional approach is documentation, what to do first, which command to run second, who to call in third, where to run in fourth and find errors. I thought documentation was a good way to communicate ideas and architecture, so we temporarily abandoned documentation and used CMDB images instead.

We understand that CMDB is about hardware, but our CMDB is not only about hardware, but also about business.

The business layer is built on top of the physical structure, including regions, packages, configurations, permissions, test cases, and so on, and we collectively call everything a resource that each service node needs and manages logically.

So the question is, are unreliable documents reliable resources? Resources come from various subsystems in various aspects.

The maintenance of our CMDB image is consistent, some configuration information will be reported online, and the consistent logical layer will go to the configuration library to get the corresponding one, and repair it if it is found wrong.

The logic of the repair, which used to overwrite the configuration with the live network, is now reversed, overwriting the live network with the configuration. The CMDB image falls on the page as shown in the figure on the right.

With resource mirroring, and very reliable, it is very easy to do one-click expansion. The 23 steps above are easy to understand, just take a look. You can’t mess with the order of the 23 steps, because if you mess with it, it’s going to be a problem, and that’s what we figured out.

I also want to say that there are more than 200,000 machines on the live network, so many services, we only have four sets of condensed process to manage.

The image above shows the result of our demo, enlarged here to highlight automatic expansion.

In order to do automatic expansion, we did some logic, but this logic is for very strict, very high standard business modules.

We will expand at high load and reduce capacity at low load. According to its standards, we will judge whether our requirements for automatic capacity expansion and reduction have been met. If so, we will adjust online or offline and the rhythm of online through load components and certain weights.

Change of physical examination, since it is the whole process, must be responsible from head to bottom, not just the beginning. After the change, we send the change report through the change physical examination method, so that the person in charge of the change and relevant people receive abnormal alarms. If it is normal, it will not be sent, and if it is abnormal, it will be reported, so the change report must be sent after the change.

Summary, the management team under what ideas do, share our program management, module resource and process automation, the entire automated operations also contains more content, such as permissions, configuration, how to do the process engine, the division of operational organization, event management, how to transfer more efficient resources, the API platform, business architecture. Manage a single service, how to manage a larger service.

A few years ago, I heard a lot of voices, and I was worried, saying that operations was going to lose their jobs. Now cloud computing is very hot, automated operation and maintenance is very popular, and another AI operation and maintenance. I don’t think there’s any need to worry. I’ve come up with three popular operations: deep, global, and endurance.

Endurance is a person who is willing to work overtime and do a lot of things over a long period of time, as long as the business needs it.

Global people can take charge of many resources, including QQ red envelopes and New Year’s Eve activities, or emergencies and activities with high concurrency. Therefore, global people must come out to optimize operation and maintenance solutions and solve problems for business architecture.

The deep type may not have as many resources, but has a good understanding of operations.

For example, AI operation and maintenance, what scenarios are suitable for AI, and what features should be captured in such scenarios for algorithm calculation and machine learning. Machines will not know these things by themselves, and people are needed.

If you want to be a popular operation, you have to liberate yourself and make yourself better. Thank you.

Note: This article is based on Yang Lidong’s speech at GOPS 2018 Shenzhen station.

GOPS 2018 Shenzhen station Tencent operation and maintenance system did not see enough? GOPS 2018 Shanghai station Tencent operation and maintenance special oh, at that time, Tencent operation and maintenance director Nie Xin, Tencent cloud director Liang Ding ‘an will come to the scene, take you to chat about all aspects of Tencent operation and maintenance system ~

Want to see the full picture of Tencent operation and maintenance system ~ please stamp the original link

Click to read the original article to reveal the highlights of GOPS 2018 Shanghai

Tencent QQ massive business automated operation and maintenance of the secret

1. Tell us our story

2. The next life of package management system

3. CMDB resource mirroring and process automation

Related Posts

Python regular expressions

Crash Course computing: Episode 1 The early history of computing

No longer have to worry about Internet cafes open black teammates inaudible! Noise reduction solution to know?