This article is compiled from GOPS2017. Beijing station speech “wechat monthly live 900 million efficient operation and maintenance road”, efficient operation and maintenance community is committed to accompany you to grow together.
About the author:
Wu Lei, responsible person of wechat basic platform operation and maintenance, joined the QQ mailbox operation and maintenance team in 2010, and supported the savage growth from 0 to hundreds of millions of people online at the same time from the first version of wechat. At present, I am responsible for the operation and maintenance of basic business such as message sending and receiving, circle of friends, and focusing on the construction of automatic operation and maintenance system.
Today’s topic is about elastic scaling, expansion and shrinkage.
When the business volume of wechat grows, actually we are more concerned about efficiency. In the early stage, the volume may double in two or three months. How can we ensure that our operating efficiency can keep up? The latter may be mainly concerned with cost. Our growth slowed down a bit after 2014, so the main focus will be on cost.
Divided into four parts:
-
Operation specification
-
Cloud management
-
Capacity management
-
Automatic scheduling
I. Operation specifications
1.1 Configuration File specifications
Let’s start with the configuration file specification. We spent a lot of effort up front. Maybe the whole system design is complicated, at the beginning there will be some configuration management engineers to deal with it, later we made it more standardized.
Configuration file specifications fall into the following categories:
-
Directory structure standards this is how to define the directory structure of a service when it is deployed online. You should define these directory structure standards first.
-
Why does the same configuration item management across services mention this? There may be some configuration items that you’ll need in one service, and you’ll need in another service. If you change this configuration item, do you send service A again and service B again? We have a set of mechanisms, each machine under each directory will have a global common configuration items, we will control the release here through some automated grayscale way, this piece is singled out.
-
I’m not sure if you’re going to run into similar problems with different instances of the same service. Even if you use Docker, your image management is fine. If you deploy an instance, your business may need you to make some adjustments to instance 1, but you can also use scripts to manage it. But we think it is a complicated state, so we will extract all these differences for it in a unified manner, and try to make the MD5 of its configuration files consistent in all environments.
-
Development/test/live network difference configuration item management will also have this problem in the early stage, we generally do not care about the development environment, test environment is also responsible for the management of operation and maintenance, live network is of course operation and maintenance management. Basically, the only difference between test and live network is routing. We ensure that other configuration items are completely consistent. The effect of this requirement is that no matter what method you use to expand the capacity, we can directly copy a mirror to it or directly copy a package to it, and there will be no subsequent scripts to change it.
Multiple instances of the same version in the same service have the same MD5 configurations in all environments.
1.2 Name service specifications
The name service area is more important and has three layers:
-
Access layer class LVS implementation
-
Logical layer class ETCD implementation
-
Automatic route configuration for the storage layer
Service scaling is an operations project, independent of the change release of r&d.
The access layer and the logic layer are similar implementations, and we do some business development internally based on our business characteristics.
The storage layer is a bit different from QQ. QQ seems to have the same logical layer and storage layer implemented by the same name service. We haven’t done this in the storage layer yet.
We now have a storage layer that is quite separate from the logical layer, and when it comes to data, we can almost think of it as the way operations can configure, but not quite the way operations can configure, with those auxiliary scripts.
Service scaling is an operations project, and we don’t want to scale services for other factors, independent of the change release of development. We have a characteristic that our research and development is the change system provided by operation and maintenance. Basically, it does not need to do any operation on it, and the whole chain is opened up. Operation and maintenance do not need to care about the change release, but only need to manage the service expansion.
1.3 Data Storage Specifications
-
The access layer does not carry data
-
The logical layer has short-period cache, static data, and no dynamic data landing
-
Storage layer with long period cache and data landing based on PAxOS protocol
Service scaling at the access layer and logical layer, regardless of data migration and cache hit ratio.
In terms of data storage, I think this scenario is quite common. For example, the access layer will definitely not bring any data, while the logic layer I hope will not bring data, and we strictly require that it does not bring data.
It is often the case that the logic layer has to scale automatically. After running for a period of time, some logic has been online and some data has been saved on it. When scaling down, is it a mistake?
There is no data in the logical layer, there is static data, including messages sent by users, moments sent by users, which obviously should be put in the storage layer.
Don’t take the data access layer, actually we also have a time in history is, with the data in every Chinese New Year red envelopes, everyone in the a shake, actually the quantity is very terror, shake red envelopes that point we do design is really down with each wave red envelopes request access layer, we will not have a client shake five times before a request is coming up. Our access layer performance guarantee we can withstand this quantity, but the logic behind the layer certainly can’t hold the quantity, 1 seconds to tens of millions, this time we will do some very special logic, the data access layer directly take a red envelope, of course, this is a special scene, we only in the specification is not performed during Chinese New Year.
1.4 Summary of operation specifications
-
goals
-
Services can be operated and maintained
-
The implementation of the measures
-
Change system interception
-
The whole network scan service is not standard
There are several points here. Our goal is simply that the service is operable. Operable means that you do not need to do manual operations when you expand or shrink the capacity.
In order to realize the operation and maintenance of the service, we made some intercepts in the change system. If the next change may not conform to our operation specifications or there is a place that does not conform to our operation specifications before, we will stop it and ask for the change to realize our operation and maintenance specifications.
2. Cloud management
Next, let’s talk about cloud. People use cloud a lot, and it’s also popular in recent years. Maybe Docker is what people use most.
2.1 Why go To the Cloud
-
The total number of micro services is nearly 5K
-
Resource preemption problem with multiple services on physical machines
First, why do we want to go to the cloud? In 2013 and 2014, our micro services have reached nearly 5000. In fact, I just know this is called micro service in recent years.
In fact, when I first said the micro service, I remember that SOME courses of QQ’s massive operation will be mentioned. I think this idea is completely consistent with the micro service.
Our operations here in the realization of the released a new service, are there any restrictions on basically not give r&d said this service you don’t do too much, so the whole system to get down to the amount is more exaggerated, certainly will appear his multiple services to deploy in the same unit, because there are 5000 in service, be sure to deploy from the physical machine.
Deploying multiple services leads to resource preemption, and for that reason, we went to the cloud.
2.2 Which part of the cloud
-
The access layer exclusively uses physical machines, with sufficient capacity and few changes
-
The logical layer is deployed in mixed mode, the capacity is uncontrollable, and changes are frequent
-
The storage layer is exclusive to physical machines, with controllable capacity and less change
Which piece of cloud, just said, because the access layer has to hold a relatively fierce amount, wechat users are also more, if it has any problems, the access layer may avalanche. Therefore, we mainly monopolize the physical machine, with sufficient capacity and little change, so there is no need to go on cloud for the time being.
In the logical layer, there are more than 5000 microservices mentioned before, so it is chaotic, with many changes and uncontrollable capacity, so this part is put into the cloud.
In the storage layer, there is no cloud, but there is separate capacity management and data migration.
2.3 Cloud based on Cgroup
-
Virtual model Customization
-
VC11 = 1 CPU core + 1 GB memory
-
VC24 = 2 CPU cores + 4G memory
-
Physical machine sharding
Our cloud approach is direct Cgroup, the name of Cgroup may not be known to some, we use the kernel Cgroup mechanism.
On the live network, use a simple VM model, for example, 1 CPU +1 GB memory. In the early stage, we also considered some traffic factors, but later we cancelled it. The traffic guaranteed by our system architecture seems not to be a problem.
Here is a brief list of the ways in which the virtual machine can be divided. Another problem with this arbitrary partition is that the system can have some kind of disk fragmentation during operation, like if you run Windows for a long time, the disk fragmentation will exist. We also have some special ways of dealing with this.
2.4 Docker is not enabled online
-
Svrkit framework covers 100% of the entire network and has been standardized
-
The framework itself relies heavily on IPC interactions
-
Self-developed non-invasive vs Docker invasive
Here’s a list of reasons why we didn’t use Docker.
Docker was so popular that we launched a few rounds online in 2014 and 2015.
Our sVRKit framework is a set of framework developed by ourselves in wechat, which covers 100% of the entire network. What does 100% mean? There is no open source code, including nginx which we may use most in front of us for access. In the later period, we also changed into our own components.
The current network with our own framework 100% coverage, our standardization and standardization are better.
We can say that some of the problems Docker solves are not problems here.
-
Firstly, our demand for Docker is not very strong
-
Secondly, the framework itself uses a lot of IPC interactions, which are strictly regulated in Docker. If we use Docker, we may destroy various mechanisms in Docker. If that’s the case, you might as well not use Docker.
-
Third, we still have concerns about the implementation of Docker access mode. A year or two ago, some people discussed with me a lot. When the main Docker process is up, you may want to update the Docker itself, which will cause your service to restart. It seems that this problem has been solved recently, which we considered to be a serious problem in the early stage. I don’t want to say that the change of Docker itself will have any impact on online services.
2.5 Private Cloud Scheduling System
Based on the above points, we developed a set of cloud-managed Docker system. This is our private cloud scheduling system:
-
Based on sVRKit framework developed from
-
Refer to the advantages of mainstream scheduling systems such as BORG, YARN, K8S, and MESOS
-
Cover 80% of microservices
Based on the SVRKit framework + self-development I mentioned before, the scheduling system of course also needs to be covered by our own scheduling system. At present, 80% of the micro-services are covered, and there is still a little less than 100% coverage.
2.6 Private Cloud Scheduling System Architecture
This is our architecture, familiar with the industry at a glance know that there is no difference with the things, there are some virtualization pits in the middle, you can think with everyone used not too big difference, I won’t go into details.
2.7 Cloud Management Summary
-
goals
-
Resource isolation between services
-
Service scaling page operation
-
The implementation of the measures
-
The deployment system intercepts services that are not logged in to the cloud
-
Take the initiative to transform core business
In the area of cloud management, one of our goals is resource isolation. Docker services should not be abnormal due to one service exception.
The second is the service page operation, some services are still very large, a service under thousands of units may, this time do not want you on the machine, on the page can point.
In order to achieve this goal, we will put the deployment system, those who are not on the cloud to stop, he has to go online by the old deployment way, he needs to go through the old process to go to the old business model.
These operational norms and cloud management, I believe that we will more or less have their own way of implementation, but also have their own summary. I’m not sure what everybody’s doing with the capacity.
Capacity management
3.1 How to Support business development
This is a business growth curve, this is our capacity curve.
So normally you expand once and you’re always balanced here, and at one point you find you can’t hold it, you expand, you expand, you stay there, and it doesn’t match the capacity curve very well.
The first point is when your capacity is lower than the business volume of the current network, which is actually insufficient capacity, can you find this problem soon?
The second point is the processing efficiency, when the capacity is insufficient, put up the curve, how long it takes to expand the capacity, if it is a few days or a few minutes, the efficiency will be different.
We want to achieve an optimal capacity approach, which is exactly the way to match business growth.
This is not a very difficult curve to achieve, as long as you expand it frequently enough, this is exactly the optimal capacity curve.
If you find that it has a capacity deficit of minutes and a capacity expansion of minutes, you can draw exactly the same set of curves.
3.2 Using hardware indicators to evaluate the capacity
We would say how do you know the capacity of this service is low? Generally our first response is to use hardware indicators, including how much CPU usage, disk space, network card capacity, memory. That solves the problem, but it doesn’t solve everything.
3.3 Evaluating the CPU Capacity
Service capacity = Current peak value/experienced CPU upper limit
This is a way to calculate the capacity of a service using CPU.
This is a simple example, for example, if you have a current CPU peak of 40%, your own experience suggests that this can reach 80% of the CPU, simply divide by 50% of the capacity.
3.4 Hardware Specifications are Reliable
How reliable is this stuff? Here I draw two schematic diagrams:
Something like the graph on the left is true, capacity basically keeps growing with CPU.
It’s kind of like this picture on the right, narrow-mouth bottle, when the CPU goes up to a certain point, the capacity doesn’t go up.
3.5 Real Cases
This is the example of the current network traffic, artificial traffic play up, will find that some services how to play are in 80%; Some of them are 60%. They can’t go up.
3.6 Limitations of hardware Specifications
Hardware metrics we believe have some limitations:
-
The first is the different services that depend on the type of hardware, some are CPU-bound, some are memory-bound.
-
The second is that the quality of the critical point is unpredictable. When the CPU approaches the point of 80% or 100%, we observe that the quality of some services is unpredictable and unstable. Some services can run up to 80%, and the CPU can stabilize the service for a long time.
Indeed, after much online service, the thing that does is not controllable.
We think that only through pressure measurement can we get a more accurate model of capacity.
3.7 Pressure measurement method
There are two ways of pressing side:
-
One is to simulate flow
-
A real traffic
There are also two kinds of environments:
-
The test environment
-
Now the network environment
Four pressure measurement methods.
The first one simulates some verification of quality within the test team when the flow is being played in the test environment. This test method is the responsibility of the test team and we are not too involved in it.
The second simulation flow is similar all over now net link pressure measurement, such as taobao in the double tenth often do that, for WeChat, we only do on New Year’s, WeChat almost like taobao shuangshi Chinese New Year, is a more crazy state, so we will be in use before the Spring Festival, simulated flow type network now, this is not a very common way of pressure measurement.
The third type of real traffic is sent to the test environment, which is usually done when we want to verify some storage performance. The traffic on the line is bypassed to the test environment to check the storage performance. It can be better simulated in the test environment and does not affect the live network.
The fourth kind of real traffic play live network environment, I mainly want to say here, we really in the live network pressure test, with real traffic play live network environment is what situation.
3.8 Live network pressure measurement
The network pressure test is very simple to achieve, when you name the service to achieve the standard, are a unified process, is the pressure side process, keep adjusting the weight of one of the services, observe what the consequences of it is on the line.
3.9 Exceptions that may be caused by the live network pressure test
What is the problem with such pressure? Everyone can think of it. When you pressure that service, how can you ensure that there will be no accident? We think there are three points here:
-
First, pressure test will cause failure, can you find it in time.
-
Second, pressure measurement may bring some underlying problems, in fact, your monitoring is not found.
-
Third, how can you recover quickly when something does go wrong?
I think these three questions are more critical.
3.10 Self-protection of services
Are your services protecting themselves? We introduced a concept of fast rejection. When the service is normal, maybe you can handle 1 million requests per minute, but when the upstream service gives you 5 million, you can only handle 1 million requests. Can you reject other 4 million requests?
3.11 Retry protection for upstream services
What is upstream service retry protection like? For example, when you press one instance, you keep increasing its weight, so that it fails to quickly reject return, then do you have a mechanism to go to other instances to complete the process, this is also a key point, we also support this in the framework.
3.12 Three-dimensional monitoring
Whether the monitoring system is perfect, including hardware monitoring, the monitoring of quick rejection mentioned earlier, time-consuming monitoring of front-end and back-end, as well as the aforementioned, whether there is failure, we need to pay attention to the whole line.
It can be said that with self-protection of services, retry protection of upstream services and three-dimensional monitoring, we dare to pressure test. If these three points can not be done, pressure test can not be done.
3.13-second level monitoring
Can you find your abnormality quickly? To solve this problem, we upgraded minute-level monitoring to second-level monitoring, with all anomalies detected within about 10 seconds.
The whole implementation of the way we should be similar, each machine has a collection, before every minute reported to the queue, and then summarized, and then into the database. Now we have changed this logic to do a collection every 6 seconds, then data conversion, and then to do a second of data summary.
It’s still in minutes.
3.14 Dynamic control of discharge rate
There’s also a dynamic control of the rate, which is a little bit more complicated, and I don’t know how to explain it.
On the left side of the picture is an early pressure measurement is implemented, we compare it from one point to start, may use a faster way of adjusting its weight, until the call out failure, and then back again, to continue upward pressure, kept up not down, in this way to approach your capacity limit. At this point, we obviously see a problem. In fact, operation and maintenance are looking for problems for themselves during the pressure test. Every time they come up, they fall down and come up. The failure curve is like a dog’s gnawing curve. This kind of pressure measurement method, operation and maintenance can not stand.
In fact, this is also related to the capabilities provided by our service framework. Our entire service framework will have an in-team/out-team mechanism, and each microservice will have such mechanism. The backlog of the queue here is a key indicator. We will monitor the queuing delay and find that once it has a backlog, its performance can not keep up. We use this indicator to adjust the rate of pressure measurement, and roughly achieve the effect shown in the figure on the right. In the early stage, the weight is relatively fast. When it is observed that the queue has backlog and delay, it starts to slow down until it reaches a point, and maybe there is a little pressure test failure at the last point. To achieve such an effect, it may only take a few seconds in the whole process to get the performance model and the real value of pressure test.
3.15 Effect of live network pressure measurement
This is the result of our actual online compression.
The slope shown in this graph rises quickly at first, then slowly increases in the later period, and then stops the pressure measuring point.
In order to play the effect of this picture, we also made quite a long time, maybe more than a year to achieve this status quo, basically will not bring failure to the live network service.
3.16 Capacity Management Summary
After we finished this task, we felt that there were several benefits:
-
The resource requirements of the service can be accurately quantified. I just mentioned how to calculate each of the hundreds of microservices.
-
Best model of confirmable service How do you know which model your microservice is suitable for? Some services will find that some scenes are different from others according to our pressure test. As long as we realize the way of automatic pressure test, try each model and press it out so that you know what its optimal model is.
Four. Automatic scheduling
4.1 Automatic Capacity Expansion for Service Growth
Solve business growth capacity automatically, because your business in the rose, you are the user’s request including all requested is during follow up, do you know how to increase business, know how micro service growth curve, the pressure in front of the test and know how about the performance of each instance, you will be able to accurately know where I’m capacity ratio is about line.
We generally let it run in 50%, 60% of the interval, 66% is a disaster recovery reservation, we first deploy three IDC, you want to hang two, the other one will not have any abnormal words, may be these two limits.
We will set aside some time for purchasing machines for us and control the flow of some services at 50%.
4.2 Automatic Capacity Expansion in abnormal cases
There are also some exceptions:
-
The sudden increase in the volume of business of some products without informing us, we will use some rude ways to solve, including CPU, we will give it a point, when your CPU rushed past this point, we will automatically initiate expansion.
-
Application performance degradation your business does not change very much, but your application changes, which can actually cause capacity problems. We can detect this at this point, depending on how often we manometry, if you manometry every day, you graph the manometry curve.
4.3 Reasonable evaluation of program performance degradation
We almost have some services that we are running every day in 2015, and what is the performance of each day, so let’s plot a curve. You may find that there are some points in time when the performance degradation is far away. For example, in this example, the performance degradation is significant because a large version has been released with some processing logic added to the feature. It may take a month or two before someone comes out and fixes it. This kind of performance monitoring I think is a byproduct of capacity management and intelligent interaction, which we use a lot, to know exactly how the overall microservices performance is changing.
4.4 Performance management closed loop
Can we not let it appear this new point of decline, we are pressing every day, can you speed up, directly stop it when it goes online. For example, the development of students gray on a machine, we immediately pressure test its gray line of this machine, to see how its performance, if found that its performance decline, pull him.
4.5 Several Service modes
This is some of the business patterns of our current network, probably more of which is the top one, which peaks at 9 to 10 p.m.
The picture in the middle is generally related to work, such as wechat, which is used more during work time.
There is also the bottom one, a typical example is wechat activity, wechat activity seems to come at 22 o ‘clock in the evening. And then there’s the older business, the bottle, which is used by a lot of brainless fans, who watch the hour zero.
4.6 Peak cutting and valley filling
What do we want to do with these business curves?
First of all, the peak is so high that all our equipment is designed to deal with the highest peak. Can we do something special for this peak and not give them dangerous equipment?
And the second one, after midnight, it’s a waste of equipment on such a low volume of business.
We want to have a peak-filling effect, which is very simple, again, to stagger the peaks of some of your services from other services. Some service peak, give up its resources, as for who to use it no matter.
4.7 Online Service Peak Clipping
-
Release resources after normal service peak hours
-
Staggered peak service obtains resources
4.8 Offline calculation of grain filling
-
The running period of an offline task
-
No tasks are allowed from 01:00 to 08:00
-
08:00 ~ 20:00 task entry control
-
Shares + memory. limit_IN_bytes + blkio
-
Offline tasks have the lowest priority
Offline computing, the wee hours are really busy. At this time, it is more suitable for offline computing for scheduling. Before wechat group, there were few offline computing things, mainly voice input and other artificial intelligence things. There are probably some new features to look at and search for recently, and they use a lot of offline computing scheduling.
In terms of resource occupation, the configuration of these sites is basically more strictly controlled by Cgroup.
Then we reprioritized offline tasks and filled in our valleys with offline tasks.
4.9 Automatic scheduling Summary
This is one of the goals we achieved in the auto-scheduling area:
-
Take full control of all online services and make full use of resources
-
We don’t need to apply for computing resources separately for offline tasks. We can just use some of our online CPU memory and store it separately.
END