The 10th Double 11 has come to a successful end, but the exploration of technology will never stop. Ali Technology launched the Series of “Ten Years of Animal Husbandry Code”, inviting the core technology leaders who participated in the annual Double 11 preparations to review the changes of Ali technology together.

“Stability is Paramount”, how to achieve the steady support of Double 11 is the eternal goal of Ali technical staff. Today, You Ji, a senior technical expert of Ali, will explain the five historical evolvings of Ali’s capacity planning in detail by focusing on two factors: “precision and certainty of capacity planning” and “efficiency and cost of related links of capacity planning”.

Alibaba senior technology expert You Ji

2018 marks the tenth year of Double Eleven. Every year, stability is the top concern in the preparation for November 11.

For a decade, stable-related technologies and double 11 have grown on each other like the DNA double helix. Tempered by the Double 11 scene, Ali’s stability technology system has become the benchmark and imitative object in the industry.

Ten years later, double 11 has left the deepest impression on technology.

From the first year, the zero hour of Singles’ Day represented our highest ever business traffic, which is often dozens or even hundreds of times the daily traffic. Therefore, how to make a distributed site with complex technology and business to support the sudden traffic impact more smoothly is the problem we have been solving for the past 10 years.

Among all the preparations for Double 11, capacity planning is the most important and challenging part. With such a large distributed system architecture, how much resources to allocate to each business system becomes a technical challenge.

In fact, capacity planning is like a scale. At one end of the scale is cost. We need to support our business with as few resources as possible. At the other end is stability, in the case of the lowest possible cost, each system can run at a suitable water level, not only ensure the normal operation of the business, but also do not appear local resource waste.

Capacity planning Evolution path

Centering on the two most core driving factors of “precision and certainty of capacity planning” and “efficiency and cost of related links of capacity planning”, capacity planning has undergone five major evolvement:

I. Manual capacity estimation stage

During this period, the resource requirements of Double 11 for the system are still in the stage of manual estimation. For example, capacity planning in 2009 was done mainly through manual estimation. The students in charge of each system gathered together for a meeting, summarized the information into an Excel sheet, and spent half a day or a day to set down the machine budget for capacity planning. Moreover, each system usually has a relatively large machine redundancy, business flow is not big, even if the estimation is not accurate will not cause a big business impact.

Ii. Offline performance pressure measurement and capacity evaluation stage

Double 11 in 2009, although the magnitude of business was not even a fraction of the peak value of double 11 in recent years, the surge of business directly brought a very big impact on our system.

Therefore, in 2010, we started to develop a set of systematic capacity planning platform, at this time, the capacity calculation formula was proposed for the first time. There are two crucial variables in this formula: estimated volume of service, which represents the estimated amount of calls to the system, and single unit capacity, which represents the maximum service capacity of a single machine.

Capacity planning formula

In fact, the capacity planning formula is not complicated to understand. The minimum number of machines required by the business system can be obtained by dividing the estimated business level by the service capacity of a single machine. The minimum number of machines is the theoretical lower limit of the number of machines, and a buffer value is added to ensure that nothing is wrong.

The estimated business magnitude is an estimated value of the volume of the business system under business scenarios such as Double 11, for example, how many people will visit the product details, how many people will visit my shopping cart, how many people will place orders, how many people will pay and so on at the same time at Double 11. We estimate the business magnitude through BI (business intelligence) analysis. Combined with the corresponding prediction algorithm can get more accurate value.

The service capacity of a single machine is not so easy to get. In the 1.0 version of capacity planning platform in 2010, the capacity of a single machine is mainly obtained through offline performance testing. At that time, we already had a very mature offline performance test environment, so we conducted performance tests on each business system one by one in the performance test environment, and obtained the single capacity value of each business system.

After solving the two key variables, Csp capacity planning platform formally entered the technical stage of Ali. In 2010, we completed the transition from manual capacity planning to systematic capacity planning.

3. Capacity evaluation stage by on-line pressure measurement

Csp capacity planning platform after launch, among all the double 11 immediately have an immediate effect, compared with the pure human flesh before capacity planning model, not only save the human cost, more important is the way through the data calculation to replace the traditional experience estimate method, the accuracy of the considerably increase our capacity planning to.

In order to obtain a more accurate value of the service capacity of a single machine, we conducted a lot of exploration in the online pressure test mode, and accumulated a lot of experience, which later set a model for the industry capacity planning road:

 

A. On-line simulated pressure test to obtain single machine capability

On-line simulation pressure test initiates simulation call to on-line application system. The simulation request guarantees the authenticity of the environment and can greatly improve the accuracy of the single machine capability. On-line simulated pressure test is easy to operate, and there are many tools available.

B. Online traffic replication pressure test to obtain stand-alone capability

Online simulated pressure test solves the problem of authenticity of pressure measurement environment, but does not completely solve the problem of real flow rate. If both flow rate and environment can be real, the stand-alone capacity obtained through online pressure test will be more convincing. Online traffic replication By multiplying the traffic of a machine on the line by N times to the target machine under pressure measurement, when the traffic of the machine on the line is very low, the traffic can be effectively amplified by copying the traffic by N times.

C. Online drainage pressure test to obtain stand-alone capacity

In view of the complexity and cost of traffic replication, we continue to explore an accurate and convenient online pressure measurement model. Ali’s business systems are all distributed architecture. A business system is provided by several machines at the same time. If we can centrally call the flow of the distributed environment to a certain machine, we can test a machine! The on-line drainage pressure test model was then used in the production environment.

Online drainage pressure test enables most business systems of Ali Group to obtain very accurate online stand-alone capacity, which is the most widely used online stand-alone pressure test mode at present.

4. Full-link pressure test stage

The capacity planning platform solves the problem of capacity planning from the perspective of a single point. However, there is a prerequisite for single-point capacity planning: the downstream dependent service status is very good, which is not the case.

In addition, with more and more technical components of distributed system architecture, it is difficult to do a single point of capacity planning for all technical links from front to back.

Double 11 0 points on the day of arrival, from the CDN to access layer, the application of front-end and back-end services, cache, memory, middleware are faced with huge traffic on the link, at this time of the application state of services in addition to the affected by their own, also by relying on the environmental impact, and will continue to deliver effective to upstream, even if a link appears a bit error, No one can be sure what the error will do as it accumulates over several layers up and down the river.

Therefore, in addition to carrying out capacity planning in advance, we also need to establish a set of verification mechanism to verify that all our preparations are in line with expectations. The best way to verify this is to allow the event to happen earlier, and if our system could experience double 11 a few times earlier, the capacity uncertainty problem would be solved.

2013 was a big milestone for double 11 stability. We adopted mode Double 11 in production environment to verify the certainty of capacity in all aspects. In other words, the birth of full-link pressure measurement solves the deterministic problem of capacity. Since 2013, based on full-link pressure measurement, we have created a series of supporting ecology related to capacity planning, which not only improves capacity, but also reduces the cost and improves efficiency of the whole process.

 

In fact, it sounds not easy to simulate Double 11 in advance. After all, the scale and complexity of Double 11 are unprecedented, so it is quite difficult to simulate double 11 in advance. The birth of full-link pressure measurement mainly solves the following four major challenges:

 

  1. There are hundreds of business systems related to Double 11, and all the infrastructure and middleware on the whole link are involved to ensure the unimpeded flow of pressure measurement in the whole process.

  2. How to construct the data of pressure measurement (hundreds of millions of commodities and users), the data model is as close as possible to double 11.

  3. Full-link pressure test Directly simulates the real online environment in double-11 mode, ensuring that online data and services are not affected.

  4. Double 11 is a grand event attended by hundreds of millions of users, creating a large-scale traffic platform with tens of thousands of user behaviors per second.

On the eve of Double 11 every year, full-link pressure test will be organized several times to continuously find problems through pressure test for iterative optimization and all-round verification of business stability. Only after the verification of full-link pressure test can our system be confident to meet the arrival of zero o ‘clock on Double 11. The full-link pressure test will be the most important nuclear weapon for the double 11, double 12, etc., and will continue to evolve with the development of the business and play an irreplaceable role.

Fifth, “full link pressure measurement + isolation environment + elastic expansion” technology ecosystem

 

After several years of development, full-link pressure measurement has gradually evolved from a single pressure measurement platform to a set of technology ecosystem.

Technical products such as isolated environment, pressure-play at the same time, function preview, and full-link pressure test at the merchant end have all become important members of the full-link pressure test ecological family, which together play an important role in ensuring the stability of Double 11.

The full-link pressure measurement system has evolved

In addition to the continuous evolution of the capacity of full-link pressure measurement, the efficiency of pressure measurement is also constantly improving, and the intelligent technical ability is gradually cut into the pressure measurement scene.

The root cause is used to automatically locate the problems in the pressure measurement. In the process of pressure measurement, the variable blank holder cartridge is carried out to adjust the capacity ratio to the optimal. The system produces detailed pressure measurement report, hoping that all these can be completed automatically without anyone.

The Spearhead Program is progressing towards this goal and has made phased progress, with a significant increase in the first pressure test success rate across the link in 2017 and 2018. In the past, it may take many times to succeed in the pressure test. Through the “Spearhead Plan”, the unaccompanied normal isolation environment pressure test was carried out, and 80% of the surface problems were found in advance, and the blocking rate of large-scale pressure test was greatly reduced.

In addition to the breakthrough in intelligent manometry, the identity of full-link manometry has also undergone a major shift. Ali internal pressure test platform from a preparation for the big contributors to, through the transition to upgrade, output to the cloud PTS:https://www.aliyun.com/product/pts ali

Thousands of external Internet companies can stand on Ali’s shoulders and complete a precise capacity test through PTS, easily equipped with the same capacity planning capability as Ali. Behind this, full-link pressure measurement has undergone a major technological evolution.

  1. Internal and external pressure measuring system: a core pressure measuring technology base supports two sets of internal and external pressure measuring system simultaneously, which means that most of the work can be reused, greatly reducing the operation and maintenance cost.

  2. Pressure measurement capability upgrade: In addition to the 100 million level of request traffic output per second, it also has the second level of large traffic and big data scheduling capability, showing the pressure measurement capability that amazes the industry.

  3. Open: with more open expansibility, it not only supports a larger external user group, but also better supports the more abundant economic business forms of Ali;

  4. Open source compatibility: compatible with mainstream open source ecology, so that the configured requirements can be directly run on the pressure measurement platform without any change;

  5. Solutionization: In the upstream and downstream of the whole pressure measurement technology, we have built technology modules such as recorder, data factory, instruction set and diagnostic expert, which has evolved from a single pressure measurement platform into a closed-loop solution for the whole capacity planning.

     

In the future, we will do our best to bring perfect experience to consumers, businesses and partners around the world. Ten years of pasture code, a ride dust!

Focus on “Ali Technology”

Grasp the pulse of cutting-edge technology