It is hard to imagine that Zheng Yangfei, who was born in 1992, has already been the Leader of cloud native performance and capacity team, the general manager of the stability of The 2018 Singles’ Day, and the vice captain of the 2020 Singles’ Day. For 6 consecutive years, double Eleven is not only a training ground for him to lead the team, but also can see the technological evolution track of Ant Group.

I asked Zheng yangfei, “What do you think is the best thing you have done since you entered ant?”

I had hoped to hear from him the exuberance of “buy, buy, buy” and the lofty ambition of securing 1.3 billion orders with his own efforts. But at the moment, the young man scratched his head and said a strange word: “It should be the cloud native capacity technology that I do next.”

“I am the fifth person to challenge this technique. Many predecessors failed, but I think I can do it.”

Behind the halo of the past, Zheng Yangfei rushed to the next stop. The road of technology is endless. Ant Group is standing on a platform built by tens of thousands of people, ready to climb a new mountain.

“Double Eleven” : from new recruits to veterans

In 2013, zheng Yangfei, an intern, was still doing the trivial work of “expanding and shrinking human flesh” for the server.

In 2015, he has been pulled to the front line of singles’ Day, and signed a “betting agreement” with the supervisor, in charge of the stability of the whole link pressure test of the entire singles’ Day. From an unknown participant to the leader of the project, Zheng’s vision suddenly opened up.

As a post-90s generation, this is zheng’s first solo job.

Times are hard, but misfortunes never come alone. The stability team was at a time when morale was low due to the high number of breakdowns in the first half of the year. Zheng yangfei said bluntly: “We just want to give ants a voice, we can’t let people think we can’t.”

The youth rushes to cut the general, breaks into “bright top”. Zheng Yangfei recalled that at that time, guangming Top (double 11 full link pressure test site) left few places for alipay team, and the head of Ali Economic promotion held a loudspeaker at the scene, and whenever there was a problem, the voice would be loud and clear: “What’s wrong with Alipay? Pay treasure how to fall again?”

Zheng held his breath and tackled everything. ** “You have to deal with everything in that conference room, and you have to be able to cover any situation.” ** At this time, no matter what KPI or betting agreement, as long as the pressure measurement curve jitter, the whole team’s heart will jitter.

But in the end, they pulled through. When the rush of traffic hit at 0:00, Alipay withstood the pressure and Zheng took his winning bet from his supervisor: an Apple Watch.

The carnival of ** online shopping festival broke out, and the wheels of the era moved on quietly. * * compared to a year, 2015 double tenth of a full link pressure measurement made a dramatic improvement in several aspects: one is from the core system extended to the entire system, 2 it is and get through the group’s pressure test, three is a platform, which is to build a link pressure measuring platform tools, all of them, and will be part of the technical personnel work delivered to the platform.

In the following years, full-link pressure measurement technology has been evolving from “great promotion” to normality and production. With the precipitation of technology and the deepening of business understanding, Zheng Yangfei’s responsibility has gradually expanded from the responsible person of pressure measurement on Singles’ Day to the responsible person of stability on Singles’ Day. In recent years, the pressure measurement work of the great Promotion activity, in his words, “smooth and smooth.”

At the same time, more and more technologies are being integrated into the platform. With the continuous emergence of the great promotion of central control platform, inspection platform, change core, current limiting platform, plan and other platforms, the pure technical support personnel at the front line is reduced year by year, the great promotion of technical team can free their hands to tackle more technically difficult problems.

“We will promote the development of driverless cars,” is the vision of all singles’ Day participants.

Cloud native capacity: from comfort zone to no man’s land

Zheng yangfei, once known as the “little prince of pressure measurement”, said that one of the key points in promoting safeguard technology is capacity assessment.

In other words, how to use the lowest cost, the fastest efficiency, to ensure the stability of the double eleven. As these promotions become more regular, daily traffic surges are visible, creating capacity and stability issues.

With low cost, fast efficiency and high stability, Zheng Yangfei and his team have gradually deposited the great promotion technology and created a variety of platform tools in these years. Now the team faces an even stranger and more difficult area: cloud native capacity.

The role of cloud native capacity is to figure out how much resources each application should use, based on historical trends and real-time projections. ** Its working mechanism is based on classical and machine learning prediction algorithm, coupled with capacity scaling engineering technology based on cloud native development, to achieve the stability of overall cloud native application capacity and rational use of resources.

This is done because the online application resource utilization is always low and because of its long-running nature, the resource specification and the number of copies are fixed at the time of application. Ant Group hopes to find a set of Autoscaling technology suitable for financial scale to flexibly adjust the application specifications and the number of copies based on application traffic characteristics, so as to achieve Serverless for traditional online applications, so as to improve the resource utilization efficiency of online applications and save costs.

Some people have considered using the existing HPA/VPA technology of the open source community, such as K8s, but encountered difficulties in practice: First, the relationship between service capacity and resource utilization of most online applications is not simple linear, and cannot be directly driven by Metrics like community HPA technology. Second, Ant Group’s financial business has high requirements for stability, and the business complexity caused by historical reasons is also very high, which makes elastic expansion a high-risk business. Technical risk control means need to be built to prevent anomalies from causing failures. Third, online applications need more than 10 minutes to expand and shrink capacity, which cannot meet the requirements of fast flexibility.

For these reasons, it became an imperative task for Zheng And his team to develop and design a capacity hosting elastic scheme suitable for ant production environment.

** The architecture of cloud native elastic capacity technology is mainly a multi-layer closed negative feedback control system composed of portrait system and AutoScaler. ** Portrait system through big data technology and machine learning algorithm to achieve the optimal planning of the application, AutoScaler according to portrait analysis given application portrait, to perform multi-level HPA changes and VPA changes. The portrait system will accumulate big data for the application characteristics, and then analyze the offline and real-time algorithms to realize the optimal solution of workloads by accumulating the data rules of application and the data feedback of production environment. It will also carry out change management and gray control for the portrait system to reduce the technical risk. AutoScaler uses multi-level HPA to scale horizontally and VPA to scale vertically. Multi-level HPA greatly shortens application startup time through Service Mesh, provides stable and efficient application expansion speed and reduces capacity reduction risks.

In a word, it is to achieve Autoscaling of classical applications by optimally allocating resources while ensuring stability. “When this technology matures, it can significantly reduce capacity failures and improve resource utilization.” Zheng Yangfei imaginings.

Easy to say, not easy to get started. Before Zheng Yang fly has four failure of precedent, he himself was also in this have stumbled, numerous questions and opposition to, ignore Zheng Yang fly: “cloud native infrastructure was much better than before now, we also went deep to problem definition and understanding, and we’re a team does not fear the difficulty, I think can make it.”

Since look for the road, just indomitability. The research into the cloud’s native capacity is still in its infancy, and the team’s work has made some progress: In 2019, Zheng and his team saved about 10 percent of the operating and maintenance costs for ants.

He waved his hand excitedly and said, “I feel like I have a lot of money in my hand. I can buy whatever I want!”

“New Positioning of technical Risk Department”

Zheng Yangfei’s experience Outlines the development trajectory of ant Technical Risk Department: “Competency-based position, competency-based platform”.

A variety of platform tools, the achievement of “unmanned” 18 weapons. If the previous “Double 11” was a battlefield of war, now “Double 11” is more like a military training: manpower costs are greatly reduced, and the technical risk department often arrange new players to hone their abilities. “If we rely too much on the existing platform, we will not have the tension we had in those days.”

Platforms are the condensation of technology, and people are the key to creating and evolving technology. ** Take SRE as an example. The concept first proposed by foreign Internet companies refers to Site Reliability Engineer, “website Reliability Engineer”. SRE is required to have strong programming algorithm ability and network architecture technology at the same time, only the top Internet companies will appear real SRE.

In ants, the definition of SRE is different, which refers to Site Risk Engineer. Many people who don’t know this concept often question it: is it just PE? Are you taking the blame for other businesses?

“SRE is not a position, but a kind of ability,” said Zheng.

“When our technical risk capability is mature enough, there is no need for SRE positions.” Zheng yangfei said that the team has gradually “de-traditional SRE” in the past two years. SRE has been integrated into software and platforms as a capability, and engineers’ task is no longer traditional operation and maintenance work, but to provide software engineering services for these platforms.

Both inside and outside the ant, Zheng yangfei and the entire technical risk team have never been short of questioning voices. Some people choose to quit and give up, while others keep moving forward.

** “The significance of the existence of the technical risk department is to carefully analyze the reasons behind each failure and summarize a set of rules to avoid the occurrence of this kind of failure.” As a veteran of ** for many years, Zheng Yangfei has become a senior member of the department. “I just want to prove that on the one hand, I have a sense of achievement here, and on the other hand, what we do can be recognized by value.”

When, I asked him, did he first realize that his job was relevant to everyone?

One day, alipay’s customer-service hotline was flooded with inquiries and complaints when a feature was launched, and many customer service girls missed lunch that day, Mr. Zheng recalled. “It’s hard to realize the weight of every line of code you type until you’ve been in it.”

“I wasn’t a genius, certainly not.” “I’m just an ordinary person,” Zheng said.

He used the recent popular Yang Chaoyue “golden sentence” : god does not only love smart people, he will also favor us these stupid children. “Thank you for giving ordinary people like us a chance.” At this point, the “stupid boy” was overjoyed.

Human life is ordinary, but occasionally like a star. As the ants climb over the mountains, the world opens up, and every star shines in its place.

Join us

We are the ant Cloud native capacity team. The role of cloud native capacity technology is to calculate how much resources each application should use based on historical trends and real-time forecasts. Its working mechanism is based on classical and machine learning prediction algorithm, coupled with capacity scaling engineering technology based on cloud native development, to achieve the stability of overall cloud native application capacity and rational use of resources.

The job description

  1. Responsible for the research and development of ant intelligent monitoring, performance capacity and risk data infrastructure, including demand research, system analysis and design, core module implementation, tuning and maintenance.
  2. Lead core technology problem solving, solve world-class distributed processing problems, identify and solve potential technical risks.
  3. Responsible for platform stability and system quality, guarantee system availability and data quality measurement indicators.
  4. Participate in ant double eleven large-scale activities, and ensure the high availability and capital security of the whole ant system under the limit of requests through platform capabilities.
  5. Continue to connect with various technical risk prevention and control business parties and prevention and control systems on the platform to meet the evolving business needs.

Job requirements

  1. Strong technical passion and sense of responsibility; 1. Bachelor degree or above in computer software or related field;
  2. Have the spirit of innovation, willing and hot to study in technology. Rigorous thinking, clear logic, critical thinking ability and habit;
  3. Have a solid computer professional foundation, including algorithm and data structure, operating system, computer architecture, computer network, database, etc.
  4. Solid language foundation of Java/C/C++/Rust/Go, good programming literacy, pursuit of code beauty, familiar with at least one relational database such as Oracle, Mysql, etc.
  5. High availability experience in well-known Internet companies, experience in real-time computing (Spark/Flink/Storm) or mass data processing (Hadoop/HBase/Hive) is preferred;
  6. Have a strong ability to analyze and solve complex problems, have a strong sense of responsibility and mission.

Contact email

[email protected]