“There are a lot of things that many people are doing, but only when they do it to the extreme can the differences in technology be seen,” Chen Liang (aka Jun Yi), a researcher at Ant Financial’s technology risk department, told InfoQ. In the previous Alipay technology carnival, InfoQ had an exclusive interview with Chen Liang, the witness and lead architect of Several technical architecture upgrades of Alipay, and for the first time systematically understood the alipay technical risk system behind the stable support of “Double 11” and other practical battles.
Alipay technology risk system
In 2007, Chen joined Alipay and was responsible for the architecture of Alipay’s search and communication middleware. In the following ten years, Chen Liang was responsible for the overall structure of Alipay transaction splitting, which became the standard of Alipay database splitting architecture. Alipay’s third-generation architecture is unitary and the overall architecture of disaster recovery, which has become the unitary architecture standard of Alipay. To summarize his first ten years at Alipay, Chen liang said:
I spent the first 10 years of my life working on scalability.
During this time, problems and demand drives prevail. Chen Liang recalled, “At the beginning, Alipay was a single architecture, with a minicomputer and two Apps written in Java. That year, THE DBA came to us and said that it would be difficult to sustain business development without database splitting.”
After a series of renovation, this work was finally completed. At the time, Chen thought the structure could support Alipay’s development for at least the next five to 10 years. However, the Double 11 soon came, and the impact of the instantaneous flow of large scale posed a new challenge to the architecture, and the whole team began to carry out the research and development of remote multi-activity related projects without stopping.
At that time, Alipay had two teams dealing with technical risks. One was called the technical quality team, and the other was the operation and maintenance team. Technical quality is mainly a variety of functional testing, and solve program bugs, failures and other problems; The operation and maintenance team is mainly responsible for production partial infrastructure and application, DB operation and maintenance management and guarantee. At the same time, it also does some technical risk related things spontaneously, but it has not formed systematic technical risk organization formation and playing method.
In 2013, the technical team of Alipay put forward the quality 2.0 strategy, with the purpose of extending the capability in the field of technical risk and systematically settling Bug detection. Since then, the construction of alipay’s technical risk system has gradually stepped into the right track.
Organizational structure evolution
In 2014, the Quality and Technology Department was established in the hope of solving technical risks from a global perspective. However, the quality and technology Department does not have an operation and maintenance team. It mainly provides technical solutions related to general quality testing and high availability guarantee, and drives the technical teams of various business departments to land. At that time, the quality and technical department personnel is not much, is a small and lean department in Taiwan.
After more than a year of development, quality and technology department found that only relying on quality technology can not solve the production of various failure risks. Although the quality and Technology department will focus on the production and development process, its main energy is to output technical risks to the business and technical teams, such as the solution of high availability and universal quality testing, and the platform system of high availability and financial security that has not yet emerged. While full-link manometry and continuous integration platforms were in place, there was no established platform for high availability.
At the time, the technical team decided that risk needed to be looked at from a higher dimension and a more holistic perspective than just from a quality perspective. In 2015, the Quality and Technology Department was upgraded to the Technical Risk Department, focusing on research and development and construction of technical risk problems, and making corresponding solutions and landing platforms.
In 2016, Chen liang single-handedly created alipay’s SRE (Site Risk Engineer, referring to Google’s Site Reliability Engineer) system. PE and DBA teams will be added to the technical risk department. PE team will directly take technical risk prevention and control for operation and operation in the production link. The function of the whole team belongs to SRE. It is understood that this is also the first SRE team in China.
Chen Liang found that traditional operation and maintenance ideas and culture could not completely solve the stability problem of Alipay, so it was necessary to set up the SRE team. As a matter of fact, the traditional operation and maintenance mode focuses on human efforts to solve risks. No matter how to adjust parameters or change configuration, the stability of Alipay cannot be solved in essence. On the contrary, the operation and maintenance personnel will have a low sense of achievement in work. In the end, the problem in the field of operations and maintenance is ultimately a software problem, and a software platform needs to be established to better manage risks.
In the process of building the SRE team, Chen Liang believes that the most difficult thing is not to promote the technical level, but to make the team engineers, including the whole company, recognize the value of SRE, which requires everyone to understand what new problems SRE can solve and why the traditional way of thinking is not acceptable.
It is understood that Alipay’s SRE team is mainly composed of R&D, operation and peacekeeping testing personnel, and 80% of operation and maintenance personnel need to write codes related to stability. After the team is formed, automatic fault location, adaptive disaster recovery, and fine-grained high availability (HA) will be implemented. Users must not be aware of any network or infrastructure jitter. Refined high availability, also known as single high availability, its granularity can be accurate to the user’s every transaction, far better than the industry’s machine room level high availability.
In 2016, the SRE team built many platforms and capabilities. At the same time, the technical team discovered two very important phenomena. One is that production failures are not inevitable, but are usually accidental. Second, the production failure is low frequency. The problem with this is that the failure sample is so small that there is no way to prove that the platform is capable of handling a real failure when it comes. In other words, the reliability of the defense system the SRE team built could not be fully verified.
In 2017, the SRE team created a dedicated, self-reliant technical Blue Army, whose main job is to identify weaknesses in defense systems and launch real-world attacks. Technical blue army is not responsible for the business parties, only for the stability and reliability of the defense system.
In the view of the technical blue team, failure is inevitable, but it is only a matter of time, the technical Blue team will do everything possible to trigger these failures, to ensure that the team has enough capacity to deal with the failure when it does occur. At present, full-stack technical attack and defense drills are carried out every week, while the fault defense system and continuously optimized high availability architecture are accumulated and constructed by the red army of SRE team in deep cooperation with various businesses.
So far, Chen Liang said that the main work of alipay’s technical risk team is actually two things: one is to ensure the stability of Alipay’s production environment; Second, ensure zero errors in the funds of the Internet financial system. The goal is clear, but how to solve the problem and map out a viable path to it is not easy.
Technology evolution
Four years ago, we started out doing fault location, and now it’s really a practice run.
Review the whole process of technological strength changes, Chen Liang said alipay’s offensive and defensive drill is the epitome of technological evolution. So far, the drill has been held four times, and the duration has been extended from one day to four days.
At first, Chen liang introduced that the attack and defense drills were mainly aimed at disaster recovery. Although some online disconnection drills were also conducted, the system at that time did not have the conditions to conduct stability drills directly online, mainly for fault location within a narrow range. The next year, the team built a new infrastructure — a grayscale environment that was completely isolated from the production environment but could bring in environmental flow for production validation. At the same time, the environment has a 24-hour pressure measurement flow, the team can carry out stability attack and defense in various environments, and is required to restore stability within 10 minutes, at this point, from only do positioning to real drill.
Now, the attack and defense drill has been extended to four days, and alipay’s technical risk team will practice the overall failure recovery capability in the virtual environment. With AIOps and TRaaS, the team’s goal has become self-healing within five minutes, with the latest offensive and defensive data showing that nearly 80% of the business has been self-healing. More complex Dr Drills have also been carried out from 12 times a year to more than 100 times, and the success rate of Dr Has increased from 50% to 90%. In this process, Alipay has accumulated many capabilities related to technical risks. The following two dimensions of AIOps and TRaaS will be briefly introduced.
Alipay technology risk control platform TRaaS
In the past, there has been a high level of acceptance and adoption of new technologies, but perhaps a lack of sharing. Now, we have opened up the risk control system of the whole set of attack and defense drills.
Last year, At ant Financial ATEC Technology Conference in Hangzhou, Alipay officially launched TRaaS (Technological risk-defense as a Service). TRaaS, which has gone through numerous tests, is an immune system that combines alipay’s entire distributed architecture with its technical risk capability, and combines AIOps with its high availability and capital security capabilities to make the system self-heal from failures and thus have immunity.
The decision to open up a whole set of risk platforms precipitated by offensive and defensive drills was partly driven by Alipay’s opening strategy, Chen said. In the past, Alipay opened middleware and PaaS platforms to customers. Secondly, for users in the financial field, there is a real need for stability, and there has been no particularly good solution. Alipay is willing to productize and provide the technical capabilities accumulated over the years.
In short, TRaaS has three main features: 99.999% high availability; Second real-time verification of billions of funds; 5 minutes of discovery, 5 minutes of self-healing immunity.
First of all, relying on alipay’s three-place five-center remote multi-activity DISASTER recovery architecture and full-link pressure test, TRaaS finally achieved high availability of 99.999%, namely extremely high availability, that is, the annual downtime of the system will not exceed 5 minutes.
Second, As TRaaS’s head, Chen recalls, alipay, like many banks, initially relied on human reconciliation during the evolution of the entire fund prevention and control system. After that, through the automatic way to export the full database table to do the calculation to check. Later, when the volume of traffic increased, T+H was introduced, and the check time changed from days to hours, and exception management was added in the process. Finally, it evolved to real-time business verification, adding circuit breaker decision, fund immunity, and intelligent monitoring functions, resulting in TRaaS’s powerful 100-billion-dollar fund second verification capability.
Finally, TRaaS integrates Alipay’s exploration at the AIOps level.
AIOps
As mentioned above, self-healing is an important exploration in the direction of Alipay AIOps. At present, the recovery ability of self-healing is controlled at about 5 minutes. With the continuous optimization of AI algorithms, Chen liang believes that this time is expected to continue to shorten in the future. Chen liang said that in the process of system construction, AI algorithm certainly played a good role, but self-healing through AI may be limited to some scenarios, which requires software engineering modeling with the ability of SRE. Alipay will also use AI to locate root causes and handle alarms.
In the interview, Chen liang mentioned that the biggest value of AI in DevOps can be summarized as improving efficiency and expanding boundaries. On the one hand, the AI can assist engineers in business monitoring by training the model through historical monitoring data, thus improving monitoring efficiency. AI, on the other hand, effectively increases the number of configurations of monitoring points and covers a wider range of business, which is difficult to achieve with existing manpower.
Pay treasure to production environment is very complex, to realize AIOps, the greatest technical challenge and is the result of high scale data concurrency, technical risk team to achieve business high availability will need to find all possible cause of a fault, such as the cause for all of the down payment, the process internally referred to as “seek the denominator”, AI plays an important role in this phase.
Taking financial security as an example, there are two tables upstream and downstream of the SOA architecture for the same business, and the amount of the same order in the table must be the same. When the form data is large enough, it means that the number of samples available for training is large enough. At this time, AI can be used to find out the cause of the failure of each transaction with inconsistent amount, and then continuously improve the “denominator” of the failure.
As for the future planning of TRaaS platform, Chen liang said that when conditions are mature and allowed, TRaaS platform will integrate all capabilities of Alipay’s technical risk team in the field of attack and defense, including grayscale architecture, drilling platform, self-healing platform, alarm processing platform and change platform.
The future planning
In the future, technology risk prevention and control systems will have more intelligent features that minimize human intervention and, at best, become unattended. Chen liang revealed that this will be the team’s main direction for at least the next two years — leaving all changes unattended. Of course, unattended is very simple, the key is the risk control ability to go up.
In the process of building the technical risk capability of Alipay, Chen Liang said frankly that in the future, he hoped to integrate the technical risk and AI capabilities into cloud and combine them with Service Mesh, so that the business could focus on developing business codes and leave the rest to the cloud.
Chen Liang (name: Junyi), researcher of Technical Risk Department of Ant Financial, witness and lead architect of Several technical architecture upgrades of Alipay. Before joining Alipay, I used to do Chinese programming, and started a search website. Now I am leading the technical risk team of Alipay to research and develop the architecture system and products related to ant’s new generation of high availability and capital security, such as AIOps and TRaaS.