Ali Continuous delivery platform has experienced 8 years of continuous iterative evolution, and has grown into the most important R&D tool relied on by tens of thousands of applications in the group. Its efficiency directly affects the daily work of tens of thousands of R&D. However, the platform should not only be a pile of tools, but also need to think deeply about the RESEARCH and development mode of the Internet era, and constantly polish it, and constantly integrate the engineer culture and engineer practice into it. Pay more attention to technology than control, use the latest engineering practice in the industry, and use the evolution of technology to solve the efficiency problem of technicians. This speech will introduce the evolution of Alibaba continuous delivery tool and the thinking and practice of hot issues in the field of Internet industry delivery.
Hello, everyone. I’m from Alibaba, the beauty of flower. Today, the topic I bring to you is the continuous delivery of the Internet era. Why emphasize the Internet before continuous delivery? What’s so special about Ali’s continuous delivery practice? I hope I can bring you a little bit of harvest here.
I am in charge of the construction of continuous delivery platform and R&D tool chain in Ali, as well as the output of corresponding capabilities through Ali Cloud. Our external version is called Cloud Effect, which is currently under public testing on the public cloud.
First of all, I’d like to introduce today’s main content.
- First of all, I would like to introduce the evolution of Ali’s continuous delivery tools and our construction ideas in these years.
- Then we will discuss with you an important topic in the rapid evolution of Internet enterprise products: quality and efficiency, how we see the relationship between the two, and how to coordinate.
- Another issue we are facing is delivery and Devops. Traditional software companies are certainly familiar with delivery, but in the Ali scenario we have some new challenges. Devops: Let’s talk about our progress in the last year and one small innovation.
A time line
First of all, I would like to introduce the development process of Alibaba continuous delivery platform: In the first stage, in 2009, we made a simple automatic release tool to solve the single point problem of SCM and PE students. Before this, many enterprises may have experienced such a process. They submitted the application for release at a fixed time, and the management students froze the code and packaged it, and then handed it to the operation and maintenance students for release. It’s barely enough in small teams. As the application scale grew and the online environment became more and more complex, the ability and efficiency of the pipe and operation students began to hinder the development of our products. Of course, there are some corruption, such as the need for a ride to the emergency release of the management students to buy coffee and so on. Just kidding.
Less than two years later, more and more research and development personnel, all kinds of complex research and development specifications, online all kinds of complex scripts, all kinds of new students dug pits, posterity trampled pits suffering. All of this had to be regulated, and then in 2013 we unified everything from code changes to online distribution, and controlled it by building a unified deployment platform.
Apparently that wasn’t enough, and in 2016, the platform was upgraded again. A one-stop platform from requirements to code, delivery to feedback. Project, requirements, code, construction, testing, release, assembly line, public opinion feedback and so on, the product picture is basically complete.
In 2017, we opened the 8-year experience of platform tools to Ali Cloud, also known as Cloud effect, hoping to feed back the cloud ecosystem through Ali’s experience, and at the same time rely on the experience of the vast number of developers to help us grow tools.
Tools and ideas evolve
Having said that, let’s take a look at how our tools and ideas have evolved, and we’ll expand on those four points in detail.
- The first is automation, which is the first value to be accomplished by tools and the most direct way to improve efficiency. For us, we will do a good job of configuration, code, testing, operation and maintenance of automation.
- The second is standardization, which is the biggest mission of tool platforms. For example, Amazon’s Apollo environment deployment tool, which is often talked about, is very good. Alibaba also has its own R&D standards and operation and maintenance standards, and r&d standards such as R&D mode, technology stack and configuration management standards are relatively easy to do. The operation and maintenance domain is quite difficult. At present, Web applications, mobile applications, search, system basic software and so on have their own systems. The group and Ali Cloud will also be slightly different. But with the advance of containerization and unified scheduling, these are all expected to be unified.
- The third is customization. Customization should be a higher requirement for the platform. Different teams and skill levels have different requirements for tools naturally. We must not lose flexibility by regulation, nor limit high skills by accommodating low ones. Therefore, we will first recommend appropriate delivery processes and management practices based on team maturity.
- The fourth is one-stop. When tools start to blossom, it will do harm to the r&d students. Different interactions and different product docking forms will not only increase the complexity of the system, but also reduce the efficiency in the flow between platforms. Therefore, we integrate platform tools to connect the whole link from requirements to feedback, and complete value-based delivery on one platform.
Automate everything
Well, let’s start with our first idea: automate everything. This diagram shows a common development process. We first pull the development branch from Master for development, merge it into the Release branch for distribution, and merge it into Master after publishing. Each r&d team should have its own set of specifications for dealing with branching. As teams get bigger and bigger, or as new people come in, how to standardize operations, improve collaboration efficiency, and avoid mistakes, you need tools to take care of that.
We will be familiar with several research and development models: Trunk development, branch development, Gitflow, etc., start from pull branch, mergerequest, code merge and conflict resolution, and finally simplify into a pipeline. A novice developer can immediately integrate into the research and development work without any mistakes as long as he only needs to operate on the platform.
Standardization landing
In terms of standardization, the tool level mainly completes the following aspects: application creation, test acceptance, standard environment, on-line checkpoint and deployment process.
One application corresponds to one code base and one service unit. We implement and iterate group standards through code recommendation and technology stack template, such as Springboot promotion. Through resource arrangement to quickly complete the application of infrastructure construction.
In terms of test and acceptance, group protocol and security test are the main standards promoted in recent years and have been implemented in the whole group. Code quality score is an objective evaluation of the current application quality by means of data measurement.
Standard environment is the basis of delivery line and operation and maintenance control, which can be realized easily through containerization and unified scheduling. The fourth point is interesting. The original tool idea is to guide users to use automatic tools to solve deployment or resource problems, such as clearing logs or restarting the server. Now, we will adopt more self-healing methods, and the operation and maintenance of environmental resources will be settled by platform without intervention, replacing people with tools.
Most of the online bottlenecks are control requirements, which are determined based on the group’s unified control strategy to control r&d behavior and quality.
Finally, the deployment process, publishing policies, monitoring, baselines, and rollbacks are all required functions that I won’t go into here.
Customize the solution
To implement customization, look at several elements of the solution:
- Team maturity: What is the size, 1-2, 7, 10 +? Full stack or independent test operation team? What is the quality? Is there technical debt? Are there any specific conventions within the team?
- Iteration speed: Any time per day? Or periodic delivery? Is there a window limit? From the perspective of our continuous delivery, we do not want to do too much constraint on the online behavior, and the online behavior should be able to meet the deadline. Ali’s core applications are basically released at any time, even several times a day.
- BU technology stack: some of the BU specifications, and personalized differences. Although we have been building unified infrastructure and R & D operation and maintenance platform, we still cannot achieve 100% unified, which is the direction we have been working hard
- Finally, integration delivery: Are there product integration requirements, project delivery? Or proprietary cloud delivery. The typical forms are e-commerce, mobile terminal and Ali Cloud.
Based on these four factors, we deduce several customization directions:
- Research and development mode: according to the application, small teams adopt branch research and development, co-built large teams adopt Gitflow, and larger teams adopt trunk development mode. Due to the popularity of micro-service design, the team size of the application is becoming smaller and smaller, so the r&d mode in the Ali branch will be more popular.
- Technology stack: Java, C++, script classes, etc., we will use code recommendation and templating to help users create code framework and compilation environment in one click.
- Deployment templates: Common practices are package templates and Dockerfiles. In Ali, many BU architecture leaders and PE will provide a variety of technology stack base images to help common developers quickly deploy the environment. Similar technology stack control also depends on tools.
- Finally, multi-stage pipeline is used to realize various types of delivery processes to meet the needs of integration delivery.
One-stop platform
What we see now is the ali cloud effect the whole picture of r&d collaboration platform, generally can be divided into project collaboration and continuous delivery of two parts, the delivery part formed from requirements to complete closed loop feedback, the feedback part of the performance metrics aren’t the only ones, and in view of the business itself public opinion analysis and questionnaire survey, and intelligent customer service tool.
Engineer culture landing platform
Finally, our ultimate goal in building platform tools is to put engineers on the platform. Of course, engineer culture is a very empty thing, every enterprise has its own culture. Alibaba’s internal taobao, B2B, Ali Cloud and other BU have their own characteristics. But there are four main points:
- Quality culture: Quality is the core of continuous delivery and the only way to team growth. Without a quality culture, efficiency is impossible, and code quickly rots away, leaving no one to touch it and becoming technical debt. Not to mention rapid iteration and continuous delivery.
- Innovation culture: as the middle stage of r&d, we cannot and need not become the source of all tool innovation and efficiency innovation. Ali itself also has a strong culture of innovation. Today we bump into an idea, and tomorrow a group of friends turn it into a tool or small product. Innovation is discovered and quickly develops into a series of ecosystems. These things happen every day. For our tool platform, it should become a carrier to put the best innovation on the platform, promote the integration of similar products to avoid repeating the wheel, and also promote drainage, strengthen and expand, and form a positive cycle.
- Full stack culture: There is a lot of talk about DevOPS, but DevOPS without tools is basically empty talk. When our platform can help dev automate OPS work, or facilitate knowledge learning, there are no dead ends for r&d, testing, and operation collaboration.
- Lean Culture: This is the philosophy of our one-stop platform: value-based delivery, accurate measurement based on data, to help developers or leaders evaluate product value and optimize team effectiveness.
Good, the above is our internal r & D tool construction of ali some practice, hope to be able to throw off a brick to introduce jade, common discussion.
Next we are going to discuss a topic, quality and efficiency, which is probably an eternal topic in the field of engineering. Today we are going to focus on how our software can solve the problem of both want and want in the Internet scenario.
Let’s take a look at some of our challenges in the Internet age:
When delivery speed determines the market: our CTO once said that research and development tools should ensure that an idea can be completed within 2 weeks from birth to launch. Quick trial and error, and if it fails, it will be destroyed, and if it is good, a group of people will be pulled together to make it bigger and stronger. This might seem very difficult to do in a traditional way, but it’s happening.
Under this premise, how will our quality efficiency be selected?
Will changing engines become the norm? Based on our previous assumptions, first occupy the market, and then continue to iterate optimization, basically has become the consensus of our software development. Now if someone says we’re flying from plane to plane, I’d probably just say “hehe”, because we do that all the time, right?
Continuous integration faces challenges
Let’s look at the challenges we face in continuous integration:
- Continuous integration without test coverage becomes a burden. When we don’t do a good job of single testing and API testing, continuous integration imposed through the process is basically self-deception, either unstable or ineffective.
- Quality degradation caused by test team transformation and development of full stack. With so many demands, no time to write tests, or too used to nannies to realize it, etc., etc.
- Test environment interdependence produces unstable factors in Ali integration environment should be a major pain point in the research and development process.
Having looked at so many questions, we need to think, what else can we do besides push for better testing?
From tools to efficiency
Ok, so what I’m going to talk about today is, from tools to efficiency, quality to efficiency. So let’s first look at where do we get efficiency from?
- The first quick feedback, I think the first word that comes to mind for efficiency is fast, which is quick feedback. Speed up construction and speed up regression.
- The second is collaboration, because communication is often a big cost for programmers, and we’re not very good at it, right? We often find that IQ is inversely proportional to EQ. Therefore, reducing collaboration costs can effectively improve efficiency, such as branch development model, online audit, mobile office and so on
- The third is innovation, when the original extensive, manpower can not continue, innovation may be the only way out. Such as dual-engine tests, mock tests, and so on.
I will illustrate the above three points one by one.
Cost of collaboration
Let’s start with an example of collaboration, a comparison between branch development and trunk development.
Branch development is when everyone is coding on the branch, such as requirements, bugs, etc., which is packaged into a temporary release branch when it needs to be integrated, and then merged into the trunk when it’s done.
The so-called trunk development is that developers code directly on the trunk, commit the trunk immediately after development, and release it packaged with the latest trunk code.
Ok, let’s look at the comparison of the two r&d models:
- First let’s see if we can set up a branch. Branch development mode Branches are established for each feature, which is conducive to control and control. When you see branches, you can immediately understand what you are doing. Trunk mode commits the trunk directly, distinguished by commit.
- Second, branch research and development needs to integrate and merge release for many times, which will inevitably lead to repeated conflicts, and the test after integration will lead to delayed test feedback. Trunk development, on the other hand, resolves only one conflict to achieve commit integration.
- Third, when a branch function does not want to be released, the branch mode can simply exit the integration, and the branch that needs to be released can be merged with release again. The trunk development often uses the feature switch mode to turn off the corresponding function, because the code stripping is difficult.
- Fourth, when a release is rolled back online, branch mode we can roll back the trunk and automatically roll back the next release branch merge to prevent the error code from being sent online again. The trunk mode requires a hotfix, before which blocks may be released later.
Through the comparison of the above four scenarios, we marked the ones that are relatively favorable for collaboration as black, and those that are unfavorable as white. It can be seen that the branching mode seems to be a slight winner, especially in the third point, where arbitrary integration based on functional branching provides a high degree of freedom for the r&d students. After practice, instrumentalized support has greatly reduced the disadvantages of delayed integration feedback of branch development under the scenario of clear division of labor of microservice personnel, less coupling and rapid iteration.
Quick feedback
Ok, let’s move on to the next efficiency point: rapid feedback. During the development process, as the number of test cases gradually increased, so did the test time. When the execution time reached 30 minutes, I thought it would be too much for me to handle. I had eaten several cups of coffee and the test was still running crazy. Here I’m going to introduce a tool that’s a little bit more fun to do to achieve this goal quickly, and we’re going to call it precision regression. It has several features: establishing the correlation between test cases and business methods with the help of whole-link trace technology of middleware, recommending use cases to be executed when code changes, accurate regression and fast feedback.
Architecture diagram
Take a look at this picture. First we used EagleEye, our middleware plug-in for tracing, to pile test code and application code and inject TACEID. When the test case is executed, we record the complete link log from the test case to the application code, namely eagleeye log. Through log collection, it is sent to the real-time computing engine to calculate the correlation between the test code and the application method.
For example, when we know what application code is covered by a test case, when the code changes, we will naturally know which use cases can be covered by it, so as long as we execute these use cases, the test time of tens of minutes can be shortened to a few minutes or even seconds, which will greatly improve the efficiency.
Of course, this solution is not without its flaws, and we recommend running the full set of use cases during the integration phase, but that’s okay because tests are run more often during development than during integration.
The test innovation
Let’s start with three issues: what about incomplete test coverage, and writing test cases, especially good ones, takes a lot of time. Beta test will produce capital loss failure, how to deal with the real traffic, if there is a bug, it will certainly lead to some problems, although the impact is small, but it will also lead to some irreparable problems. Test data is difficult to maintain, often contaminated how to do, this is a complex and headache.
In order to solve the above problems, tmall business team established a platform called Dual-engine test, which assisted us to improve coverage and greatly improve efficiency through online data collection, offline service isolation, replay and comparison, and automatic regression.
Architecture diagram
You can see this architecture diagram, on the left is the online service, first through the client to collect the online request, including request and response, as well as the downstream system, cache, DB and so on call link snapshot. Mq messages are sent to clients in the beta environment for playback.
Is this just a simple replay request? At its core, the tool isolates and mocks all of the application’s downstream dependencies. For example, an application sends a SQL query to db, and the twin-engine test platform blocks the request and returns a snapshot of the same query online. Finally, get the response of the application for real-time comparison and store inconsistent results.
Through this mechanism, we can easily implement online request, offline playback test, debug, focus on testing business code itself, isolate dependencies and avoid interference.
On this tool, we can also grow many products, such as use case management, failure analysis, offline playback and so on. At present, this platform has formed its own ecosystem in Ali, landing core applications, and guaranteeing the reconstruction and upgrading of core codes for many times.
We’ve covered some of alibaba’s past practices in continuous delivery, and the current quality and efficiency challenges. Now I’ll talk about the two directions we’re currently working on and exploring, delivery and DevOps.
Internationalization and the shift to private clouds
When it comes to the topic of delivery, traditional enterprises must be familiar with, and can be said to be absolute experts. Now Internet companies like Alibaba face new delivery problems. What should we do when we want to attach the e-commerce system to the joint venture company? What should we do when we want to export the infrastructure and complete Ali technology system? What should we do when we apply so much and form a network dependence? Sounds like a huge deal, right?
At present, our two delivery changes from unified delivery to batch delivery, for example, we first output the middle stage of e-commerce, and then output the upper business applications of e-commerce. From the whole delivery to the block delivery, when all output, version iteration, such a large application scale can not do the whole delivery, the need for block delivery. And that block might be different every time. This certainly poses new challenges for tools.
Delivery efficiency Challenge
Moving on to efficiency, HERE are three:
- Quick build up: We need a low-cost one-click build environment, to reproduce problems, or to create integration test environments for deliverables.
- Test regression: What happens when there are multiple versions of services in the environment, and how can it be done fast enough
- Link management: whether the delivery process can be made into one-click, whether the delivery version can be visible and controllable, and avoid delivery risks.
Here are some examples for the above three points.
Exploration of delivery process
Here I write the delivery process into the pipeline, dependency identification, comparative regression, fast build, accurate regression, one-click delivery.
First of all, dependency recognition, why do I do dependency recognition, as I mentioned, when I have a very large number of applications, and mesh dependencies are very complex, it’s not a good idea to have people identify dependencies. And I have to deliver a feature that, if you pull all the strings and let all the associated applications do the overall output, involves too many teams and is basically unsustainable. Therefore, tools are necessary to accomplish automatic dependency recognition. This can be done with full link call data.
For example, when I modified version A1, I found that B1 and C1 versions must be delivered along with the link data combined with the link snapshot of the delivery end. At this time, the system helped me delineate a delivery set, and the subsequent test work can be carried out based on this.
Before the integration test, we can use the dual-engine test platform just introduced to perform data playback, comparative test, and confirm the impact on the data at the delivery end.
Integration testing
Next, we will enter the integration test stage. First, we will build up quickly. Through application container arrangement and middleware isolation, we can easily isolate A1, B1 and C1 to form a small integration link. Through accurate regression technology, the function of change can be quickly feedback.
A key delivery
Finally, one-click delivery. In addition to basic version management capabilities, I can also directly manage the delivery environment in the group environment to minimize the cost of the delivery process for the r&d staff. At the same time in the delivery side, support grayscale publishing ability, further reduce the delivery risk.
The most important thing is to have a smooth feedback channel. Relying on the rich functions of the feedback module on the product picture of the platform introduced above, such as public opinion and Q&A, I can quickly grasp the situation of the delivery end and take necessary measures.
Ok, so these are some of our approaches to the current new delivery issues, and I welcome you to discuss them with me in depth.
The last topic, Devops, I’m not going to go into detail here, but I’m just going to talk about some of the changes we’ve made to DevOPS over the last year or so and one small innovation.
Since 2015, Ali Group began to carry out transformation, and removed most of the PE teams of its businesses and entrusted them to development. In 2016, we built a unified scheduling platform and implemented Docker container technology, and completed the upgrade of core applications. ’17 will probably be the best year for Devops at Alibaba. This year we will complete the containerization of all active apps and build a complete tool chain upstream and downstream.
Here I have listed three points. First, what are the most basic OPS for R&D? Application configuration, environment, software baseline, online change plus pipeline control. This is used every day.
We didn’t do well with such a complex set before containerization, but when containerization began to materialize, our environment was further standardized, code driven changes were implemented, and management tools were greatly simplified. The previous baseline tools, package validation tools, batch execution tools are no longer needed. And through the supporting scheduling system, the environmental resources are really handed over to the r & D students.
Operation and maintenance system began to simplify complexity, service sinking, self-help operation to self-healing, and began to try intelligent solutions. For example, large elastic scheduling.
The conversation in ali
It looks great, doesn’t it? Actually? It is undeniable that Ops poses considerable challenges to developers, especially the lack of basic knowledge of operation and maintenance and the lack of problem-solving ability. Is DevOps simply handing Ops to Dev? Obviously not, not only is it a disservice to development, but it’s also inefficient, with 10,000 people doing what 1,000 people used to do.
So we can’t make DevOps a burden. DevOps robots are on the way.
The conversation robot
Devops robots are actually data-based active service robots that are ready to help developers supplement knowledge and solve problems, and we have a mechanism to ensure self-loop knowledge acquisition.
First, where does the data come from? Common, build errors, machine error logs, deployment-specific, container-specific, etc. We first collect these and store them in the data platform in conjunction with user actions, such as code change Diff, configuration change, etc.
When we have data, we can use machine learning to analyze the correlation with the accumulation of knowledge base from experts, tool operators and ordinary developers to form rules.
When the same problem arises again, the system will automatically recommend relevant solutions to help developers solve the problem, and collect feedback to train the model.
Through our data accumulation and product construction such as problem square, a closed loop of problem occurrence = scheme push = user feedback = knowledge contribution is formed. At present, the matching rate has reached more than 70% through our simple accumulation. It is believed that DevOPS robots will be more intelligent through closed-loop knowledge training in the future.
The authors introduce
Xin Chen (Shenxiu), responsible for the construction of aliyun cloud efficiency continuous delivery platform and R&D tools, is committed to the research and exploration of enterprise R&D efficiency, product quality and DevOps. Led the big data test team, test tool r&d team and continuous delivery platform team in Ali for 6 years. Deep insights into R&D collaboration, testing, delivery, operations and maintenance.