Summary: DevOps seeks shorter iterations and more frequent releases. But the more you publish, the more likely you are to introduce failures. More failures will reduce the availability of services, which in turn will affect the customer experience. So, in order to ensure the quality of service and the final stage of release, Alibaba has gradually developed a release strategy to meet the requirements of DevOps.
The author | | ali heavy silver source technology to the public
preface
DevOps is about shorter iterations and more frequent releases. But the more you publish, the more likely you are to introduce failures. More failures will reduce the availability of services, which in turn will affect the customer experience. So, in order to ensure the quality of service and the final stage of release, Alibaba has gradually developed a release strategy to meet the requirements of DevOps.
Before we start talking about Ali’s practices, let’s take a quick look at some common publishing strategies, their applicable scenarios, and their pros and cons.
A common publishing strategy
1 Release of shutdown
An outage event shuts down services prior to release, stops user access, and then upgrades all services at once. This release strategy tends to be released infrequently and requires adequate testing prior to release.
Downtime release features are:
- All components that need to be upgraded are consolidated into one release
- Most applications in a project will be updated
- The development and testing process before release often takes a long time
- If there is a problem at release time, the cost of fixing and rolling back is high
- It takes a long time and many teams to complete an outage release
- Often, client and server upgrades need to be synchronized
Downtime is not suitable for Internet companies, because the gap between releases is too long, the time from feature introduction to market is too long, the market response is not sensitive, will be at a disadvantage in a fully competitive market. Each release also brings financial losses due to downtime.
Advantage:
- Simple, less need to consider compatibility issues when old and new versions coexist
Disadvantage:
- The service is unavailable during publication. Procedure
- It can only be released during peak business times (often at night) and requires many teams to work together
- It is difficult to roll back after a failure
Suitable scene:
- Development test environment
- Non-critical applications have small user impact
- Scenarios where compatibility is difficult to control
2 Canary release
Canary to release the term derived from the beginning of the 20th century, the British coal miners next well before mining, will carry the caged canary in the mine, if the high concentration of toxic gases such as carbon monoxide in the mine, before impact miners, canary human performance is more sensitive than fast, canary after poisoning, mine workers has known the evacuate immediately. Canary release is to release the new version of the entire software to some users before it is released to all users, and test it with real customer traffic to ensure that the software will not have serious problems and reduce the risk of release.
In practice, canary releases tend to be distributed to a small percentage of machines, say 2% of servers, for traffic validation, and then quickly get feedback from that to scale up or roll back. Canary publishing is usually combined with a monitoring system that monitors indicators to observe the health of the canary machine. If the Canary test passes, upgrade all remaining machines to the new version, otherwise roll back the code.
Advantage:
- The impact on user experience is minimal, with only a small number of users affected during canary’s release
- Release security can be guaranteed
Disadvantage:
- Canary’s machines are small in number, and some problems do not come to light
Applicable scenarios:
- Monitoring is complete and integrated with the distribution system
3 Grayscale/rolling release
Grayscale publishing is an extension of Canary publishing, which is to divide publishing into different stages/batches, with the number of users in each stage/batch increasing step by step. If the new version does not find problems in the current phase, the number of users is increased to the next phase until it is extended to all users.
Grayscale publishing can reduce publishing risk and is a publishing strategy with zero downtime. It gradually switches from one version to another by switching the route weight between the online versions. The entire release process can take a long time, during which time old and new code coexist, so compatibility between versions should be considered during the development process. The coexistence of old and new code should not affect the functionality and user experience. Grayscale distribution allows for faster roll-back to older code when problems occur with newer versions of code.
Combined with characteristic switch and other technologies, grayscale publishing can realize more complex and flexible publishing strategy.
Advantage:
- User experience impact is relatively small, do not need to stop the release
- Ability to control release risk
Disadvantage:
- It will take a long time to release
- Complex publishing systems and load balancers are required
- Compatibility needs to be considered when old and new versions coexist
Applicable scenarios:
- Suitable for production environments with high availability
4 Blue green release
A blue-green deployment is when you have two identical, independent production environments, one called “blue” and one called “green.” Green environment refers to the production environment that users are using. When deploying a new version, deploy the new version to a blue environment and then run smoke tests in the blue environment to check that the new version is working properly. If the test passes, the system updates the routing configuration to divert user traffic from the green environment to the blue environment, and the blue environment becomes the production environment. The switch is usually done in less than a second. If a problem occurs, switch the route back to the green environment and debug it in the blue environment to find the cause of the problem. As a result, a blue-green deployment can make the new version available to all users instantly with just one switch, and the new features become immediately visible to all users.
Advantage:
- The upgrade switchover and rollback speed is very fast
- Zero down time
Inadequate:
- One-time full switchover, if the release of problems, will have a relatively large impact on users
- Requires twice as many machine resources
- The middleware and applications must support traffic switchover for the hot spare cluster
Applicable scenarios:
- Surplus machine resources or on-demand allocation (backed by cloud vendor)
5 A/B testing
A/B testing is very much like grayscale publishing and can be distinguished by the purpose of publishing. The AB test focuses on making decisions based on the differences between version A and version B, and finally selecting A version for deployment. Compared with grayscale publishing, AB testing is more decision oriented, and more flexible in weight and flow switching than Canary publishing.
For example, there are two implementations of A feature, VERSION A and version B. Through fine-grained flow control, 50% of users are always directed to implementation A, and the remaining 50% are always directed to implementation B. By comparing the conversion rate between implementation A and implementation B, Finally, A implementation with high conversion rate was selected as the final version of the function.
Advantage:
- Rapid experimental ability
- The impact on user experience is small
- You can test with production environment flow
- You can test for specific users
Inadequate:
- Complex service traffic identification and control capabilities are required
- There are complex compatibility issues to consider
Applicable scenarios:
- For business exploration and innovation testing
- Multiple options need to be decided upon
6 Release the traffic Isolation environment
In the release strategy, release units are used, but a function module is often composed of multiple applications together to provide services, even if the current release of application exception, the exception is not reflected in the current application, in the case of complex, exceptions will be delayed until its downstream applications, How to detect such problems without affecting the user experience is very important. In addition, we sometimes hope that new versions of code will only affect a small number of users once they go live. Traditional grayscale publishing, however, cannot identify business traffic, so even if only one machine in an application has a problem, it may affect all users.
In the grayscale publication on the left side of the figure below, all the machines in App1 have a probability of being routed to the red App2 machine with the problem. In the isolated environment release on the right, the new version of the code is first released in the full-link isolated environment. Even if there is a problem in the release, only a small number of users will be affected.
Advantage:
- Be able to find complex problems involving multiple applications
- When a failure occurs, only a small number of users are affected
Inadequate:
- The traffic isolation environment needs to be independently monitored
- The system design is complex, requiring middleware and all applications on the link to be able to identify the traffic
Applicable scenarios:
- Core production business scenarios
Alibaba releases best practices
We will follow the release process to introduce the best practices released by Alibaba.
1 Release Plan
Do a good job of verifying the release functionality before release, and think about how to stop the bleeding if the release introduces a glitch. So it is important to write out a list of plans for the release before it is released. A typical release plan is as follows:
-
Participants of this release
- developers
- Test one
- The code Review
-
Publish content
-
The testing process
-
Risk description
-
Online verification scheme
-
Hemostasis solution with problems on line
-
Publishing steps
- Release in x batches
- Pause x hours after the first X batch is published
2 Different environments use different publishing strategies
Each of the publishing strategies described above has its own advantages and disadvantages. You need to choose an appropriate publishing strategy based on your own scenario characteristics and requirements.
Generally speaking, the test environment is used to do preliminary functional testing, so it will frequently update the code and release. If the grayscale release method is adopted and the release batch setting is relatively large, the development efficiency will be greatly reduced. At this time single or multiple machine single batch shutdown release is actually a do not do the choice.
For the pre-release environment, you should not only consider your own testing needs, but also consider the testing needs of other upstream and downstream developers, so a single batch of downtime release is no longer appropriate, you can set up two batches of release.
For the online environment, you can publish the isolated traffic environment first, and then publish the online environment in batches.
3 Pay attention to monitoring alarm in release
Failure cannot be prevented by a release strategy alone, and it is important to carefully observe application monitoring data during and after release. The application’s core metrics monitoring data, such as QPS, RT, success rate, and error count, help users detect faults as early as possible. In addition, in a production environment where the number of batches is set to be small and the number of release machines per batch is small, it is important to configure independent monitoring of the released machines even if some monitoring metrics fail because the data volume is small and can be drowned out in the overall monitoring data.
4 Canary release and unattended
Ali inside the vast majority of applications in many rooms/units deployment, there may be a scene, the same code and configuration in some room/unit normal, under other unit/room can appear fault, it is necessary to released in batches, a combination of all the rooms/units are released in the first, so that problems can be exposed as early as possible. In addition, developers tend to focus on the first few batches of release and may not be able to respond quickly if problems occur in later batches.
Unitary is to solve the problems of disaster recovery and scalability. The above picture shows alibaba’s unitary deployment architecture.
In addition, the application generally has many monitoring items. In the case of a long release cycle, the r&d personnel cannot be required to focus on every monitoring item at all times. Certain intelligent solutions are needed to help the r&d to find the monitoring items that need to be focused on.
To solve these two problems, Ali designed and implemented its canary launch strategy. Canary released from the application of each room/unit by 10% in the first machine unattended intelligent monitoring system will on the part of the machine set up independent monitoring, monitoring for each item, unattended will contrast the published and unpublished machine monitoring index data, released at the same time, contrast before and after the release of monitoring data, if discovery is unusual, It will be sent to the r&d staff for further judgment.
This canary publishing strategy can help developers find problems as early as possible, and reduce the workload of developers and improve development efficiency.
Continuous integration and release
By choosing the right release strategy and following the best practices described above, the release risk can be kept to a minimum, even less than the risk of an outage. In fact, a short release cycle with only a small amount of code per release is a good release practice. Because of long deployment intervals, each deployment will contain more code changes, resulting in more defects and outages. In this case, people tend to add more reviews to reduce release risk, which in fact has little impact on release risk reduction other than significantly increasing deployment time. This is an enhancement loop that is getting worse and worse, and we need to reverse this vicious cycle through continuous deployment at high frequencies.
Three summary
Agile development can shorten the time to market, allow customers to get the features they want faster, and allow product teams to take customer feedback and iterate on the product faster. In order to solve the release risk caused by frequent release under agile development, this paper introduces a variety of release strategies, including the advantages and disadvantages of each release strategy, and application scenarios. The integrated application of these modes in different scenarios can deliver high-quality products faster.
The original link
This article is the original content of Aliyun and shall not be reproduced without permission.