Summary: 2022, how close is your team to continuous deployment? Continuous deployment is a term we hear a lot, but what exactly is continuous deployment? How can we achieve continuous deployment? This article will break down the concept and implementation path of continuous deployment layer by layer.
Editor’s note: Continuous deployment is a term we hear a lot, but what exactly is continuous deployment? How can we achieve continuous deployment? This article will break down the concept and implementation path of continuous deployment layer by layer.
Planning & Editor | Ya Chun
In the era of cloud development, the dominant publishing form has become a service-oriented publishing form, which provides a realistic basis for continuous publishing. The premise of publishing is to deploy the release artifact to production, so the premise of continuous release is continuous deployment.
Four requirements for continuous deployment
Continuous deployment requires the continuous provision of a stable and predictable system service. Sometimes there will be an outage during the release process, during which the system is not available. This is a non-continuous deployment mode.
We hope for continuous deployment:
First, it should be accurate — deployment results should be accurate and predictable;
Second, it should be reliable — the on-line service is not affected throughout the continuous deployment process;
Third, it should be continuous — there are software increments for sustainable deployment as continuous deployment occurs;
Fourth, process cost is low — the continuous deployment process is low cost and efficient.
How do you do these four things?
1. Accurate and predictable deployment results
Accurate deployment depends on three prerequisites: clear release artifacts and configurations, clear operating environment, clear release process and release strategy.
Here is a simple publishing example:
A release starts with an explicit image, which is the build product upstream. It also contains many configurations, such as startup configuration, container configuration, and so on. The other is the environment, where we will configure K8S in the deployment tool, and this configuration will eventually form an environment that will be used in the DevOps process. Finally, we publish artifacts and configurations to this environment, and the release is complete.
So, the process of publishing is the process of applying a collection of artifacts and configurations to a collection of environments. First of all, there should be a clear product to be released and the operating environment, and then through the corresponding description, the product, configuration and environment are described clearly, forming the released content, can enter the next step.
The easiest way to publish is kubectl Apply, but there are some problems with this approach.
First, the results are uncertain. The pod may not be up after Kubectl, the Deployment may not be available, the service may fail, the pod may not be enough after the release, the resource may not be available, these are unknown. So there’s uncertainty about whether or not the launch is successful, how much of a launch is successful, and that’s unpredictable.
Second, the state is invisible. Release is not an overnight process, it is a gradual process. How much has been sent, how many problems there are, and what traffic has been cut through is unknown.
Third, the process is not controllable. In this issue, an order cannot be withdrawn after it is issued.
If there is a problem with the version, there are serious bugs, all traffic drops to zero, there is no going back, very dangerous. So in the actual release process, we need to have interventions, like when I find that traffic is causing a significant decrease in usability, I need to be able to stop the release immediately.
Regardless of the deployment approach, we want to minimize the impact on the online service to the point where the deployment process does not affect the online service at all. This is our second principle.
2. The deployment process does not affect online services
In order not to affect the online service, there are four requirements:
First, roll deployment
Gray scale is adopted to deploy the vast majority of services on a rolling basis. When it is confirmed that there is no problem, the flow will be cut off, so that the online services will not be interrupted. Rolling may be too fast, so it is necessary to ensure that the interval between each lot is sufficient to detect problems and sufficient time to collect enough data for judgment.
Second, deploy observables
The deployment itself may generate some alarms. For example, the deployment causes the water level of some service nodes to decrease, but not the whole service. Therefore, deployment and monitoring need to be integrated. Firstly, meaningless alarms need to be avoided. Secondly, monitoring needs to discover deployment problems in a timely manner. What about the service? Does the delay increase? These need to be monitored.
Third, be ready to intervene
During the deployment process, many uncertain problems may arise suddenly. At this time, some intervention measures are needed, such as the operation of diversion, and the corresponding flow cutting to avoid the problem affecting the entire system.
Fourth, you can roll back at any time
If your intervention doesn’t solve the problem quickly, then it’s time to roll back. Rolling back at any time is necessary because there are some failures in the deployment process, and the corresponding repair cost is very high. Fast rolling back can ensure that the service is not affected.
Examples of common publishing patterns
Here are some common publishing strategies.
(a) grayscale release
Grayscale publishing common architecture as above. First, there is a load balancer. The service version under the load balancer is currently V1. To release the new version is V2, a node can be taken from it and one-fifth of the traffic is V2.
In this case, all of the original pods are in Deployment1, but a new Pod is in Deployment2, and a portion of the traffic is routed into the new Deployment2 from Loadbalancer to Service.
In some cases, specific traffic, such as 5% of the cookie underlying traffic containing Grey, is routed into Deployment2 through ingress or mesh for finer traffic control.
We expect deployment2 to progressively replace Deployment1, with the flow of deployment1 gradually being replaced and taken offline. During the whole process, users are unaware, requests are normal, all kinds of monitoring, basic monitoring, application monitoring and business monitoring are normal, which is the expected result.
The most common approach for grayscale publishing is to generate a new Deployment, associate the new version of Pod, have two Deployment versions at the same time for a period of time, and achieve the purpose of grayscale publishing by constantly adjusting the number of PODS on both sides. This is the most common deployment strategy, and the cost is relatively low. The disadvantage is that you cannot do very fine flow control, but you can consider this method if the service volume is not large.
This type of publishing has requirements on services. First of all, for a specific service, there is only one publishing in progress, because traffic switching is required for verification.
Second, only one version can be deployed after a service has been published.
Thirdly, there are two versions of deployment in the whole process, and two versions of services are provided. It is necessary to ensure that these two versions of services can be provided correctly, so that business requirements can be properly handled no matter what is upstream or downstream.
Fourth, the entire publishing process must not cause interruption of service. For normal short connection services, ensure that a session is not broken or discontinuous due to publishing. If it is a long connection, ensure that the connection can be automatically migrated to the new service.
Finally, the entire publishing process does not cause errors in user requests, but has an elegant logoff mechanism to ensure that it does not accept new requests after processing, in which case the desired grayscale publishing effect is guaranteed.
Therefore, the whole process of grayscale publishing is not only for publishing tools, publishing strategies have some requirements, also have many requirements for the application itself, in order to achieve a very smooth grayscale publishing.
Based on this, we summarized some suggestions for grayscale publishing practice for your reference.
First, we recommend that applications ensure compatibility with previous versions (or versions). The number of compatible versions of this app depends on the online version of the app. Sometimes there are several versions of the app online at the same time, and we need to ensure that these versions are compatible.
Second, create a new Deployment that provides the same service and grayscale it by adjusting the pod number or ingress flow. In this grayscale case, it can be controlled very fine. Flow control is recommended.
Third, define grayscale batches and the proportion and observation time of each batch. Grayscale batches should be designed so that there is enough space between each batch to identify problems and deal with them. If the gray interval is very short, it is possible that the monitoring has not had time to alarm into the next larger batch, which may bring very large risks.
Fourth, you need to focus on business monitoring data in addition to basic monitoring and application monitoring. Monitoring is a large category, but released from the point of view, our ultimate aim is to avoid the business loss, release may lead to a business is not available, or the business there is an error, is more serious is that release causes business some observation index has great changes, such as the user conversion or abnormal data such as user login successful times. These abnormal data should be detected in time and suspended immediately.
Fifth, when the release process is complete, you should first do traffic switching for observation, rather than rushing to clean up the POD, to ensure that future rollbacks are more efficient. If the POD is still in place, it will soon be able to cut through the traffic, shortening the time online services are affected.
Sixth, record the released version for easy rollback. In addition to the specific version, we also need to know where it was deployed so that we can roll back. Record the corresponding version, if the compliance check automation is done well, can also do one-click rollback.
Seventh, rollback is not the same as republish. The rollback strategy is different from that of publishing. It cannot be the same as publishing each batch is small. In order to solve the problem, we need to reduce the batch, shorten the time, and quickly roll back.
Finally, if the system supports multiple tenants, you are advised to perform traffic isolation and AB tests based on tenants, especially when AB tests are convenient.
(2) Blue-green deployment
Another common deployment is the blue-green deployment:
Blue-green deployment is similar to grayscale, but requires more resources. This depends on the deployment pattern of the software and the amount of machine resources available. Blue-green than grayscale software requirements will be lower, can ensure that all the business is deployed and then cut, but grayscale is not, to be able to continue to deploy. But the risk of blue and green is also relatively high, once the problem is global.
In order to ensure that online services are not affected, in addition to deployment policies, there may be other problems, such as software is only half developed, or services deployed in the hope of cooperating with other services can be provided to users as a complete system service, in this case, the feature switch mode is needed.
A feature switch is a special configuration in nature and is usually delivered in the form of dynamic configuration. You can do continuous deployment, but keep the switch off until the client or front end is released and then turn it on. So strictly speaking, turning on a feature switch is itself a release, and the feature switch itself requires versioning.
The ultimate goal we want to achieve is that anyone can release software at any time. This means that your service can be published at any time, anyone can publish with confidence, publishing is very simple, requires no special skills, and there are no major problems after launch, and if there are problems, they can be solved quickly.
Therefore, our vision is that any application can be launched at any time.
For Alibaba, it can be visualized as: The Internet is not blocked on Double 11. In fact, in the process of Double 11, there are also a lot of emergency releases, which need very complete technical support to ensure the safety and reliability of the release, because once there is a problem, it may be a public opinion failure. And the more that happens, the more likely it is that there will be some kind of avalanche effect, which will lead to a series of failures and problems that will bring the whole system down.
3. Software increments for sustainable deployment
Do continuous deployment. Continuous deployment is key. In the upper group of the picture, Mona Lisa’s smile is a small piece each time and finally forms the fifth piece. However, 1, 2, 3 and 4 are all incomplete, while 1, 2, 3, 4 and 5 below are all complete but constantly enriching details. What it means is:
(1) Our software increment should correspond to a clear requirement value point, which can be delivered.
(2) Second, software increments should be complete and distributed independently.
(3) Third, software increments should be independently verified.
KentBeck once said:
That said, integration is a very important thing, because most of our software development collaborations involve breaking things down and solving them again.
There are three steps to integration: code submission, package deployment, and validation. These three steps are very simple.
The purpose of integration is to verify completeness, to verify that the merged code can be built, to verify that the corresponding functional tests can be done, and to help identify risks early. Therefore, to integrate early, the batch size should be small.
Two units: one that is deployable from a deployment point of view; The other one from an integration perspective is the integrable unit.
The unit of deployment is releasable to testable, a requirement perspective. A delta is a requirement, a feature, something that the user can see, something that they can use. The other is the integrable unit, which is a lot of buildable units that logically can be built together, unit testing done, and then code level verification, which is the code perspective.
After the code is submitted, the code is analyzed, and the code is compiled and built. Compilation success often gives us first-hand feedback. The process of building the compiler helps us find some problems in the process of writing code. Build itself, if relatively fast, programmers are particularly fond of using, once the build fails, he will know what the problem is, and then unit test, integration test, functional test, and finally go to the state of release. So the first part in blue is continuous integration to get to a release-ready state.
Low cost, efficient deployment and distribution
With artifacts to be released, how can you deploy a release cost-effectively?
Let’s start with some common questions. The most common is delayed integration, such as integration once a month and batch submission once a month. The second type is the accumulated liabilities, the backbone is always unstable, there are a lot of problems, never pass the test, that is the accumulated liabilities.
The third is no test automation, the whole test is completely manual to ensure, or there is test automation but unstable, there is no way to rely on, then completely rely on people to judge whether the test is OK. The fourth is rework, often due to quality problems or defects lead to repeated release activities, resulting in a lot of time wasted.
Another is time-consuming activities, such as manual code checking, manual approval at each stage, and manual inspection of the quality of each stage, which will take a lot of time, resulting in inefficient release. When the software enters a state after completing a certain work, the state migration is judged manually before entering the next state, which takes a long time. As there is feedback waiting, it is relatively difficult to be done efficiently during the whole release.
The release statistics of the two applications in the figure above are app A and app B. Each dot represents A release. The green dot represents A successful release, the yellow dot represents A cancelled release, and the red one represents A failed release. The vertical axis is the time taken for this release. The horizontal axis shows the day the release was completed, so the vertical axis shows the time, and the horizontal axis shows the point in time.
Actually, neither of these apps is doing very well. The problem with the first app was that it was released very infrequently, sometimes once or twice a month, but the success rate was relatively high. The second app is released more frequently, maybe once every few days, but the failure rate is very high, and the number of failures is much higher than the number of successes.
So both are problematic, and both apps take a long time to launch, often 24 hours or more. If the release is more than 8 hours, it means that you can’t handle it in a day, and you need to work overtime. Because the release is a high-risk thing, many companies need to keep an eye on the price when the release is released, and it is impossible to leave before the release is finished. If it takes more than a day to publish, let’s say it takes 12 hours between two people.
For example, there are many businesses that put their integration time on Tuesday and release time on Thursday, because the default release time is overtime, and the default release time on Thursday can not be done, and the release time on Friday, so if it is on Friday, they will have to work on Saturday. In many cases, even if it’s on A Thursday, most of it will be sent on Friday night or Saturday, and a lot of it will be sent on Sunday.
From these two figures, we can see that the release frequency of APPLICATION A is very low, only once for A long time. In addition, many release failures of application B do have corresponding risks. For example, if it takes A long time and is prone to errors, it is difficult to release on demand.
Take a closer look at the B application, if there are several consecutive red dots followed by a green dot near the time, it usually means that the continuous release has failed and needs to be repaired urgently. This means the software is likely to be at risk during emergency fixes. The ability to integrate and publish will be mentioned if you want to continue to publish quickly, with high quality, and with confidence.
The means of fast integration include reducing batch sizes and maintaining smooth flow. Because the granularity of integration and resource utilization, and a correlation between cycle time, small batch cycle time is shorter, large quantities of cycle time is usually long, resource utilization is relatively close, won’t have too big problems, so as far as possible by reducing the batch can let the cycle time is short, frequency will increase after a shorter, frequency feedback more quickly, The response to application fixes or problems can change faster. Second, keep it smooth. After we fix the problems inside, we can make the road smooth.
First, reduce the batch size, as we mentioned earlier in terms of granularity of requirements and granularity of code, you need releasable units, as few as possible testable units, as few as possible buildable units, as small as possible testable unit code granularity.
From the perspective of keeping things smooth, we have a lot of practices, the most common ones are various kinds of automation, such as testing can be automated, build automation, deployment automation, the whole process can be automated, state migration can be automated, etc.
Second, to manage anomalies, that is, to avoid rear-end collisions during the release process, anomalies should be fixed first, and then make the whole process smooth. When there is a problem in the release pipeline, it should stop first, do not checkin first, do not trigger, solve the problem first. Get the trunk right and then continue with the rest of the work, making sure the trunk is fixed and the rest of the integration is done. In some enterprises, the person who submits the code is responsible for solving the problem. If the problem cannot be solved within half an hour or a limited time, the system will automatically remove the code and let the following person continue to integrate.
Another is to reduce dependencies. If there are too many dependencies on the outside during the integration and release process, this can also cause congestion because dependencies lead to waiting.
Another is quality built in, built in the upstream quality to ensure that the downstream faster. If the upstream does not solve the corresponding problem, the downstream will definitely be blocked there. We should look for problems upstream as early as possible, and think upstream as much as possible about how we can make sure the downstream is better early.
The same is true of timely feedback, when there is a problem, it is accurate and timely feedback to the specific person. Avoid spamming feedback, where too much useless or irrelevant information interrupts the developer and leads to a loss of trust in the whole feedback mechanism.
Finally, reuse, reuse as far as possible to avoid repeated wheel.
These are our four principles and recommended practices for continuous deployment.
The original link
This article is the original content of Aliyun and shall not be reproduced without permission.