Summary: A major business or system change often affects the user experience or operation process of the entire product. To control the impact level, select a group of specific users, processes, and documents, and allow only these users or data to flow in the system according to the new logic after the change, while other users still use the old logic before the change. This step is the starting point of gray scheme for on-line system.
The author | with both source | ali technology to the public
The basic concept of a gray scale
A typical gray scheme
A major business or system change often affects the user experience or operation process of the entire product. To control the impact level, select a group of specific users, processes, and documents, and allow only these users or data to flow in the system according to the new logic after the change, while other users still use the old logic before the change. This step is the starting point of gray scheme for on-line system.
After dividing the users into two categories according to specific rules, we mainly need to pay attention to the part of users who hit the gray scale, whether the new logic is executed as expected, the data is generated as expected, and the changes of the system as a whole. This stage is gray observation stage, on-line verification is also a key step.
With the gradual accumulation of users, orders and other data using the new logic in the system, the correctness and effectiveness of the new system can be proved, so more users should be migrated into the new logic, this stage is generally called gray scale advance. Gray scale advance is sometimes cut the full amount immediately after small flow verification, but also need to gradually increase the amount, which needs to be combined with the actual business & system capacity to make a decision.
Finally, all users are included in the scope of the new logic. At this time, it is necessary to decide whether to synchronize the gray logic and the old business logic in the system offline. All users can only use the new logic, and then the gray level is completed. There are also due to historical data reasons, long-term cannot complete the full gray switching, at this time the business system will reside in two sets of logic for a long time.
2 What problem is gray scale solving
If a change is launched in full quantity immediately after release, it will be disastrous if system, logic, data and other problems occur, such as all users cannot create new orders, all new orders have dirty data, etc., and even the data before the change may be affected.
The grayscale process is designed to avoid the biggest risk of change: global impact. By reducing the scope of influence, and by means of gray line verification, monitoring and alarm, the impact surface will be controlled within a limited scope when problems occur, such as reducing the amount of revised data or reducing the amount of capital loss.
The so-called “no gray scale, no release” in production safety rules is the idea of reducing the impact of problems as much as possible through gray scale. If an on-line problem is discovered through the grayscale process, then removing the grayscale protection may result in a serious failure.
3 What risks will gray bring
Grayscale scheme can avoid the global impact, but will it bring other risks? The answer is yes, there is no silver bullet in engineering.
The first is how to find the problems in the gray process
This is similar to the monitoring and alarm in the on-line process. Both of them mainly rely on the construction and configuration of log & monitoring & alarm rules. However, there are some differences between them. For example, how to configure alarm threshold to effectively detect small traffic anomalies? Will the old logic outside the gray list trigger the monitoring alarm of the new logic? Does the upstream and downstream affected by the grayscale system also have corresponding grayscale monitoring? All these problems may affect whether gray scale problems can be discovered and the timeliness of finding problems.
In addition, the grayscale system should focus on the risk of capital loss. The asset loss field must do a good job of checking the guarantee before the online, or at least should be completed before the gray level of the beginning stage, especially for the introduction of new changes or the impact of the asset loss field, to achieve full coverage, “no nuclear not online”.
In the gray scale process, we can also cooperate with customers, operations, products and other students to do a good job of defense, timely perceive and deal with relevant public opinion, and use non-technical means as the bottom and supplement for problem discovery.
Secondly, how to control the influence surface of the problem in gray scale
Grayscale data generated in the grayscale process can not invade non-grayscale data, and vice versa, to ensure full isolation of the two.
But gray system need and the upstream and downstream linkage, gray also need its own propulsion, once encounter problems, also need to stop to gray, gray back more complex operations, such as gray level overall, therefore, is a dynamic process, and in the whole dynamic process, the need to strictly keep the gray data & the gray level data of isolation, otherwise it will cause problems influence surface enlargement, Endanger the entire system, or even serious failure.
What needs special attention here is the complexity of grayscale stop and grayscale rollback: if the grayscale stop means can not be effective, then the influence of the problem can not be effectively controlled; Gray scale rollback involves blocking gray scale process, modifying existing gray data, and repairing wrong data, etc. Generally speaking, it is the most complex part of the whole gray scale scheme.
Finally, the handling of problems can be complicated
Production systems often do not have too many resources or conditions for AB-test. Grayscale and non-grayscale data are real business data. Once problems occur, they cannot be solved by deleting grayscale data or dirty data. The number of data revisions, the correctness of the revised data, how to identify gray users, how to ensure the correctness of the new changes, how to ensure that the new changes can effectively repair the problem data, etc., are all difficult tasks and potential risks in the restoration process.
Conclusion of this chapter:
Complex gray scheme will introduce a variety of problems and risks, the complexity of the whole system will increase exponentially, the gray quality assurance scheme will also become more complex. So how to effectively control these risks and achieve project objectives with high quality at the same time? We often say that good quality is not measured, and the same applies to complex grayscale systems. A high quality gray scheme, not only requires perfect testing, but also depends on good design. Ensuring safe production and achieving project objectives are by no means contradictory, as long as the grayscale scheme is properly designed, you can have your cake and eat it.
Two gray design to solve the basic problem
1. Selection of gray dimensions
Common grayscale rules in the production system include the end of user ID, end of business document ID, whitelist, blacklist, time stamp, etc.
Whitelisting is often used for online testing, such as using test accounts for separate verification. This method is not suitable for use alone as it cannot quickly expand the grayscale range, but it is recommended to use in combination with other methods to increase the flexibility of the grayscale process.
Blacklist is a means to shield special users (such as users with large data volume and key customers, etc.), reduce or avoid the influence of gray scale, especially when problems occur in the gray scale process, directly block their entry into the problem logic in the system.
It is more common to use the end number of user ID or business document ID as gray key. However, how to select such gray key, we need to pay attention to several points.
First, the selected key should be evenly distributed or approximately evenly distributed, such as havanaId of the group, otherwise the full users cannot hit the new logic in batches, and the ability of gradually increasing gray scale will lose its role. In extreme cases, the whole gray scale ability will degenerate into a global Boolean switch.
The easy mistake here is not to use all the same grayscale keys, but to misunderstand that an ID is evenly distributed. For example, using the last four digits of a user ID as a gray key in a unitized application is likely to cause problems because the user ID is already used to distinguish unitized tokens. The common ID generation itself is random, but when the business system is reached, it may already have a certain pattern, so it is necessary to identify and guard against such situations.
Second, the logic for calculating keys needs to be as simple as possible
Gray key is used in the system to determine whether to use the new logic or the old logic. This condition judgment generally occurs repeatedly and is executed many times in the system. At this time, if the design of a particularly complex calculation method, it will bring extra overhead to the system. In addition, simplifying the calculation logic of the key also simplifies the business semantics, facilitating the quick understanding of technical students and non-technical students in the whole business chain, and facilitating the quick locating and troubleshooting of problems, which is more conducive to the long-term maintenance of the system.
Third, the actual selection should be combined with the business
If you select a new service field as the gray key, do you need to synchronize the upstream and downstream systems? Does offline data & report need to be modified? What if you pick a field that is unrecorded or meaningless to the downstream business? These are the renovation costs that can be saved through reasonable design.
Therefore, when selecting grayscale keys, you need to select existing, general, and business-meaningful fields of upstream and downstream services.
2. Simplify gray logic
Gray logic is only to separate a user or document from one another, so gray logic not only does not need to be too complicated, but also should be simplified as far as possible. If conditions exist in the business, it is best to use a field or a variable to do it.
First of all, it is beneficial to complete the adjustment of grayscale progress, such as grayscale advance and grayscale pause, which can be completed quickly through the adjustment of single variable. Otherwise, adjusting several grayscale variables at a time will lead to complex problems such as inconsistent grayscale advance, incomplete grayscale coverage and inconsistent grayscale data. For example, adjusting user ID coverage and order creation time at the same time may cause some users to be skipped, or the adjusted gray scale may be much larger than expected and other problems. In fact, this kind of problem is the most common in actual production. Recall that every time when the grayscale advance or grayscale pause and other progress adjustment, is it necessary for many people to jointly supervise the grayscale script and repeatedly confirm the release of content? Even after adding such a heavy process, it is still not 100% problem-free.
Secondly, after the gray scale starts, the gray scale data is often complicated. If multiple conditions are needed for collaborative judgment, it is an unfavorable factor for problem location and may even lead to misjudgment. Taking the user ID + timestamp example above as an example, the data originally generated when the gray logic goes wrong may be misjudged as the data generated by the old logic before the time is up, and the misjudgment caused by such complexity will seriously affect the hemostasis and processing efficiency of online problems.
Finally, for users or documents that can be gray scale, the entry and exit should be widened and the threshold of gray scale access should be appropriately raised, which is conducive to the rapid exclusion of most data to the gray scale range. Because in general, when we decide to go grayscale to drive change, we tend to take a pessimistic view of the system, preventing potential problems from escalating too quickly. Therefore, in the initial stage, let as little data as possible go to the new logic, which can give us time to do manual data verification, monitoring and alarm validity verification, check validity verification and so on, to prevent the first wave of gray users from going wrong, directly into a big problem. In that case, it completely lost the significance of doing gray.
Here is a brief explanation, reducing gray variables and gray hit width in and out of the two are not contradictory, the former is generally dynamic, configured in the switch to provide reading, the latter is generally written statically in the code fixed conditions. For example, a change uses a user ID as a grayscale variable, but you should initially set a threshold that is only open to users above a certain level.
3 How to initialize gray data
Grayscale is best started from 0, that is, initial data is not modified in advance through data correction or batch trigger, but is triggered by a real business request, such as user order, etc. A common practice here is that when the data in the business request matches the grayscale, a special mark is marked when the corresponding DB record is created to indicate the grayscale hit. If necessary, you can also create a separate new table, write a new record in DB to represent the relevant user or document hit grayscale.
The advantage of this approach is that zero starts and there is no pre-data preparation process, but the problem is that the overall grayscale progression may be slower. Because part of the online data generated before the online has been determined as only old logic, if you want to carry out the gray logic offline after the full gray scale, generally speaking, you can only wait for the natural shutdown of business data.
Take a simplified example, the order that has been paid before the grayscale starts uses the old logic. If the data of this part of the order is not processed, the grayscale logic and the old logic can only be offline after all the orders are confirmed and received. In a real production system, there are refunds, billing, and so on, so the waiting cycle only gets longer.
However, some grayscale schemes cannot simply initialize the data carried in the request, but also need to initialize the full amount of user data once. For example, the data in online system A will be imported into system B involved in this change according to certain rules as data preparation for the gray process. There are two advantages to this.
First, the judgment of gray threshold can be simplified in some scenes, so that all the data can be identified in accordance with a certain prerequisite, saving a judgment. And this query will generally be a database search operation, and the use of full business data to check the database, often appear DB performance problems, and even distributed DB due to the distribution of gray data caused by single database single table hot spot, here the DB problem does not do in-depth. In a word, this scheme can effectively alleviate or even avoid such problems.
The second is to accelerate the overall gray scale progress in business, shorten the period from gray to full scale, sometimes for business considerations, we may have to choose this scheme.
But the disadvantages are also obvious. For example, if the data initialization scheme is to import table B from table A, then the data migration logic needs to undergo additional validation work first. Later data migration also needs to occupy a certain project cycle; How to ensure data consistency of AB system should also be considered in the design of data migration process. For example, in the process of data migration, new business data is generated in SYSTEM A. Should we migrate it? Or will some records in table A be locked during migration? Or even stop the services corresponding to Table A? If we really need to stop service, then this is not the Internet.
4 Maintain data consistency in gray scale process
The previous section described the initial grayscale phase, but the grayscale process often starts with the previous business step before affecting the next business step. For example, the same user hits the gray rule at time T, and hits the gray scale when writing the table; At the time t+1, an operation needs to update the table record, but due to gray scale rollback or other reasons, the gray scale rule is not matched, how to judge in this case?
In fact, this kind of problem is the consistency of gray data, but also the most core problem in gray design.
Principle 1: Use the existing gray matching data
In many business scenarios, it is very common to write the table in the previous step and update the table in the next step. There is no need to say much about marking when creating the table. The basic judgment principle in updating should be to take the existing gray data as the judgment standard, rather than whether the gray key hits. That is, when the next step update operation is always to check the RESULT of DB shall prevail: DB records as gray hit, then the new logic shall be executed, otherwise, according to the old logic of gray not hit.
Principle 2: give priority to the consistency of data in the process of gray scale advancement
When the gray scale advances, more users or documents will be included in the range of gray scale matching, so whether this part of data can enter the new logic should be considered.
For example, if the end number of the user’s bill ID of the current month is gray scale rule, the user’s bill of the current month must follow the new logic when the bill is updated again once it is marked as gray scale matching. When the bill is created, if the gray scale is not hit, then the bill will keep the old logic until the bill is cleared.
This principle has certain similarity with the previous one, but the core concern is that the gray key matching situation caused by gray progress changes in the two stages of creation and update. In this case, the principle of DB record marking is generally still followed.
On the other hand, the data that has been matched before the grayscale advance should ensure that the grayscale can still be matched after the grayscale advance. This is a self-evident rule, in order to ensure data consistency, only this can be called gray scale advance. However, in the process of actual operation, this rule is sometimes violated because of gray switch configuration errors and other reasons, so we can consider a certain error prevention design for configuration items.
In addition, in the process of gray scale advancement, it is necessary to pay attention to the consistency of switch data of each machine in the cluster. Firstly, it is necessary to ensure that the changed grayscale switch value is pushed to all the machines in the cluster. Secondly, in order to keep the consistency of grayscale advance time, an effective time stamp is generally added to the grayscale switch to avoid possible problems caused by switch push delay.
Principle 3: If you need to quickly advance the grayscale, you can try to start another grayscale dimension after the first grayscale dimension is fully completed
As mentioned above, if the record creation is not marked when updating the data, the old business logic should be used even if the gray key has been hit. However, in this way, the overall grayscale progress will be pulled to a very long time. For example, a refund can be initiated within 90 days after the receipt of goods is confirmed, so whether to wait until 4 months before the new logic can be fully cut? Is that allowed in business?
The above example can be done, that is, when the order has been created in full gray scale, then it can be understood as the creation has all cut into the new logic, at this time continue to pay or confirm the receipt of goods when gray scale marking, so that you can still maintain the principle of gray scale for only one variable at a time.
Here are a few less-than-ideal ways to quickly advance the gray scale.
1. Promote multiple gray dimensions at the same time
This is the kind of design that we tried to avoid when we discussed the simplification of grayscale logic. Rather than do this, it is better to directly complete the first grayscale dimension after sufficient verification, and then advance to the second grayscale dimension.
2. Grayscale marking is carried out at multiple entrances at the same time
This method seems to speed up the elimination of unmarked data creation records, but multiple entries marked with the same mark, when there is a problem, how to troubleshoot the cause? Do you want to overwrite the markup created when updating? How to stop marking synchronously when gray pause? In short, this is a highly complex and verification – intensive scheme.
3. Manual data correction
Since data correction is to be done, it is better to do a round before the grayscale starts, so that the benefits can be achieved at the beginning of the whole grayscale scheme. The cost is higher, but the benefit is lower, so it is not cost-effective from the perspective of ROI.
In the process of grayscale design, do not easily try to overturn these simple principles, because the more simple and basic principles, the greater its influence, the change of these principles, will often cause the aforementioned design was completely overturned.
Of course, there are some businesses that just create records and never update them again, and the focus of this business is not greyscale advancement, but the greyscale pause & rollback strategy below.
Grayscale pause and grayscale rollback
Gray scale is to serve the safety of production, so the corresponding, must establish the appropriate circuit breaker and rollback mechanism.
Principle 4: gray process must have the overall suspension ability, that is, gray fuse.
Grayscale fusing does not require correction of data that has already entered grayscale, but simply does not produce more grayscale data.
Why does gray scale not continue to advance, but also need to add a switch like this? Here’s an example.
When the end number of user ID was used as the gray key and n users entered the new logic, we found that there was a bottleneck on the DB side that needed to be fixed. The business-layer application has three options.
First, immediately reduce the gray scale or roll back the code. This is not desirable, has hit grayscale into the new logic of the user, often can not easily back to the old logic.
Second, do not continue to promote gray, also do not operate gray switch, laissez-faire system to continue to run. This is also risky, because at the present stage only N users enter the new logic, but according to the total number of users * grayscale ratio, there may be m users will hit the grayscale into the new logic, or even M >> N, if the DB problem cannot be fixed before a large number of users enter, the whole system will face disastrous consequences.
Third, gray fuse. Do not operate the grayscale switch, but stop new users matching grayscale rules, that is, there are currently N users into the new logic, then even if there are m users matching grayscale rules later, it still cannot enter the new logic, so as to ensure that before the DB problem is repaired, the system can maintain the status quo and continue to run.
Through this example, we can fully illustrate the necessity of building grayscale suspension capability.
Principle 5: An operational grayscale rollback scheme is a meaningful grayscale rollback scheme.
Generally speaking, we all want to be able to achieve gray observable, roll-back. However, it is more important and safer for business to avoid data consistency problems in the gray scale scheme repeatedly described above. When a problem occurs, the mechanical implementation of the rollback, but will cause more impact, and the use of grayscale pause ability to quickly stop bleeding, and actively repair the problem, but more appropriate.
So back to the gray rollback scheme, in the premise of ensuring the consistency of data and other principles, can we design a reasonable rollback scheme?
I think it should be possible, but unfortunately, we have not been able to do this successfully in project practice. Because resources in engineering are often limited, it is impossible to devote a lot of time and energy to a highly complex rollback scheme.
Therefore, FOR the grayscale rollback scheme, I have some negative conclusions: if the complexity of the grayscale rollback scheme is difficult to control, then the correctness will be difficult to verify;
Complex designs can lead to longer development and test cycles, potentially hurting the business more;
Just as contingency plans need to be rehearsed in advance, no one would dare to go online and use an unproven rollback scheme.
Therefore, I recommend that you only consider designing a full grayscale rollback strategy for businesses where the model is more simplified.
Conclusion of this chapter:
The scope discussed in this chapter is mainly from the technical perspective, which can basically meet the design requirements of a conventional gray scheme. But can not be achieved does not mean to do well, in addition to technical means, there are more other types of means can be applied to gray scheme, to help us become more perfect and robust, to achieve observable, measurable and other engineering objectives, to build a high-quality gray design scheme.
Three more perfect gray scheme
1 Has good testability
We generally consider using a more detailed or complex grayscale scheme in complex projects. On top of the business complexity of the project itself, the technical complexity introduced by grayscale is superimposed. At this time, how to conduct a complete test becomes a no small challenge. We need to make it clear that testability is something that needs to be taken seriously in the design. Let the data flow within the system & state migration are observable, the request, processing data process value, switch value, branch judgment results and other information is clear, no omission persistent to log or DB, especially whether gray matching, gray determination rules and other key information; Don’t let the complex system become a black box with only initial inputs and final outputs. Otherwise, it will cost a lot of communication costs during debugging and testing, and may even bury defects that cannot be found.
The consistency of upstream and downstream logs should be maintained. The best code is self-explanatory, and the best logging should be self-explanatory. If the upstream system uses a grayscale marker, the downstream system should use the same marker; If there is a downstream change in business semantics, you can add a field instead of overwriting or clearing the upstream field of the same name. This can be very helpful when tuning or working on problems across multiple systems or teams so that everyone has the same understanding of the same tag or concept. For example, the upstream logs AA=true after matching AA rules, and the downstream generates BB rules based on AA rules. In this case, the AA field information can be retained and BB=true can be recorded.
For the processing of dropped data, the problem of collability should be considered. When data is transferred across systems, it should be clear what the business primary key is in each system. The downstream system should drop the primary key or unique key of the upstream system into the database. If conditions permit, the upstream input key value should be used as a tiled field, and even index it in DB. The first advantage of this is for the convenience of follow-up construction check, convenient to use the same unique key to find the associated records of upstream and downstream systems. This is also for the future system scalability consideration, if the downstream of the downstream system needs to be connected to other systems in the future, at this time, through the upstream unified key value can effectively connect multiple systems. A typical example is that the unique id of the transaction system is transparenly transmitted to all the limit details, bill details, refund and other systems downstream.
The above design ideas for improving testability are not only for grayscale schemes, but also for schemes that do not involve grayscale; Not only should the test students identify and discover the testability shortcomings of the scheme in the design stage, but also the developers should consciously design for testability.
2 Pay attention to the full-link pressure
It is necessary to pay attention to the change of pressure dependent on the downstream in the system transformation, and it is also necessary to take this into account in the grayscale design, especially when the system pressure changes with the grayscale advance.
A typical scenario is that as the gray scale progresses, more and more requests are passed downstream. This is a more understandable example, so I won’t expand too much here. In this case, the main thing you need to figure out is whether the rate of growth of downstream requests is linear, logarithmic, or exponential (which is a terrible case of failure) as the gray scale progresses. In addition, is the flow model relatively stable or continuously changing after the grayscale advance?
In actual services, the situation is not so simple. In some scenarios, the downstream traffic is the maximum when the gray scale starts. With the advance of gray scale, the downstream flow will become smaller and smaller. For example, after the gray scale starts, all users need to query a certain service. Users who match the gray scale can avoid this query by other short-circuiting methods. If the assessment finds this particular situation, in addition to the usual stress assessment, you may want to consider adjusting the dependency plan to avoid this counterintuitive situation.
There will be many other possibilities. Take another extreme example: the flow of this month is gradually rising with the grayscale advance. After N days, the grayscale advance reaches the full volume and the flow remains stable. However, on the 1st of next month, the business data needs to be regenerated. As the gray scale has been completed, a huge amount of traffic suddenly burst out, which has a great impact on the downstream system. Such extreme scenarios can cause unexpected and serious failures if they are not identified in advance and properly responded to.
In addition to the downstream business systems, we need to focus on possible bottlenecks, DB side of our business systems can generally fast cluster parallel capacity to cope with heavy traffic, but the expansion of DB is more complex, may involve data migration, lock library, index rebuild operation, such as some operation belong to high-risk operation, if there is improper, It can even affect other businesses that use the same library or table. When such problems are identified, it is necessary to contact the DBA in advance to discuss a reasonable expansion scheme and reserve sufficient time to complete expansion before the grayscale startup.
All stress assessments, of course, can be tested with manometry. But use the pressure test as an acceptance of the design, not as a way to find problems. Spotting systemic problems just before launch can be too late, and the cost of forcing a launch or delaying it could be huge.
3 Grayscale progress and monitoring
First of all, monitoring is the most important means of observation in the early stage of gray scale. It is very important to establish a complete and comprehensive monitoring for data observation in the early stage of online, gray scale opening and gray scale volume. At the early stage of online, we should focus on whether the old business logic under the new code can operate normally. At the initial stage of gray opening, it is necessary to observe when the data matching the new logic appears and which business branches are entered. During the gray scale, it is necessary to observe whether the change of flow can match the switch adjustment, and whether the error amount continues to be at a low level, or increases linearly or even faster. In addition, the monitoring concerned in the process of gray scaling still needs continuous observation after the subsequent full gray scaling, and some of them need to establish corresponding alarm rules.
Secondly, there are some differences in the check of gray scheme. In general, table A corresponding to the upstream system is used as the left table (check data source), and table B of the downstream system is used as the right table. But upstream downstream system in gray-scale phase will do two classification data, writing table B of the gray level of, miss don’t write, at this time to establish verification is reversing, the writing table B after downstream gray hit as the table on the left, which in turn with the upstream A table set up check, make sure that all of the gray level of data is still keep the right relationship with upstream.
However, there is another problem, a user should hit grayscale, but did not write to table B, how do we find such a problem? Here, I propose A solution, that is, still establish the check from A to B, but add the equivalent condition statement of gray rule to the check rule, and modify the check rule with the advance of gray level. However, the verification rules will be very complex, and it will put higher requirements on how to design the drop field.
Finally, we can also establish a small report for the whole gray scale scheme for quick check, judging whether a specific user or data has matched the gray scale from the perspective of falling database results. Furthermore, the aggregation and statistical data related to gray scale can also be displayed in the report to judge whether the data distribution conforms to the pace of gray scale advancement, and whether the next step needs to be accelerated or postponed. With the help of these data, on the one hand, it can improve the efficiency of the technical side students when answering questions or troubleshooting problems, and on the other hand, it can provide the overall data or local conditions of gray scale to the business side students, so as to make more business decisions.
4 Emergency policies and repair measures
In the process of gray scale advancement, we can obtain feedback from the system and users through various ways and channels, including but not limited to monitoring, checking, user consultation, etc. When data and scenarios that do not meet expectations or even have serious problems are found, the standard operation is to stop bleeding first and then repair them. A common solution to stop bleeding can be to turn off the business logic switch, go offline or roll back the code of the new logic. The incorrect data is then corrected during the repair phase.
But back to the original question, why do we do grayscale? The value of grayscale itself is that it can control the impact surface at the initial stage of the problem. If the above general scheme is only implemented mechanically, why do we need to design a complex grayscale scheme?
For example, in the gray scale scheme, the hemostatic switch can be designed with a new logic of complete volume offline, or it can be designed to no longer generate new gray scale users, which has been given examples in the previous chapter and will not be described here. When there are multiple grayscale dimensions, it can also be designed to adjust A dimension and adjust B dimension to achieve the aforementioned purpose respectively. However, it should be made clear that the semantic design of hemostasis switch should be as simple and unambiguous as possible, because the significance of hemostasis is that it can be executed immediately without complex judgment in a short time.
The work of detailed judgment and analysis is completed in the next phase of repair. As mentioned in the previous chapter, rolling back in the general sense can cause bigger problems, so you can continue to advance the gray scale after fixing the code logic or data; Of course, rolling back the grayscale range is also optional, such as revoking the user or document that has already hit the grayscale and changing it to miss. However, in addition to the data consistency and system complexity mentioned above, this kind of rollback should also be considered from the business logic to see whether forward compatibility can be achieved, and from the product perspective to consider whether user experience can be guaranteed after rollback, which is also an important basis for deciding whether to do this kind of complex grayscale rollback scheme. For example, the day before the user hit gray, can use a new function; But the second day after gray rollback can not be used, this high probability will lead to user consultation or even complaints.
Production safety should be placed in an important position, but the goal of the project is never a single. Grayscale system design in this part must be a trade-off, not blindly from the perspective of system stability, but should be combined with the actual business situation, to balance the complexity of the design, development, testing, operation and maintenance costs, and the impact on the product and user experience.
5. End point of gray scheme
At this point, we’ve pretty much covered the hard part and the possible failure scenarios. Below talk about in gray progress all smoothly, what matters need our attention.
First of all, after the completion of gray scale, the gray switch offline. The most obvious benefit is the simplification of code complexity, the gray level of completed code, which is essentially equivalent to useless business code. At this time, the old logic code can also be offline, all directly execute the new logic, but also convenient for other students to read and maintain. This step is not necessary, however, and there may be some limitations.
The ultimate goal of greyscale is, of course, to fully switch to the new logic, but this can sometimes take a long time to achieve. Take a business example, for example, from a day in May to grayscale a far month bill related function, then some users have produced August non-grayscale bill, according to the expectation, will have to wait until September to theoretically achieve the full bill hit grayscale. When this happens, it is generally necessary to fully communicate with the business side, because the business may not be able to tolerate a long gray cycle. The means to compress the whole cycle, in addition to push the gray switch to the full, but also through data correction and other ways to accelerate the data level of gray advance.
However, the real situation may be more complicated. To continue with the above example, if the bill in August is overdue, the full amount is still not realized in September. There are similar examples in transaction-related systems, such as payment, confirmation of receipt and refund. Each cycle can be very long. If disputes and other scenarios are superimposed together, the cycle is uncontrollable, so it is almost impossible to properly handle such long tail values in the design. Generally speaking, after the whole approaches to the total, there will always be some abnormal data and outlier data. Therefore, from the engineering point of view, as long as the non-gray data tends to converge, it is in line with expectations and acceptable.
6 Grayscale cost
We repeatedly mentioned before, in the design of gray scheme to have a trade-off, especially to the most complex part of the trade-off. But doesn’t the easy part cost anything? Isn’t. In the project to achieve a gray scheme, will pay the corresponding cost. A design should always be evaluated in terms of cost-benefit ratios to determine whether it should be retained or discarded.
The first cost is increased complexity, which has been mentioned many times before. In general, we can afford to do this on complex projects because the complexity of the project itself is so high that adding a little more complexity doesn’t cost much at the margin. But for simple projects, we need to think about whether we need to use gray scale; Or for the need of safe production, we must carry out grayscale, the simple project must also match the most simplified grayscale scheme, to avoid the waste of cannon hit mosquito.
The second problem is the release delay. The design, development and verification links need to spend extra work and research and development cycle to ensure the correctness and effectiveness of gray logic, and there will be a large probability of time to repair gray problems. It is also not easy to properly estimate the amount of work gray scale introduces at the beginning of a project, because gray scale logic is often orthogonal to business logic. From the perspective of the number of use cases alone, theoretically every increase of a gray switch, the number of related functional use cases will double. Of course we can exclude some use cases based on the actual business, but this trend of increasing use cases is not a good sign for the project as a whole.
Then there is the problem that grayscale lengthens the project cycle. The period here refers to the period from the release to the full gray scale. As the above case shows, a project that starts graying in May may not be fully completed until September or later, which sounds unacceptable. At the extreme, this long gray-scale process can affect the design and launch of the next project, or even spread the impact downstream. If that happens, it’s already a design failure.
Finally is to carefully consider not to do gray. Do not do grayscale is actually against the rules, generally we do not recommend doing so. But there are always exceptions in engineering, and sometimes there will be some business scenarios that cannot be grayscale, or grayscale is better than not grayscale. If you decide not to do gray scale scheme, it is better to sort out the problems brought by gray scale and the benefits of not doing gray scale in advance, and at the same time, to fully evaluate the risk of giving up gray scale, so that other students in the project team can understand and agree with this decision.
Conclusion of this chapter:
The quality of a project or product is never measured, but built during design. It is hoped that the various design methods and ideas mentioned in the above two chapters can give you more input and inspiration, so as to design and build more stable and robust projects in the future. You are also welcome to supplement and correct the contents of the article.
The next chapter will discuss how to guarantee the correctness of complex gray scheme from the Angle of testing.
Quality assurance of four gray scale schemes
The previous chapter mainly focuses on the design of grayscale scheme. However, the correctness and stability of a system not only depends on effective design, but also needs comprehensive and reasonable testing to ensure. This chapter discusses the quality assurance system of gray scale scheme in detail and lists the key points of test coverage in gray scale system.
1 Grayscale basic logic
This is the most basic test point, that is, how to distinguish the data between one and the other: if the preset conditions are met, the gray scale is hit; otherwise, the gray scale is not hit.
Gray matching results should not only be predictable, but also stable. Using the same data and configuration, you cannot hit grayscale on one request and miss it on another, or you will have serious problems.
For example, if user A hits the gray scale for the first time and drops the matching result into the database, but the gray scale judgment condition is accidentally affected, then when user A requests again later, it may fail to hit the gray scale again. Low-level defects of this type need to be found early or they will block subsequent tests.
2 Persistence after gray matching
Matching grayscale data sometimes needs to be persisted to the database, in the test in addition to check grayscale markers, but also check the newly added fields. If a new table is written after gray matching, all fields should also be checked completely.
If the data in the database is associated with the upstream, it is necessary to check whether the records are consistent. If feasible, it is best to promote the development to set it as a tiled field to facilitate the establishment of a check between the upstream and downstream. If fields such as document number have uniqueness, additional idempotency tests should be done to prevent multiple writes of the same gray matching data.
In addition to outcome data, process data is also critical. Multiple conditions may be experienced in the gray determination process, so the input value and judgment result value of each condition need to be printed in the log, which is convenient for joint adjustment and follow-up troubleshooting. In addition, it is necessary to check the uniqueness of variable names and correctness of variable values in logs to prevent the printing of invalid logs with confusing semantics.
3 Gray compatibility
Since the requests in the gray scale process are divided into two parts, the system should have the corresponding processing capacity for both kinds of requests, that is, before the full gray scale, the old logic should still be available.
The old logic in the new version of the code, in essence has been different from the logic of the last version, because the last version is not gray judgment, directly execute the old logic; In the new version of the code, there is an extra layer of judgment logic. This layer of judgment logic may sometimes add various tags to the input parameter before entering the downstream module flow of the system, causing more complications. For example, if the system adds an attribute to the data that does not match the gray scale, but the downstream process no longer processes the traffic that has any marks, the old logic in this case will be affected.
If grayscale system involves multiple applications, also consider the compatibility between applications. Common test points include:
Whether the grayscale system affects the process interaction between the upstream and downstream. For example, if the application of GRAY scale is not matched, the application of GRAY scale is not matched, so whether the monitoring and checking of application A will be affected;
Whether the gray level is newly introduced into the downstream dependence, whether the original dependence is removed or needs to be weakened, whether the design of strong and weak dependence is reasonable, how about the specific dependence, whether the circular dependence is introduced, or whether the data flow constitutes a loop;
If the upstream and downstream applications are modified, whether the changes will affect the unpublished upstream applications after the downstream applications are first released.
4 Grayscale advance
Gray scale starts from 0, to partial coverage, and then to full coverage. The process of gray scale advancement also needs to be focused on by testing.
The first is the case of one end, gray switch configuration for full old logic and full new logic, whether the result of the request meets the expectation;
Secondly, in the process of gray scale advancement, if user A does not match the gray scale in the last request, but matches the gray scale in the next request due to the expansion of the gray scale range, can user A’s request be processed normally? Can user A be included or excluded from the scope of grayscale new logic as expected?
Finally, it is necessary to evaluate the compatibility problems that may be caused by gray scale advance. The point to be concerned here is to dynamically evaluate the compatibility of internal logic in the case of gray scale switch changes, which may not be covered by the above static compatibility test. This requires careful analysis based on the actual business and design scheme to exclude possible, deeply hidden and complex defects. For example, in the first request of user A in the current month, the gray scale is not matched, so A record without gray mark is written, which means that user A will not match gray scale in the current month. When user A requests for the second time, the service times out when it queries whether there are records with unmatched gray scales. In addition, as user A becomes gray scale matched due to gray scale advance, another record with gray scales is written, resulting in two records with contradictory business semantics in the library.
5 Grayscale suspension or grayscale fuse
As has been repeatedly mentioned above, the function of grayscale fuse is crucial to grayscale scheme, and is even the only escape path of the system at some critical moments, so special attention should be paid to it.
First, when the fuse switch is closed, ensure that there is no new gray flow into. There are two meanings here. On the one hand, the data that has not been matched with gray scale cannot be matched with gray scale again; on the other hand, the data that has been matched with gray scale should be determined whether gray scale can be continued to be matched depending on whether the gray scale system can be rolled back and whether it is forward compatible.
Second, when the fuse switch is closed, it is necessary to ensure that the gray logic of other parts is not affected, which is also part of the basic logic test.
Testing of such emergency plan, also need to combined with the actual business scenarios to carry on the design considerations, such as when there are multiple other business logic switch, whether to testing all switch combinations, or preferred test business combination of actual use, or only test the emergency scenarios will appear and a limited number of several combinations.
6 Grayscale rollback
As mentioned above, we generally do not recommend the introduction of complex gray scale rollback logic in gray scale schemes that may involve the consistency of gray data. However, it is undeniable that grayscale rollback is still valuable in some scenarios, and the quality of rollback ability should be guaranteed by testing means at this time.
First of all, no new gray matching data is added in the rollback process. The point of protection here is the same as when the fuse switch is opened.
The second is to ensure the consistency of matched gray data in the rollback process. The most important scenario is how the system will deal with the data that has been matched gray in the previous business process but not in the next business process. For example, gray scale is matched and marked when the order is created, but gray scale is not matched in the payment phase, so the gray scale mark needs to be removed at this time.
In addition, the rollback ability of grayscale switch should also be tested. If grayscale switch has multiple dimensions or restrictions, the combination of test cases here will also be very complex, but it has certain similarity with the test scheme of grayscale advance logic, which can be used as a reference.
Finally, the process of gray scale rollback generally needs to change the gray scale data that has been dropped in the database by means of data correction. This does not involve the test of code flow, so it can be considered to establish check rules to guarantee.
7 Fault tolerance of abnormal configurations
Grayscale logic often relies on a switch or diamond configuration item, but it can introduce errors. If you think of the system as a barrel, configuration items are often the shortest plank. We should avoid the more serious problems caused by configuration problems by optimizing the design.
First, the grayscale switch mismatch, the application can not receive, should still use the last correct configuration. Although the middleware layer will always persist an incorrect configuration value entered in the Diamond configuration item, the application can report an error at this point and deprecate the incorrect value sent by the middleware.
In addition, if it is commercially feasible, you can take the default value every time an error value is received.
A typical example is that the grayscale configuration of the previous version contains users with the tail number of 00 and 01, while the grayscale configuration of the later version only contains users with the tail number of 0, excluding users with the tail number of 1. If this configuration takes effect, the data consistency of users with the end number 01 will be compromised. At this time, if the configuration of the later version is verified, it is found that the user with the tail number of 01 can hit the gray scale originally, but it is not hit after the promotion, then this problem can be avoided.
This is not only the exception logic to be considered in test design, but also the poka-yoke mechanism to be considered in solution design.
Fault tolerance or alarm for abnormal data
If a new field is found missing in the grayscale process, but can be written through a certain backfill mechanism, then it is best to carry out silent processing and tolerate such error data. For example, the user should be marked with grayscale when browsing goods, but the user within the grayscale range is still not marked with grayscale when the subsequent purchase is found, then the user can be marked with grayscale again.
However, if a core dependent field in the system encounters a data consistency error, it should stop processing immediately. For example, an order with a grayscale hit mark is missing a key new field that should be written at the payment stage when the goods are confirmed. At this time, no processing should be done, by recording error logs, throwing exceptions and other means, trigger external monitoring alarm, waiting for manual intervention.
Here you can simplify the test scenario with the exception injection class facility. By destroying the consistency of gray data, check whether the system’s processing of abnormal data meets the expectation. The correctness of this part of the function will play a great role in the gray scale rollback and other complex situations, such as the first request gray scale is not hit, the second request gray scale hit compensation; Or the data correction is not complete when the rollback, the system processes the data to trigger the alarm, remind the correction again, etc.
9 Impact on external systems
In addition to paying attention to the internal data flow of the business system, sometimes the impact on the external system should be considered. For example, when a node is executed, a message is sent to the external system, and several listeners of the downstream external business parties need to execute the corresponding system logic after receiving the message. Or most commonly, dropped data is periodically written to an offline data table.
The impact on the external system should be synchronized to the downstream business side in time at key nodes such as the early stage of change and after the design scheme is confirmed, and the changes needed in the downstream should be evaluated. Effective series test and acceptance should be carried out in the pre-release environment. If necessary, separate monitoring or verification should be established for the data generated by the new logic. In addition, in the gray advancing stage, it is necessary to synchronize the gray changing rhythm downstream to observe whether the monitoring changes meet the expectation.
The impact on offline tables is twofold. First, new check rules should be established for the changed parts. Second, it should be evaluated whether the check rules established on these offline tables have any impact on them and whether they will lead to false positive or missing check.
Gray flow model analysis
The flow model of gray scale process changes dynamically. First, there are some differences with the previous version of flow model before gray scale advance. Then with the advance of gray scale, the flow model will gradually change; Finally, the gray scale is stable.
In the stage after the changes go online and before the gray start, there is generally not much difference with the previous version of the service or DB dependency, otherwise these changes should also be included in the gray process. In this stage, we mainly need to evaluate the service invocation and new DB fields to determine whether there is complex calculation logic or impact on DB read and write.
In contrast, there will be more points to analyze in the gray advancing stage. In the process of gray scale advancement, the query interface of gray scale judgment logic, the two sets of business logic interfaces after shunting according to gray scale matching results, the DB when falling into the database, and the flow of other dependent or downstream parties are all changing synchronously. These points need to be sorted out one by one, and then the possible consequences of traffic changes are analyzed.
Here are a few common scenarios where performance issues arise as pressure increases:
- The downstream service traffic limiting is triggered, which increases the service failure rate of the system.
- As the downstream service RT becomes longer, the system service times out and the failure rate increases.
- When advancing gray scale, the key value of matching gray scale is improperly selected, which leads to hot spots in single database after sub-database sub-table rules.
- In gray scale advance, the advance range is too large, resulting in too many write library requests in a short time, resulting in the flow or performance jitter of the whole library.
- The DB index added for the gray field is not applicable to the flow model in the process of gray advancing, resulting in the DB performance is not as expected.
After the gray scale is full, the flow model will reach or gradually reach a new stable state. In addition to continuing to observe the various points in the process of gray scale advance, it is also necessary to consider the action of switching after the full scale, such as short-circuit the gray scale judgment logic to reduce a query; Or move the query operation of gray condition from one interface to another interface with better performance. In short, there may be only performance optimization in this phase, and it is unlikely to make the overall performance worse. Such optimization does not need too much attention except to ensure the correctness of basic functions.
11 Pressure test the gray scale system
According to the situations listed in the previous section, pressure bottlenecks often occur in the newly added service interface and DB, which need to be analyzed based on the specific business. However, analysis is not a panacea. Before launching a new interface or new library table, a round of pressure test should be carried out according to the planned flow requirements to ensure that there are no hidden defects caused by the omission of analysis.
The setting of pressure measurement flow should be evaluated based on the interface adjustment amount of the current online service. It can be calculated by magnifying the flow value of the full gray scale by 1.2 to 2 times. The purpose of amplification is to cope with peak flow on the one hand and to quickly expose problems on the other. The common problem is that the traffic is multiplied in the downstream. For example, when a request is made and an interface is invoked twice, when the traffic is small, the multiple relationship between the two is not obvious, and it may be mistaken for the disturbance caused by the increase of real traffic on the same time line, so that the problem cannot be found. However, when the flow is large, the multiple relationship will be immediately apparent.
If you need to use an online cluster to do a pressure test after the release, you should also consider the isolation of shadow data from real data if the pressure test traffic is high. The new logic of full grayscale hit is needed when using shadow request pressure measurement, but the grayscale cannot be opened for real online request. In this case, an additional switch for manometry is added to the code to determine whether to perform grayscale logic by determining the requested manometry flow marker field at the entry point.
12 establish check rules for gray scale
In order to ensure the timely discovery of possible problems in online data after the new project goes online, all relevant check rules should be online before the grayscale starts at the latest.
In gray scale projects, there is often a discrepancy between gray hit and unhit. At this time to establish a check to consider how to select the left table. We take the whole request of the system as the whole set, and the gray hit part as a subset, so the data in the gray hit subset must maintain a certain relationship with the data of the whole set. On the contrary, it is not true, because there are still some gray unmatched data in the full set, which cannot be consistent with the data in the gray matched subset.
For example, a grayscale system is located downstream and needs to check with the upstream system to ensure that all requests from the upstream are processed correctly. At this time, it is necessary to use the table after hitting gray as the left table, the upstream request table as the right table to establish a check.
In addition, it is necessary to establish a check for the unmatched parts of gray scale to ensure consistency. There are two processing methods: if grayscale matching only adds the drop table without affecting the original drop table logic, then full check can be done for the old logic first, that is, after the grayscale starting, no matter whether the hit, the consistency constraint under the old logic should still be followed; If the data will be moved from the old table after gray matching, and only the new table will be written, it is necessary to analyze the upstream and downstream relation and the subset complete set relation of the unmatched part of gray scale, and then select the subset as the left table to establish the check, which is similar to the processing method of gray matching.
The above principle can help us check that the data is always consistent in gray matching and not matching, but it cannot ensure the correctness of the result of gray matching or not. To ensure this, we mainly rely on basic functional testing. Secondly, we can consider introducing conditional statements equivalent to gray scale rules into the check rules, and modify this conditional statement synchronously after each gray scale advance. However, this method can only be used in real-time or quasi-real-time check, and may not be applicable to offline data check, because the gray scale rule followed by historical data in offline table may be inconsistent with the current gray scale rule. If necessary, you can manually query the offline table for a single time and judge the result by combining the gray switch operation record.
Finally, we should appropriately improve the timeliness of the grayscale system’s checking rules, because from the point of view of the efficiency of finding problems, real-time checking is far better than offline and other day checking. At the initial stage of gray scale, the probability of finding problems is greater and the repair cost is smaller, but the premise is that it can be found in time.
Conclusion of this chapter:
The quality guarantee strategy of grayscale scheme is matched with the design strategy, and the complex grayscale system design must correspond to the complex grayscale testing scheme. Back to the significance of gray itself, it is to serve the safety of production, so a good comprehensive test coverage of gray system is the bottom line of the bottom line, be sure to be the focus of the test work. In this article, I hope you can share more experience and experience on the topic of gray quality assurance.
The original link
This article is the original content of Aliyun and shall not be reproduced without permission.