A, sequence
1.1 Start with two stories
Story 1: On May 27, 2015, A Lanxiang excavator broke the optical cable in Xiaoshan District, Hangzhou, causing Alipay to collapse. Users could not pay after logging in repeatedly, taobao sellers could not collect money, and offline merchants could not pay by scanning their codes. At that time, # Alipay exploded and became a hot word on Weibo. Before this, no one believed that Alipay would crash. When the accident happened, many users called to scold the operator, preferring to believe that the mobile unicom network crashed, rather than to suspect that Alipay would have problems. But after the accident, when Alipay was paralyzed again, public opinion calmed down a lot. “Old” users seemed particularly calm, “I guess the cable was cut again.”
Story 2: On July 13, 2011, netizens found that all functions of Bilibili products seemed to be broken. The news quickly became the top trending trending post on Weibo. All functions of Bilibili station did not return to normal until around 2:15 am on July 14. Bilibili’s technical strength with 400 million users has been questioned by a large number of netizens. Bilibili’s technical director MAO Jian’s article “High availability Architecture Practice of STATION B” on Zhihu has also been dug up and beaten by everyone, who thinks that he “seeks for hammer”.
1.2 about me
After telling two stories, tell me about myself. I was POC of c-end marketing & Big promotion direction of Douyin e-commerce company before, AND PM of chief technical executive of Alibaba 2020 Goods Festival & Big Promotion Of New Year’s Goods Festival. I have six years of experience in late-end development of advertising and e-commerce, and I have experienced the technical test under the scenario of large data volume, high concurrency and huge capital.
1.3 Topic Selection
Two stories show how inadequate consideration of failure scenarios can hit a company’s reputation. From the point of view of individual programmers, failure-oriented design also has a huge impact on individuals. The responsibility of enterprise accidents will eventually fall on individual programmers, and accidents will often consume the trust of the organization for individuals, and directly or indirectly affect the development of individuals. At Bytedance, the impact on individuals is modest, but at other companies, an accident often means a programmer’s “year for nothing.”
What are the differences between programmers of different ages? In addition to the ability of architectural design, project management, technical planning and technical leadership, the ability to design for failure is also extremely important.
New students of business development can sometimes feel delusively confident that their code is no different from the older ones. In fact, writing normal flow of business code does not vary much, but handling exceptions, boundaries, and uncertainties is the true skill of a programmer. They have developed a lot of muscle memory over time, and when faced with a specific problem, they will come up with a lot of failure oriented design points in their mind, and write highly usable business code. How to learn the methodology of failure-oriented design and gradually develop your own muscle memory is the best way to turn a novice into an old bird.
Based on such considerations, I wrote this article to summarize my experience and lessons in the past few years, hoping to inspire more old birds to share their experience and learn from each other and make progress together.
Second, the word
At the Tao level, I want to talk about the worldview of design for failure.
2.1 Failure is everywhere
In the ideal, the hardware never ages, the system software never expires, the traffic is always within the expected range, the code you write is bug free, and the product manager never changes the requirements, but the reality often gives you a full blow and gives you a social beating: Hardware will fail at some point, software will fail at some point, traffic will jump at unexpected times — even if you’re at a wedding, there are programmers who don’t write bugs, and product managers who change requirements every day, even with conflicting or logically flawed requirements.
Whether in the era of traditional software or the Internet or the cloud, systems will eventually fail at some point. Failure oriented design is not about eliminating failure, but reducing and even eliminating the impact of failure, and keeping the corporate and individual purse strings.
The only constant is change
Not only is failure everywhere, but change is everywhere.
2.2.1 Don’t write dead – your PM is born to change requirements
“Don’t write die | your PM for the change of demand and students”, this sentence is my counterpart, a product manager of fly book character signature, it is deep in my heart. Always be nervous about code dying. Murphy’s Law is that the more fields or features you think will never change, the more they will change. So, more configuration, less writing, so that you can quickly respond to product requirements so that others sit up and take notice, but also give you more means to do a quick recovery in the event of a failure.
2.2.2 Isolation variability – Programmers respond to software changes
If system software never changes, do we need design patterns? Need object orientation? Isn’t a shuttle faster and better for the process? But what is the programmer’s use of system software that never changes? Tiktok has become so powerful that it can make bytes a lot of money by doing nothing. Can tiktok’s programmers be laid off? It doesn’t seem so.
Design pattern, is the predecessors summarized to deal with the sharp weapon of change. 23 design patterns, in a word: Isolation variability. Whether it is a creative pattern, a structural pattern, or a behavioral pattern, the purpose of design is to cage change in the design pattern.
2.2.3 Regular regression — function deteriorates during evolution
Returning regularly is also an important principle in dealing with failure. The Iteration of the Internet is really too fast. Traditional software tends to iterate in terms of years, while the Internet tends to iterate in terms of weeks or even days. Every day, the functionality of the system can deteriorate in evolution, and rapid iteration not only causes business code to decay rapidly into a mountain of shit, but also causes internal logic to become increasingly bloated and even conflicting. At some point, good, bug-free code becomes a trigger for accidents.
2.3 Be alert to the world of code
Be wary of the world of code, or one day you will learn the hard way.
2.3.1 Don’t believe the partner’s “nonsense”
Be skeptical of any approach or solution your partner gives you, and don’t trust any claim that you don’t personally verify. Practice is the only criterion to test truth, and always being skeptical of the world is the core quality of engineers. Don’t regret it when you shake the pot with your partner after failure. Do more verification in the early stage to protect you and him, but also to protect the plastic friendship between you.
2.3.2 Don’t trust code comments
A line of incorrect code comments took me from Ali to Byte, a lesson of my own.
Bad code comments are better than no comments, do not use the wrong comments to later people buried pit, save the children.
2.3.3 Do not trust function input
NPE (NullPointerException) is probably the most common error a programmer will encounter in his or her career, which is somewhat confusing because programmers know that function arguments need to be checked from the very first LeetCode problem.
The reason for this result is that the scenarios encountered in the online production environment are far more complex than a code problem, which is actually the difference between industry and academia, where the problem is fixed, and industry’s problem is uncertain. Even upstream passing parameters is a highly reliable system do you think, even if you read the program context determine parameters will not appear empty, also had better do some defensive design, because the return will give you a reliable system of irregular parameters, currently there is no empty parameter code in the future one day will also be changed beyond recognition.
2.3.4 Don’t trust infrastructure
Even Alipay crashes, and even the system with the availability of 6 nines has 31 seconds of outages throughout the year. Don’t trust infrastructure, do disaster preparedness, do chaos engineering, and you can sleep soundly every night and avoid being woken up by a 911 call.
2.4 Design Principles
2.4.1 A simple scheme is the most elegant
If you design a technical solution that doesn’t have too many bells and whistles, but conveys a minimalist aesthetic, you may be close to success. Concise solutions represent lower understanding costs, lower maintenance costs, and better scalability.
If your project is full of bells and whistles and looks complicated and rigorous, you may be on the way to causing headaches for yourself and others, with a $2,500 monthly salary.
, of course, is not the most concise plan, is the most suitable for chestnuts, core trading link service is bound to service demand higher stability than data show, many high availability design schemes will do more complicated, so choose in meet the stability under the premise of as far as possible concise scheme is recommended.
2.4.2 Open and close principle is the general outline of design mode
The open and closed principle is the general outline of design pattern. Most design patterns have the shadow of the open and closed principle. Software entities should be open to expansion and closed to modification, and the open and closed principle can be realized by “abstract constraints and encapsulate changes”. The open and close principle can make the software entity have certain adaptability and flexibility as well as stability and continuity.
Based on the open and closed principle, many common design questions have been answered:
1) Massive if-else code problem. A large number of if-else code branches are definitely incompatible with the open and close principle. Each branch of if-else code is a breach of the original code structure. Here we can use the factory + policy design pattern to strip the if-else, limiting the addition and modification of logic within the factory mode subclass.
2) Lengthy business workflow to deal with problems. Business process code is often very lengthy, and it is very difficult to read and maintain the code if it is not packaged properly. Consider using the command + chain of responsibility design pattern to encapsulate workflow. The benefits of encapsulation are that the overall workflow is very clear to read, the main process code can often be reduced from hundreds of lines to less than 10 lines, and changes to the process can be as simple as breaking the chain or adding chain nodes to minimize the impact of changes.
3) Historical field type modification problem. In the process of Internet development, it is often necessary to change the types of historical fields. According to the open and closed principle, we should not change the types of the original fields, but add a new field to ensure the minimum impact on upstream and downstream links.
4) Tampering of object attributes during the process. Taking a practical business scenario as an example, in some business requests, Douyin Ultra Speed edition needs to perform the same processing as Douyin. It is the simplest method to change the APPID of Douyin Ultra Speed Edition to douyin APPID, but this method is inconsistent with the open and close principle. Tampering with object attributes in the middle of the program will change the semantics of the object. One day it will not perform as expected and many accidents will result. Instead, pass a new field in context, and each step downstream can choose the right field to do the right thing without being fooled by tampered fields along the way.
2.4.3 Laziness is the greatest virtue of programmers
Laziness is the greatest virtue of programmers, good programmers are often unknown to the public, the more in the team wawa Shouting everywhere to brush the presence of the programmer is more likely to be a team of chronic poison.
In order to let oneself lazy, safely lie flat to do a good job of business, programmers must master the platform, tool, automation. Platformization, the programmer from the endless repetition of labor rescue; Tool, the programmer from the depth of hot human operation and oncall rescue; Automation, so that the program like assembly line smooth, so as to improve the human efficiency of programmers. Can wield this three board axe to what level, also reflects the programmer ability to reach what level. With the platform, tool, automation, we can do standardization, scale, help the company and business continue to move up.
Three,
At the technical level, I want to talk about how to design for failure from an organizational and process perspective.
3.1 organization
3.1.1 Jobs designed for failure
Test engineer, test development engineer, risk control & safety compliance engineer are the development engineer’s most reliable partner, and also the enterprise design for failure.
Test engineers are the gatekeepers of software quality. They are the guardians of online quality and are responsible for the quality and performance of the developer’s code. Test development engineer is a technical type of software testing work, in addition to doing routine testing work, but also write some testing tools and automatic scripts, with automatic means to improve the quality and efficiency of testing. Risk control and anti-cheating Engineers are responsible for the ecology of the business, monitoring for abnormal problems in the business and improving the effectiveness of the business risk control. Security compliance engineer is responsible for information security and can provide compliance consultation and information security risk assessment for projects.
3.1.2 Organizational form of failure-oriented design
Safety team is a form of organization designed for failure. Safety production team is often a horizontal technical team, for multiple business team to provide standard formulation and implementation, production process control, accident checking organizations such as technical support, responsible for online quality, often in each business team set up the system stability, as the interface to effectively promote their system.
Pair programming is also an organizational form of failure oriented design. Strict pair programming requires two programmers to work together on a computer. One person typed the code, and the other reviewed every line of code he typed. Pair programming allows programmers to write a shorter process, better design, and fewer defects, at the same time, the pair programming can also promote the spread of knowledge, let the couple rapid progress, also let the old man with a new process of summed up their own knowledge and experience, can also avoid the corresponding developer asks for leave or quit the work handover problem.
Strict pair programming is extremely rare in the Internet industry, and few teams actually do it, perhaps because in the view of managers, two people doing the same thing greatly increases the cost of manpower. However, some ideas and concepts of pair programming are also worth learning. For example, we can let two programmers pair up as business owners, backup each other and review each other’s code, so as to gain the benefits of pair programming to a certain extent.
3.2 process
Assuming you don’t design for failure, the software development process might be simplified to code + release. However, the development process of mature enterprises is roughly as follows:
Unable to copy content being loaded
In the stage of demand proposal, compliance assessment, anti-cheating assessment and safety assessment should be carried out in advance, and some potential safety compliance risks should be eliminated in the early stage.
In the coding stage, hemostasis/degradation/rollback measures should be considered in the design of the technical scheme, and technical review and safety review should be organized to evaluate the safety risks in the technical scheme. In addition, it is best to do some unit testing, which can greatly improve the quality of the code.
In the testing phase, developers should first do self-testing, and then test engineers should participate in functional testing and security engineers should do safety checks. In view of the additional impact that code changes may cause, a larger range of regression tests should be done to eliminate some unexpected impacts.
In the release stage, the grayscale release mechanism needs to be adopted. Release a small number of machines first, or release grayscale for users in some areas only. After the grayscale release, perform grayscale test to verify the normal function, and continue to release in batches and in full.
In the verification stage, the test students can make an online regression after the completion of the release to ensure the stable and available online environment of the function. For large events, it is often necessary to organize online preview or mass test for internal users. To prevent the system from being suspended due to unexpected internal traffic, you can perform single-link and full-link pressure tests. Expose yourself to some of the risks ahead of a big event, if you can, or do an online demo in a small area.
In the operation stage, the developers need to do a good job in monitoring alarm and off-line data reconciliation. For the effectiveness of the project, the AB test can be used to quantify the benefits.
When a fault occurs, it is necessary to make a quick fault recovery in the first time to minimize online loss as much as possible, and then consider locating the fault cause.
After the end of the project or troubleshooting, it is necessary to organize an effective review, summarize the problems in the process, form an effective improvement plan, and continue to follow up the implementation of the improvement plan
3.3 Some Views
3.3.1 How important is it to test your classmates
Test engineers are the most important guardians of online quality, and their importance cannot be overstated. A good test student can do the following things:
- Non-black box testing, ability to read and develop code, design test cases according to the code
- Design complete test cases to cover all test scenarios
- Write data reconciliation scripts, able to do offline data reconciliation and real-time data reconciliation
- Write automated test tools
- Write data consistency monitoring scripts and capital loss prevention and control tools
3.3.2 Unit testing saves the most time
Writing unit test cases, which may seem time-consuming, is actually the most time saving approach. Unit testing ensures that the code behaves as we expect, thus saving a lot of rework time for releasing, self-testing, co-commissioning, and modifying the code. In addition, the code that can be used for unit testing tends to have clearer responsibilities, more reasonable fragmentation, and better stability.
3.3.3 Review is a necessary way to align high standards
Review is a necessary way to continually optimize the organization and align high standards. Through a cycle like PDCA (plan-do-check-Action), knowledge accumulation will be formed after continuous improvement of work, which will play a role in the next Plan execution. As a result, the team becomes more and more capable of execution, and individuals become Better Me.
3.3.4 R&D red line is the programmer’s protection umbrella
Red line is the corporate failure – oriented design of effective violence machine, it is composed of countless parts (specifications and items), cold, mechanical, can not stop the movement, not to the will of the individual. The red line forces programmers to follow the company’s processes and specifications, and warns programmers not to make stupid mistakes. It seems cold and heartless, but in fact it is an umbrella for programmers.
Four,
At the technical level, I want to talk about the specific technical details of fail-oriented design. But there are too many technical details, limited space, here only listed some classical technical problems of the solution.
4.1 Failure orientation as part of system design
- For the unexpected flow, can do system current limiting, system overload protection, adaptive capacity expansion;
- For the dependent service timeout or error, it is necessary to set the timeout time for the dependent system, and sort out the strong and weak dependencies of all dependencies, and downgrade the non-core dependencies at critical moments.
- For unexpected situations, emergency plans can be prepared in advance and rehearsed.
- In view of instantaneous high flow, it is necessary to judge the limit of the system sensitively, do a good job of flow dispersion, and avoid DB and cache hot key;
- In view of the possible computer room problems, do a good job in the same city double (more) live and remote live;
- In view of human error, we can reduce human operation by means of platform, tool and automation.
- Avoid single point problems, do redundant design to reduce the impact of local failure on the system;
- Caution should be taken when trying again to avoid trampling avalanche.
- Failure can only be reduced, not eliminated, do a good job of monitoring and alarm, fault drill, attack and defense drill, temper risk emergency ability;
4.2 Six levels of distributed locking
You only saw the second floor. You thought I was the first floor. Actually, I’m on the fifth floor.
— Dassima of Wuhu
Redis has six levels of distributed lock implementation. Let’s see which level of distributed lock we usually use.
Distributed lock design principles
-
Mutual exclusivity. Only one client holds the lock at any one time.
-
Die lock. A distributed lock is essentially a Lease lock. If a client becomes abnormal after acquiring the lock, the lock can be automatically released after a period of time and resources will not be locked.
-
Consistency. External problems such as hardware faults or network anomalies, as well as internal factors such as slow search and defects may lead to high availability switchover of Redis, and replica is promoted to the new master. In this case, if the service has a high requirement for mutual exclusion, the lock must remain in the original state after the switch to the new master.
Level one
redis.SetNX(ctx, key, "1")
defer redis.del(ctx, key)
Copy the code
Using the SetNx command, you can solve the problem of mutual exclusion, but you can’t do deathlock
Level two
redis.SetNX(ctx, key, "1", expiration)
defer redis.del(ctx, key)
Copy the code
Use lua scripts to ensure atomicity of SetNX and Expire. Undead locks are achieved, but not consistent
Level three
redis.SetNX(ctx, key, randomValue, expiration) defer redis.del(ctx, key, If redis. Call ("get",KEYS[1]) == ARGV[1] then return redis. Call ("del",KEYS[1]) else return 0 endCopy the code
Distributed lock value set a random number, delete only the current thread/coroutine seized by the lock, avoid the program running too slow when the lock expires to delete other threads/coroutine locks, can achieve a certain degree of consistency.
Level four
func myFunc() (errCode *constant.ErrorCode) { errCode := DistributedLock(ctx, key, randomValue, LockTime) defer DelDistributedLock(ctx, key, randomValue) if errCode ! = nil { return errCode } // doSomeThing } func DistributedLock(ctx context.Context, key, value string, expiration time.Duration) (errCode *constant.ErrorCode) { ok, err := redis.SetNX(ctx, key, value, expiration) if err == nil { if ! Ok {return constant.ERR_MISSION_GOT_LOCK} return nil} // In case of timeout and success, Time. Sleep(DistributedRetryTime) v, err := redis.Get(CTX, key) if err! = nil {return constant.ERR_CACHE} if v == value {// indicating timeout and success return nil} else if v! Return constant.ERR_MISSION_GOT_LOCK} // The lock has not been stolen. Ok, err = redis.setnx (CTX, key, value, expiration) if err! = nil { return constant.ERR_CACHE } if ! Ok {return constant.ERR_MISSION_GOT_LOCK} return nil} if redis. Call ("get",KEYS[1]) == ARGV[1] then Return redis. Call ("del",KEYS[1]) else return 0 end Func DelDistributedLock(CTX context.Context, key, value String) (errCode * constant.errorCode) {v, distributedLock (CTX context.Context, key, value String) (errCode * constant.errorCode) {v, err := redis.Cad(ctx, key, value) if err ! = nil { return constant.ERR_CACHE } return nil }Copy the code
To solve timeout and success problems, write timeout and success is an occasional, catastrophic classic problem.
The remaining questions are:
- Single point problem, single master has a problem, if there’s a master slave, then there’s a problem with the master slave replication process
- What if the lock expires and the process does not complete
Level 5
Start the timer and renew the lock if it expires without completing the process. Only the lock preempted by the current thread/coroutine can be renewed.
// The following is the lua script for renewing the lease. Redis. call("get",KEYS[1]) == ARGV[1] then return redis.call("expire",KEYS[1], ARGV[2]) else return 0 end Redis. CAS (CTX, key, value, value)Copy the code
It can guarantee the consistency of lock expiration, but cannot solve the single point problem.
At the same time, you can think out of the box, what if the renewal of the lease fails? How do we solve the nesting doll problem of “high availability of high availability methods used to ensure high availability”? Open source libraries Redisson using the watchdog mode to a certain extent solved the problem of the lock relet, but here, personal advice don’t lock relet, more concise and elegant way is to extend the expiration time, because we distributed lock lock code block is the maximum execution time controllable (depends on the RPC, DB, middleware and other calls are set timeout). Therefore, we can set the timeout time to be greater than the maximum execution time to guarantee the consistency of lock expiration succinctly and elegantly.
Level 6
Replication of Redis is asynchronous. If a request for data modification is sent to the master, the Master suddenly becomes abnormal and a high availability switchover occurs. Data in the buffer may not be synchronized to the new Master (original replica), resulting in data inconsistency. If the lost data is related to distributed locks, the locking mechanism may be faulty, causing service exceptions. There are two solutions to this problem:
1) Use RedLock. RedLock is a consistency solution proposed by Redis authors. The nature of red locks is a matter of probability: if the probability of a master-slave Redis losing locks during a high availability switch is K %, what is the probability of N independent Redis losing locks at the same time? If distributed locks are implemented with red locks, the probability of lock loss is (k%)^N. In view of the high stability of Redis, the probability at this time can fully meet the needs of the product.
The problem with red locks is that:
-
The locking and unlocking delay is large.
-
Difficult to implement in clustered or standard Redis instances (master-slave architecture).
-
Too many resources are occupied. In order to achieve red lock, it is necessary to create multiple unrelated cloud Redis instances or self-built Redis.
2) Use WAIT command. The Redis WAIT command blocks the current client until all write commands are successfully synchronized from the master to the specified number of replicas. You can set the WAIT timeout in milliseconds. After locking, the client waits for data to be successfully synchronized to the replica before performing other operations. After the WAIT command is executed, if the result is 1, the synchronization is successful and data inconsistency is not required. Compared to red lock, this implementation method greatly reduces the cost.
4.3 Hotspot inventory deduction
Second kill is very common interview questions, many interviewers come up to the interviewer to design a second kill system, of course, the interviewer is “experienced”, can quickly give a familiar “standard answer”.
However, split-kill is a relatively simple hot spot inventory deduction problem, because the inventory deduction is not large. A more typical hot inventory reduction problem is the Red envelope rain during the Spring Festival, when hundreds of millions of people grab red envelopes from the same capital pool. There are two schemes for the Spring Festival red envelope rain:
Plan a
Unable to copy content being loaded
Existing problems:
- Uneven inventory consumption among different buckets may result in some users unable to deduct inventory, but other users can deduct inventory, which leads to complaints from users.
Scheme 2
Unable to copy content being loaded
Distribute inventory several times in small quantities to alleviate uneven consumption of pail inventory.
The 2021 Douyin Spring Festival red envelope is also a good technical idea to break up the time for users to enter and reduce the instantaneous request peak.
How to design for failure
1) Why use scheduled task scheduling to actively allocate inventory instead of passively pulling inventory when sub-barrel inventory is insufficient?
A: Because active allocation inventory QPS is several orders of magnitude lower than passive pull inventory
2) How to deal with heavy traffic?
Answer: flow does not touch DB, buckets, scattered
3) Why not use a certain master machine to maintain the total inventory pool of Redis, but use timed task scheduling to randomly select machines?
A: Avoid single points
Five, the colophon
The beauty of programming is universal. Good code, often clear structure, clear meaning, exquisite design, whether reading code or writing code can give programmers a straight to the heart of beauty, even let readers fondly, let the author proud, take it as his representative work. However, in order to retain this beauty, we also need to design for failure and fully consider the failure scenario, so as to reduce the probability of failure and live to death.
This article has done some simple thinking about failure – oriented design, welcome to discuss, supplement and correction.
Six,
- Summary of design for failure – developer.aliyun.com/article/726…
- High performance distributed lock help.aliyun.com/document\_d…
Author: Wang Weiqiang