Hello, I’m Bella Sauce

“It’s not that it can’t be used, but it can be used.” “It’s not like you can’t run. Either the program or the person can run.”

I believe many students have heard these two sentences. Sounds fine at first. Make it Run, Make it fast, Make it Better. I boil it down to the “make it run” phase, which is fine if you look at it alone, but not if you stay in it all the time.

What is stability? I put it in the “Make it Better” stage. Stability is to detect problems in advance through a series of means, and strive to strangle problems in the bud; Stability requires that when a problem occurs, the problem is perceived before the business, and the problem is handled to minimize the impact of the problem. Stability requires that problems occur after the disc, disc where can be improved to avoid such things happening again. In short, stability is about ensuring that the system is running properly, that the business is silky smooth, and that the data is accurate.

Stability is not only able to do some scattered things, stability is also a rule to follow, there are systematic methods. This paper will explain how the business team can achieve stability in daily work from the eight aspects of mechanism management and control, alarm monitoring, system risk points, life-saving measures, online problem emergency mechanism, fault drill, post-review and publicity, which can be said to cover all aspects before, during and after the event.

1. Mechanism control

Everything is on the system when it can be on the system, and it’s extremely risky to run it by people. I believe that many students understand this truth. Stability guarantee is also the same, through a series of mechanisms and systems to regulate some behaviors in daily work.

Here are some examples:

1) Branch publishing must merge origin master branch code.

2) Branch release must be CR and must be approved by testers. Development, testing can not be the same person, unless the test approved the development of self-test. The publisher and THE CR person cannot be the same.

3) Branch release should be recorded in the system, including at least the release time, release branch, release content, impact, whether the test passed, and so on.

4) Develop release window. If the release time outside the window is needed, it needs to go through the approval process. The approval personnel can be the boss, the staff in charge of stability work in the team, or other personnel, such as team core development, depending on the situation of the team. The release window is designed primarily to prevent night or weekend releases, and to avoid delays in responding and handling releases when problems occur.

5) Develop the release process, mainly in the server environment. For example, the release system setting branch must be successfully deployed in the daily environment before the pre-deployment; The deployment can be online only after the pre-deployment environment is successfully deployed. When publishing online, you must deploy successfully in the beta environment and observe the specified time before you can continue publishing online. Publish at least several batches online, at least how long to observe each batch, etc.

6) The flow is forbidden to cut one size fits all, there must be a gradual gray scale process. By publishing the proportion of the machine, through the business with gray mark, it must be gradually gray, and the flow is not allowed to switch all at once. Step by step gray process is a verification function, expose the problem of the process, gray range is controllable, can inform the gray range of the business side, so that when the problem occurs, the business side know where there is a problem, rather than scratching their head.

7) Data repair must be performed by tools, tools should be able to obtain the current operator, operation time, etc., tools should have the ability to check the correctness of data repair, and every use of tools should be recorded in the background to facilitate the future view of historical operation records.

2. Monitor alarms

The significance of alarm monitoring is that the fault is detected before the service provider, which greatly improves the response speed and facilitates fault locating. Once the problem is located and the impact level can be evaluated, you should consider how to stop bleeding quickly and minimize the impact level.

Note the following when setting alarms:

1) Alarm range. Do not send the alarm to only one person. It is too risky. For example, the person may miss the alarm while washing. You can build a nail alarm group, and pull all the students related to this business into it, as well as the students and bosses in charge of stability in the team, etc. This avoids the “single point of unavailability resulting in service outages”, even if one person misses the alarm, as well as others, as long as someone sees the alarm and handles it.

2) Alarm priority. You need to set different alarm priorities for different situations. For example, if the service success rate is lower than 97% after 3 minutes, alarms can be sent to the nail alarm group. For 5 minutes, the success rate is lower than 97%. SMS alerts can be sent to subscribers. It lasts 10 minutes and has a success rate of less than 97%. Subscribers can call directly.

3) Alarm content. This needs to depend on the team service status. Generally, monitoring alarms at the DB level, service success rate, number of service processing failures, and machine level are required. DB Monitors alarms such as CPU usage, number of connections, QPS, TPS, RT, and disk usage. Monitor alarms at the machine level, such as CPU usage, LOAD, and disk usage. You need to consider the duration when configuring the service success rate alarm.

4) Verification of the limitation of alarms. This is critical. If the alarm is invalid, the scope, priority, and content of the alarm are useless. If the alarm threshold is set too high, the alarm will have no effect. If you set the alarm threshold too low, you will receive a large number of alarms every day. Over time, you will become tired of these alarms. When the Wolf comes, people may lose faith in it. After the pressure test and the baseline of the service, DB, and machine, alarms set according to the baseline are the most reliable. If there is no pressure measurement, observe the daily water level, set the alarm threshold to a lower value, and adjust the alarm threshold according to the alarm condition.

3. Sort out system risks

The ancient cloud, know yourself and know the enemy, a hundred battles without danger. Why should I know him? You don’t know his weakness until you know him. The same is true of stability. By understanding the system, you will know where the risk points of the system are. Of course, it takes a lot of time and effort to fully understand a system. The ancients also cloud, the gentleman is not different, good false in things also. In the early stage of sorting out system risk points, you can borrow your lovely colleagues to find the owner of the corresponding application, understand the current situation, potential risk points and whether there are countermeasures.

When sorting out system risk points, the following points should be paid attention to:

1) Whether the link is closed. This is critical, because if it is a system-level defect, the impact must be huge. Sorting out whether the link is closed requires you to think outside of development and consider a product to see if it can run properly under any circumstances. If you find that the system is not closed, be sure to inform the product and the boss as soon as possible. Then think about how to solve the problem from a product perspective, what features need to be developed to make it ok, and schedule the fix as a high priority development task.

2) Slow SQL. The power of slow SQL is very large, a slow SQL execution may directly wear down a library, if there are other SQL requests to execute, then other SQL execution time will be greatly amplified, it is often difficult for developers to identify which is really slow SQL. At this point, you need to calm down and find the real slow SQL slowly according to the time point, or ask for help from DBA students, professional students do professional things, the results will be more reliable, and the time is shorter. If you find a slow SQL, it is best to verify the results of their search and DBA students are correct, so as not to miss the real slow SQL.

3) Whether core applications and non-core applications affect each other. Services of different importance levels should be directly separated in the physical dimension. This is also often referred to as read/write separation, where read/write requests access different groups of machines so as not to interfere with each other. In addition, there is the separation of DB dimensions.

4. Life-saving measures

Mechanisms to control, monitor alarms, and sort out system risks are used to prevent problems. However, often walk by the river where there is no wet shoes? If something goes wrong on the line, how do you stop the bleeding quickly? At this point, it is time to prepare some life-saving measures in your daily work. If you wait until the problem occurs, it may be too late to prepare.

So what are the measures to keep us alive?

1) Current limiting. Although limiting the number of requests will cause some failure, but it will reduce the service pressure, reduce DB pressure ah, in some cases can be used to save life. Sentinel is commonly used as a flow limiting tool. In daily work, the flow limiting value can be configured in the Sentinel console first. When problems occur, the flow limiting switch can be pushed directly. You can configure traffic limiting for clusters or single machines, depending on specific service scenarios and requirements. Traffic limits can be set for all application sources or for specific applications, depending on specific requirements.

2) Demotion. Demotion is divided into lossy and lossless. Lossy degradation is business-aware, while lossless degradation is only technical. For example, a query is degraded from one data source to another standby data source, and the business side is not aware of it. The loss of the downgrade, a typical example is the Spring Festival Gala hongbao activity, Baidu direct notice of the New Year’s Eve, Baidu cloud disk login registration demoted, in order to ensure the Spring Festival Gala hongbao activity smoothly. For lossy degradation, it is necessary to define in daily work, under what conditions, the system to achieve what indicators can be lossy degradation. In order to avoid problems occur, too little consideration, direct downgrade, resulting in more serious problems.

3) Tangential flow. If the server in one equipment room is unavailable due to network problems, hardware problems, or other problems, you can switch to another equipment room. When one data source is unavailable, you can also switch streams to alternate data sources. These can reduce the impact of the problem, or solve the problem.

4) Control. If there is a problem with the service that deals with external users, you can inform external users through the business side or other ways, telling them that there is a problem with the system at this time, and relevant students are solving it, so as to appease them, so as not to make external users confused and at a loss. Some official numbers through the micro blog way to inform users, is also a kind of control.

5) Contact information of upstream and downstream, DBA, etc. Sometimes it may not be the problem of one’s own system, but the abnormality of the upstream and downstream systems, which indirectly leads to the decline of the success rate of one’s own system. At this time, it is very important to master the contact information of the students of the upstream and downstream partners, so that you can quickly notify them and ask them to intervene in the investigation. Sometimes there may be DB problems that need to be solved by DBAs. Therefore, it is very important to master the contact information of DBAs. A small command from DBAs may save you from trouble.

The life-saving measures such as current limiting, degradation, tangling, and control can be implemented only when the system indicators have been clearly defined. In this way, when problems occur, decisions can be made quickly and judgments can be avoided to cause more problems caused by improper use.

5. Online problem emergency mechanism

What if there is a real problem on the cable? Don’t panic. The more panic, the more confusion.

First, you need to respond quickly to show that someone is locating the problem. Once the problem has been identified, the priority should be how to stop the bleeding quickly. This is where the life-saving measures mentioned above can be applied. What if there’s no life-saving measures? You have to do something else, including publish.

One can’t do a good job of responding quickly. Need to have a correspondent to communicate with the outside to report current progress timely; Need to deal with the problem of positioning, assessment of impact, give advice; There needs to be a decision maker, maybe the person in charge of stability within the team, maybe the boss, maybe the business side, etc., to decide how to stop the bleeding quickly. After reaching a unanimous decision, the handler can execute according to the decision, and the correspondent keeps timely communication with the outside.

6. Fault drill

The purpose of a walkthrough is to simulate how people would react to a failure and how they would handle it. Therefore, the fault rehearsal should not be informed in advance of the team members, let alone the online problem, otherwise the rehearsal will be meaningless. Of course, the fault drill should also grasp the degree, so as not to affect the online business.

7. After-sale review

After the occurrence of online problems, no matter how big or small, there should be a corresponding disc, small problems can be a small scope of disc, big problems can further expand the scope of disc participants.

What are the main contents of the disc?

1) When was the problem discovered?

Point in time

2) How did you find the problem? Monitoring alarms or service feedback?

To determine whether the monitoring alarm is valid and whether there is room for optimization.

3) When to respond to a problem?

To determine whether the response is timely and whether there is room for optimization

4) When is the problem identified?

To assess familiarity with the system, or the ability to respond to online problems. Some students may not understand the system, so it takes a long time to locate the problem; Some students may be because of things, or under high pressure, psychological quality is not enough, resulting in confusion, so it takes longer than usual. In both cases, the capacity to improve is different.

5) Are there any quick hemostasis measures?

For example, stream limiting, demotion, and tangential flow. Can see peacetime guarantee work is in place. If it is found to be missing, it can be denoted as an action, and it will be replaced later in the work. The best way to do that is to simultaneously comb through the system, see if there are any other quick stops that are missing, and do them together.

6) Is the coordination smooth when upstream and downstream or DBA need to be involved?

To evaluate whether there is room for improvement in coordination and communication. After all, if upstream and downstream or DBA are involved, it will save a lot of time if we can quickly contact the corresponding students and help solve problems more quickly.

Preached a 8.

Stability is never a matter of one person, it is related to every student, and every student should bear in mind the matter.

When connecting requirements, we need to consider whether the requirements are reasonable and whether the link is closed.

When writing code, you need to consider the robustness of the code, whether the syntax is used correctly, and whether you are writing slow SQL.

When publishing, you need to consider whether the test is passed, whether the release window is in, whether there is a grayscale strategy, how to deal with problems found during the release, whether it can be rolled back, and so on.

When online problems occur, how to quickly emergency response is also every student should understand.

Some lectures are necessary to help each student better understand what they should do to ensure the stability of the system.

All right, I’m Bella Jam, a girl who writes code in BAT. Welcome to follow my personal wechat official account (public account: Bella’s Technology Wheel) to learn and grow together! That’s all for today. See you next time