This is a question I ask a lot, whether it’s when interviewing for some high P’s or when I’m a judge on a promotion panel. Most of the answers are: current limit, downgrade, circuit breaker. However, the systematic answer, basically very few people say.
First, what are system Reliability and Availability? And what is system Stability?
System reliability: a highly reliable system with fewer failure times and low frequency can continue to run for a long time without failure.
System availability: A highly available system with less time to failure, faster stop-loss, and timely operation at any given moment.
Take chapter 8 of distributed Systems Principles and Paradigms for example:
If the system crashes at 1ms per hour, its availability is over 99.9999%, but it is still highly unreliable.
If the system never crashes, but is down for two weeks in mid-August every year, it is highly reliable, but the system is only 96% available.
The stability of the system is based on the reliability and availability of the system, that is, the failure frequency is reduced and the stop loss speed is increased, and the performance of the system is required to be stable, not fast or slow.
In other words, NOT only do I want the system to be as ready to serve me as possible, but I also want the system to provide me with quality assurance.
So, back to the original question, what can we do based on system stability building?
1. Improve system reliability and reduce the number of failures
Code specification:
Java engineers, I think the Ali code specification, should be enough. Of course, I will add a few more red lines based on my years of experience.
-
Mybatis select * from mybatis select * from mybatis No limit added, 100W records can be found, can be full network card.
-
Mybatis select * from ‘where’ where ‘where’; Hundreds of millions of data in the large table, because the index did not go to the full table scan, database directly dry hang.
-
Mybatis update and DELETE statements, single scenario and single write SQL, dynamic splice SQL is not allowed, and the condition item after where must have a column of index, and the index’s distinction should be high; Update and DELETE are dynamic SQL combinations, and conditions are not passed. As a result, the whole table is modified or deleted.
-
In Java project loop statements, RPC and DB operations are not allowed. Too much cycle, can be directly downstream and DB dry hang, simply cut.
Online specification:
-
Services with high level of importance must be TL reviewed before they can go live.
-
Services with a high level of importance must have a grayscale strategy before they can go online.
-
For services of high importance level, if the code change momentum is large and the impact scope is large, it must choose the time that does not affect the business to go online. (If not, go online when business is low)
Fault disk:
It is also true that the best way for adults to learn is to recycle, and that applies to systems engineering. Now that we’ve paid an expensive tuition fee for the breakdown, we hope it’s worth it.
-
5 Whys, from the point and the surface, dig the root cause of the problem, in the process of the disc, put an end to this “forget”, “not careful”, “not in time”, “time is too tight” and other perfunctory, form-based words, so as to avoid the same, the same type of failure repeated.
-
According to the two dimensions of “important and urgent” and “important and not urgent”, make short and medium term TODOS, specify the executor and the completion time, and continuously monitor and follow up until all TODOS are completed. And, if todo is real, avoid “next time more serious”, “next time to pay high attention”, “next time to be vigilant” and so on.
-
Identify the person responsible for the fault. There is a mistake to recognize, to be beaten to attention, there is a problem must stand out and take responsibility, do not worry about the problem of “more do more wrong”, because we have to develop “rewards and punishment rules”, more do belongs to the category of reward, more wrong belongs to the category of punishment, in fact, this is not contradictory.
Pressure measuring current limit:
It has become a general solution to perform pressure measurement in the production environment by the way of the peak flow rate of the system and the flow rate playback. We can use the way of gradually increasing pressure, 1, 1.5, 2, 2.5, 3 times, and so on. Finally, the pressure measurement shows that the current system can carry several times of the peak flow rate on the line. Set a threshold for limiting the current (usually lower than the peak flow on the line that can be carried to avoid error) to prevent the system from overloading and failing.
- Our current limiting should be multi-layered, including the overall system current limiting, as well as the core interfaces, or interfaces with high QPS and long RT time.
- Pressure testing is a routine inspection activity. The inspection rhythm is set according to the frequency of online release and the importance of the system. It is recommended that the pressure test be performed every two weeks for high-importance services.
The brush:
Look inside as well as outside.
Distributed Denial of Service (DDoS) attacks refer to Distributed Denial of Service (DDoS) attacks, which use a large number of legitimate Distributed servers to send requests to the target. As a result, legitimate users cannot obtain services.
Internally, prevent upstream services, clients, front-end due to the error of timing refresh code, resulting in downstream services in a short period of time, is repeatedly executed the same request. \
The implementation is relatively simple. The user name/IP + method name is used as the key of Redis, and the expiration time is set to the threshold of the brush-proof interval. However, this specification is suitable for large interfaces of business aggregation, non-BY ID interfaces, and also needs to be formulated with the upstream specification, in case of being mistakenly killed in normal business use.
In addition, for core services that cannot be hung in an absolute sense for half a minute, there are some principles to follow:
- Minimum link closed loop, as little as possible to rely on services, as little as possible to rely on middleware. (Make sure you don’t hang up)
- Maximum isolation, irrelevant non-core services, as far as possible separate deployment isolation (not being dragged by others)
We will continue to talk about how to improve system availability and system stability
Not ended.