2020 is destined to be an extraordinary year. The spread of the epidemic has disrupted everyone’s established original plans, and at the same time brought about some demands for the application of online business management ability. As a technical student, I need to quickly support the construction of system capacity and guarantee the stability of its operation system in a short time. At the end of the year, it is time to sort out and summarize my experience and thinking on system stability construction.

The opening

Before we get into service stability, let’s talk about SLAs. Service level Agreements (SLAs) are often used to measure service stability. Often referred to as “several nines,” the more nines the service is available throughout the year and the more reliable the service is, i.e. the less downtime. Usually as service indicators between service providers and service users to achieve specific commitments – quality, availability, responsibility.

1 year = 365 days = 8760 hours

Three nines, i.e. 99.9% = 8760 * 0.1% = 8760 * 0.001 = 8.76 hours

Four nines, i.e. 99.99% = 8760 * 0.0001 = 0.876 hours = 0.876 * 60 = 52.6 minutes

Five nines, namely 99.999% = 8760 * 0.00001 = 0.0876 hours = 0.0876 * 60 = 5.26 minutes

Behind the stringent service level agreements are a set of specification requirements.

First, what does system stability construction mean?

As for what system stability refers to, I believe that many developers will have their own understanding and cognition, but there may be doubts about whether the understanding is one-sided or whether it is standard. What judgment criteria and demarcation boundaries are there on earth?

Take a look at this explanation from Wikipedia:

Stability is a term used in mathematics or engineering to determine whether a system produces bounded outputs at bounded inputs.

If so, the system is called stable; If not, the system is unstable.

Simply put, system stability is essentially a deterministic response of the system.

From another perspective, the construction of service stability is how to ensure that the system can meet the service level agreement required by SLA.

Second, why is it necessary to build system stability?

To be sure, the construction of service stability is very necessary, whether to meet the normal operation of the daily system or the stable and orderly operation of major festival activities.

Let’s take a look at a few cases that are affected by service stability failures:

1) On the day before the National Day in 2020, due to the demand of “the most Difficult Taxi Taking Day in 2020”, Didi platform and Didi Platform went down successively;

2) Amazon Prime Day 2018: Amazon Prime Day glitches (customers can’t add items to their shopping carts) cost the company $99 million.

3) In 2015, due to the upgrading of the computer system in some areas of INDUSTRIAL and Commercial Bank of China, the transaction of counter and electronic channels was slow and even could not be accepted;

4) In 2012, 12306 railway booking website suspended online ticket sales, refund and change due to the failure of the air-conditioning system in the machine room.

Service stability is very important for enterprises, which will not only bring direct economic losses to enterprises, but even have a serious impact on the industry and people’s life. So the significance of service stability construction is very significant.

3. Why is it difficult to build system stability?

In terms of stability and how to improve stability metrics, we can think of many optimizations:

Eg. Add servers, capacity expansion, timeout retry, service degradation, resource isolation & backup, code logic optimization, asynchronous eventization…

So what is the main difficulty of system stability construction?

3.1 The challenges are relatively large

  • Traffic is unknown

Especially for a new business that is newly reformed and launched, the system stability construction is mainly an unknown quantity of flood peak. Since I have no experience to refer to, I am not sure whether it is a million level or ten million level or a higher level.

  • To change the momentum big

Often, such system stability construction needs to consider the demand is mainly to support the on-line of XX capability in a short time, which often involves multiple changes at the system level from bottom to top, including the adjustment of the underlying data structure, the transformation of business logic and the optimization of user interaction mode. Time is short, change is big, quality is difficult to guarantee.

  • uncertainty

Software engineering is often used to describe “the study of engineering methods to build and maintain effective, practical, and high-quality software”. It includes all aspects of software construction, everything in every detail, any slight negligence may cause overall failure problem, uncertainty problem is particularly serious.

3.2 System stability construction is a systematic project

The multi-link division of labor is delicate and complex, allowing no negligence.

In terms of system structure, it can be divided into single-service system stability and multi-service cluster stability.

  • Single-service stability

It mainly includes: controllable function configuration, cache acceleration (sharp tool), service isolation (third party), scenario exception solution, service monitoring and timely response, etc

  • Cluster stability

It mainly includes: reasonable system architecture, excellent cluster deployment, scientific fusing current limit, pressure measuring mechanism, fine monitoring system and so on

Iv. How to start with the construction of system stability?

4.1 Premise of system stability construction

Before proposing solutions for system stability construction, we need to clarify the following prerequisites:

  • Familiar with business needs to be familiar with the overall business process, with strong control;

  • Architecture clearly requires knowledge of system technical architecture and some practical experience.

Only in this way, after having the ability to control the business and architecture, can we talk about dismantling and optimizing the stability construction and have the basic guarantee.

4.2 Process Division

In general, when we mention system stability construction, it is more like taking system stability as a special Topic. From the perspective of its operation process, there are mainly the following aspects:

  • Premise Clear objectives (benchmarks)

  • Pre-request link optimization, service performance optimization & pressure test, emergency plan formulation, failure drill

  • Fault monitoring, problem location, fault stop loss, problem repair

  • After the fault review, rectification and optimization, experience summary precipitation

The construction of service stability is a systematic project, including all aspects.

Key actions of system stability construction

From the last Part of the work disassembly, stability construction includes more points, and miscellaneous. In more cases, we will do service stability projects to sort out corresponding solutions for specific problems in certain scenarios.

Then we can see from the small, from the single service system itself, extract and see what the key points of stability construction. In fact, only when each single service link is stable and reliable, the stability of the cluster system and even the whole engineering system can be guaranteed.

If the system faces the situation of sudden increase in request traffic, how to do a good job in the construction of service stability?

Key actions of stability construction can be divided into the following categories:

5.1 Peak cutting and current limiting

For example, the classic secakill scenes, such as train ticket buying during the Spring Festival and double 11secakill on e-commerce platforms, are flooded by hundreds of millions of users in a short time, with huge instant traffic (high concurrency).

No matter how to expand the server resources in the early stage, there will be a processing upper limit, so it is necessary to carry out the necessary peak clipping and traffic limiting strategy, similar to the city morning and evening peak staggered traffic limiting solution. Similarly, seckill scenarios require a similar solution.

So how do you do that?

  • Use message queues to peel peaks

Message queue to buffer instantaneous traffic, the synchronous direct call into asynchronous indirect push, in the middle through a queue at one end to accept the instantaneous flow peak, at the other end smoothly push messages out.

Message queue is like “reservoir”, which can hold the upstream flood and reduce the flood peak flow into the downstream river, thus achieving the purpose of flood relief.

  • Use baffles to filter invalid requests

Flow baffle filtering is mainly to establish a verification mechanism to filter out invalid requests and protect core services from being affected by more invalid requests. A common solution is the Bloom filter.

  • Adjustment of product strategy

Product strategy adjustment is a particularly effective means, even better than the technical level of improvement and optimization.

For example, the queuing strategy is used to effectively break up high-concurrency requests. Adjust the time spread of the campaign to avoid high concurrent requests at the same time…

5.2 Cache Acceleration

Cache is a powerful tool to solve concurrency and can effectively improve the throughput of the system. Multiple levels of cache can be added if necessary to ensure hit ratio according to business and technical latitude.

Main application ideas: between the database and the server using Redis cache service, reduce the request directly impact to the database.

5.3 Asynchronous Processing

The opposite of asynchrony is synchronization, where things queue up one by one and wait until the next thing is done before moving on to the next. It’s a bit like a stick of sugar gourd strung together. It needs to process and respond in real time, terminating the session once the time is exceeded, during which the caller waits for the responder to finish processing and return.

Instead of blocking the current thread to wait for processing to complete, asynchronous processing allows subsequent operations until another thread has completed processing, with a callback notifying that thread.

It is important to emphasize that asynchronism is a design concept, asynchronous operation is not equal to multi-threading, common messaging middleware, publish and subscribe broadcast mode, etc., can achieve asynchronous processing mode.

Six, some experience in the process of stability construction

6.1 Do a good pressure test

Do a good system pressure test in advance, do know, nip in the bud, pressure estimation should be realistic, do not blindly too much. For performance bottlenecks, try to improve and optimize them in advance or focus on defense.

6.2 Necessary Emergency plan

We have to have contingency plans. Developers tend to be confident, which is both good and bad, and we need to be prepared for the worst. This is because no experienced engineer can calculate the number of unexpected events that might occur in the future, and failures tend to occur somewhere outside the plan (Murphy’s Law).

6.3 Improve the monitoring system

The establishment of a sound monitoring and alarm mechanism, to avoid us to become blind and deaf, guarantee the timely perception of errors. The main principle of monitoring point setup is: all dependencies are not trusted!

6.4 Quick response capability

Similar to changing an engine on a moving plane, use all your power to cut your losses “quickly” whenever something goes wrong. The service should be divided into levels to ensure the control of large and small, and protect the core service principle. If there is indeed a problem that cannot be quickly located, it can be degraded layer by layer. Main goals: prevent problems from spreading, stop failures, and recover quickly.

conclusion

Key points of stability construction

  • In the face of resource upper limit, do technical and business level processing, to achieve flow peak cutting guarantee service stability;

  • Cache acceleration The cache is used to solve concurrency and improve system throughput. At the same time, avoid hot keys and large keys.

  • Asynchronous processing (synchronous -> asynchronous), effectively improve the response efficiency, ensure the final consistency of data.

Technology serves the business

Technology or to solve practical problems to land. Application scenarios are very critical. All optimization work should not be done solely for the sake of technology, because technology ultimately serves application scenarios and industries.

Try to make the business vision goal the ultimate goal, and use all technical means to ensure that the goal is achieved, so as to maximize the technical value.

Do not stick to the form, flexible use

The stability scheme needs to be flexibly adjusted and applied according to the scene, and must not be mechanically applied. In the concrete realization process, the key is to control the main action path, and select the one with the highest input-output ratio in the case of multiple paths. Promote a path of action: problem driven (problem awareness -> problem analysis -> problem control -> problem resolution).

– [Denver annual essay | technologies of 2020], this paper also please everybody is to point out mistakes, if any, thank you!

Denver annual essay | 2020 technical way with me The campaign is under way…

Thanks for reading!