High Availability. -B jumped. What do I have to do with A?

This article has participated in the good article call order activity, click to see: back end, big front end double track submission, 20,000 yuan prize pool for you to challenge!

Yesterday’s big melon, B station jumped, everyone jumped up to analyze a wave of abnormal reasons, really for everyone’s autumn to prepare a wave of hot material! When everyone is paying attention to B station, I am big A station finally to stand up!! After many netizens to the utmost drainage, I A station also jumped ~

Urgent notice: B station changed domain name!!

Here is A brief introduction to the history of station A:

Station A, originally founded in 2007, was the first barrage video website in China, and it had been brilliant. In fact, Station A had been suppressing Station B until 2011, and was the largest barrage video website in China at that time.

From 2009 to 2016, the chaotic struggle of capital and personnel became the biggest obstacle on the road to the development of Station A, which eventually led to the gradual decline of Station A.

The original “winger network” of STATION A was the winger network. The business of “living and broadcasting” was independent from station A and A new company was established, which was later called Douyu TV.

2018 was wholly acquired by Kuaishou, struggling to operate.

A standing up

In the majority of netizens under the drainage, A station ushered in A wave of large flow, the success of the service to hang up. But unlike the site B crash, an avalanche caused by heavy flow can be stopped quickly, service can be restored,

Crash analysis of Station A

With so many pictures posted, here is a wave of rational analysis:

Crash recovery takes about an hour (23:15-00:25)
crashpageAccording to the normal
Restore partial abilityfusing

It can be intuitively felt that the reason for the jump of station A is that the service is knocked down by the large flow, resulting in abnormal service.

It’s a shame the brothers are draining

CDN jumped down

The CDN cache mainly includes H5 resources and videos. At the beginning of station A, the presentation page looked like this:

As you can see, the static resource loading of H5 page is not a problem, the resource can also be accessed, then the CDN is still in a normal state, to the back of the page, the whole page is jumping, this time the CDN is also hit.

The database jumped.

Database jump, is guessed, under the impact of large traffic, the login of new users, different types of data access, resulting in a large decrease in cache hit ratio, requests directly to the database. Here comes the first step in the avalanche: the DB processing timeout

Services around the

The above mentioned database timeouts, causing service avalanches. Is one of the possibilities. If the bottleneck of the service is not DB, but logical processing, data transmission, etc., which occupies the CPU, IO and other resources of the machine, with the rapid increase of traffic, the machine is overloaded and cannot respond to the business quickly, which is also the cause of the service avalanche.

Taking the video stream transmission of Station A as an example, the slow response of OSS, or the limited transmission bandwidth, leads to the accumulation of requests in the video service, and finally leads to the avalanche of the whole link. Of course, other links will have similar possibilities. The main indicators are CPU, memory, and I/O load, and interface response time.

Service at Station A is restored

The recovery effect is not satisfactory, can directly and clearly feel that part of the capacity is broken, to ensure the provision of basic capacity.

Why did I recover so quickly?

Unlike the service crash of Station B (physical attack), the impact of Station A is mainly due to the heavy traffic, resulting in excessive machine load, causing an avalanche. Most services are already in the cloud, and the advantage is that they can be dynamically expanded according to demand (money can solve problems, not problems).

1. Dynamic expansion

By monitoring services, you can quickly locate abnormal services (with high load). Directional capacity expansion can reduce the load on a single server and improve the cluster processing capability. In the case of the video service mentioned earlier, the server load is too high, add machines! OSS bandwidth is full, charge money, wide bandwidth!

2, current limiting

In dynamic capacity expansion, in order to avoid service suspension, it is necessary to add traffic limiting. One of the three musketers of high availability, the service can be guaranteed through gateway traffic limiting, server traffic limiting, transmission speed limiting and other channels. Sentinel also offers, as the blogger introduced earlier, the ability to protect machines from avalanches caused by excessive machine loads.

3, fusing

The service circuit breaker disables some capabilities to ensure the stability of the overall link. In the figure below, the recommendation system capability may not be restored temporarily, or it may be broken.

The overall architecture can be understood as:

A station exclusive, five banana must eat

Well folks, that’s all for this post, and I’ll be updating it weekly with a few high-quality articles about big factory interviews and common technology stacks. Thank everyone can see here, if this article is well written, please three!! Creation is not easy, thank you for your support and recognition, we will see the next article!

I am Jiuling, there is a need to communicate children’s shoes can add me WX, JayCE-K, follow the public number: Java tutorial, master first-hand information! If there are any mistakes in this blog, please comment and comment. Thank you very much!