Station "B" is down. Several possible causes

As we all know, although I am a programmer, I love sports very much, such as dancing. Every day I go home and learn related dance in the DANCE area of B station before going to bed.

Yesterday was no exception, I rushed to sit in front of the computer after washing, open the B station dance area ready to learn to bite the cat, Xin Xiaomun, xiao Xian if their new dance moves, have to say that the wives dance very good, even I this introverted person also unconsciously followed the wriggling up.

Just as I was about to learn the next move, I found a 404 NOT found.

As a developer, my first instinct was that the system was broken. I even suspected that it was my network. I found that the mobile network was working and the computer was accessing other web pages properly.

I refresh several times, found or so, I have a little sympathy for the corresponding responsibility of the students, the end of the year should not. (The site hasn’t been restored as OF this writing)

As a former programmer, I am used to thinking about the composition of the website structure of B station, as well as the recovery of the accident, may be a problem. (Old job used to)

First we can sketch a simple architecture diagram of a website, and then we can guess where the problem might be this time.

Because I stayed up all night writing articles, I had never worked in a company that mainly relied on live video streaming, and I didn’t know much about the technology stack, so I drew a sketch of the general logic of the e-commerce business.

From top to bottom, from entry to CDN content distribution, to front end server, back end server, distributed storage, big data analysis, risk control to search engine recommendation, I just draw it casually, I think the overall architecture should not be very different.

I went to the Internet to check some companies like Betyu, B station and A station. The main technical stacks and technical difficulties mainly include:

Video access storage

flow
To the nearest node
Video codec
Breakpoint continuation (much worse than the IO example we wrote)
Database system & file system isolation

Concurrent access

Streaming media server (all major manufacturers have it, the bandwidth cost is large)
Data cluster, distributed storage, cache
CDN content distribution
Load balancing
Search engine (sharding)

Barrage system

Concurrency, thread
kafka
Nio Framework (Netty)

In fact, they are similar to the technologies we all learn, but their corresponding microservice language composition may go, PHP, Vue, Node accounted for a large proportion of.

Let’s analyze the possible causes and places of this accident:

1. Delete libraries and run away

Before the micro alliance happened this thing, I think each company should not give so much operation and maintenance permission, such as host permission directly banned rm-rf, fdisk, drop such commands.

And the database is now the probability of more than the master more from, more than the backup, disaster recovery should also be done very well, and the database alone burst, that CDN a lot of static resources should not load out, the whole page directly 404.

2. A single microservice collapse brings down a large cluster

Now the front end is separated from the back end. If the back end is suspended, the front end can still load a lot of things, but the data can not come out, so the cluster to hang may be the front end is suspended, or the front end is suspended, but the same problem, now it seems that all static resources can not be accessed.

However, I think this point is a little bit possible, because some services are down, causing a lot of errors, pulling the cluster down, and the more you refresh the page, making it more difficult for other services to restart, but this is not as likely as I said last time.

3. There is a problem with the server vendor

This kind of big website is CDN + SLB + station cluster, all kinds of traffic limiting degradation, load balancing will do very well according to the truth, and they will not do disaster recovery according to the truth.

Therefore, it is only possible that there is a problem with the server vendors of these front-end services. If the CDN is down, the pressure on the gateway load balancing will be great, which will eventually lead to a chain avalanche effect that will crash the whole system.

However, WHAT I am confused about is that the BFF of station B should be routed to the equipment room where some A-nodes are more likely to enter. In this way, when people from all over the country brush, some people should be good, some people should be bad, and some people should be good and some people should be bad, but now it seems that they are all bad. Do they bet on one node area of one manufacturer?

I see the Internet also in the sea of cloud data center fire, do not know true or false, can only wait to wake up to see B station official announcement, B station in principle, in theory, from CDN, distributed storage, big data, search engines should do a lot of guarantee measures, if really all in a place that is not very wise.

My feeling is that the offline server has problems, and the key business is just not in the cloud. Now the company is using hybrid cloud such as public cloud and private cloud, but the private cloud part is the internal business of B station, so there should be no problem in his own room.

If you really like I said, a bet on a server vendors, is only a matter of CDN out ok, if the physical machine is wrong, the data recovery may be slow, I do big data before their own, I know the data backup is incremental + total quantity, restore the really good part of the can node to pull from other areas, but if it is in one place, That’s a problem.

analyse

I think no matter what the cause is, what we technology people and companies should think about is how to avoid this kind of thing.

Data backup: backup must be done, or if there is a natural disaster, it is very uncomfortable, so a lot of cloud manufacturers are now in Guizhou my hometown so less natural disaster places, or the bottom of the lake, the sea (relatively cool cost can go down a lot).

Full, incremental is basically always to do, days, weeks, months of incremental data, and full data backup on time, so that the loss can be reduced a lot, afraid of all areas of the mechanical disk is broken (in addition to the destruction of the earth can be recovered).

Operation and maintenance permission convergence, or afraid of deleting library run, anyway, I am often rm- RF on the server, but generally have a jump machine to go in can do command prohibition.

Shang Yun + Cloud native: The capabilities of cloud products are very mature. Enterprises should have enough trust in the corresponding cloud manufacturers, and of course, they should choose the right ones. For one thing, the capabilities of cloud products, and the disaster recovery and emergency response mechanisms at critical moments are not available to many companies.

Cloud native is everybody to pay attention to technology in recent years, the docker + k8s that correspond to some of the portfolio, combined with the capabilities of the cloud computing, it can achieve unattended, dynamic expansion and shrinkage, as well as the above said emergency response, but the technology itself is an attempt to need some cost, and I don’t know B stand such video system, compatible or not.

The design of Kubernetes also has some issues with layout and communication.

Build their own strength: in fact, I think whether it is on the cloud, or not on the cloud, can not rely on a lot of cloud manufacturers, their own core technology system and emergency mechanism or have, if the cloud manufacturers really unreliable how to do? How to do the real high availability, which I think is the enterprise technical personnel need to think about.

Very cloudy vendor, for example, is a physical machine is separated into multiple virtual machine sales, then there are more than a single physical machine hosts, if one party is electricity play double tenth, one side is the game makers, each other a lot of network bandwidth, you may be lost package, it is for the game user experience poor, That’s why I say don’t trust and rely too much on cloud vendors.

If the other party buys it to go mining, that is even more excessive, the calculation force is exhausted, running at full load is even more uncomfortable.

Fortunately, this time, such a problem was exposed in advance, and it was at night, there should be a lot of traffic low time to recover, WHEN I write here, most of the page was restored, but I found that part of the recovery.

Anyway, it can be completely eliminated next time. I believe that B station will be busy with architecture system transformation for a long time to ensure its true high availability.

I hope I can see the dance area at night steadily in the future, instead of staring at 2233 niang 502, 404 in a daze, xi xi

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.