A brief discussion on the high availability architecture of large websites on the weibo outage incident caused by Lu Han's announcement of love

Lunch brush brush micro blog found that the micro blog suddenly hung. I thought at first that the home network was not good, and then changed the flow brush or brush out the content, and reported error, I know that weibo should be hung up. It turns out that Lu Han and Guan Xiaotong have “announced their relationship” on weibo. I might not know Guan Xiaotong if I hadn’t watched The TV series “Good Sir” before. Land CP a few days ago is not still fried? Why all of a sudden? Aye.. Your neighborhood is full of thieves.

As a programmer, I am more interested in how weibo responds to the instantaneous influx of high concurrency and large traffic. From a long, long time ago, the article Ma Yili “See you on Monday”, to later “cheating team”, “drug team” competing for points, and then some time ago guo Jingming event, Xue Zhiqian event, and today’s Lu Han announced love…… Micro-blog seems to be hanging every time, there has been no progress, everyone will ridicule the application platform of micro-blog is very hot chicken every time they encounter hot events can not brush the content. However, the background system of Weibo must have been reconfiguring, upgrading and optimizing. I think it is already very good to achieve this level today.

From the user’s perspective (mainly my perspective HHHHH)

After a period of time (about half an hour), the interval refreshes (about 10 seconds). It is possible to see 5XX error at some time. The HTTP status code starting with 5 indicates that there is a problem with the server or gateway. A 5XX error can result from server rejection, gateway timeout, or application code bugs. Within 1 hour after the hotspot event occurs, the system can restore normal services.

From the perspective of website development, operation and maintenance personnel

M: Oh my God? Why is the traffic so high? Is there a bug? Oh, my gosh! This isn’t another cheating addict, is it? emmmm…. My National Day holiday is not over yet!

Yunwei: Guys, wake up! Add more machines! The system is going down!

Development: Don’t rush! Suicide again!

Leader: Pull out the test immediately after expansion!

Test: Man lies at home, pot from heaven!

I made all that shit up there!

Why do I think weibo has a good performance in terms of high concurrency and large traffic? For example, Taobao also has to deal with high concurrency during the Double 11 shopping festival every year, but it can make a lot of preparations in advance, such as purchasing bandwidth resources in advance, increasing server resources, and carrying out complete remote disaster recovery. Many of these are predictable. And what about Weibo? Hot events can happen at any time, so it is a great test for the operation and maintenance engineers of Weibo. Of course, the current operation and maintenance platform is also very intelligent, can carry out real-time monitoring of various indicators, a abnormal, immediately SMS or email alarm, and then is the deployment of all kinds of resources by engineers of various positions.

That micro blog in peacetime why not increase some server resources?

Server resources and network bandwidth resources are both important and expensive. Since you don’t have to deal with high-concurrency scenarios all the time, it’s a waste to add redundant server resources to empty a lot of machines in normal times. We have to look at cost as well as service availability.

I started this blog with the intention of reviewing what I had learned about high-availability architectures for large websites. But because high performance, scalability, and extensibility go hand in hand with high availability, I felt like I couldn’t keep writing. There are so many things I want to write that I don’t know where to start. So let me address a few points that are often mentioned in high availability architectures.

Highly available architecture for large web sites

Layering is a must for both small and large sites: coarse-grained layering is typically the application layer, business layer, and data layer. After horizontal layering, it is possible to split each layer vertically depending on the module. Take Weibo for example, I think its comment module and like module should be decoupled. The more complex the system, the finer the granularity of the horizontal and vertical layers. A lot of times you think of it as a single system, but it can be serviced by hundreds or thousands of independently deployed systems.

The cluster

Clustering is a very, very important concept in large site architecture. Because servers (whether application or data) are prone to single point problems, failover is necessary once a server is down.

Application Server Cluster

In general, application servers must be stateless. What is a stateless server? Before introducing it, let’s talk about the state server: A state server typically stores information about requests, and each request uses the previous request information by default. This can easily lead to sticky sessions: if a state server goes down, the information it holds on requests (such as sessions) is lost, which can lead to unpredictable problems. In contrast, a stateless server does not store request information, and the customer information it processes must be carried by the request itself or retrieved from another server cluster. Therefore, stateless servers are more robust than stateful servers, and state is not lost even if the server is restarted or even if the server is down. Another advantage of this is that it is easy to expand capacity. As long as the same application is deployed on the added server and the reverse proxy is configured, normal external services can be provided.

Session management

Since the application server is stateless, how does the user’s login information (session) be managed? More common are the following four ways. However, since the first three have significant limitations, I will only talk about cluster-based session server management.

Session replication
Source Address Hash (Session binding)
Use cookies to record sessions
The session server

Here we are separating the server state into stateless application servers and stateful session servers. Of course, the session server is definitely a cluster of session servers. Sessions can be stored in distributed caches or in relational databases. In the case of Weibo, distributed caching is definitely needed: with a relational database, connecting to resources becomes a bottleneck, and I/O operations take a lot of time. A common k-V in-memory database is Redis. I think it is appropriate to use Redis to save the number of likes, users’ attention and fans in microblog content.

Load balancing

Now that clustering is mentioned, load balancing must be mentioned. However, it seems that load balancing can be categorized under scalability, so I will not go into details here, but will briefly talk about common load balancing methods and load balancing algorithms.

Load balancing mode

HTTP redirection load balancing
DNS Load balancing for domain name resolution
Reverse proxy load balancing
IP load Balancing
Data link layer load balancing

Load balancing algorithm

The polling method
Random method
Source address hash
Weighted polling
A weighted random
Minimum number of connections

We cut in on something else

An interesting thing suddenly occurred to me: in the architecture of microblog, asynchronous pull model should be adopted instead of synchronous push model. What does that mean? Let’s take an example: Lu Han has over 30 million followers and Guan Xiaotong has over 10 million followers. Let’s say they send a tweet that needs to be pushed to a list of 40 million followers (assuming they’re using a relational database), this is a simplified synchronous tweet model.

So what’s the downside? First, it would consume a lot of database connection resources, and more importantly, it would not comply with the software design specifications: for their fans, there are more than 30 million and 10 million data redundant respectively. If Chen He and Deng Chao gave some likes to their micro blogs in the first time, then the bottleneck came: it was acceptable to insert more than 40 million into the database just now, but now the four fans add up to several hundred million. Is it inappropriate to insert so much data into the database at the same time?

It doesn’t matter, we’re now pushing content in a different way: instead of pushing it synchronously, we’re pulling it asynchronously. Do we have to pull down and refresh every time we check weibo on our phones to see the updated content? Yeah, that’s asynchronous pulling. What are the benefits of asynchronous pulling? One obvious benefit is that hotspot data can be centrally managed without having to do a lot of redundant data insertion. In addition, the consumption of system resources is also less. So where does microblog content pull from? The dominant solution is to put hot content in the cache and check the cache every time, which can reduce I/O operations and avoid timeouts due to resource exhaustion.

PS: WHEN I wrote this, I started to lose my mind. Now I feel unable to continue writing. I haven’t written blog for a long time, and I have forgotten a lot of knowledge. But today is in my mind to sort out the relevant things, because the graduation design is also going to do high concurrency and large flow related things. Today is a hotspot and review the knowledge learned before

In fact, high availability architecture also includes service upgrade, service degradation, data backup, failover, and so on. There are many more things to write about the high availability, high performance and high expansion of the website. But some knowledge does not have certain practical experience, can not be very good grasp. I’ll talk to you later

Finally, an AD: I’ve been writing a tech blog on Jan, but I’ve come to feel that jan’s miscellaneous content is too much to be pure. Plan to leave the fun stuff to the simple book, and the tech blog will slowly move to the gold mining front.Focus on the author of brief books -EakonZhao

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

A brief discussion on the high availability architecture of large websites on the weibo outage incident caused by Lu Han’s announcement of love

Highly available architecture for large web sites

The cluster

Application Server Cluster

Session management

Load balancing

Load balancing mode

Load balancing algorithm

We cut in on something else

A brief discussion on the high availability architecture of large websites on the weibo outage incident caused by Lu Han’s announcement of love

Highly available architecture for large web sites

The cluster

Application Server Cluster

Session management

Load balancing

Load balancing mode

Load balancing algorithm

We cut in on something else

Related Posts

Learn these 10 Timed Tasks and get a little lost…

Python uses MQTT to transfer images to the hardware

Should Kotlin be used for server-side development from an architect’s perspective?