Title picture: The Tomb Raiders

How did Lu Han’s public affair with Guan Xiaotong bring down sina Weibo’s server?

(200+ likes, excellent answerer of programmer topic)

I think it is not like a database failure, the level of micro-blog architecture is not a simple distributed server+DB can resist, let alone Lu Han Guan Xiaotong to make a big news, even if the usual operation pressure can not bear.

Just now, Wang Gaofei said that with 1000 servers, it is impossible for the database to temporarily scale so flexibly. What can scale is nothing more than HTTP Server, middle-layer services, cache or message queues.

Probably weibo automatic expansion of the algorithm did not write, or did not dare to all algorithm to do. For example, if you find that traffic increases, automatic order plus dozens of servers can accept, suddenly add a thousand if the program bugs, micro won’t cost how much money ah… Most capacity expansion of this magnitude requires manual confirmation by O&M.

And it happened in the middle of the last day of a long holiday, off-peak, and the servers were poorly prepared. Stars announced love this thing can not be warned, who knows when they suddenly introduce girlfriend ah whim…

Acquaintances buzz

Because without Zhuo Wei, it’s too sudden

– XiaoXiaoBang

According to the existing information, the database is overwhelmed. I guess first and then write a program to analyze the data of likes and comments forwarded at that time to verify the guess. Sites like Weibo, if overwhelmed by heavy traffic, are unlikely to have fault-tolerant non-essential fields. Having experienced several hot events before, I believe that when hot news breaks out, Weibo temporarily sacrifices a little data accuracy to ensure the availability of key services. In other words, it’s hard to overwhelm weibo with requests alone. According to the accident of weibo thumb up number, forwarding number, number, number of comments reply, comment on the number of thumb up and the number of forward forward comments thumb up, weibo is most likely due to the time need to write the database request too much (write behavior could reach hundreds of thousands of even higher peak), and most of the writing will land on the same weibo, Moreover, some write operations also need to trigger other corresponding write behaviors (reply comments need to notify the commenter, like need to feed followers, etc.), the database is too heavy to bear, and finally kneel for a while. In fact, if the cache is good, this time can still meet the core data read request (of course, weibo cache is not good, I micro-blog personal page data error for a long time feedback is useless). If the database is under too much pressure to asynchronize part of the write request, or consider temporarily discarding part of the request in exchange for stability, of course, this has both advantages and disadvantages, not necessarily good. You can grab the time of all comments, retweets, replies and likes on Lu han’s micro blog at that time to see how many successful writing actions there were in the seconds before the failure. Irresponsible unverified guess (drawing level is limited, omitting part of the process, but the number of excessive arrows from the top and bottom, roughly indicating that many requests are read and not pressed to the database, will do) :

Let me show you two pictures from the background data of Weibo:

That might not be intuitive, right?

No contrast, no harm! Guan Xiaotong hot discussion trend abruptly rose 1122.9%, social society!

Click to read the original article

See more answers