Author | White MANna

While we were enjoying the National Day holiday, there was a major “accident” across the Internet: Facebook and its Instagram and WhatsApp apps went down for nearly seven hours and five minutes, and browsers showed a DNS error when trying to open them. That’s a big loss for Facebook, which has 3.51 billion monthly and 2.76 billion daily apps. Investors estimate that the seven-hour outage cost more than $968 million in impact costs. It wiped $64.3 billion off Facebook’s market value and $7 billion off founder Mark Zuckerberg’s net worth.

Facebook said the root cause of the failure was a routine maintenance problem. A configuration change in the backbone router that coordinates network traffic between data centers caused problems with its DNS server and shut down internal tools and systems, preventing operations personnel from remotely accessing the device to restore the network. Therefore, operation and maintenance personnel have to enter the data center with strict process measures to manually restart. As a result, MTTR is severely stretched out.

In short, a bad command, a flawed audit tool, a DNS system that prevented a successful network recovery, and cumbersome data center processes all contributed to Facebook’s seven-hour meltdown.

Specifically, o&M personnel perform off-network maintenance on a part of the backbone network. As part of routine maintenance to assess the availability of global backbone capacity, it inadvertently disconnected all connections to the backbone network, as well as to Facebook’s global data centers. At the same time, Facebook’s architecture is designed to extend or shrink DNS services based on server availability. When server availability drops to zero due to a network failure, all DNS servers are disabled. The breakdown of the automatic response backbone appears to be the cause of the DNS failure. This outage is done by sending messages via Facebook’s DNS name server to Internet Border Gateway Protocol (BGP) routers, which store information about the route used to reach a particular IP address. These routes are usually advertised to routers so they know how to properly direct traffic.

BGP messages sent by Facebook’s DNS server disable advertising to routes, so traffic cannot be resolved into any corresponding content on Facebook’s backbone network. The end result is that even if the DNS server is still running and cannot be accessed, users lose their service when the network they are trying to access crashes. More unfortunately, DNS services are used for client-facing web sites as well as their own internal tools and systems.

And as you can see here, DNS plays a very important role in all of this. So what is DNS? DNS is short for Domain Name System. The Domain Name System maps Domain names to IP addresses in a distributed database. Simply put, THE DNS is used to resolve domain names. In normal environments, each Internet access request of a user is resolved to a matching IP address through the DNS. As an application-layer protocol, DNS works for other application-layer protocols, including but not limited to HTTP, SMTP, and FTP, and resolves the host name provided by a user to an IP address. The process is as follows:

(1) The client running DNS on the user host (PC or mobile phone); (2) The browser extracts the domain name field from the RECEIVED URL, i.e. the host name of the visit, such as www.aliyun.com/, and sends this host name to the client of the DNS application; (3) THE DNS client sends a query message to the DNS server, which contains the host name field to be accessed (including some columns of cache query and distributed DNS cluster work); (4) The DNS client finally receives a reply packet containing the IP address corresponding to the host name; (5) Once the browser receives an IP address from DNS, it can initiate a TCP connection to the HTTP server located at that IP address.

The nearly seven-hour Facebook outage, which affected about 85 million users, was the worst since 2008. Looking back at the glitch from the outside, we can see one key problem point: However, users reported that the websites and apps of Facebook, Messenger and WhatsApp, and Instagram were unable to refresh due to server errors. Facebook is almost completely offline in Europe, The Americas, Oceania and Japan, South Korea, India and other Asian countries, affecting users in dozens of countries and regions around the world. It seems that Facebook didn’t catch these problems in the first place. The problem was only discovered after feedback from users in multiple countries and regions around the world.

Even companies as large as Facebook did not discover the DNS failure in the first place and suffered significant financial losses. In the face of such failures, how should we find and monitor the health of products and DNS in the first place? And keep abreast of user usage in different countries and regions around the world?

Across a wide range of APM products, non-invasive cloud dialing is the best solution. Ali Cloud dial test through 1000+ monitoring points around the world, including real user monitoring, 24 hours a day to launch network requests to the target domain name, to help users monitor DNS service availability and resolution performance, DNS dial support to specify recursion, iteration of different query methods and resolution server, Use flexible dialing parameters to simulate the access of real users as much as possible.

After scheduled dial-up tasks, Aliyun dial-up can generate reports of DNS resolution times in different regions, and clearly list details of DNS requests for each dial-up, including A address, DNS time, and DNS resolution process, which can help users quickly analyze and locate DNS resolution problems.

By configuring DNS alarms, DNS availability and resolution performance problems can be rectified before users detect and ask questions, improving user satisfaction and reducing economic losses.

To avoid this kind of problem, start using the cloud.

Part of the content is quoted from 1. “Europe: Facebook suffered the worst outage in history, burning 6 billion in 7 hours” www.163.com/dy/article/… 2. “Why Facebook went wrong” baijiahao.baidu.com/s?id=171292…