At 5 o ‘clock in the morning on a Saturday in 2015, there was user feedback that the official website of the company’s official QQ group could not be opened, but some users could open it. The customer service got up and tried it on their computer, but there was no problem, so he gave the customer feedback that it might be the problem of his network, please try again later. At 8:00 a.m., more and more users reported that the official website could not be opened, and some users also reported that the APP could not be opened. The customer service called me while I was still asleep.

Analysis of the positioning

After being called up by the customer service, a face meng force, do not know what the situation, reply to the customer service, know, immediately check, there will be news timely communication. After washing my face with cold water, I woke up and immediately recalled the production and production situation of these two days based on experience: XX module was put online, but XXbug was not affected and fixed, and IT should not be affected either. HTTPS was just configured for the server, which seems to be related, but HTTPS was not put into production for the app for the time being. I opened the computer and checked the recent production records, which should not have caused such a serious problem. As I suspected whether there was a problem with the network, I immediately called the operation and maintenance manager and relevant people to conduct investigation.

While letting the network and operation and maintenance to eliminate the problem, we checked again the Web server, database server, business log, database log, and other monitoring data, all of which were normal. Try to native ping the domain name does not, more suspected to be a network problem, try to access the direct use of the network, can open no problem, can confirm the basic service there is no problem, but what normal operations department feedback network equipment, it must be something wrong with your production code, all parties crustily skin of head to continue in the UK.

At 9 o ‘clock, the group began to have large-scale user feedback official website and APP are not open, more some users instigated, XXX company ran out (15 years many P2P companies run away, leading to users have become frightened of the birds of a bow, a little problem is afraid of the company run away, all exercise into a monitoring master, see every day, real-time brush, Wee up pee pee also incidentally look at the app today’s earnings), customer service 400 hotline was basically hit explosion. While continuing to troubleshoot the problem, while reporting this problem to the director, the company’s executives, to customer service advice, to explain to the user, IDC room network jitter, technology is urgently solved, capital and data are not any impact, please don’t be impatient.

At 10 o ‘clock, after repeated inspection of development, operation and maintenance, I began to suspect that there was a problem with DNS resolution, but the specific problem was not clear. CTO decided: 1. Everyone took a taxi to the company and came to the company to solve the problem collectively; 2. When I was in the car, I recombed the whole access process of the user, as shown in the picture below:


After arriving at the company, we verified together according to this idea. All services of the company can be accessed through the external IP and internal IP, but not through the domain name. In addition, the logs of the monitoring server, firewall and network device are all normal.

A crucial question

Since it is indeed a DNS resolution problem, then the problem again? Why does DNS resolution fail? How to solve this problem? At the same time, we also tested telecom, mobile, Unicom in different network operators in the following access to the situation, found that only in the unicom network environment DNS can not be resolved. According to the customer service feedback has also verified this situation, telecom and mobile users feedback is very little, Unicom users feedback the most. So we began to call China Unicom again. At first, China Unicom did not accept our request, so we began to call China Unicom as users to immediately solve the problem of Internet access.

Thus began the nets and unicom the kop war, nets said from their side to check the DNS works, index all together, we call unicom unicom said we already know, to be by a professional person give us a reply, after a while of unicom network engineer replied that, like this kind of situation is generally DNS problem. Within 6 hours after arriving at the company at 10:30 in the morning, we took turns to make nearly 50 or 60 phone calls to China Unicom, and put up N work orders and received N phone calls to Wanwang.

During this period, the leader also began to use various relationships, including friends in China Unicom and experts in network operation and maintenance to help locate and solve the problem. We also tried many methods, such as using the ipconfig/flushdns command to clear the DNS cache of the machine, updating the DNS resolution on wanwang’s official website, adding the DNS resolution after deleting, etc. Not entirely without success. We have been looking for a way to test the network of various places and operators, and finally found two websites, 17CE and 360 Cloud test, which felt very practical. In the future network positioning, I have become a necessary tool to use. It is very convenient to monitor whether the access to websites of various operators and regions is blocked, and whether the access speed is fast or not. The screenshots are as follows:


We also found that other domain names of the company were also accessed normally, but the domain name on the official website was not connected with the relevant sub-domain name. During a lot of people are asked a question is the domain name you have forgot to pay cost, at first we have asked the operations here is not the problem, until the time of the 12:30 noon when we repeatedly pressed by said more than eight points when login in nets show this domain name is lack of status, but he is immediately put the cost to fill up. Oh almost put us to death, asked not domain name expiration prompt? After the last operation and maintenance manager left, they did not update the phone and mailbox of Wanwang in time, leading to the failure of receiving prompt emails and short messages.

Through the communication with wanwang, China Unicom, the leader of the relevant friends and our test and observation, initially understand the reasons of this matter: domain name forget to pay led to wanwang DNS resolution was stopped, the user’s local or DNS server cache, so some users can access some users can not access; After payment, the DNS of wanwan has been updated and pushed, but there are many levels of DNS resolution that need to be updated level by level. Some levels are not updated, so users of some DNS service providers that are not updated cannot access the official website.

And nets, ask the delay of all the DNS update to the latest time, the answer is sure will be ok within 48 hours, but we can’t afford to, ah, as time goes on, more and more users find problems, QQ group, WeChat group has been boiling, chairman of the board of directors also began to focus on the question of time, some customers directly inside the group said, Your technology is too not to force (like this or euphemistic, some direct call curse)…

Temporary solution

Through continuous 17CE test, it was found that the networks in most regions had been restored, except for the network environment of Beijing Unicom and some regions, which also showed that the DNS resolution records in these regions had not been updated. Now that we have identified the problem and understood the cause, we thought it would be better to try a different DNS server, so we changed the local DNS address to (Google DNS service resolution). So quickly write the first solution manual to anxious customers to use.

Users of the official website can change DNS to solve the access problem. What about the APP? There is no way, we can not wait, directly ask the developer to call the client address from the domain name to the external NETWORK IP address for temporary use of a version for users. Android is relatively easy to do, let users directly download and install the use of good, but IOS at that time at least a week to review the day lily cold. As a matter of fact, iPhone can set DNS separately. After setting and testing, we found that DNS could also be achieved, so we immediately updated it in the manual and sent it to customer service and sent it to the group for users to use.

Click to download the DNS update manual written at that time

Some people say that directly let users use the Internet on the line, the use of the Internet home page open to is no problem, but between the system call, the relevant configuration file is also written in the address of the domain name, if the hard change may cause another problem. After finishing the first day, it was more than 10 o ‘clock. In the middle, we had a meal at 4 o ‘clock and made N phone calls. We were all very tired.

The next day to the company all the nodes have been found by 17 ce test through the left Beijing unicom two contact no response, but Beijing is the capital of the us, the vast majority of users are in Beijing, to continue and nets, unicom communication to see how we can thoroughly solve the problem, on the other hand, better prepare for the worst, if has not to do. In the production environment, comb all the configuration files of domain names, and make sure that they can be directly updated to the extranet address at any time without affecting the service. Make a complete version of APP, and make sure that the version can be put into production at any time for users to upgrade to the version directly connected to the extranet.

By 10:00 PM the next day, the two nodes of Beijing Unicom were still disconnected. We discussed with the leader that if the two networks were still disconnected by 8:00 am on Monday, we would launch the modified system and APP and force them to upgrade (because there was no bidding on the weekend at that time, the bidding plan would be issued within the week). On the third day, the first thing I did when I got up was to pick up my mobile phone and check whether my Unicom network could be logged on to the official website. Everybody’s happy.

As the saying goes, truth is more and more clear, after this accident, also thoroughly let me understand the whole process of DNS resolution.

DNS Resolution Process

The Domain Name System (DNS) is a naming System for computers and network services organized into Domain hierarchies. It is used on TCP/IP networks to convert host names and Domain names into IP addresses. DNS, as the saying goes, translates web addresses into external IP addresses.

DNS process from user access to response {}

  • Step 1: The browser will check the cache to see if there is a resolved IP address for the domain name. If there is, the resolution process will end. The browser cache domain name also has limitations, including the cache time, size, can be set through the TTL attribute.
  • Step 2: If the user’s browser does not have the URL mapping in the cache, the operating system checks whether the local hosts file has the URL mapping. If yes, it invokes the IP address mapping to complete domain name resolution.
  • Step 3: If there is no mapping to the domain name in the hosts, check the local DNS parser cache to see if there is a mapping to the domain name. If there is, return to complete domain name resolution.
  • Step 4: If no mapping exists between hosts and the local DNS parser cache, the hosts searches for the preferred DNS server set in TCP/ IP parameters. In this case, this server is called the local DNS server. When receiving the query, if the domain name to be queried is included in the local configuration zone resource, the server returns the resolution result to the client. Domain name resolution is complete. The resolution is authoritative.
  • Step 5: If the domain name to be queried is not resolved by the local DNS server, but the LOCAL DNS server has cached the IP address mapping, invoke the IP address mapping to resolve the domain name. The resolution is not authoritative.
  • Step 6: If the local DNS server fails to resolve the local zone file and cache, the local DNS server queries the DNS file based on the Settings of the local DNS server (whether a forwarder is configured). If the forwarder mode is not used, the local DNS sends the request to the 13 root DNS servers. After receiving the request, the root DNS server determines who is authorized to manage the domain name (.com). It returns an IP address that is responsible for the top-level domain name server. After receiving the IP information, the local DNS server will contact the server responsible for domain. The server in charge of domain receives a request and if it cannot resolve the request itself, it will find a DNS server address of the next level that manages domain and give it to the local DNS server. After receiving the IP address, the local DNS server searches for the domain name server and performs the preceding operations until it finds the host corresponding to the domain name.
  • Step 7: If the forward mode is used, the DNS server forwards the request to the upper-level DNS server for resolution. If the upper-level server fails to resolve the request, it either finds the root DNS server or forwards the request to the upper-level DNS server, and the cycle continues. Whether the local DNS server uses a forward or root prompt, the result is eventually returned to the local DNS server, which in turn returns the result to the client.

This incident has given us a great lesson: first, there are loopholes in the process management, the turnover handover is not in place; Second, the crisis management is not mature, affecting the company’s reputation; Third, the monitoring mechanism is not perfect, such as the problem of the Internet, should set up monitoring measures in advance.

Sometimes a very serious problem is a little one that you often overlook

Like my article, please follow my official account

Pure smile