background

Recently, I encountered a troubleshooting problem of the legacy system, which is more interesting. Anyway, everyone looks at it and nags about it here.

Basic situation

Recently, a customer who had not maintained the system for a long time called and said there was an anomaly in the feedback system. Let’s see what happened. After a preliminary inquiry, the system could not be accessed after an abnormal power failure in the computer room. I thought it was probably a service that did not get up, and the problem should not be big. And comfort after customers, put the basic situation of the project picked it up and looked at it, this project last maintenance can be traced back to nearly two years ago, and the system is our responsible for the back-end, front end by outsourcing unit is responsible for the development and operations, up to now outsourcing unit has changed two dial, the recent wave of people can’t find it. What’s more, the customer’s system is deployed on the internal network and cannot be accessed through the external network. That’s about it. I don’t feel good about it. I think it’s a pit. Anyway, I got feedback from clients, so let’s see. So the story begins.

Story of arrangement

Get the basics

When I contact the customer’s contact person, the system interface cannot be accessed, but the server can be pinged and Telnet can be accessed. Ok, basic tools can be used, it is quite reliable.

First encounter

Because the system is deployed to the client’s internal network, I can’t see the situation. Fortunately, the front end is independently deployed on a server, and it is docker-run. After the deployment script uses -restart =always, it should be able to automatically start. In the spirit of trying, let the customer to a restart method. After 3 minutes of long waiting, the customer feedback started, quickly let the customer open the browser to have a look, well sure as expected, the page can be accessed, I am secretly happy. But then came the bad news:

  • The page is particularly long, more than 10 seconds still in circles.
  • Unable to log in to the system.

WTF!!!! What’s going on? Slow response is easy to say, how can not log in, can not log in is generally the back-end service out of the problem ah, hurriedly with the customer video chat, an action an action, check whether the back-end service is still alive. WTF!!!! Everything is normal. Let the client open the browser console, check the networrk situation, login interface is indeed a long time no response, but a closer look, this interface is not issued to the background but to the front end of their own, huh? On second thought, you may have made a Node or nginx proxy to solve the cross-domain problem. So let the customer directly through the tool will be directly sent to the background login request, as expected normal access. The problem is still at the front desk, unconsciously relieved, after all, it is not homemade, in addition to the problem is very convenient. – -! This is customer feedback and now the page can’t open. WTF!!!! What’s going on? By this time three hours had passed. Considering that it is time for the client to have dinner, I would like to take a rest first. I am sorting out my thoughts. At the same time, he asked why we were not asked to go there this time. Previously, the client unit did not allow people from external units to enter, otherwise we would have gone there to solve the problem (in the same city).

Secondary to the battle

Lunch box lunch, while sorting out the information at hand.

  • The server pinged successfully. Procedure
  • The back-end service is normal
  • The request to access the back-end service through the proxy timed out
  • The page wouldn’t open
  • The page can be accessed after the restart, but it is very slow to load
  • After about 10 minutes, the page cannot be accessed, indicating timeout. At present, there is a high probability that the front-end server is abnormal. Poisoning? Network packet loss? Don’t like. After lunch box, contact the customer to fight again. The first thing is to check the network condition, a simple ping a large packet, no packet loss. The front-end server operating system is centos without GUI, which is a challenge for the customer. Their own operation and maintenance system cannot log in to this system. I don’t know why, so they can only use putty first. After half an hour putty finally connected to the front-end server, quickly pS-X check, the result is a big surprised, there are N node processes, and each of the CPU usage is more than 100%, add a block about 800%, which is consistent with the number of cores, this can basically determine that the CPU is full caused by the problem. Combined with the fact that all the services on this machine are run by Docker, we took advantage of Docker Stats to see the resource occupancy of each container, and sure enough, it was over 100%. That clears up the problem, but the bad news is that there is no straightforward solution. They asked again if there was another way to access their system. The customer provided an exciting message, just remembered, that is our unit should have a network card to access, and provided the network card contact. Finally found a savior, quickly contact, half an hour later to get a dedicated network card. Now I can finally get my head straight.

Three decisive battle

Take a drink of water to calm your mind, plug in your network card, input your login information, and connect to the network. In the open terminal, connect business front-end server, the final battle officially began. It’s hard not to see this with your own eyes, a single front-end server running nearly 10 page applications, all on different ports. It is estimated that due to the continuous demand of our business staff and the temporary replenishment of the page, the outsourced students will be directly dumped on this server. Time is really a knife to kill pigs, so to people, so to the system. Contact business people and customers to determine what works, and stop using all the services. Ok, that leaves four services. Now the CPU is not full and the front page is accessible. Now that the server is operating smoothly, we can continue to locate the problem. None of these four services are accessed, and CPU usage is still extremely high, which is clearly abnormal. Find a service that uses Docker logs XXX to look at the log and see that the log shows that it is frantically reconnecting to the Websocket. Exit to view another service, same problem. There is hope at last, and that may be a factor. Check the code to see what he is connected to the websocket, found to be a middleware. Check this middleware. Although the middleware service is running, it cannot be accessed. It should be caused by abnormal status after power failure. What’s worse is that you can’t shut down the service, so you have to restart the machine and then forcibly shut down the service. Ok, trying to restart middleware, WTF! Still no. Checking the logs didn’t show anything out of the ordinary. Oh, shit. This is a problem. While getting annoyed, I accidentally looked at a message-middleware container in the Docker container, but it wasn’t running. The inspection instructions show that the container did run. There is a possibility that the outsourcing operation and maintenance team deployed directly at the beginning, but found compatibility problems, so it changed to Docker deployment. After success, the directly deployed service was not deleted. After the power failure, the directly deployed services started automatically before Docker, and the normal middleware could not run, which led to the node service crazy reconnection problem.

So I quickly started the middleware in Docker, and found that everything was finally quiet, and the heap of 100% services quickly returned to normal. Just when I thought the world was finally back to normal, I removed the directly deployed middleware and invited customers to validate the system, only to find that the system was not quite right, many things were different from the original system, and some functions still didn’t work. WTF!!!! What’s going on here? This is a classic version problem. Check the code version, it’s ok. Check the docker mirror version again, no problem. Looking at the container information, the image that started was not the latest, but a version that was nearly two years old. -restart =always is used. Then use docker container ls -a to check all images. A large number of containers are not destroyed, and all use the same port. This makes it clear that many containers are automatically restarted, but it is always a port, so that a container can be successfully restarted later, and the image used by the container is not the latest. Stop the wrong container and start the latest mirrored container. Delete containers and images that have expired.

Check the system page again and finally everything is back to normal.

The service recovery time is 6 hours after the customer contact time.

Subsequent processing

While the problem was fixed, the root cause of the CPU fill was not found. Could trying to connect to websocket all the time really cause this problem? I tried to simulate the production system problems on the machine, although the CPU did increase but was far from full, usually 4-5% and up to 20% at the highest. There has been no recurrence of production accidents. Considering that the docker and OS versions of the development machine are inconsistent with those of the production system, there is no way to rule out this cause. Later, because the dedicated network adapter was borrowed by another person, the system could not access the network, and the system could only be located after the network connection was restored. It is highly suspected that the version of Docker and operating system is too low. After all, the version was 4 years ago, and pM2 docker is used as a guardian, so there may be compatibility problems.

review

  • The fault is caused by power failure or improper production environment Settings
  • The configuration is incorrect for technical reasons. The problem of power failure and restart without o&M is not fully considered
  • For management reasons, the handover is not thorough. For personal dependence, the production environment is out of control due to various reasons

conclusion

  • The occurrence of a problem is often not a problem, pay attention to eliminate all problems and recovery check
  • Docker is a good thing, but the operation and maintenance need standardized management
  • Don’t rely on one person or a few people, have documentation
  • Blind application of poorly understood technologies can be extremely negative