Recently, a relay module was released online in an emergency, and a total of 600 machines were released within half a day. Mobile phones have been vibrating since 6pm, signaling in the REALY module has been abnormal and the user failure rate to enter the room has skyrocketed and a dent has been made in the monitoring curve. Immediately roll back the RealY module to resume services and locate the cause of the problem.
The background service of user request number is developed using a specific framework. During the rollback process, it is found that some processes have CoredUMP and cannot be pulled up, which is a rollback failure. Analysis of the Coredump file shows that core is initialized in a shared memory location, and the change does not change the code in that location. So, it’s not the root cause, it’s just the surface.
It’s all about getting business back on track as quickly as possible. Since the shared memory initialization failed, if the shared memory is released, theoretically the process can pull up, can provide services normally. Locate the shared memory key and run the following command.
ipcrm -M 0x499602d2;
ipcrm -M 0x499602d3;
ipcrm -M 0x499602d4;
ipcrm -M 0x499602d5;
Copy the code
Roll back the program to a previous stable version by releasing changes to the system. Once the shared memory is removed, the process immediately pulls up and the service returns to normal. Once you verify that there are no exceptions to this method, you can begin to plan for a version rollback. To reduce the impact on users, we can only perform a rollback at 11pm, stop the process -> clean up the shared memory -> restart the process. At 12:30 in the middle of the night, all 600 machines on the live network were pulled up again, and the mobile phone vibration finally stopped.
It is clear from the above stack that this is a memory problem, and the first scene is most likely not the above stack. Further analysis of the crash stack revealed that there was another core that was out of bounds. This site is most likely the root cause of the crash.
At the lowest stack level, the problem occurs when memcpy triggers system protection, causing the process to be terminated by the system. The order of calls from the bottom to the top of the stack is normal and there are no stack exceptions, so it is highly likely that this cache function caused the problem.
Using the GDB command frame and print output under different stack runtime values of each variable, find a a buff appears out of bound hint, that the problem is due to memcpy visited illegal memory address, When the operating system is triggered to send SIGSEV signals, a segmentation fault occurs. The next step is to analyze the cause of the fatal buff and what might trigger the transgression.
This buff is used to cache incoming packets for the receiver to request retransmission. It is a fixed-size memory pool, and a large space is created directly after the process is started, and the memory unit is initialized according to the configuration.
$ensp; The relay module has four processes running in the background. Each process must support 2000 users and each user has two streams. So you need to open up 1.6K * 800 * 2 * 2000 * 4 total 20G memory, an average of 5G memory per process, and the online service machine is a gigabit machine 8 core 16G memory.
After the above calculation, a burst of inner joy. The reason for last night’s problem was that the amount of memory allocated exceeded the size of the physical memory, and because the virtual memory mechanism of the operating system did not fail at process startup, it crashed only when the running process needed to write memory. As for the shared memory to be cleared, it is highly likely that abnormal data is written into the memory space of the shared memory. As a result, abnormal data is read after the shared memory key is opened, triggering system protection.
Through this online anomaly, I deeply understood that when there is an anomaly online, do not panic, quickly locate the change caused by the business, priority to restore the business through the operation means such as rollback, scheduling and elimination, so as not to affect users, and the subsequent problem can be located through the reserved site. During the development process, we should be able to evaluate the impact of each configuration item modification to avoid blindly going online.
Next, we will analyze the corresponding relationship between virtual memory and shared memory through this online anomaly. Welcome to pay attention to coding craftsmen for the latest article.