Service CPU or memory surge is a common problem in the deployment environment. Logs are usually used to diagnose the problem. However, logs may not be able to diagnose the problem.
In order to locate the problem as accurately as possible, it is also necessary to know how to run stack information through the dump analysis service. This article describes how to do dump analysis for.NET Core 2.2 and.NET Core 3.1 projects respectively (this is only for container deployment under Linux).
Create dump file
Before creating a dump file, it is a good idea to look at which threads in the service are raising exceptions and then analyze them for specific threads; otherwise, scanning them all would be a time-consuming task.
After entering the container, install hTOP:
apt-get update
apt-get install htop
Copy the code
Viewing resource usage using hTOP:
The above is the situation simulated by the test program. PID 12 is the thread that needs attention
Run the following command to create the dump file (2.2.8 is used as an example. You can run createdump –help to view more parameters. The default PID of dotnet processes in the container is 1) :
/ usr/share/dotnet/shared/Microsoft.NET Core. The App / 2.2.8 / createdump 1Copy the code
After the command is executed, the dump file/TMP /coredump.1 is generated. You need to copy the coredump.1 file to the host directory using docker cp or kubectl cp, and then download it to the machine for dump analysis.
** Note: ** In Docker deployment mode, createdump command execution requires container privileges, so the — Privileged = true parameter needs to be added during container startup. In addition, the dump file requires a large amount of memory. Therefore, adjust the container memory limit.
The.net Core 2.2
Currently, LLDB is mostly used for analysis, but it is not recommended to build an environment from scratch. There are encapsulated images available online for direct use, such as: 6opUC /lldb-netcore, 6opUC /lldb-netcore is a default image built based on.net Core SDK 2.2.8. If the current service to dump is not 2.2.8, You need to modify the LLDB-netcore source code to rebuild the image.
Run the following command to access the LLDB:
docker run --rm -it -v /root/coredump.1:/tmp/coredump 6opuc/lldb-netcore
Copy the code
View the thread running at that time:
clrthreads -live
Copy the code
Specify the number of the thread to analyze (PID 12 corresponds to a hexadecimal C, so find the record with OSID c, corresponding to 7 [column 1])
thread select 7
Copy the code
View the stack information of the current thread in managed code
clrstack
Copy the code
You can view more commands by running the soshelp command
The.net Core 3.1
Since.net Core 3, dotnet-dump has been provided for dump analysis, which is relatively easy to use. Of course, we can still use LLDB.
Install the dotnet – dump
Dotnet tool install --global dotnet-dump --version 3.1.141901Copy the code
Into the analysis
dotnet-dump analyze /root/coredump.1
Copy the code
If the following error occurs, it indicates that the.net Core SDK is not installed in /usr/shard/dotnet. You can specify or reinstall the SDK using DOTNET_ROOT.
To view the managed thread running:
clrthreads
Copy the code
If the following error occurs, the currently installed.NET Core SDK version is different from the SDK version used by CreatedUMP in the container (for example, createdUMP uses 3.1.3, analysis uses 3.1.12).
Specify the DBG of the current thread to be parsed
setthread 7
Copy the code
View the stack information of the current thread in managed code
clrstack
Copy the code
For more dotnet-dump commands, see docs.microsoft.com/zh-cn/dotne…
Case description
The following is a specific case in a production environment where a service runs for a while and the CPU is 100% and cannot be lowered:
After locking the abnormal thread, several times of dump and analysis of the stack information, found that the problem is related to the following code:
NCalc, an open source component for expression calculation, was used here. It was preliminarly judged that the loop parsing might be caused by the illegality of the expression itself. Dumpobj checked the method parameters and found that they were all normal expressions, so the guess was not valid.
Continue to look for similar issues in Github project, found that in earlier versions, there is indeed a deadlock phenomenon github.com/sklose/NCal… , this problem has been fixed in the new version, and the NuGet package used by the service in question is indeed an old version, so the problem can be basically located, after the NuGet package version upgrade, this phenomenon finally disappeared.
conclusion
In fact, when facing a tough problem, you may often have no clue. Too many problems are not easy to locate. When building services to support business capabilities, pay attention to the robustness of the code itself. When using external components, pay attention to their ecology. Dump analysis is just one way to help solve the problem.