A: background
1. Tell a story
Last year, all ISV SMS channels on Ali Justone Tower were connected to Ali Communication, so we made docking transformation and used.NET SDK provided by Ali.
Address: help.aliyun.com/document_de…
My colleague was using ONs-.NET v1.1.3 at that time. A few days after the launch of the program, there would be a program crash. At that time, I didn’t pay much attention to it, thinking that there was something wrong with my code or the environment. Until one day the server alarm, the program CPU actually went up to 100%, the server is 16 core 128GB oh…
Two: analyze the problem
1. Capture the dump file
The situation is more urgent, immediately to the program to send Ctrl+C command to let the program to exit, the results and exit, weird… In order to analyze the problem caught a dump down, and then forced to kill the program.
2. Look at the thread pool and what the individual threads are doing?
0:00 0 >! tp CPU utilization: 100% Worker Thread: Total: 0 Running: 0 Idle: 0 MaxLimit: 32767 MinLimit: 16 Work Request in Queue: 0 -------------------------------------- Number of Timers: 1 -------------------------------------- Completion Port Thread:Total: 1 Free: 1 MaxFree: 32 CurrentLimit: 1 MaxLimit: 1000 MinLimit: 16Copy the code
CPU utilization: 100%, CPU utilization: 100%, CPU utilization: 100%, CPU utilization: 100%, CPU utilization: 100%, CPU utilization: 100%, CPU utilization: 100% Clrstack prints out all managed thread stacks.
0:000> ~*e ! clrstack OS Thread Id: 0x1818 (0) Unable to walk the managed stack. The current thread is likely not a managed thread. You can run ! threads to get a list of managed threads in the process Failed to start stack walk: 80070057Copy the code
The output shows that there are no managed threads and the only thread 0 is not yet managed. The dumpStack finds the stack of unmanaged threads.
0:000> ~*e ! dumpstack OS Thread Id: 0x1818 (0) Current frame: ntdll! ZwRemoveIoCompletion+0x14 Child-SP RetAddr Caller, Callee 000000637323ef40 00007ff8327bac2f KERNELBASE! GetQueuedCompletionStatus+0x3f, calling ntdll! ZwRemoveIoCompletion 000000637323efa0 00007ff81b9c8a00 ONSClient4CPP! metaq_juce::URL::launchInDefaultBrowser+0x273d0, calling kernel32! GetQueuedCompletionStatus 000000637323f090 00007ff81ba3eb0a ONSClient4CPP! ons::Message::getMsgBody+0x5a8a, calling ONSClient4CPP! metaq_juce::URL::launchInDefaultBrowser+0x1f100 000000637323f140 00007ff81ba3f084 ONSClient4CPP! ons::Message::getMsgBody+0x6004, calling ONSClient4CPP! ons::Message::getMsgBody+0x5800 000000637323f280 00007ff81ba233b4 ONSClient4CPP! ons::ONSFactoryProperty::setSendMsgTimeout+0xa6b4, calling ONSClient4CPP! ons::ONSFactoryProperty::setSendMsgTimeout+0xa5d0 000000637323f2b0 00007ff81ba11b43 ONSClient4CPP! ons::ONSFactoryAPI::~ONSFactoryAPI+0x153 000000637323f310 00007ff81ba12d64 ONSClient4CPP! ons::SendResultONS::operator=+0xc44, calling ONSClient4CPP! ons::ONSFactoryAPI::~ONSFactoryAPI+0x10 000000637323f460 00007ff81ba83eb4 ONSClient4CPP! ons::Message::getStoreTimestamp+0xf484, calling ONSClient4CPP! ons::Message::getStoreTimestamp+0xf1c4 000000637323f630 00007ff8356f7d94 ntdll! RtlExitUserProcess+0xb4, calling ntdll! LdrShutdownProcess 000000637323f690 00007ff832777c23 KERNELBASE! CtrlRoutine+0xa3 000000637323f780 00007ff834df8364 kernel32! BaseThreadInitThunk+0x14, calling kernel32! WriteConsoleOutputW+0x530Copy the code
From the unmanaged call stack, where KERNELBASE! CtrlRoutine indicates that the main thread has received the Ctrl+C command, and from the top of the stack, it seems that it cannot exit because the main thread is taken over by ONSClient4CPP, and the C++ is making a remote connection and waiting for the network IO to return, but this should not be possible because the 16-core CPU will be full. The problem seems to be stuck here.
Three: restart the program found that the problem is still
1. Capture the dump file
After the program restarts, the CPU is again skyrocketing in 2 minutes. This time, we have learned to catch the dump file when the CPU reaches 60, 70%.
2. Continue troubleshooting
0:000>.debug session time: Fri Apr 17 17:36:50.000 2020 (UTC + 8:00) System Uptime: 355 days 5:33:48.092 Process Uptime: 0 days 0:02:11.000 Kernel time: 0 days 0:03:31.000 User time: 0 days 0:13:22.000 0:000>! tp CPU utilization: 59% Worker Thread: Total: 3 Running: 0 Idle: 3 MaxLimit: 32767 MinLimit: 16 Work Request in Queue: 0 -------------------------------------- Number of Timers: 1 -------------------------------------- Completion Port Thread:Total: 2 Free: 2 MaxFree: 32 CurrentLimit: 2 MaxLimit: 1000 MinLimit: 16Copy the code
This is 59% CPU utilization. That’s a bit early, but let’s take a look at Threads first.
0:00 0 >! threads ThreadCount: 25 UnstartedThread: 0 BackgroundThread: 8 PendingThread: 0 DeadThread: 16 Hosted Runtime: no Lock ID OSID ThreadOBJ State GC Mode GC Alloc Context Domain Count Apt Exception 0 1 cdc 0000022bb9f53220 2a020 Preemptive 0000022BBBFACCE8:0000022BBBFADFD0 0000022bb9f27dc0 1 MTA 2 2 3dc 0000022bb9f7f9f0 2b220 Preemptive 0000000000000000:0000000000000000 0000022bb9f27dc0 0 MTA (Finalizer) 3 4 296c 0000022bb9fe97b0 102a220 Preemptive 0000000000000000:0000000000000000 0000022bb9f27dc0 0 MTA (Threadpool Worker) XXXX 5 0 0000022bb9ffc5a0 1039820 Preemptive 0000000000000000:0000000000000000 0000022bb9f27dc0 0 Ukn (Threadpool Worker) XXXX 6 0 0000022bd43938c0 1039820 Preemptive 0000000000000000:0000000000000000 0000022bb9f27dc0 0 Ukn (Threadpool Worker) . 163 24 29e8 0000022bd4898650 1029220 Preemptive 0000022BBC102108:0000022BBC103FD0 0000022bb9f27dc0 0 MTA (Threadpool Worker) 164 25 2984 0000022bd489d470 1029220 Preemptive 0000022BBC0EA2D0:0000022BBC0EBFD0 0000022bb9f27dc0 0 MTA (Threadpool Worker)Copy the code
In just 2 minutes and 11 seconds, the managed thread ThreadCount: 25 died. 16 threads, and from the threads list, windbg gave the maximum number is 164, indicating that there are currently (164+1) -25 =142 unmanaged threads, should be ali ONSClient4CPP opened, why so many threads, this is a very worthy of attention. Again, use ~*e! The dumpStack prints out the managed and unmanaged thread stacks for all threads, and since there’s too much information, I’ll just take a few pictures.
Personal speculation, purely technical discussion:
Figure 1:
From the stack, there are 105 threads stuck in NTDLL! ZwRemoveIoCompletion+0x14 here, and from ONSClient4CPP! Metaq_juce: : URL: : see, launchInDefaultBrowser + 0 x23072 seems to open a browser kernel, ali to send data, with the kernel estimate concurrent threshold open here is quite big, consulting the colleague is in front of a big customer to send a lot of messages, The C# SDK is reading like crazy. It’s probably related to CPU inflation.
Figure 2:
A classic example of a high CPU is when many threads are waiting in a critical section, and when one of the threads in the critical section leaves, the scheduled race of those 28 threads is also a cause of high CPU.
Personal level is limited, further dig unmanaged heap has not this technology (┬ _ ┬)…
Four: solutions
What can be the solution to this SDK problem? All you can think of is to go to the official website to find the latest version:
You can see the optimization point mentioned in the latest version of ONs-.NET V1.1.4: optimize the message pull process to avoid the message accumulation caused by the pull exception in special cases.
Sure enough, use the latest version of the SDK, 🐮👃.