Preface:

CPU usage is an important indicator of server development in software development and performance testing. What are the factors that cause CPU usage to increase? What can be done to reduce CPU usage?

If you still don’t have a solution to the project after reading this article, please leave a comment below or in the background of our official account. (I will update a post or two on memory optimization and memory leak detection.)

Cut the crap. Let’s do something dry.

Check the CPU usage

1. On Windows, you can see the CPU usage of each process directly in the task manager.

2. Actually, what I’m going to focus on is CPU monitoring in Linux.

We typically use the top-HP process ID

For example, top-HP 5490

Here we can see the CPU usage of all threads under the current process (%CPU column).

Yes, sharp-eyed you may have noticed that there are several child threads here that have extremely high CPU usage of over 85%, so that’s where we’ll be optimizing.

How do I rectify the fault that the CPU usage is high

What are some factors causing CPU usage to increase?

(1) Complex calculation

Run some algorithm processing, such as: audio and video codec, image processing, scientific calculation and so on, especially some floating point number operation. Frequent loop nesting can also result in a high CPU

For example, this loop nested code:

int main(){ vector<int> aa; for(int i=0; i<10000; i++){ for(int j=0; j<1000; j++){ for(int k=0; k<1000; k++){ aa.push_back(i); } } } return 0; }Copy the code

Next, we compile and generate an executable file./t runs as follows:

​​

You might be surprised to learn that the CPU usage is always 99.7%-100%, with three layers of nested for loops and frequent push_backs into vectors.

(2) Continuously occupy CPU

Some high-priority processes/threads constantly occupy the CPU with little or no sleep, similar to while(1) 0

This phenomenon is actually quite common in multithreading,

Suppose you have a thread pool, and the work child thread has the following logic:

void ThreadPool::work(){ // every thread will compete to pick up task from the queue to do the task while (is_running_) { ThreadTask* task = nullptr; { std::unique_lock<std::mutex> lk(mtx_); if (! tasks_.empty()) { task = tasks_.front(); tasks_.pop(); // remove the task } else if (is_running_ && tasks_.empty()) cond_.wait(lk); } if (task ! = nullptr)// task(); // do the task { int ret = task->Run(); if(ret == RET_PAUSE_TASK) { //need to run this task again,push to task queue std::unique_lock<std::mutex> lk(mtx_); tasks_.push(task); cond_.notify_one(); // wake a thread to to the task }else{ delete task; task = nullptr; }}}}Copy the code

If a thread task runs and returns RET_PAUSE_TASK at line 26, it will wake up a thread in the thread pool to preempt the task and execute it. Unfortunately, the current task is always preempted by the current child thread, so it keeps running and executing 26 lines of code. At this point, the CPU has been squeezed dry, and you will find that the CPU is close to 100%.

(3) Data copy

Make lots of data copies frequently.

I won’t go into details here, but if you have a lot of memcpy, you should pay attention to your CPU, especially in Linux, memcpy performance is not very good, after all, it is a byte by byte copy, so too many copies of big data will increase CPU consumption.

(4) Frequent system calls

For example: frequently call printf to print, read and write hard disk, network transceiver and so on.

Although IO requires very little CPU resources. Most of the work is assigned to Direct Memory Access (DMA). But first, Concurrencey. A very loose interpretation is to do both A and B at the same time. Do process A for A while, then context switch, then do process B for A while. I’ll come back and do A again. So give us the illusion that we’re doing both A and B at the same time. This is known as the process model.

It looks cool, but it doesn’t actually work. Because you have to do both A and B, right? Whether you do A and then B or switch back and forth, the time should be the same. Even more, as there is the overhead of context switching to consider.

Therefore, it is because of this process of dispatching tasks, communication and waiting that the concurrent system shows its significance. Of course the actual process could be 10,000 times more complicated than that. For example, the CPU does not talk directly to the hard disk. There is a middleman between them, called DMA (Direct Memory Access) chips

CPU calculates the file address -- > delegate DMA to read the file -- >DMA takes over the bus -- >CPU process A blocks and suspends -- >CPU switches to process B -- >DMA notifies CPU after reading the file (an interrupt exception) -- >CPU switches back to process A to manipulate the fileCopy the code

This process corresponds to the following figure (source: UNIX Network Programming). See the application timeline? After aio_read, everything is blank, and the CPU can do other things.

Let’s say the CPU waits 50 nanoseconds to read the file originally. Now although two context switches take 5 nanoseconds each. The CPU still earns 40 nanosecond time slices. As you can see from the above figure, this is the general process of “asynchronous IO” in the traditional five IO models. For more information, I recommend reading UNIX Network Programming, Volume 1, sockets.

Computer hardware uses DMA to access disks and other IO, that is, after a request is made, the CPU does not care until the DMA processor completes the task, and then interrupts to tell the CPU that it is done. So, a single IO time, the CPU is very small, blocking will not occupy CPU, single-threaded environment will cause the program to not continue to run, in multi-threaded environment, the CPU will hand over time to other threads and processes.

While I/O does not consume a lot of CPU time, very frequent I/OS can be very time consuming.

How can I reduce the CPU usage

You actually know the reason for the high CPU usage, and you can do your research to find out the answer.

(1) Hardware acceleration and with the aid of parallel computing/multithreading

Common hardware acceleration includes multi-core computing, GPU, DMA, audio and video hardware codec, etc. Many hardware acceleration functions require platform to provide API or driver support.

For complex or highly concurrent data processing, we often think of using thread pools. For the usage of thread pools, please refer to the thread pool principle in the previous article.

UNIX(Multithreading) : 21– Implementation principle of thread pool

UNIX(Multithreading) : 22– Several common thread pools

UNIX(Multithreading) : 23– Thread pool considerations and FAQs

Thread is the basic unit of CPU scheduling and allocation. When multithreaded tasks are executed, in order to grab CPU resources, they will fight. The winner will occupy CPU resources and do his task. The winner gets CPU resources to run the task… Then you find, oh, a CPU is a bit too much, so make more cpus, after all, “Wolf more meat less” helpless pain is not easy, reduce the pressure of one of the CPUS, but the thread is not the more the better, after all, “gang fight” when the real work efficiency is not necessarily high.

For example, some common audio and video codec algorithms are usually very complex. GPU hardware acceleration is used to improve execution efficiency and reduce CPU consumption. In addition, the code library of FFMPEG algorithm library has the parameter thread_count to specify the number of threads. Your coding efficiency will be improved by multi-threaded computation.

(2) Caused by the for loopThe CPU is cutChange the number of more problems optimization

For example, as I mentioned earlier, the for loop has three nested levels, and you can logically adjust it to have as few loop levels as possible.

Take a look at these two pieces of code:

for(int i = 0; i<100; i++) { for(int j = 0; j<10000; j++) { //TODO ... }}Copy the code

and

for(int i = 0; i<10000; i++) { for(int j = 0; j<100; j++) { //balabala14 }}Copy the code

Which is more efficient? The second one? If the for loop is nested too deeply, CPU switches will occur more often, so putting the longest loop inside can improve I cache efficiency and reduce latency due to cache misses and pipeline flush

2. The success rate of jump prediction can be improved after several times of the same cycle, and the efficiency of the assembly line can be improved. 1 int x[1000][100]; 2 int x[1000];

for(i=0; i<1000; i++) for(j=0; j<100; j++) { //access x[i][j] } int x[1000][100]; for(j=0; j<100; j++) for(i=0; i=1000; i++) { //access x[i][j] }Copy the code

In this case, the efficiency of the first one is higher than that of the second one. This has something to do with the hardware. The CPU accesses the memory through the cache. For example, for a general-purpose CPU, l1-cache has a size of 16K, and its organization structure is a group of 32 bytes (Cache line size=32byte), that is, every time data is accessed from level 2 Cache or internal Cache to level 1 Cache, 32 bytes are fetched at a time. For the first piece of code above, there are eight consecutive memory accesses that can share a cache each time data is fetched to level 1 cache. For the second piece of code, every time the data is fetched into the level 1 cache, it is accessed once and there is almost no chance of being used again. The difference between the above two pieces of code is that in the first piece of code, the address value needs to be incremented by a constant 4 after each memory access, whereas in the second piece of code, the address value needs to be incremented by 400 after each access. If there is no distinction between cache access, then we can indeed put the long loop in the inner layer and the short loop in the outer layer. However, the main reason is not the difference of instruction number as commonly imagined, but the main reason is the interruption of pipeline caused by branch prediction error, which leads to the degradation of performance.

(3) Learn to “surrender” the CPU

There are several ways to “yield” the CPU in a program, the first is sleep, the second is await/signal mechanism, and any programming language will have similar interfaces.

Another strategy is to reduce and increase the priority of your process/thread.

In fact, line 26 of the thread pool block I just provided fixes this problem by adding sleep(0) after execution.

When sleep(0), the thread will give up the current time slice, but remain runnable until the next idle time slice will be run again, equivalent to giving up the time slice is put to the end of the queue of all runnable threads. The purpose of Sleep(0) is that you simply want to abandon the timeslice, give another thread a chance to run, and hope that it can be scheduled again as soon as the system is free. Pure arithmetic threads can release some time slices with sleep(0) to give other I/O threads a break. Frequent calls to Sleep(0) can degrade performance significantly, so use with caution. Sleep (0) can be placed at the end of while(ture) to reduce CPU usage. Of course it is not recommended to have an infinite loop, and sleep(0) is not the best way to do it, and the code structure needs to be redesigned to avoid excessive CPU usage (e.g. with await/signal).

Continue with the example:

Here is a thread pool model code I wrote:

Link: pan.baidu.com/s/15GNTo59M…

Extraction code: IGSJ

And when you run it, you realize oh my God, a few simple printfs are taking up this much CPU?

So you might as well follow through on the multithreading problem,

I added a sleep after the child thread loops through the task

The result is a big difference, the CPU is down, and there is no significant impact on execution efficiency.

(4) Avoid frequent data copy

In a multithreaded programming, data copy is inevitable, but can through some techniques to reduce some unnecessary copies, heart to have this idea, when programming for snacks, such as memcpy I mentioned just now, you can achieve a memcpy, such as 4 byte alignment, you are don’t need a byte copies of a byte, You can copy 4 bytes of content at a time.

(5) Merge some system calls

In many cases, multiple prints can be combined and then printed as much as possible, multiple hard drive/network access requests can be combined and then sent, or you can use caching to reduce frequent I/O reads and writes.

Therefore, faced with a large number of IO tasks, sometimes algorithms are needed to merge IO, or cache to relieve THE IO pressure.

If I have a function fun that needs to print out some of the player’s actions to a log file, and the function is called frequently, I can add a condition: if (log_count_++ % 500 == 0), output every 500 logs

if (log_count_++ % 500 == 0){ LOG_INFO("writing... , cmd(0x%08x), bigSeq=%u, samllSeq=%u, E=%u, payload=%u, confid=%u, groupid=%u, ssrc=%u, data_count=%u", dts_msg->GetCommand(), dts_msg->GetSequence(), dts_msg->GetExtSEQ(), dts_msg->GetExtFlag(), dts_msg->GetPayload(), conf_id_, dts_msg->GetGroupID(), dts_msg->GetSSRC(), log_count_); }Copy the code

For example, if you have a lot of player/user data to write IO, but that data is received intensively and frequently, if you write IO every time you receive data, then look at your CPU.

In order to reduce CPU overconsumption, you can cache a portion of the data to record_MSG_cache_, and write the cached data to the IO once it reaches the threshold of g_max_MSG_size. This maximizes data integrity. It also reduces CPU resource consumption.

    record_msg_cache_.push_back(tpb);     auto uSize = record_msg_cache_.size();    if (uSize > g_max_msg_size)    {      ret = storage_handler_->WriteData(conf_file, record_msg_cache_, msg_cache_size_);        record_msg_cache_.clear();        msg_cache_size_ = 0;    }
Copy the code

For example, if you have a large amount of player data to read and write, you can use shared memory, such as some cache, a file block mapped to disk, which is convenient when you have player data updates.

(6) Adjust thread priority

Our operating system, OS, is very powerful and provides scheduling policies. Here I mainly share three policies of the Linux kernel:

  • SCHED_OTHER time-sharing scheduling policy, (default)

  • SCHED_FIFO real-time scheduling policy, first come, first served

  • SCHED_RR Real-time scheduling policies for time slice rotation

The real-time process will be called preferentially, and the real-time process will determine the scheduling weight according to the real-time priority, while the time-sharing process will determine the value by the value of NICE and counter. The smaller nice is, the larger counter is, and the greater the probability of being scheduled. In other words, the process that has used the least CPU will get the priority scheduling.

The difference between SHCED_RR and SCHED_FIFO:

When the SHCED_RR policy uses up its time slice, the system reallocates the time slice and puts the time slice to the end of the ready queue. Placing RR tasks at the end of the queue ensures fair scheduling of all RR tasks with the same priority.

SCHED_FIFO runs as long as it occupies the CPU. Run until a higher priority task arrives or it abandons itself.

If a real-time process with the same priority (with the same scheduling weight calculated by priority) is ready, FIFO must wait for the process to give up before running the task with the same priority. RR allows each task to be executed for a period of time.

Similarities:

  • Both RR and FIFO are only used for real-time tasks.

  • The creation priority is greater than 0(1-99).

  • According to preemption priority scheduling algorithm.

  • Ready real-time tasks preempt non-real-time tasks immediately.

SCHED_OTHER: 1 for all tasks with a time-sharing scheduling policy. To create a task, use the time-sharing scheduling policy and specify the nice priority value (-20 to 19). 2. The execution time (counter) on the CPU will be determined based on the nice value of each task. 3. If no waiting resource is available, add the task to the ready queue. 4. The scheduler traverses the tasks in the ready queue, selects the one with the largest calculation result to run through the calculation result of the dynamic priority of each task (counter+20-nice), and when the time slice runs out (counter decreases to 0) or gives up the CPU, The task will be placed at the end of the ready queue (the time slice runs out) or in the wait queue (the CPU is abandoned while waiting for resources). 5. The scheduler repeats the calculation process and goes to Step 4. 6. When the scheduler finds that all ready tasks have weights less than or equal to 0, repeat step 2.

For all tasks with FIFO scheduling policy (SCHED_FIFO) : 1. When creating a process, specify the FIFO and set the real-time priority rt_priority(1-99). 2. If no waiting resource is available, add the task to the ready queue. 3. The scheduler traverses the ready queue and calculates the scheduling weight according to the real-time priority. The task with the highest option value uses the CPU. 4. The scheduler is found to have higher priority task to reach (high-priority task may be awakened timer interrupt or task, or be awakened the currently running task again, and so on), the scheduler immediately in the current task stack all the data, save the current CPU register again from the stack of high-priority task load register data to the CPU, The high-priority task starts running. Repeat step 3. 5. If the current task waives CPU usage due to resource waiting, the task is deleted from the ready queue and added to the waiting queue. Repeat Step 3. With SCHED_RR for all tasks: 1. Set the scheduling parameter to RR when creating a task and set the real-time priority of the task and the NICE value (which will be translated to the time slice length of the task). 2. If no waiting resource is available, add the task to the ready queue. 3. The scheduler traverses the ready queue and calculates the scheduling weight according to the real-time priority. The task with the highest option value uses CPU. 4. If the time slice of the RR task in the ready queue is 0, the system sets the time slice of the task based on the NICE value and adds the task to the end of the ready queue. Repeat Step 3. 5. If the current task exits the CPU due to waiting resources, the task is added to the waiting queue. Repeat Step 3. The system has both time-sharing scheduling, time-slice rotation scheduling and FIFO scheduling: 1.RR scheduling and FIFO scheduling processes are real-time processes, while time-sharing scheduling processes are non-real-time processes. 2. If a real-time process is ready and a non-real-time process is running on the CPU, the real-time process preempts the non-real-time process immediately. 3. Both RR and FIFO processes use real-time priority as the scheduling weight standard. RR is an extension of FIFO. In FIFO, if two processes have the same priority, which of the two processes with the same priority executes is determined by their unknown status in the queue, which leads to some unfairness (the priority is the same, why keep you running?). If the scheduling policies of two tasks with the same priority are set to RR, the two tasks can be executed periodically and fairness is ensured.

Here’s an example:

  thread t = std::thread(&MixerImp::MixerThread, this);  sched_param sch;  sch.sched_priority = sched_get_priority_max(SCHED_FIFO);  if (pthread_setschedparam(t.native_handle(), SCHED_FIFO, &sch))         {​  }
Copy the code

I’ve given thread T the highest priority (with sched_get_priority_max for the current scheduling algorithm), but of course you can use pthread_getschedparam to see the current scheduling policy and thread priority.

After tweaking thread priorities and scheduling algorithms, wow, it still works.