Author: Wang Lulu | kuang apparent MegEngine architects

The background,

In the early development of MegEngine Imperative Runtime, we faced some performance optimization problems. In addition to some known areas that require refactoring (early design compromises that sacrifice performance for development efficiency), there are unknown performance issues that require observation and analysis with profilers to uncover. MegEngine’s Imperative Runtime is a module written in Python and C/C++. For such programs, there are a dizzying number of profilers. In the process of investigating various profilers, we also stepped a lot of holes. For example, we found that the profiling results of two profilers for the same program were greatly different. At first, we suspected that the accuracy of one profiler was questionable, and finally we found that the observation objects of the two were different. One of them only profiles the CPU time of the program, while the other profiles the wall clock time. Although some of this information can be found in the corners of the document, many users may not notice it until they have stepped in a hole. We might have saved ourselves a lot of time if we had found an article at the beginning that described the features, strengths, weaknesses, and usage scenarios of the various profilers.

Therefore, this article attempts to summarize these experiences, hoping to give readers who use these tools for the first time some reference. Performance optimization is a very broad topic, it involves CPU, memory, disk, network and other aspects, this article mainly focuses on Python and C/C++ extension program performance optimization on THE CPU, the article mainly around the following three issues:

  • What are some common optimization goals for Python and C/C++ extensions
  • What are the capabilities and limitations of common tools, and how should we select tools given an optimization goal
  • Use of various tools and visualization of results

In addition, this article introduces some of the basic concepts you need to know about performance optimization, and ends with a practical example of a performance optimization during the development of MegEngine to show you the optimization process.

Second, basic concepts

This section describes some basic concepts in performance optimization:

2.1 Wall clock time, CPU time and off-CPU time

The most direct measure of performance is the running time of the program, but knowing the running time of the program is not a good guide to how to optimize the program faster. We want to know more about what happens during that time.

The time command on Linux gives us some rough information. We can measure the time it takes to check the CRC code for a file by printing the following command line:

time cksum \<some_file\>

Taking my computer (MacBook Pro 2018) as an example, I get the following output:

8.22s user 1.06s system 96% cpu 9.618 total

This passage tells us where our time is spent:

  • The total time of 9.618 s
  • The user time of 8.22 s
  • The system time of 1.06 s

The reason why the EXECUTION time of THE CPU is divided into two parts is that when the program is running, it will not only execute the code of the program itself and some library code, but also call the functions provided by the operating system (i.e., system call, Programs have higher permissions to run system calls), so programs typically run in two states: user state, where the CPU is running system calls, and kernel state, where the CPU is running non-system calls (that is, user code or some library).

So the user CPU time mentioned above refers to the time spent in user mode, and the system CPU time refers to the time spent in kernel mode.

User CPU time + system CPU time = 8.22s + 1.06s = 9.28s Such as waiting for files to load from disk to memory. This period of time does not count in either user CPU time or system CPU time. We call the amount of time a program takes to execute on the CPU (user CPU time + System CPU time) CPU time (or on-CPU time), The time when the program is in a state such as sleep is called off-CPU time (or blocked time), and the time when the program is actually running is called wall clock time (literally, wall clock time, which is the time that passes in the real world). For a given thread: Wall clock time = CPU time + off-CPU time.

Generally, CPU time takes a larger proportion in computation-intensive tasks, while off-CPU time takes a larger proportion in I/O intensive tasks. Knowing the difference between CPU time and off-CPU time is important for performance tuning. For example, if a program has a performance bottleneck on off-CPU time and we choose a tool that only looks at CPU time, it is difficult to find the real performance bottleneck, and vice versa.

2.2 Performance observation tool

We know the CPU time and off-CPU time during the execution of a thread. If we want to optimize the performance of the program, these are not enough. We need to further know the CPU time period. What happens on the CPU, how much time it takes, where it causes the thread to be blocked, how long the block lasts, etc. We need performance observation tools to get this detailed information. This observation tool is often called a profiler.

Different observation objects correspond to different profilers. In terms of CPU alone, profilers are numerous.

According to the scope of observation, profilers on CPUS can be roughly divided into two categories: per-process (also called application level in some places) and system wide (system wide), where:

  • The process level only observes what is happening on a process or thread
  • The system level is not limited to a single process. The observation objects are all programs running on the entire system

It is important to note that some tools can observe both the entire system and individual processes, such as PERF, so such tools fall into both categories.

According to the observation method, it can be roughly divided into two categories: event based and sampling based. Among them:

  • Event based: Event based on a specified set of events, such as entering or leaving a specific function, allocating memory, throwing exceptions, etc. An event based profiler is also called a tracing profiler or tracer in some articles
  • Sampling based: Sampling some information of a running program at a specified frequency. Usually, the sampling object is the call stack of the program

Even if it is determined that the objects we optimize fall into one of the above categories, there is still a more fine-grained classification. Before choosing a tool, we need to find out what the specific optimization object is. Generally, a single profiler cannot meet all our requirements. We need to use different profilers for different optimization objects (such as Python threads, C/C++ threads, etc.). In addition, for the same optimization object, if we care about different information, we may need to use different profilers.

2.3 Python process model

This article focuses on the optimization of Python (including C/C++ extension) programs. A typical Python and C/C++ extension program process is shown below:

A Python process must contain a Python main thread, which may contain several Python subthreads and several C/C++ subthreads. Therefore, we further subdivide the optimization objects into three categories:

  • Python code in a Python thread
  • C/C++ extension code in Python threads
  • C/C + + thread

The Python thread here refers specifically to the CPython interpreter thread, while the C/C++ thread refers to the C/C++ thread that does not contain the Python call stack.

3. Classification and selection of profilers

Profilers are described from the following two perspectives:

  • Do you support profiling time, off-CPU time, wall clock time (CPU time + off-CPU time)?
  • Whether to support profiling C/C++ stack
  • Is it possible to parse the Python call stack from the call stack of the CPython interpreter

We will introduce six profilers, py-Spy, cProfile, PyInstrument, PERf, SystemTap, and EU-Stack. In order to make it easy for you to choose between these profilers, we have divided them into four categories, which are listed in the table below. In this table, we have ⚠, × for support, not fully support and not fully support.

The first is a pure Python profiler, which can only observe the stack of Calls to Python functions in Python threads, and is suitable for optimizing pure Python code. This article introduces CProfile (Python’s built-in profiler) and PyInstrument (third-party Python profiler), but there are many more. For example, Scalene, line-profiler, pprofile, etc., because these profilers have little difference in ability, so we will not introduce one here.

The second type is the Python thread profiler. The main difference from the first type is that in addition to observing the Python call stack in Python threads, it can also observe the C/C ++ extended call stack.

The third type of profiler is the system level profiler, which is used to observe the call stack of C/C++ extensions in Python threads and pure C/C++ threads. These profilers can observe the call stack of the CPython interpreter, but are not designed specifically for Python. You don’t parse the call stack of Python functions, so you’re not well suited to observing the Python stack. We will introduce perf and SystemTap to these tools.

Eu-stack can sample the C/C++ call stack of the program on wall clock time (CPU time + off-CPU time). Eu-stack can sample the C/C++ call stack on wall clock time. Therefore, it can be used as a profiler in this scenario.

The profiler is profiled by the profiler. The profiler is profiled by the profiler. The profiler is profiled by the profiler. It should be noted that no profiler can be absolutely accurate. The influence of profiler itself on the program and the randomness of sampling will affect the results. We should not take profiling result as the actual situation of program running. Rather, it should be seen as an estimate (even biased) of what is actually happening.

In addition to profilers, we need tools to visualize profiling results to analyze performance bottlenecks. Different from profiler, visual tools are generally versatile. One widely used tool is flamegraph. This article will introduce how to use flamegraph, and an improved flamegraph tool: speedscope.

Since the introduction of profiler needs to reference visualization tools, we will first introduce visualization tools and then introduce profiler.

4. Visual tools

4.1 flamegraph

Flamegraph is a powerful tool for visualizing profiling results. It can not only process the output of various profilers, but also visualize the processed results. It can process the raw output of more than a dozen profilers on different platforms. In addition to visualizing the profiling results on the CPU, it can also visualize the output of some memory profilers.

Flamegraph is usually used to process the profiler’s original output and then regenerate it into an SVG file that can be opened in the browser. The effect is as follows:

The main function of FlameGraph is to display the frequency distribution of the call stack sampled by the profiler. The vertical stacks in the graph represent the call stack. The width of the rectangular blocks in the call stack represents the frequency that the function is sampled at run time. So the sampling frequency is approximately equal to the time fraction of the execution path). By looking at the flame graph, we can see which execution paths the program has, and the time ratio of each execution path, and then optimize the performance bottlenecks (or “hot spots”) that are more time consuming to optimize performance.

To learn more about Flamegraph, check out the author’s home page or Github repo:

  • Homepage: www.brendangregg.com/flamegraphs…
  • Making repo: github.com/brendangreg…

4.2 speedscope

Another tool worth introducing is Speedscope. Speedscope is used in a similar way to FlameGraph and is compatible with flamegraph’s output format. Speedscope has two advantages over FlameGraph: 1) Speedscope is very good in terms of visualization, interactivity, etc., and 2) Speedscope runs at a much lower overhead than SVG, while opening many Windows without causing noticeable delays. Therefore, we recommend using Speedscope in combination with Flamegraph: Using Flamegraph to process the output data of different tools, and using Speedscope for visualization. Speedscope is a Web app, and the author provides a direct address to use: www.speedscope.app/. We can also deploy it locally, but it is more convenient. This article uses Speedscope to visualize profiling results. The following is a brief introduction to how to use it:

Go to www.speedscope.app/ and open up the profiling results in JSON format (we’ll see how to convert the results to this JSON in the various profiler usage sections below). You can see the following interface (one difference from Flamegraph is that Speedscope’s call stack is reversed):

Three modes can be selected in the upper left corner:

  • Time Order: the timeline pattern, from left to right representing the direction of Time, with each middle column representing the call stack sampled at the Time of change
  • Left Heavy: The time ratio of the call stack functions (estimated by the number of samples) is displayed, that is, each layer of the call stack is sorted in the order of more time on the Left and less time on the right. Click on any function in the call stack:
    • You can see the Total cost and Self cost of the function on the current call stack (This Instance) in the lower left corner of the figure. The integer in the figure represents the number of times sampled, and the percentage is the sampling proportion (approximately equal to the time proportion).
    • The white box at the bottom of the figure is the call stack for that function.
  • Sandwich: sorted by the total cost of the function and its own cost. Clicking on the function shows the caller and the caller of the function

For more information, see Speedscope’s official repo: github.com/jlfwong/spe…

Five, performance observation tools

Let’s look at a few profilers that are commonly used in Python and C/C++ extension optimization. We will introduce the advantages and limitations of each profiler, and you should choose the right tool according to your actual needs.

5.1 py – spy

Py-spy is a sampling based profiler, whose profiling objects are Python and C/C++ extended call stacks. The overhead of py-Spy is moderate, and does not affect the running speed of the program too much. It also supports direct output of speedscope and Flamegraph format results.

Repo: github.com/benfred/py-…

You can install it directly using PIP:

pip install py-spy

Usage:

Py -spy record --format speedscope -o output.json --native -- python3 xxx.py # ===== Output format, default is Flamegraph, recommended to use speedscope format --native will only sample the Python call stack # ===== # other parameters # --rate: (default value: 100) --native raises overhead when this is on --native lowers overhead when this is on --rate # --subprocesses: Whether to sample subprocesses, default value: no Thread ID = thread ID = thread ID = thread ID = thread ID = thread ID Whether to sample idle(off-CPU time) threads. The default value is no. Select whether to enable # ===== based on specific scenarios. Py -spy record -sti --rate 10 --format speedscope --native -o output.json -- python3 xxx.py # It is also possible to sample an already running Python process: py -Spy record --format speedscope -o output.json --native -- PID 12345Copy the code

Other features:

In addition to the record subcommand, py-spy also provides the top subcommand, which dynamically displays the most expensive function sorting on the command line. And dump subcommand: you can print the call stack of a running Python process. The specific use method can refer to the document: github.com/benfred/py-…

Py-spy supports sampling of a program’s call stack at CPU time or wall clock time (CPU time + off-CPU time). The object sampled is the Python call stack of a Python thread or c/ C ++ extended call stack.

While py-Spy can also sample the C/C++ thread call stack, for now (v0.3.4) it puts the Python call stack together with the C/C++ call stack (see GitHub issue: github.com/benfred/py-…) , no matter whether there is a real call relationship between the two, this will make the profiling results of C/C++ threads aggregated according to the Python call stack during visualization, making it difficult to analyze. Therefore, py-spy is currently only suitable for observing pure Python code and C/C++ extensions on Python threads (there is a call relationship between Python and C/C++), but not for observing independent C/C++ threads, which is a shortcoming of Py-Spy.

5.2 cProfile

CProfile is an Event based profiler that works by tracking every Python function call and time overhead on the Python interpreter. CProfile is a Built-in profiler for Python and theoretically has relatively high Python support. The limitations of cProfile are many:

  • You can only observe pure Python function calls, not the C/C++ extended call stack
  • Only wall Clock time is supported
  • Overhead is relatively large

Usage:

python3 -m cProfile xxx.py

The output format of cProfile is a table, as follows:

2173 function Calls (2145 primitive calls) in 10.230 seconds Ordered by: Standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.000 <frozen _bootstrap>:103(release) 1 0.000 0.000 0.000 <frozen importlib._bootstrap>:143(__init__) 1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:147(__enter__) 1 0.000 0.000 0.000 0.000 <frozen Importlib. _bootstrap>:151(__exit__) 1 0.000 0.000 0.000 <frozen importlib._bootstrap>:157(_get_module_lock) 1 0.000 0.000 0.000 <frozen importlib._bootstrap>:176(cb) 1 0.000 0.000 0.000 0.000 <frozen Importlib. _bootstrap>:211(_call_with_frames_removed) 32 0.000 0.000 0.000 0.000 <frozen _bootstrap>:222(_verbose_message) 1 0.000 0.000 0.000 <frozen importlib._bootstrap>:307(__init__) 1 0.000 0.000 0.000 <frozen importlib._bootstrap>:311(__enter__)....Copy the code

You can also specify an output file:

python3 -m cProfile -o xxx.cprof xxx.py

The raw output is not easy to analyze, and we need some third-party tools to process the output data and visualize it. Here are two methods:

  • Turn it into a call graph
# dependency: gprof2dot, graphviz
# sudo apt install graphviz
# pip3 install gprof2dot
  
gprof2dot -f pstats xxx.cprof | dot -Tpng -o xxx.png && feh
Copy the code

Visual effects are as follows:

  • Into flamegraph
# dependency: flameprof
# pip3 install flameprof
 
flameprof xxx.cprof > flamegraph.svg
Copy the code

More information please refer to the cProfile document: docs.python.org/3/library/p…

5.3 pyinstrument

Pyinstrument is a sampling based profiler, similar in function to cProfile, only supports sampling pure Python call stack, similarly only supports wall clock time, However, overhead is lower than cProfile and provides multiple visualization methods.

Official Repo address: github.com/joerick/pyi…

Installation:

pip3 install pyinstrument

Usage:

python3 -m pyinstrument xxx.py

The default output is displayed on the command line as follows:

_ the _ __ / __ _ _ _ _ _ _ Recorded: 15:50:07 Samples: 8794 / _ _ - / / / / / / _ \ / / / _ / / / / / _ '/ / / Duration: 23.332 CPU time: 31.381 / _/ v3.4.1 Program: "Train". Py 23.331 < module > train. Py: 9 ├ ─ 22.641 main train. Py: 29 │ └ ─ 22.631 worker train. Py: 114 │ ├ ─ item 10.077 Megengine/core/tensor/array_method. Py: 391 │ │ └ ─ 10.067 numpy megengine/tensor. The py: 113 │ │ └ ─ 10.067 tensor. Numpy < built - in > : 0 │ ├ ─ 8.056 train_step train_random. Py: 159 │ │ ├ ─ 5.396 __call__ megengine/module/module. The py: 120 │ │ │ └ ─ 5.392 the forward model. Py: 170 │ │ │ └ ─ 5.348 __call__ megengine/module/module. The py: 120 │ │ │ └ ─ 5.239 forward Megengine/module/sequential. Py: 94 │ │ │ └ ─ 5.198 __call__ megengine/module/module. The py: 120 │ │ │ └ ─ 4.838 forward model.py:87 ...Copy the code

You can also output HTML files to view the above results in the browser:

python3 -m pyinstrument \--html xxx.py

The effect is as follows:

5.4 perf

Perf is the official Linux profiler whose source code can be found in the Linux kernel. The function of PERF is very powerful. It supports the observation of the entire operating system and a single given process. In addition, PERF can trace the specified event and also support the sampling of the process call stack. Perf runs in kernel state and has very low overhead.

Install the perf:

sudo apt-get install linux-tools-common linux-tools-generic linux-tools-\uname -r\

The perf version must be consistent with that of the Linux kernel. The uname -r in the installation command ensures this

Usage:

# Run the program directly and sample, Python3 xxx.py perf record -f 99 --call-graph dwarf python3 xxx.py # or perf record -f 99 --call-graph for a given thread ID (TID) Dwarf --tid < tid > # You can also use -- PID to specify the process ID to be sampled. In this case, perf will sample all threads of the processCopy the code

When finished, a file named perf.data is generated in the current directory

Visualize the results:

You can use perf’s own subcommands to analyze the profiling result:

perf report \--stdio -i perf.data

We recommend turning it into a flame diagram:

Perf script -i perf.data > perf. Unfold # Clone FlameGraph git Clone https://github.com/brendangregg/FlameGraph # CD FlameGraph hypothesis perf. Unfold on the < result_dir > stackcollapse - perf. In the pl <result_dir>/perf.unfold > Perf. fold Flamegraph.pl perf.fold > perf.svgCopy the code

For users with Linux kernel version 5.8 or greater, perF supports converting the results directly to a flame graph:

perf script report flamegraph -i perf.data

Perf is not a profiler for Python. It samples the call stack of the CPython interpreter, so you can’t see the specific Python function directly, as shown below:

Therefore, perf is less suitable for optimizing pure Python code and more suitable for observing C/C++ threads. In addition, the perf record command only samples the call stack of the program’s CPU time. If you want to observe the off-CPU time, you need to trace the perf command. For details, see this article: www.brendangregg.com/blog/2015-0… . So far, perF does not support simultaneous observations of CPU time and off-CPU time, and these two cases can only be analyzed independently.

5.5 systemtap

Systemtap is an operating system dynamic tracing tool. It is more powerful, and we can simply think of SystemTap as a programmable system-level Tracer. Systemtap provides a scripting language that allows users to write scripts to track and monitor various system calls, user-mode function calls, and other custom events. Therefore, users can write scripts to obtain various information needed to be observed during the operation of the program according to their needs. Systemtap compiles the scripts written by the user into C code, which is then compiled into kernel modules, and then loaded into the running operating system kernel.

This article is not intended to cover SystemTap in detail, we will only focus on its use in profiling (for those interested in SystemTap, see the SystemTap Beginners Guide) in addition, This article also does not show how to write scripts to implement a profiler. The OpenResty team provides a systemTap toolset (OpenResty-SystemTap-Toolkit) that you can use directly.

Here’s how to install SystemTap and use the toolset:

Installation of systemtap:

sudo apt install systemtap systemtap-runtime

In addition, we need to install the various kernel dependencies of the SystemTap runtime: We can install the dependencies directly by running sudo stap-prep, or manually if this command fails:

sudo apt install kernel-debuginfo-`uname -r`
sudo apt install kernel-debuginfo-common-`uname -r`
sudo apt install kernel-devel-`uname -r`
Copy the code

Systemtap requires sudo permission to run, so we can check if the installation is successful by running the following command:

sudo stap -e \'probe begin { printf(\"Hello, World! \\n\"); exit() }\'

If Hello World is displayed, the installation is successful.

Use systemtap:

Sample_gt and sample_bt_off_CPU samples the call stack for CPU time and off-CPU time, respectively:

Git clone https://github.com/openresty/openresty-systemtap-toolkit.git CD openresty - systemtap - # toolkit to program the CPU Time, the sampling time is 10s, the sampling object is user space call stack, Bt sudo./sample_bt -u -t 10 -p PID > CPU. Bt # Sampling the call stack of the program off CPU time, sampling time is 10s, sampling object is user space call stack, Bt sudo./sample_bt_off_cpu -u -t 10 -p PID > off_cpu.bt # You can also select -k to sample the kernel space call stack; -t indicates the sampling time, which is set to 10s. -P is the PID of the program to be sampledCopy the code

Like Perf, SystemTap runs in kernel mode with very low overhead. And for Python threads, SystemTap also samples the call stack of the CPython interpreter, so it is more suitable for observing C/C++ threads.

Visualize the results:

Similar to perF, FlameGraph supports processing systemTap output:

<result_dir>/cpu.bt > cpu.fold Flamegraph.pl cpu.fold > cpu.svgCopy the code

5.6 the eu – stack

Eu-stack is a member of the Elfutils (sourceware.org/elfutils) toolset and its basic function is to sample a program’s call stack:

Eu-stack -p <pid> #===== #0 0x00007f5f96bbadd7 __select #1 0x00007f5f95868278 call_readline #2 0x0000000000425182 PyOS_Readline #3 0x00000000004a84aa tok_nextc.cold.140 #4 0x00000000005a7774 tok_get #5 0x00000000005a8892 parsetok #6 0x00000000006351bd  PyParser_ASTFromFileObject #7 0x00000000004ad729 PyRun_InteractiveOneObjectEx #8 0x00000000004afbfe PyRun_InteractiveLoopFlags #9 0x0000000000638cb3 PyRun_AnyFileExFlags #10 0x0000000000639671 Py_Main #11 0x00000000004b0e40 main #12 0x00007f5f96ac5bf7 __libc_start_main #13 0x00000000005b2f0a _startCopy the code

Unlike perf and SystemTap, eu-stack gets its C/C++ call stack even if the program is in an off-CPU state. Therefore, we can implement a sampling based profiler on wall clock time by writing a simple script that continuously samples the call stack of a thread.

A simple profiler implementation:

#! / bin/bash - xe pid = $1 for x $(seq 0 1000) do in eu - stack - p $pid > d_ $x.t xt | | true sleep 0.2 doneCopy the code

Eu-stack does not require root permission to run, and it takes tens of milliseconds to run once, so the impact on the program is not too big.

Usage:

# 1. Install elfutils sudo ' 'apt' 'install' 'elfutilsCopy the code
' '/profiler' '.sh <pid>Copy the code

Visualization method:

Flamegraph also comes with a tool to process the output of the EU-stack, turning the collected samples into a Flamegraph, which can also be viewed in Speedscope:

Stackstackcollapse -elfutils.pl <result_path>/d* TXT > Use speedscope.app # Open eu_perf.txt in speedscope.app # Use flamegraph flamegraph.pl eu_perf.txt output.svgCopy the code

Six, examples,

Finally, a real-world example of performance optimization during MegEngine development is presented to show a process from profiling to finding bottlenecks to performance optimization.

During MegEngine’s early development we found that one of the detection models was very slow to train on single-player 8-card data parallelism, with one step running more than twice as long as a single card. The first thing we suspected was the impact of the cost of All Reduce when the gradient was synchronized between multiple cards. To confirm this, we turned off the gradient synchronization between the cards, so that the training of each card was independent of each other. It was found that the speed only increased a little, and the speed of 8 cards was still about twice that of a single card.

Since the 8-card training will start 8 processes, in order to completely eliminate the interference caused by the possible links between the 8 processes, we changed the 8-card training to start 8 independent single-card training, and found that the speed barely changed. In addition, we also observed that the running time of a step was positively correlated with the number of single-card training starts. The speed was only slightly slower when two single-card training starts, and about 50% slower when four single-card training starts.

Therefore, we suspected that the speed might be slowed down due to the competition between 8 processes for CPU resources. To confirm this, we bound the same number of CPU cores to each process, that is, the number of CPUS used by each process in the single-process training and the 8-process training was the same. Once again, the speed was almost unchanged. So there should be no relationship between a slow 8-card speed and CPU resource competition.

Simple guesses do not locate the problem, and we intend to use profilers to find it. MegEngine’s process can be simplified into two threads: the Main Python thread and the worker thread, where:

  • Python main thread: Executes Python training scripts and sends tasks to the queue
  • Worker thread: a pure C++ thread that is responsible for fetching tasks from the queue and executing them

We first profiled the Python main thread, and since we wanted to see C/C++ extension at the same time, pure Python profiler was not sufficient for our needs, so we used py-spy. We turned on the –idle option to include the sample on off CPU time and got the following result:

We found that about 80% of the main thread’s time is spent waiting, as shown in the red box in the figure above. There are two places to get a tensor’s Numpy ndarray and a single place to get a tensor’s Shape, and in both cases the wait is all caused by the Sync worker thread. In MegEngine, the main line command and the worker command are asynchronous, when we need to get a tensor numpy ndarray, we need to wait for the worker to finish the calculation on CUDA and copy the result into memory, so sync will be triggered. A tensor’s shape is a tensor’s shape. If you can’t guess the shape of an operator’s output in advance, you have to wait for the worker to finish weighing it, so sync will also be triggered.

It can be seen that almost all the time of the main thread is spent waiting for the worker, indicating that the execution of the worker is slow, and the real problem should be in the worker thread.

Therefore, we plan to profiling worker threads. Since worker threads are pure C/C++ threads, our optional tools include PERf, SystemTap and EU-stack. We are not sure whether the problem is CPU time or off-CPU time. In order to observe the results of the two states together, we chose to use EU-stack, and the following results were obtained:

We find that worker threads actually spend more than 85% of their time in calling cudaGetDeviceProperties by topk operator (red box in the figure). After checking the documentation, we find that this API has a high overhead and IO traffic competition occurs when multiple processes are called at the same time. Therefore, the ioctl takes too much time in process 8. We later commit topk to avoid calling cudaGetDeviceProperties, and after testing again, we found that the model returned to normal speed: single-process speed increased by about 13%, and 8-process speed increased by more than twice.

Vii. Reference materials

  • Systems Performance: Enterprise and the Cloud
  • www.brendangregg.com/offcpuanaly…
  • www.brendangregg.com/flamegraphs…
  • Iobservable.net/blog/2013/0…
  • Blog.openresty.com.cn/cn/dynamic-…
  • Qqibrow. Making. IO/performance…
  • www.percona.com/blog/2017/0…

Please join my MegEngine Developer Exchange group at 1029741705

For discussion or feedback on the use of the framework, please visit the forum: discuss.megengine.org.cn; GitHub project address: github.com/MegEngine/M… ; MegEngine official website: megengine.org.cn/ ~~~