Must-read: 20 Development Tips for developing high-performance computing code

Abstract: Huawei cloud experts from optimization planning/execution/multi-process/development psychology and other 20 points, teach you how to develop high-performance code.

High performance computing is a very broad topic, from dedicated hardware/processor/architecture /GPU, to operating system/thread/process/parallel/concurrent algorithm, to cluster/grid computing, and finally to Tianhe-2 (TH-1).

This time, we will start from the exploration of our personal practice project, and share with you our own experience and experience, as always adhere to the original. The content involves optimizing planning/execution/multi-process/development psychology and so on about 20 points, including example code snippets, using Python.

High-performance computing, in the process of commercial software application development, to solve the core problem, in a very plain way, “in the limited hardware conditions, how to make a piece of code that can not run, run, or even fly.”

Performance Improvement Experience

Take two examples. Feel free.

(1) The historical behavior data of 6.35 million users reading documents, and the data processing time was optimized from 50 hours to 15 seconds. (Yes, you read that right)

(2) Mongo-based wide table creation, from 20 hours, optimized to go out to get a glass of water time.

In the era of big data, an excellent programmer can write programs with performance hundreds or even thousands of times higher than others. Such skills will undoubtedly contribute a lot to the product, which is also a highlight and bonus on his/her resume.

Talk about history

Around 2000, due to the limitations of PC hardware, programmers of that generation, such as Qiubojun/Lei Jun in China and Bill Gates/Kmart in foreign countries, were able to improve program performance from the perspective of machine code/assembly.

By the mid-2000s, PC hardware performance was evolving rapidly, and high-performance optimizations were often heard from embedded devices and mobile devices. The mainstream mobile devices of the era were developed using J2ME, which had 128KB of memory available. Programmers in those days had to be very careful about program size (OTA downloads, data limits like 128KB) and memory usage, and they really had to pinch their fingers. For example, usually a program has only one class, because a new class uses a few more K of memory. The data files are merged into one, reducing the number of files, so you have to figure out, for example, what the data is starting with the byte.

Around 2008, the first generation of iOS/Android smart phone was released, App available memory reached 1GB, App can be downloaded through WIFI, App size can also reach more than 100 MB. I just took a look at my P30 and in terms of storage space, QQ uses 4G and wechat uses 10G. The performance of devices improved, the availability of memory and storage space increased, and programmers were finally “liberated” until the advent of big data.

In the era of big data, the amount of data is growing crazily. It is common for your program to run all night for a large data set operation.

Basic knowledge of

This post assumes that you already know thread/process /GIL concepts. If you don’t, you can read the following summary and remember the following three basic facts.

What is a process? What is a thread? The difference?

The following content from Wikipedia: en.wikipedia.org/wiki/Thread…

Threads differ from traditional multitasking operating-system processes in several ways:

processes are typically independent, while threads exist as subsets of a process
processes carry considerably more state information than threads, whereas multiple threads within a process share process state as well as memory and other resources
processes have separate address spaces, whereas threads share their address space
processes interact only through system-provided inter-process communication mechanisms
context switching between threads in the same process typically occurs faster than context switching between processes

Famous GIL (Global Interpreter Lock)

The following is from Wikipedia.

A global interpreter lock (GIL) is a mechanism used in computer-language interpreters to synchronize the execution of threads so that only one native thread can execute at a time.[1] An interpreter that uses GIL always allows exactly one thread to execute at a time, even if run on a multi-core processor. Some popular interpreters that have GIL are CPython and Ruby MRI.

Basic Knowledge summary:

Because of the famous GIL, threads in Python can only run on the same CPU core for thread safety, and cannot be truly parallel
For computationally intensive applications, use multiple processes
IO intensive applications, use multithreading

Practice points

So that’s a little bit of groundwork. From now on, let’s get down to the business of developing high-performance code.

All the time, I’ve been thinking, how to do effective sharing? First of all, I insist on being original. If the same content can be found on the Internet, there is no need to share it and waste my own time and others’ time. Secondly, different things should be said in different ways to different people.

So, this time, the audience is mostly experienced Python programmers, so we won’t spend too much time on some basic content, don’t understand it, don’t worry, come down to see yourself also can understand. This time we will start from more practical issues, I summed up about 20 points and development skills, I hope to be helpful to your future work.

Planning and design as early as possible, and implementation as late as possible

When we receive a project, we can identify which parts of it are likely to have performance problems. In terms of design, you can think early on, for example, to choose the right data structure and decouple the class and method design for future optimization.

In our previous projects, we have seen some projects, because we did not design in advance in the early stage and wanted to optimize in the later stage, we found that the changes were too big and the risk was very high.

However, a common mistake here is to optimize first. In the world of software development, this is often mentioned. We need to stop thinking about optimising early, and start putting the big picture together first, achieving the main features, and then thinking about performance.

Make it simple, evaluate it, plan it, and optimize it

Evaluate retrofit costs and benefits. For example, if a module takes an hour, if it takes 3 hours of development and testing to optimize, it may save 30 minutes and increase performance by 50%. The other module, which takes 30 seconds, can save 20 seconds and improve performance by 67% if it takes the same amount of time to optimize, develop, and test. Which module would you optimize first?

We recommend giving priority to the first module because the benefit is greater, saving 30 minutes; The second module, which takes 30 seconds, is acceptable without optimization and should be given the lowest priority.

In another case, if the second module is frequently called by other modules, then we need to reevaluate the priority.

When we optimize, we need to control the impulse we may have: optimize everything we can.

When we don’t have a “hammer”, we are troubled by problems and lack skills and tools; But when we have a hammer, it’s easy to see everything as a nail.

Sampling data is used in development and debugging, in conjunction with switch configuration

Sampling parameters can be set for time-consuming calculation during development, and different parameters can be passed in during adjustment, so as not only to quickly test, but also to safely manage debugging and produce code. Never use comments to turn code on/off.

Refer to the following schematic code:

	# Bad
	def calculate_bad():
	    # uncomment for debugging
	    # data = load_sampling_data()
	    data = load_all_data()

	# Good
	def calculate(sampling=False):
	    if sampling:
	        data = load_sampling_data()
	    else:
	        data = load_all_data()
Copy the code

Sort out the data Pipeline and establish the performance evaluation mechanism

I wrote my own Decorator @timeit to make it easy to print out when the code was used.

	@timeit
	def calculate():
	    pass
Copy the code

The log generated by this is something that the vegetable market woman would understand. Once in production, you can also notify the configuration to control whether to print or not.

[2020-07-09 14:44:09,138] INFO: TrialDataContainer. Load_all_data - Start... [2020-07-09 14:44:09,172] preprocess_demand-start [2020-07-09 14:44:09,172] INFO: Preprocess_demand-end-spent: 0.012998s... [2020-07-09 14:44:09,186] INFO: preprocess_warehouse - Start [2020-07-09 14:44:09,189] INFO: Preprocess_warehouse - End - Spent: 0.002611s... [2020-07-09 14:44:09,454] INFO: Preprocess_substitution - Start [2020-07-09 14:44:09,628] INFO: Preprocess_substitution - end-spent: 0.178258s... [2020-07-09 14:44:10,055] INFO: preprocess_penalty - Start Preprocess_pension-end-spent: 10.763566s [2020-07-09 14:44:20,835] INFO: TrialDataContainer. Load_all_data-end-spent: 11.692677s [2020-07-09 14:44:20,836] INFO: ObjectModelsController. Build - Start [the 14:44:20 2020-07-09, 836] the INFO: 14:44:20 ObjectModelsController. Build_penalties - Start [2020-07-09, 836] the INFO: ObjectModelsController. Build_penalties - End - Spent: 0.000007 s [the 14:44:20 2020-07-09, 837] the INFO: 14:44:20 ObjectModelsController. Build_warehouses - Start [2020-07-09, 848] the INFO: ObjectModelsController. Build_warehouses - End - Spent: 0.011002 sCopy the code

In addition, Python provides the Profiling tool, which can be used for locating time-consuming functions.

Prioritize data read performance

In a complete project, there are likely to be many performance improvements, and I recommend prioritized data reading because problems are easy to locate, code changes are relatively independent, and results are quick.

Many machine learning projects, for example, need to build sample data for model training. The establishment of data samples is often achieved by creating a wide table. Many DBS offer many ways to improve operational performance. Suppose we use MongoDB, which provides a pipeline function that allows multiple data operations to be passed to DB in a single statement.

If we do it in a very rough way, in one project we tried it, it took almost 20 hours, it took half a day to optimize, ran, got out of the seat to get a glass of water, came back and it was finished, and the time was reduced to 1 minute.

Note that most of the time we have no incentive to optimize the performance of the data is read, because the data read may not many times, but in fact, especially in the trial stage, the number of data is read and actually many, because we always didn’t stop changes to the data, such as adding a field, add a feature of what, at that time, the data read from the code will be used often, Then the benefits of optimization are reflected.

Consider reducing the time complexity, consider using preprocessing, and trade space for time

If we think of performance optimization as a banquet, we can consider the data reading part of the performance optimization, as an appetizer. Now, let’s get to the fun part, which is to optimize time complexity and trade space for time.

For example, if your program has a complexity of O(n^2), it will be very inefficient at large data levels, but if you can optimize it to O(n), or even O(1), you will get several data-level performance improvements.

For example, using inversion lists, as mentioned above, to do data preprocessing, to trade space for time, to achieve performance improvements from 50 hours to 15 seconds.

Because of the famous GIL, use multiple processes to improve performance, rather than multithreading

In the Python world, thanks to the famous GIL, the basic rules for improving computing performance are: For I/O intensive applications, use multithreading; For computationally intensive applications, use multiple processes.

A multi-process example:

We prepared a long array and a relatively time-consuming arithmetic summation function.

	MAX_LENGTH = 20_000
	data = [i for i in range(MAX_LENGTH)]

	def calculate(num):
	    """Calculate the number and then return the result."""
	    result = sum([i for i in range(num)])
	    return result
Copy the code

Single-process execution example code:

	def run_sinpro(func, data):
	    """The function using a single process."""
	    results = []

	    for num in data:
	        res = func(num)
	        results.append(res)

	    total = sum(results)

	    return total

	%%time
	result = run_sinpro(calculate, data)
	result
Copy the code

CPU times: user 8.48s, sys: 88 ms, total: 8.56s Wall time: 8.59s 1333133340000Copy the code

From here we can see that a single process takes ~9 seconds.

Next, let’s look at how this code can be optimized using multiple processes.

	# import multiple processing lib
	import sys

	from multiprocessing import Pool, cpu_count
	from multiprocessing import get_start_method, \
	                            set_start_method, \
	                            get_all_start_methods

	def mulp_map(func, iterable, proc_num):
	    """The function using multi-processes."""
	    with Pool(proc_num) as pool:
	        results = pool.map(func, iterable)

	    return results

	def run_mulp(func, data, proc_num):
	    results = mulp_map(func, data, proc_num)
	    total = sum(results)

	    return total

	%%time
	result = run_mulp(calculate, data, 4)
	result
Copy the code

CPU times: user 14 ms, sys: 19 ms, total: 33 ms Wall time: 3.26s 1333133340000Copy the code

The same calculation, using a single process, takes about 9 seconds; On an 8-core machine, if we used multi-process, it would only take 3 seconds, a 66% reduction.

Multiprocess: The cell should be designed to be as small as possible

Let’s imagine a scenario where you have 10 employees and you have 10 jobs, each of which consists of the same 5 sub-jobs. How would you arrange it? Of course, we should put these 10 people in each of these 10 jobs, and let these 10 jobs run in parallel. That’s fine, right? However, in our project, if this is the way to design parallel computing, it is likely to go wrong.

Here’s a real world example where the final performance boost was poor. The reason? (Press Pause here to think about it)

There are two main reasons. The particle size of parallel computing units should not be too large. If it is too large, there will usually be data exchange or sharing problems. Secondly, after the granularity is large, the completion time will be different, forming a short board effect. That is, with greater granularity, task completion times can vary significantly.

In a real world example, the parallel computation took an hour, but the final analysis revealed that only one process took an hour, while the other processes completed their tasks in five minutes.

Another benefit is that if something goes wrong, the code is easy to locate and maintain. Therefore, the cell should be as small as possible.

Multi-process: Avoid communication or synchronization between processes

When we design cells small enough, we should try to avoid interprocess communication or synchronization, which can cause waiting and affect the overall execution time.

Multiprocess: Debugging is a problem, try GDB/PDB in addition to log

The accepted problem with parallel computing is that it is difficult to debug. A normal IDE can interrupt only one process. It would be a good idea to print the log and add the PID to locate the problem. Notice, when you’re doing parallel computing, you don’t want to type in too many logs. If you have tuned the single-process implementation as described above, then, most importantly, print the process’s start point, process data, and shutdown point. For example, if you see a process that is holding you back, take a good look at the data for that process.

It’s a delicate task, especially when multiple processes start up and can run for hours, you don’t know what’s going on. You can use tools such as Top in Linux or Activity in Windows to monitor the status of a process. You can also use a tool like GDB/PDB to jump into a process and see where the cards are.

Multi-process: Avoid transferring large amounts of data as parameters

In a real project, we would design cells that would not normally have many parameters, as in the simple example above. It is important to note that when big data is transferred as a parameter, memory consumption is high and the creation of child processes is slow.

Multiple processes: Fork? Spawn?

Python supports three modes to start a process: spawn, fork, and Forkserver. The difference between them is startup speed, and inherited resources. Spawn only inherits the necessary resources, while fork and ForkServer inherit exactly the same as the parent.

Depending on the operating system, and depending on the version of Python, the default mode varies. For Python 3.8, Windows defaults to spawn; Starting with Python 3.8, macOS also uses spawn by default; Unix class OS default fork; Fork and ForkServer are not available on Windows.

Soul Interrogation: Are multiple processes Necessarily faster than single Processes?

And with that, we’re pretty much done here, right? Using the Python Multiprocessing API with a few examples, and referring to the points I mentioned above, solves more than 80% of the problems. Enough, after all, performance optimization is not needed every day. The following content may be engaged in performance optimization for a year, will think about, write out here, for reference, to help the future less detours.

For example, is multiple processes necessarily faster?

As mentioned in the first point, any optimization has overhead. When multiple processes don’t solve your problem, don’t forget to try, change the single process, it may be solved. (Here’s a real example of a program that took 2 weeks to optimize and took 3 hours to execute with 10 processes, and then ran in under 30 minutes.)

Optimize psychology: with a hammer in your hand, everything looks like a nail

As mentioned above, sometimes it may be necessary to optimize data structures rather than multi-processes.

Optimizing psychology: Don’t trust “experts”

I believe that many teams are like this, when the project has a major technical problem, such as performance optimization, the management will call in some experts to help. According to my observation, 80% of the time, not much help, and sometimes worse.

The reason for this is simply that, in a sentence, it is highly unlikely that someone else, based on the information you provide, will spend five minutes pointing out the problem that you can’t solve, no matter how much relevant experience he has. If you don’t believe me, you can come back to this idea by recalling your own experience or observing it in the future. Why could it be worse? Because of dependency. With the dependence of experts, people will not really spell, “anyway, there are experts to guide.” As Nietzsche once said, “To accomplish what seems impossible, one needs to bulge beyond one’s capacity.” “So, if it’s really hard, it’s probably even better to” crazily “believe that this is something that only you can solve, only you, and no one else can solve.

During a performance optimization project that lasted nearly a month, a line from Detective Conan kept playing in my head: There’s only one truth. I firmly believed that the solution was getting closer to me, even though I failed again and again, but this belief in the end helped a lot.

Optimization psychology: optimization may be a long-term process, every day in confusion struggle

Having a patient audience helps a lot in the long, painful process of performance tuning. He or she may not be able to point out a solution to your problem, just listen patiently and say, “It will be fine.” But this will help clear your mind and lead to a burst of inspiration. It’s the same with everything else in life.

Optimizing psychology: Managers help to gain time and reduce psychological stress

Experienced managers, for example, negotiate with the business and deliver in stages. Some of you will come back every few hours and ask, “Has the performance improved?” Then he got this weird look on his face: “Is it really that hard?”

In one case I know of so far, the performance optimization lasted nearly a year, during which several subcontractors came and went and crashed.

Therefore, we urge project managers to be more understanding of developers and to help them resist external pressures, rather than directly passing them through, or even increasing them.

References

baike.baidu.com/item/ High Performance Computing

www.liaoxuefeng.com/wiki/101695…
En.wikipedia.org/wiki/Thread…
En.wikipedia.org/wiki/Global… global interpreter lock (GIL,on%20a%20multi%2Dcore%20processor.
Git.huawei.com/x00349737/n…
Docs.python.org/3/library/p…

Click follow to learn about the fresh technologies of Huawei Cloud