Thinking about improving end-to-end computing performance

preface

With the improvement of computing power and the explosive growth of data, artificial intelligence began to be applied in all aspects. The improvement of computing power benefits from the good development of GPU. Taking target detection as an example, the input image can obtain the classification score of the object of interest and the Bbox representing the object in the input image after the operation of multi-layer feature extraction. The operation principle of input feature graph and convolution kernel at each layer is matrix operation. This is much more applicable to gpus for computing.

Secondly, people often say that mining, saying that 10 years ago if you could buy a few bitcoin is now developed, there is no denying that it is developed. Blockchain technology has caught fire with Bitcoin. So, what exactly is mining, the discussion of daily life has not heard a few explanations that they agree with, or resonate with. Let me explain: mining in written language is bookkeeping, and the process of using proof-of-work to reach consensus under POW consensus can be understood in computer language as doing a hash to get a hash value for a predefined feature, which is the difficulty of a mining pool. Since the amount of bitcoin is constant, it’s getting harder and harder to mine, and all sorts of miners are emerging, factory-like mining pools, or Gpus clusters.

The AI and block the calculation of chain need to do a lot of scientific computing, so, whether we are still in the mining operation in doing the experiment, the more you will need to say the machine (GPU) heap together, the better, the effect that is for sure, the GPU was good at data parallel computation, the effect of the multiple GPU computation efficiency brought by it is that is leverage.

But do you have any feeling that with the development of the Internet or the continuous improvement of users’ software experience, or the continuous upgrade of software architecture, it seems inappropriate to conduct some centralized AI algorithm processing. It doesn’t feel right in any way.

For example, when I was working in a factory, I was mainly responsible for intelligent video clipping and upstream and downstream distribution of video data. On the algorithm side, I was mainly responsible for CV and NLP, etc. The average cost was 400,000 RMB on the machine that supported the algorithm. It seems that some small algorithms can actually be deployed on the end side, which can save costs without users’ perception. At the same time, such end side processing is closer to users and has the most intuitive features of users. Rather than wrapping data through layer upon layer of protocols, there may not be much change, if at all. The benefits of an end-to-end deployment approach compared to cloud deployment are obvious: 1. High real-time, end-to-end processing can save the network transmission time of data. 2. Save resources and make full use of end-to-end computing power and storage space. 3. Good privacy, data generation and consumption are completed at the end side, avoiding the risk of privacy disclosure caused by transmission. But end side of the force is a big bottleneck, even do some simple operation, there will be some bad experiences, here you can see I’ve developed browser dig demo and run the browser posenet instance, although made some humanized treatment, but focus on the end side of the force calculation for the user experience is not so good.

Why is the GPU good for a lot of scientific (repetitive) computing

From the beginning of design, CPU and GPU were designed to solve different roles of different problems. CPU is the core of our computer and the scheduler. It is not only used for computing operations, but the main design goal is low latency. It contains a powerful but small number of arithmetic logic units (ALUs) and a large cache cache. At present, the main application scenarios of GPU are image rendering, video decoding, deep learning, scientific computing, etc. It is most suitable for multi-data stream and single-instruction stream calculation, and the main design goal is high bandwidth. It contains a large number of arithmetic logic units ALU. But the cache is small. The ALU is a combinational logic circuit which can realize multiple arithmetic and logic operations. The GPU becomes more appropriate when you need to do the same thing with a lot of data, and the CPU becomes more appropriate when you need to do a lot of things with the same data.

What are the end-side limitations

In fact, the main contradiction lies in the problem of end-to-end computing power, and the bottleneck of computing performance, which is unable to be supported by a large number of Gpus. This is the biggest limitation. In order to perform a large number of operations on the end side, the end side needs to have good language performance. Take JS as the programming language side as an example, take JavaScript as the basic programming language to do scientific operations. As we all know, JS is a dynamic, weakly typed, interpreted language, that does not have a compiler? Not that compilation is not required, but compilation at run time. Running JS requires a JS engine, the most common of which is V8. Run the JS code from the start. The V8 engine parses the code source and converts it into an abstract syntax tree (AST), which further generates bytecode through the AST interpreter and runs it.

The interpreter starts and executes quickly. Because you don’t have to wait for the entire compilation process to complete the code execution. Start translating from the first line and execute it in turn. So for web software or platforms, it’s more important to be able to execute parts of the code faster and let the user see it. However, for some repetitive code, the interpreter needs to translate it over and over again, which will cause the code to run less efficiently and the performance will be reduced. The above calculation force is used in a large number of repeated calculations, which shows that JS is relatively limited. This time, by contrast, can reflect the advantage of a compiled language, the execution of such language may take longer time for the entire source code to compile and generate code can be performed on the machines, but due to compile well in advance, in the process of compiling the code is optimized, so don’t need the extra cost for repetitive work. The performance will be better.

Of course, what I said above about V8’s parsing of JS is only a sketchbook. V8 has all sorts of optimizations for the entire JS execution process, and many major manufacturers IN the industry have customized engine optimizations to improve performance. Make an analysis of the running of the code, record the number of times the code runs, how it runs and other information. And state marking to optimize code, improve execution efficiency and performance.

End-to-end AI applications

I simply understand that end-to-end AI is a process in which AI capabilities are combined with requirements on the client (APP, Web,H5) for algorithm prediction. I think this is an inevitable trend, such as the lightweight CNN model MobileNetV2.

When we train algorithms, we will choose a suitable algorithmic implementation framework, such as PyTorch, TensorFlow, caffe, etc. The selection process is often accompanied by whether the source code is available, whether the community is complete, whether it is easy to get started, whether it is easy to write, etc. One important point that cannot be overlooked, however, is the performance of the framework. End-to-end (JS) also has a number of reasoning frameworks: 1. Tensorflow.js 2. Onnx.js 3.WebDNN 4. Paddle. Js 5. How to improve the execution efficiency and performance of end – side frame is very important.

Speed up end – side calculation force

1.webworkers

The Web Workers specification defines an API for spawning background scripts in your web application. Web Workers allow you to do things like fire up long-running scripts to handle computationally intensive tasks, but without blocking the UI or other scripts to handle user interactions.

WebWorkers is a new HTML5 API that allows Web developers to run a script in the background without blocking the UI, and can be used to do computation-intensive things that take full advantage of multiple CPU cores. Then try an instance of Webworkers here! The real effect.

The compatibility of Webworkers browsers is as follows: All browsers support Webworkers

Parallel is an encapsulation of Webworkers, which can be applied in node or browser. It can allocate threads according to the number of cores in your CPU, and then make use of map and Reduce interfaces to facilitate parallel computation. Adding a date function makes it easy to compare the different times a single thread takes on the same problem with multiple threads. For example: calculate increment from 1 to 100000000000000 beep beep beep beep.

People familiar with openGL can easily get started with webGL, but there are some differences in programming language and API. Canvas drawing object is provided at the webpage level to provide 2D or 3D context for 2d or 3D graphics drawing. Of course, you can also use the webGL library three.js, which I think is the most convenient at present, to complete the drawing of the graph. The specific knowledge of webGL can be viewed on the Internet. One of the biggest benefits of webGL is the ability to implement an internal mechanism that calls on the GPU to accelerate graphics drawing. This makes it possible to use webGL to do GPU-accelerated computing.

The color of each pixel can be represented by four dimensions of RGBA, with each dimension ranging from 0 to 255. If RGBA is represented as a value with 8 bits, each pixel can store 32 bits, which is the core point of front-end GPU computing. Each pixel can store a 32-bit value, which is just an int or uint. This in the previous browser-based mining demo BWCoin, in order to speed up THE JS hash value operation has tried, but failed 😆, but may as well become my experience value ha! The procedure is the procedure that you need to customize the method of doing the hash in WebGL, which is based on pixels. Really burn brain, super class super class ~

3.webassembly

Finally, I want to talk about WASM. Recently, I experimented with WASM, and I think it might kill JS. Also, I would like to give the browser support for WASM: and WASMChinese website 。

WebAssembly, or WASM, is a portable, compact, fast to load, and web-compatible new format. Is it very cool, JS execution efficiency is not good, then directly replace, use C or C++ or even assembly to accelerate the end of the calculation efficiency. After a few days of trying, my understanding of WASM is that WebAssembly is a bytecode standard that relies on virtual machines to run in the browser as bytecode. The bottom line is to provide access to multiple languages, compile the different languages to.wASM through LLVM, and then execute. Compiled binary code without parsing and compiling two steps, there is no JS, in the face of some high computing, high performance requirements of the application scene image/video decoding, image processing, 3D, etc., the advantages appear. In other words, you don’t have to worry about js execution efficiency any more. Write your end-to-end reasoning framework or underlying code in some more efficient underlying language, and the performance will increase exponentially.

1. Wasm installation and Emscripten compilation: You must have Git, cmake, and Python installed. Cmake can be installed via BREW

# Get the emsdk repo
git clone https://github.com/emscripten-core/emsdk.git

# Enter that directory
cd emsdk

# Fetch the latest version of the emsdk (not needed the first time you clone)
git pull

# Download and install the latest SDK tools.
./emsdk install latest

# Make the "latest" SDK "active" for the current user. (writes .emscripten file)
./emsdk activate latest

# Activate PATH and other environment variables in the current terminal
source ./emsdk_env.sh
Copy the code

2. Write C code

#include <stdio.h>
int main(int argc, char ** argv) {
	printf("Hello, bowen!");
	return 0;
}

Copy the code

3. Run the wASM compilation command to output the HTML file

emcc hello.c -s WASM=1 -o hello.html
Copy the code

4. Use the emrun command to create an HTTP web server to display our compiled files

emrun --no_browser --port 8080 .
Copy the code

5. Access hello. HTML in the browser and check the hello bowen output on the console

Of course, this is just an example of the extreme speed experience of WASM. Wasm is powerful and there are many things you can do with it, such as re-customizing players with WASM. I’m still exploring ~😁

conclusion

Conclusion? Need I summarize after reading the above? No, I’m telling you, side to side intelligence is certain, inevitable, you now learn webGL and learn WebAssembly. 😆😆😝😝🤓🤓😎😎🤠🤠🤣🤣😈😈🤖🤖👴👴<(￣) ￣)>

Just kidding, to sum up, I saw an article on InfoQ titled ** “Imho, 90% of applications don’t use WebAssembly” ** when I read this, I was tempted to say, “Imho, it’s the 10% of applications that get stuck”. He has a point, but it’s always the 10% that needs to be broken. I admit that wASM isn’t useful at the moment, but the extreme experience, when it needs to be broken, when it gets stuck in the neck, is often the few jobs that nobody does.

Secondly, it is a good choice to use the power of webGL or WASM to improve computation power and performance on the side, especially WASM.

Thinking about improving end-to-end computing performance

preface

Why is the GPU good for a lot of scientific (repetitive) computing

What are the end-side limitations

End-to-end AI applications

Speed up end – side calculation force

conclusion

Related Posts

· How PyTorch uses GPU acceleration (conversion of CPU to GPU data)

Color recognition based on MATLAB GUI machine vision RGB recognition system

Xinzhiyuan launched the million-level AI think tank information interactive platform, and the TOP10 awards list of AI world conference 2017 was released