Summary: There is a lot of practice in using WebGL for HIGH performance computing on the Web side, such as tensorflow.js in the field of end intelligence and stardust-js in the field of visualization.

The author | | surging east source ali technology to the public

There is a lot of practice in using WebGL for HIGH performance computing on the Web side, such as tensorflow.js in the field of end intelligence and Stardust-js in the field of visualization. In this article, we will cover the following:

  • The history of General-purpose computing with gpus (GPGpus)
  • The technical principle of using graphics API to realize GPGPU on the Web side at present, and the difficulties that front-end developers may encounter
  • Relevant industry practice, including layout calculation, animation interpolation, etc
  • Limitations and future prospects

What is GPGPU

Due to different hardware structures, Gpus and cpus are good at performing different types of computing tasks. The CPU implements low latency through a complex Cache design that contains complex control logic (branch prediction), of which ALU is a small part. Gpus are built for high throughput and contain a lot of ALUs. Therefore, in the scenario of single instruction stream and multiple data stream (SIMD), the computing speed of GPU is far higher than that of CPU, and the gap is still widening.

Some modern Gpus even have hardware for Tensor calculations and ray tracing, like Nvidia’s Turing architecture. This allows for greater performance gains when dealing with these computationally complex tasks.

This brings us to the concept of using gpus for General Purpose computation on Graphics Processing Units other than rendering.

Since it was proposed in 2002, it can be seen in real-time encryption and decryption, image compression, random number generation and other computing fields. There is also a special chapter on GPU Gems/Pro. By the CUDA(Compute Unified Device Architecture) proposed by Nvidia, developers can use C, Java, Python and other languages to write their own code for parallel computing tasks.

So how should we use the computing power of the GPU on the Web?

Two use WebGL to realize the principle of parallel computing

Compute Shader is provided in modern graphics apis (Vulkan/Metal/Direct3D) for developers to write computing logic. Considering that WebGPU is still under development, only webGL1/2 graphics rendering API can be used on the Web side at present. None of them support Compute Shader (WebGL 2.0 Compute has been abandoned), so they can only “save the country by curve”. In the last section of this article we will look at future technologies.

Let’s ignore the API usage for a moment and look at how the two collaborate in parallel computing from the perspective of the CPU and GPU, the former often referred to as host and the latter as device. The first step is data initialization, which needs to copy data from CPU memory to GPU memory, which will be done by texture binding in WebGL. In the second step, THE CPU needs to prepare the instructions and data submitted to the GPU to complete the compilation of the calculation program, which is realized by calling a series of apis in WebGL. In the third step, the calculation logic is allocated to each GPU core for execution, so this logic is also called “kernel function”. Finally, the calculation results are copied from GPU memory back to CPU memory, which is completed by reading pixel values in texture in WebGL1.

Next, we start with GPU programming model and execution model, and introduce the concept of threads and thread groups, which is also the key to GPU data parallelism. The following figure shows the hierarchical relationship between grid and thread groups, not limited to DirectCompute.

  • Dispatch (x, Y, z) allocates a 3-dimensional Grid of threads that share global memory space;
  • The grid contains many Thread groups (Work groups, Thread groups, Thread blocks, local groups), and each Thread Group contains many threads. The Thread groups are also three-dimensional. Typically specified in Shader via NumThreads (x, y, z). They can communicate via shared memory or synchronization primitives;
  • The Shader program will eventually run on every thread. For each thread, it can obtain its own 3d coordinates in the thread group, or obtain the 3d coordinates of the thread group in the whole thread grid, which can be mapped to different data to achieve the effect of data parallelism.

From the perspective of hardware, threads correspond to CUDA core in GPU, thread group corresponds to Streaming Multiprocessor (SM), and grid is GPU.

WebGL1 texture mapping

The figure below is from “GPGPU Programming Technology – From GLSL, CUDA to OpenCL”, which is also a classic GPGPU computing flow.

In general, the final output target of a graphics rendering API is the screen, showing the rendering results. But in the GPGPU scenario we just want to read the final calculation on the CPU side. Therefore, the off-screen rendering function provided by the rendering API is used, i.e. render to texture. The key technique is to use Framebuffer Object (FBO) as the render Object. Textures are used to store input parameters and computed results, so at creation time we usually need to turn on the floating point extension OES_texture_float, which is already built into WebGL2.

Parallel computation takes place in the rasterization phase. We write the calculation logic (kernel function) in the Fragment Shader. The Vertex Shader is only responsible for mapping texture coordinates, so Geometry can use either a Quad (4 vertices) or a full-screen triangle (3 vertices). For each pixel, it works the same, rendering logic that you normally perform becomes a calculation, and pixel values become the result.

There is an obvious limitation to this approach, however. For all threads, the texture cache is either read-only or writer-only. There is no way that one thread is reading the texture and another is writing the texture. Essentially, it is determined by the hardware design of GPU. If multiple threads want to simultaneously read/write the same texture, a complex synchronization mechanism needs to be designed to avoid read/write conflicts, which is bound to affect the efficiency of parallel execution of threads. So in a classic GPGPU implementation, we typically prepare two textures, one for storing input data and one for storing output data.

In addition, the method does not support synchronization between threads and shared memory, so some parallel algorithms cannot be implemented, such as bellman-Ford single-source shortest path algorithm.

Ping-pong technology is also mentioned in the figure above. Many algorithms need to be run continuously for many times. For example, the layout algorithm used in G6 needs to be iterated for many times to reach a stable state. The calculated results of the previous iteration need to be used as input for the next iteration. In a real implementation, we would allocate two texture caches and swap the input and output textures after each iteration to achieve a ping-pong effect.

Note that readPixels (reading data in a texture on the CPU side) is very slow, so you should minimize calls to it and keep data in the GPU as much as possible, in addition to getting the final result.

We won’t expand on the actual use of the WebGL1 API here, but refer to the tutorial for details.

2 WebGL2 Transform Feedback

The deprecated WebGL 2.0 Compute (underlying is OpenGL ES 3.1) has advanced features such as memoryBarrier and shared memory for synchronization between threads. But eventually the team moved on to WebGpus.

WebGL2 provides another means of parallel computation in Vertex Shaders, the Transform Feedback, which skips the rasterization pipeline and therefore does not require the Fragment Shader to participate (in the actual implementation, an empty Shader is enough).

This scheme has the following differences with WebGL1 texture mapping method:

  • No Fragment Shader is required, so gl.enable(gl.rasterizer_discard) can be enabled using global variables;
  • The computational logic is written in the Vertex Shader, eliminating the need for arcane texture mapping and using Buffer to read and write data.
  • You can use getBufferSubData directly when reading the results. Still, the method is slow;

Although an improvement over WebGL1, some important features in Compute Shader are still missing.

Again, we do not expand on the actual use of the WebGL2 API here, but refer to the relevant tutorials for details.

Three difficulties in implementation

Even with the above principles in mind, front-end developers face a lot of practical difficulties. In addition to the learning costs of the graphics API and Shader itself, the front end is relatively new to the GPU programming model itself.

The first problem we encountered was whether an algorithm could be parallel. Some computing tasks are very time-consuming and complex, but cannot be handed to the GPU, such as code compilation, so parallelism is not directly related to complexity. There are no strict criteria for parallelism, but more from experience and existing practices in the industry (such as graph layout/analysis algorithms mentioned later). Data parallelism can be considered when confronted with a task like “for every X do Y”. For example, the figure below shows a single-source shortest path algorithm. It is not difficult to find that there is a “relaxation” operation for each edge by traversing each node. In this case, we can consider parallelization and let one thread handle one node.

When we want to migrate an existing parallel algorithm to GPU, the first problem we face is the design of data structure. GPU memory is linear, and there is no similar object structure, so it is inevitable to redesign the algorithm when migrating. If GPU memory friendly is considered, the design difficulty will be further increased. You’ll see a linear representation of graphs in the section “About Graph Layout/Analysis Algorithms” in the example application below.

The next problem is that some important features in Compute Shader are missing from both WebGL1 and WebGL2, so some algorithms already implemented in CUDA are not directly portable. This problem is explained in detail in the last section of this article.

We’ve talked a lot about shared memory and synchronization, but here’s an example of a Reduce sum to help you understand what they mean. The following figure shows that 16 threads are allocated to process an array of length 16, and thread 0 finally outputs the final result to the first element of shared memory. The process can be decomposed into the following steps:

  1. Individual threads load data from global memory into shared memory.
  2. Barriers ensure that shared memory data is up to date for all threads in a thread group.
  3. The sum is done in shared memory, and synchronization is required for each thread after completion.
  4. Finally, after all threads have computed, the first thread writes the first element in shared memory to global output memory.

Imagine if there were no shared memory and synchronization, the end result would obviously not be correct, similar to mutex in concurrent programming, where there would be unexpected confusion if there were no read/write locks.

Finally, optimization space in GPU programming largely depends on developers’ understanding of the hardware itself. Again, take the Reduce sum above as an example. More than 5 Optimizations based on this version can be found in DirectCompute Optimizations and Best Practices.

In addition, the GPU cannot be interrupted during Shader execution, which also makes it difficult to debug the code. Many rendering engines also face this problem. Unity has RenderDoc, but WebGL does not.

Four application examples

Here we highlight some GPGPU applications in the field of visualization, from graph algorithms, high-performance animation and massive data parallel processing scenarios.

Since it is a general calculation, it is inevitable that we cannot cover all computing scenarios in all fields. We try to analyze the design and implementation ideas of the following computing tasks, hoping to give readers some inspiration. When encountering parallel algorithms in specific scenarios, we can try to use GPU to accelerate the process.

Figure 1 algorithm

Layout and analysis are two common algorithms in graph scenarios. CUDA has a library of high performance graph analysis algorithms like nvGRAPH, including shortest paths, PageRank, and more, supporting up to 2 billion edges.

Before implementing the specific algorithm, we first need to consider a problem, that is, how to represent a graph with a linear structure. The most intuitive data structure is the adjacency matrix, as shown in the figure below. If we have six nodes, we can represent it as a 6 by 6 matrix, and the ones that are connected are plus one on the corresponding element. The following figure is from wikipedia’s presentation of adjacency matrices.

However, there is an obvious problem with such data structure, which is too sparse and leads to space waste, especially when the number of nodes increases. Adjacency list is a better choice. This linear structure is divided into nodes and edges, fully considering sequential reads of GPU memory, and compressed as much as possible (for example, each Edge RGBA component stores index of adjacent nodes). For example, repulsion calculation (implementation of G6) requires traversing all nodes except itself, all sequential reads. Similarly, when calculating attraction, traversing all edges of a node is read sequentially. Random reads only appear when retrieving endpoint coordinates.

The specific algorithm implementation is not expanded here. Please refer to the process of migrating the existing layout algorithm of G6 for details. The final effect varies greatly according to different algorithms. The best effect is Fruchterman layout, and the GPU version can be improved more than a hundred times after the number of nodes exceeds thousands, but the GForce layout is even inferior to the CPU version in the case of a small number of nodes.

2 SandDance

SandDance provides smooth switching of multidimensional data across multiple layouts, extends the Vega specification, adds depth information to 2D scenes, and also uses deck.gl for rendering. Specifically for the technology used in layout switching, LUMa.GL (the underlying rendering engine of Deck.GL) provides an advanced package based on WebGL2 Transform Feedback for interpolation of animation and data transformation in the GPU. Compared to the traditional CPU to do interpolation animation performance is much higher. This technique is also commonly used for particle effects in general purpose rendering engines.

3 P4: Portable Parallel Processing Pipelines

P4 is dedicated to processing and rendering massive data, generating Shader code for data aggregation and rendering at run time. The former is somewhat similar to some op in TFJS. It is worth mentioning that some Reduce operations (such as Max/min, count, sum, average) are implemented through WebGL Blending operations. For example, the blendEquation used to implement Reduce summation is gl.add.

Current limitations and future prospects

It can be seen that WebGL is limited by the underlying API capabilities and lacks many computation-related features to varying degrees, leading to the failure of many parallel algorithms. On the other hand, the visualization world is missing many suitable scenarios, and we desperately need the next generation of more powerful Web apis.

As the successor of WebGL, WebGPU relies on more modern graphics API on each operating system at the bottom and provides lower-level interface, which means that developers have more direct control over GPU and less drive resource consumption, rendering calculation is treated equally. It is already available in Chrome/Edge/Safari preview versions.

WebGPU has abandoned GLSL used by WebGL in the choice of Shader language and switched to the new WGSL. It provides storage/workgroupBarrier synchronization methods for computing-related features of interest. With these features, some algorithms can be ported to the Web side, such as single source shortest path and other graph algorithms CUDA open source implementation, the author tried to use WGSL to implement it.

A more mature practice now is the inclusion of WebAssembly and WebGPU backend support in the Apache TVM community. You can get almost the same efficiency on MacOS as running Native Metal directly.

In short, this is definitely the best choice for gpGpus on the Web for the foreseeable future.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.