This is the 8th day of my participation in the First Challenge 2022. For details: First Challenge 2022.
preface
As you probably already know, gpus have evolved over the past decade toward more flexible architectures that allow developers to implement algorithms in them that do more than just render 3D graphics. This capability is called GPU computing, and the General Purpose GPU Programming is called GPGPU Programming.
So, today we’ll leave rendering behind and dive into the knowledge of GPU computing pipelines. Now let’s begin today’s journey
The target
The example we will implement today is matrix calculation using GPU. Such as:
The steps of matrix calculation by GPU can be simply summarized as follows:
- Create 3 GpuBuffers (two to hold the input matrix and one to hold the result of matrix multiplication)
- Write code that evaluates shaders
- Set up calculation pipeline
- Submit calculation commands to the GPU
- Read the result from GPUBuffer
Let’s follow the steps above:
Coding
Create GPUBuffer
First, we prepare the matrix data:
const firstMatrix = new Float32Array([
2.4.1.2.3.4.5.6.7.8
]);
const secondMatrix = new Float32Array([
4.2.1.2.3.4.5.6.7.8
])
Copy the code
In the matrix data above, the first two digits indicate the size of the matrix, for example, 2 and 4 in the first matrix indicate that the matrix is a 2×4 matrix.
Next, we create two GpuBuffers and write data to them; Prepare a GPUBuffer to store the results
const gpuBufferFirstMatrix = device.createBuffer({
mappedAtCreation: true.size: firstMatrix.byteLength,
usage: GPUBufferUsage.STORAGE,
});
const arrayBufferFirstMatrix = gpuBufferFirstMatrix.getMappedRange();
new Float32Array(arrayBufferFirstMatrix).set(firstMatrix);
gpuBufferFirstMatrix.unmap();
const gpuBufferSecondMatrix = device.createBuffer({
mappedAtCreation: true.size: secondMatrix.byteLength,
usage: GPUBufferUsage.STORAGE,
});
const arrayBufferSecondMatrix = gpuBufferSecondMatrix.getMappedRange();
new Float32Array(arrayBufferSecondMatrix).set(secondMatrix);
gpuBufferSecondMatrix.unmap();
const resultMatrixBufferSize =
Float32Array.BYTES_PER_ELEMENT * (2 + firstMatrix[0] * secondMatrix[1]);
const resultMatrixBuffer = device.createBuffer({
size: resultMatrixBufferSize,
usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC,
});
Copy the code
Note that we specify a different usage for each Buffer when creating the Buffer. For the input matrix, we use gpuBufferUsage.storage. However, the matrix Buffer used to store the results also needs to be added with gpuBufferUsage.copy_src because we need to copy the data from it so that we can read it from GPUBuffer.
Write the Shader program
struct Matrix {
size: vec2<f32>;
numbers: array<f32>;
}
@group(0) @binding(0) var<storage, read> firstMatrix: Matrix;
@group(0) @binding(1) var<storage, read> secondMatrix: Matrix;
@group(0) @binding(2) var<storage, write> resultMatrix: Matrix;
@stage(compute) @workgroup_size(2, 2)
fn main(@builtin(global_invocation_id) global_id: vec3<u32>) {
if (global_id.x >= u32(firstMatrix.size.x) || global_id.y >= u32(secondMatrix.size.y)) {
return;
}
resultMatrix.size = vec2<f32>(firstMatrix.size.x, secondMatrix.size.y);
let resultCell = vec2<u32>(global_id.x, global_id.y);
var result = 0.0;
for (var i = 0u; i < u32(firstMatrix.size.y); i = i + 1u) {
let a = i + resultCell.x * u32(firstMatrix.size.y);
let b = resultCell.y + i * u32(secondMatrix.size.y);
result = result + firstMatrix.numbers[a] * secondMatrix.numbers[b];
}
let index = resultCell.y + resultCell.x * u32(secondMatrix.size.y);
resultMatrix.numbers[index] = result;
}
Copy the code
[[stage(vertex)]] is now recommended as @stage(vertex) according to the latest WebGPU standard. As of this article, WGSL code will use the latest code style.
Var
firstMatrix: In Matrix, storage stands for Address space, there are uniform, workgroup, etc. In addition to the type of storage, read stands for access_mode, which represents the read/write type of this variable, including: Read, write, and read_write modes.
Next, notice the annotation @workgroup_size(2, 2).
workgroup
Workgroups are a series of “threads” that simultaneously execute the Compute Shader program and share variables in the workgroup address space. In other words, we can think of a workgroup as a “thread group.”
This “thread group” has three dimensional workgroups (I, j, k), which can be understood as a three-dimensional array. Each unit in the array executes the Compute shader once, and each unit in the array executes in parallel.
In compute Shader, you get the ID of the execution unit in the “thread group”, global_invocation_id in the code above. Its ID can be calculated as follows:
i + (j * workgroup_size_x) + (k * workgroup_size_x * workgroup_size_y)
Copy the code
When we call the Dispatch API, the WebGPU will start compute Shader. The dispatch command specifies “dispatch size”, It is a 3-dimensional array [group_cout_x, group_count_y, group_count_z] that indicates how many workgroups should be executed.
So, for each invocation of the execution unit, there is a unique ID:
- 0 ≤ CSi < workgroup_size_x × group_count_x
- 0 ≤ CSj < workgroup_size_y × group_count_y
- 0 ≤ CSk < workgroup_size_z × group_count_z
The code for subsequent matrix multiplication will not be explained much.
Set up calculation pipeline
const computePipeline = device.createComputePipeline({
compute: {
module: device.createShaderModule({
code: computeShader,
}),
entryPoint: 'main',}});const bindGroup = device.createBindGroup({
layout: computePipeline.getBindGroupLayout(0),
entries: [{binding: 0.resource: {
buffer: gpuBufferFirstMatrix,
},
},
{
binding: 1.resource: {
buffer: gpuBufferSecondMatrix,
},
},
{
binding: 2.resource: {
buffer: resultMatrixBuffer,
},
},
],
});
Copy the code
Unlike the render pipe, the calculation pipe does not specify Texture, depth template test configuration, etc. The calculation pipe only specifies a shader program.
Next, we create the bindGroup as we did in the rendering process.
Submit calculation commands to the GPU
const commandEncoder = this.device.createCommandEncoder();
const passEncoder = commandEncoder.beginComputePass();
passEncoder.setPipeline(this.pipeline);
passEncoder.setBindGroup(0.this.bindGroup);
const x = Math.ceil(firstMatrix[0] / 2);
const y = Math.ceil(secondMatrix[1] / 2);
passEncoder.dispatch(x, y);
passEncoder.endPass();
Copy the code
Without further elaboration of the code prior to Dispatch, let’s explain what the x and y passed in to Dispatch mean, as we explained earlier in the workgroup section. They indicate how many workgroups need to be executed. We now only need 1 workgroup to execute on the GPU, so we pass in (1, 1).
Read the result from GPUBuffer
Now our calculation is stored in a resultMatrixBuffer, and now we need to read the data out of it. First we also need to create a Buffer to copy the resultMatrixBuffer.
const gpuReadBuffer = device.createBuffer({
size: resultMatrixBufferSize,
usage: GPUBufferUsage.COPY_DST | GPUBufferUsage.MAP_READ,
});
// Encode commands for copying buffer to buffer.
commandEncoder.copyBufferToBuffer(
resultMatrixBuffer /* source buffer */.0 /* source offset */,
gpuReadBuffer /* destination buffer */.0 /* destination offset */,
resultMatrixBufferSize /* size */
);
// Submit GPU commands.
const gpuCommands = commandEncoder.finish();
device.queue.submit([gpuCommands]);
Copy the code
Here we need to specify the usage for GPUBufferUsage. COPY_DST | GPUBufferUsage. MAP_READ. We can then submit the command to the GPU.
After submitting the calculation command, we can get the results from the GPU. MapAsync is an API that reads data from the GPU.
await gpuReadBuffer.mapAsync(GPUMapMode.READ);
const arrayBuffer = gpuReadBuffer.getMappedRange();
console.log(new Float32Array(arrayBuffer));
Copy the code
Finally, we can get the following results:
The first two digits represent the size of the final matrix, and the following numbers represent the numbers in the matrix.
conclusion
Today, we learned how to use WebGPU for calculation. The process is basically the same as that of using GPU for rendering. The more difficult point lies in workGroup. And how to read the data in GPUBuffer, these are the points we need to master, and then we will start to use WebGPU to do some slightly more complicated calculations.
If you think this article is good, please give the author a thumbs-up. Your thumbs-up is the author’s motivation to update.