Translation: Crazy geek

Original: www.smashingmagazine.com/2019/04/web…

In this article, we’ll explore how to speed up Web applications by replacing JavaScript with compiled WebAssembly.

If you’ve heard of WebAssembly, take a look at the explanation: WebAssembly is a new language that runs with JavaScript in a browser. That’s right, JavaScript is no longer the only language that runs in the browser!

Aside from “not JavaScript”, the biggest difference is that you can combine C/C++/Rust (and more!) The code is compiled into WebAssembly and runs in a browser. Because WebAssembly is statically typed, uses linear memory and is stored in a compact binary format, it is very fast and ultimately allows us to run code at “near native” speed, which is close to what you can get by running binaries. The ability to leverage existing tools and libraries for use in browsers and the potential for speed are two reasons why WebAssembly stands out.

So far, WebAssembly has been used for everything from games (like Doom 3) to porting desktop applications to the Web (like Autocad and Figma). It can even be used outside the browser, such as ServerLess Efficient computing.

This article is a case study of accelerating Web data analysis tools with WebAssembly. To do this, we performed the same calculation using an existing tool written in C and compiled it into WebAssembly to replace the slow JavaScript calculation.

Note: This article delves into some advanced topics, such as compiling C code, but don’t worry if you don’t have the experience; You can still continue to learn about the feasibility of using WebAssembly.

background

The web application we will be using is Fastq.bio, an interactive web tool that allows scientists to quickly preview the quality of DNA sequencing data; Sequencing is the process of reading the “letters”, or nucleotides, in a SAMPLE of DNA.

Here’s a screenshot of the program:

We won’t go into details about the calculations, but in short, the chart above gives scientists an idea of how the sequencing is progressing and the ability to check the quality of the data at a glance.

While many command line tools are capable of generating such quality control reports, the goal of Fastq.bio is to provide an interactive preview of data quality in the browser. This is especially useful for scientists unfamiliar with the command line.

The input to the application is a plain text file output by a sequencing instrument that contains a list of DNA sequences and the mass fraction of each nucleotide in the DNA sequence. Because the format of the file is called “FASTQ,” the name of the site is fastq.bio.

If you’re curious about the FASTQ format, check out the FASTQ Wikipedia page. (Warning: The FASTQ file format can be intimidating.)

Fastq.bio: JavaScript implementation

In the original version of Fastq.bio, users first selected a Fastq file from their computer. With File objects, the program first reads a small piece of data from a random location (using the FileReader API). We then use JavaScript to perform basic string manipulation and calculate metrics for this chunk of data. Such metrics help us track the amount of A, C, G, and T we see at each location of the DNA fragment.

Once the metrics for this data block are calculated, we will interactively plot the results with Plotly.js before moving on to the next block in the file. The reason to process files in small chunks is simply to improve the user experience: it takes too long to process the entire file at once, since FASTQ files are typically several hundred GIGABytes. We found that block sizes between 0.5 MB and 1 MB make the application run more smoothly and return information to the user in a more timely manner, but this number varies depending on the application and the amount of computation.

The architecture we initially implemented in JavaScript was very simple:

Fastq.bio implements a JavaScript architecture that randomly samples input files, calculates metrics in JavaScript and plots the results, then loops

The red box is where string operations are performed to generate metrics. This box is the most computationally intensive part of the program, and it should be optimized at runtime with WebAssembly.

Fastq.Bio: WebAssembly implementation

To explore whether WebAssembly can be used to speed up Web applications, we searched for an off-the-shelf tool to calculate QC metrics for FASTQ files. Specifically, we need to find a tool written in C/C++/Rust that has been proven and trusted by the scientific community and port it to WebAssembly.

After some research, we decided to use SEQTK, a common open source tool written in C to help us evaluate the quality of sequencing data.

Before compiling it into WebAssembly, let’s look at how seQTK can be properly compiled into binaries to run on the command line. By examining the Makefile, I found the command to compile with GCC:

# Compile to binary
$ gcc seqtk.c \
   -o seqtk \
   -O2 \
   -lm \
   -lz
Copy the code

On the other hand, in order to compile SeQTK into WebAssembly, we need to use the Emscripten tool chain, which directly replaces existing build tools and makes compiling WebAssembly much easier. If you don’t have Emscripten installed, you can download the Docker image we uploaded to Dockerhub, which contains the tools you need (you can also install it from scratch, but it will take you a while to do so) :

$docker pull robertaboukhalil/emsdk: 1.38.26 $docker run - dt - name wasm - seqtk robertaboukhalil/emsdk: 1.38.26Copy the code

Inside the container, we can use the EMCC compiler instead of GCC:

# Compile to WebAssembly
$ emcc seqtk.c \
    -o seqtk.js \
    -O2 \
    -lm \
    -s USE_ZLIB=1 \
    -s FORCE_FILESYSTEM=1
Copy the code

As you can see, the differences between compiling to binary executables and WebAssembly methods are small:

  1. We’re going to use Emscripten to generate one.wasmAnd a.jsTo instantiate the WebAssembly module rather than output a binary executableseqtk.
  2. To support the Zlib library, we useUSE_ZLIBMark. The Zlib library is common, has been ported to WebAssembly, and Emscripten will include it in our project
  3. We enabled Emscripten’s virtual file system, which is a POSIX-like file system (source code), but it only runs in the browser’s RAM and disappears when the page is refreshed (unless you use IndexedDB to save its state in the browser, which is not the subject of this article).

Why enable virtual file systems? To answer this question, let’s compare calling seQTK from the command line with calling a compiled WebAssembly module in JavaScript:

# call from the command line
$ ./seqtk fqchk data.fastq

# call from the browser console
> Module.callMain(["fqchk"."data.fastq"])
Copy the code

Virtual file systems are powerful because they mean you don’t have to rewrite SEQTK to handle input parameters. You can mount a block of data to the virtual file system as a file data.fastq and simply call seqTK’s main() function.

After compiling SeQTK into WebAssembly, we have the new Fastq.bio architecture:

WebAssembly architecture and The WebWorkers implementation of Fastw.Bio: randomly sampling input files, using webAssembly to calculate metrics in the WebWorker, drawing results and loping

Using WebWorkers instead of the browser main thread, as shown, allows us to perform our calculations in background threads without adversely affecting the responsiveness of the browser. Specifically, the WebWorker controller starts the Worker and manages communication with the main thread. For the Worker, the API executes the requests it receives.

We can then ask the Worker to run seqtk on the file we just mounted. When SeQTK is finished running, the Worker sends the results back to the main thread via Promise. After receiving the message, the main thread updates the chart with the result output. Similar to the JavaScript version, we process the file in blocks and update the visual diagram with each loop.

Performance optimization

To assess whether WebAssembly really improves performance, we compare JavaScript and WebAssembly implementations using the number of reads and processes per second as a metric. The time required to generate the interactive chart is ignored here because both implementations use JavaScript for this purpose.

Out of the box, you’ll see about a nine-fold increase in speed:

WebAssembly is nine times faster than a JavaScript implementation

That’s good because it’s relatively easy to implement (assuming you understand WebAssembly!). .

Next, we notice that while SeQTK outputs many useful QC metrics, the program doesn’t actually use or plot them. By removing the unwanted indicator output, we can see a 13-fold increase in speed:

Removing unnecessary output can further improve performance.

How easy it is to implement it, this is another big improvement.

Finally, we will make further improvements. So far, Fastq.bio has gotten the metrics of interest by calling two different C functions, each of which evaluates a different set of metrics. One function returns information in the form of a histogram (that is, a list of values included in a range), while the other returns information about the location of the DNA sequence. Unfortunately this means that the same file has been read twice, which is unnecessary.

So let’s merge the two functions into one (without having to change the C code!). . We did some refactoring on the JavaScript side because of the different number of columns in the two outputs. It’s worth it: it gives us a 20-fold speed boost!

Finally, refactoring the code so that each file block is read only once improved our performance by a factor of 21.

Be careful

When using WebAssembly, don’t always expect 20x acceleration. This can be slow if you load very large files in memory, or if you need a lot of communication between WebAssembly and JavaScript. You might only get twice the speed or even 20%.

conclusion

We’ve already seen a significant increase in processing speed by calling a compiled WebAssembly instead of JavaScript. Because the code needed for these calculations already exists in C, we get the added benefit of reusing trusted tools. As mentioned earlier, WebAssembly is not always suitable for this kind of work, so you need to use it wisely.

Welcome to pay attention to the public number: front-end pioneer, get more front-end dry goods.