WebAssembly, due to its close to machine code features and smaller file size, using WASM files will have the advantage of less compilation and loading time compared to JS, plus the use of compiler tools such as Emscripten can compile C/C ++ into WASM files, greatly expanding the vision of the front end. Since its launch in 2017, It has been a hot topic for front-end developers. Now, we can see many web applications based on WASM games/audio and video/Web file processing, etc. IVWEB team is responsible for the business development of NOW live broadcast and other live broadcast scenarios. Under the background of exploring the implementation of WASM technology, a live broadcast player based on WASM has been realized.

1. Background

Wasm + H265 player, this is not a very friendly title, other than player three words, wASM and H265 is what, why did the IVWeb team do it?

1.1 what is

WebAssembly(WASM) has been around for a few years now, and many good developers have started using WASM in their projects to improve the performance of intensive code. I recommend reading this article to learn what WASM is all about.

What is H265? People’s continuous exploration in the field of audio and video has received good feedback. Itu-t and ISO/IEC, two major video standard-setting organizations, will develop a new generation of video coding standards every few years. H.26x series and MPEG series are constantly updated. The current widely used H.264 standard is being replaced by a new generation of coding standard HEVC. High Efficiency Video Coding (HEVC), also known as H.265 (hereinafter referred to as H265), improves some technologies (dynamic macroblock partitioning/more in-frame prediction modes/better motion compensation processing) on the basis of retaining H.264. To make it have better quality and higher compression rate, the same image quality (PSNR) video, bit rate can be reduced by 30-50%.

1.2 why

Although our team is in charge of live streaming, as a Web developer, we can only really do a small job in the area of live video content — mastering and using the browser-based encapsulated player Video. While we are looking forward to the many improvements that H265 brings, we find that the current mainstream players (except Safari) do not support H265 due to their decoding capabilities, and web decoding solutions including FLv.js do not support H265 either. Our preferred video tagging scheme is excluded from h265 usage scenarios. ** I don’t believe it! I don’t listen to! I don’t care! After a wave of three negations, we decided to continue exploring new solutions.

Lianmai mixed stream coding type The test of time The flow
H265 15 (min) 68.1 MB
H264 15 (min) 137MB

After 15 minutes of testing, using the H265 saved 30 percent more traffic than using the H264 live stream and 50 percent less traffic with the mix. (For ordinary live streaming, Tencent Cloud currently sets the bit rate conversion factor of NOW Live STREAMING H265 as 70, theoretically saving 30%)

1.3 how to do

When WASM launched, the hot application areas include audio and video encoding and decoding, Github open source video clipped/video decoding WebAssembly program, Taobao live and Huapepper live also explored wASM + H265 player program this year, in the field of player has been the previous successful practice. Based on the analysis of each module of player, several core modules (data IO/rendering layer/control layer) can be completed based on mature technology of Web, and video decoding can be completed based on mature coding technology, decoding code based on FFMPEG is compiled into WASM file by Emscripten and introduced into Web.

2. The process of a frame of data

Video data is encapsulated in streaming media. Currently, the commonly used streaming media protocols include RTMP/HLS/HTTP-FLV, in which HLS and HTTP-FLV are transmitted through HTTP, and the data can be directly handed over to JS for processing. In this section, we start from streaming media and analyze the data processing in each step of the player. Wasm + H265 live player framework.

2.1 Data Flow I/O

The first touch of streaming media comes from the IO layer of the player, where the player requests data and gives it to the IO Controller for later decoding playback. Take NOW Live as an example. Tencent Cloud provides HTTP-FLV streaming media for us. Audio and video data is encapsulated in FLV container and transmitted through HTTP. To Fetch data from streaming media, the Web provides a solution for streaming data processing. You can read the data stream directly using the XMLHttpRequest or Fetch Api. Here is a simple FETCH data flow example:

fetch(url, {
  method: 'GET'.responseType: 'arraybuffer'
})
  .then(res= > {
    return res.body.getReader();
  });
Copy the code

Replace the URL with the address of the live stream, and obtain a readable stream. The data can be read using the read method.

const stream = new ReadableStream({
  start(controller) {
    const push = () = > {
      reader.read()
        .then((res) = > {
          if (res.done) {
            // End of stream processing
          }

          // res.value reads data and passes it to controller
          this.emit('data', res.value); push(); }); }; push(); }});Copy the code

IO controller will first write the cache pool after receiving the video data in the FETCH. After reaching the threshold set by the player (the threshold for decoding start cannot be too small, at least the probesize size set by FFMPEG is required. Ffmpeg needs to read multiple frames in a loop to determine the video frame rate/bit rate/height width etc.) before passing the data to the upper controller.

2.2 WASM Data Interaction

The player core controller receives the IO data and starts decoding. As mentioned earlier, the decoder is compiled using Emscripten to compile.wasm files and.js glue code, which provides Module global objects to access the WASM lifecycle/memory model, etc. How will the buffer data obtained in the IO layer of the player be transmitted to the decoder, and how will the decoder’s decoded audio and video frame data be transmitted back to the player?

2.2.1 Player (JS) to Decoder (WASM)

At the heart of data interaction is understanding wASM’s memory model. When wASM is generated, we set the memory size of WASM. In the following compilation script, we set a memory size of 64MB for WASM.

export TOTAL_MEMORY=67108864 emcc $WASM_SOURCE/decoder_stream.c ... \... -s TOTAL_MEMORY=${TOTAL_MEMORY} \ ... -o $LIB_PATH/libffmpeg_live.jsCopy the code

The memory model is an ArrayBuferr, which can be viewed through the view provided by the Module global object provided by the glue code. The wASM view corresponds to the typedArray view in JS. Have HEAP8 (Int8Array)/HEAPU8 (Uint8Array)/HEAP16 (Int16Array)/HEAPU16 (Uint16Array), etc. As shown below:

When a player (JS) needs to pass data to a decoder (WASM), there are three steps:

  1. Write data to the WASM data model

  2. Pass data Pointers and data lengths

  3. Wasm reads data based on Pointers and lengths

    const offset = Module._malloc(buffer.length); // _malloc requests a block of memory and returns a pointer
    
    Module.HEAPU8.set(buffer, offset); // Write data to allocated memory space
    
    Module._sendData(offset, buffer.length); // Pass Pointers and data lengths
    
    Module._free(offset); // Note: The memory needs to be reclaimed after use
    Copy the code
    int sendData(uint8_t *buff, const int size)
    {...memcpy(pValid, buff, size); // Write to the data area
     pValid += size; // Update the data end pointer. logger("Data size: %d \n", size);
    }
    Copy the code

2.2.2 Decoder (WASM) to Player (JS)

Data transfer from player to decoder, data transfer from decoder (WASM) to player (JS) is actually a reverse operation, also need three steps:

  1. Pass in the player callback function

  2. The decoder calls the player callback function to pass data

  3. Js reads the pointer and length of the data to parse out the data

void initDecoder(VideoCallback videoCallback ...)
{...if (decoder == NULL) { decoder->videoCallback = videoCallback; . }}int decodeVideoFrame(a) {... decoder->videoCallback(out_buffer, buffer_size);// Pass data by callback. }Copy the code
function videoCallback() {
    const videoBuffer = Module.HEAPU8.subarray(buff, buff + size);
    const data = new Uint8Array(videoBuffer);
}
Copy the code

2.3 decode Decode cores

As a web developer who can neither write C language nor understand audio and video decoding, I learned a lot of Dr. Lei Xiaohua’s introduction to decoding when completing the decoder. Thanks for Dr. Lei’s efforts in the popularization of audio and video foundation.

For decoding the live stream, the decoder mainly does one thing:

  1. Obtain video stream metadata information

  2. Video decapsulation

  3. Decoded video, video formatted YUV420

  4. Decoded audio, audio formatted PCM

  5. Data correction

The following is the core decapsulation/decoding process:

The first step is to obtain the video stream read from memory, and obtain the basic information of the video (high/width/video decoder/audio decoder/audio sampling rate/sampling format, etc.), which is used for setting the audio/video parameters of the later rendering layer.

Decapsulation is to take the video and audio streams from the FLV container and store them in a structure called AVPacket using the av_read_frame() method. The packet contains the display time (PTS), decode time (DTS) and stream ID (Stream_id) of the data. Stream_id allows you to determine whether it is a video stream or an audio stream.

Decoding is decompression. Video decoding is decompression of every frame of image data from compressed data. Audio decoding is decompression of audio data from compressed data. In video decoding, we take real video frames from the video stream of H265 for high width conversion and data format processing.

// Feed data to the decoder
avcodec_send_packet(videoCodecCtx, packet);

// Receive the data frame output by the decoder
avcodec_receive_frame(videoCodecCtx, avframe);

// Create buffer space
uint8_t *out_buffer = (uint8_t *)av_malloc(buffer_size);

// Populate the cache with data
av_image_fill_arrays(
    frame->data,         / / frame data
    frame->linesize,     // Single row data size
    out_buffer,          / / the buffer
    AV_PIX_FMT_YUV420P,  // Pixel data format.// Height width
)

// Create a format conversion context
struct SwsContext *sws_ctx = sws_getContext(
    width,                  // Input width
    height,                               // Enter the height
    videoCodecCtx->pix_fmt, // Enter data
    videoWidth,             // Output width
    videoHeight,            // Output height
    AV_PIX_FMT_YUV420P,            // Output data format
    SWS_BICUBIC,            // Format conversion algorithm type...). ;// Format conversion
sws_scale(
    sws_ctx,  // Format conversion context
    (uint8_t const *const *)avframe->data  // Enter data...). ;Copy the code

Note: The performance of the conversion algorithm varies depending on the format you choose.

Finally, the PTS (playback timestamp) of the current frame is computed, and the frame data is returned to the player for processing through a data callback.

// Calculate the playback timestamp PTS
double timestamp = (double)avFrame->pts * av_q2d(decoder->->streams[videoStreamIdx]->time_base);

// The callback function returns frame data
decoder->videoCallback(out_buffer, buffer_size, timestamp);
Copy the code

The decoder needs to use CPU to decompress the video stream, and the larger the bit rate, the higher the CPU is occupied by the video decoding. In order to avoid decoding blocking the main thread, we run the decoder as a separate Web worker.

2.4 Render layer, meet the user

The decoded data will be handed to the rendering layer to meet the user. The data obtained from the decoder is in the following format:

const frame = {
    type.// 0 video frame / 1 audio frame
    timestamp,    / / PTS milliseconds
    data          / / frame data
}
Copy the code

Audio frames and video frames are continuously input into the rendering cache pool. After audio and video synchronization processing (referred to below), data data is extracted from the cache pool and handed to WebCL to complete rendering. Take out the audio data and hand it to the audio player.

The decoded audio data format is PCM, which is a digital representation of the audio signal. The value is taken out from the audio mode signal at a fixed frequency and converted into a digital signal represented by a fixed number of digits. The frequency and digits are the audio sampling rate and sampling number. For example, 44100 Hz 16bit indicates 1s sampling 44100 times. The amplitude of the audio sampling data is divided into 2^16 levels.

Audio data can be played back using the Web Audio Api, where it is treated as a stream that passes through one Audio node after another (a rich Audio processor) and is output to a speaker at its destination. The whole process is similar to gulp pipe. For PCM audio data playback, we only need to add a PCM to AudioBuffer conversion node.

play(frame){...// Format PCM
    const data = format(frame);
    const audioBuffer = audioCtx.createBuffer(
    channels,   / / track number
    length,     // Data length
    sampleRate  / / sampling rate); .// Data is written to the AudioBuffer
  for (let channel = 0; channel < channels; channel++) {
    let offset = channel;
    const audioData = audioBuffer.getChannelData(channel);

    for (let i = 0; i < length; i++) { audioData[i] = result[offset]; offset += channels; }}... bufferSource.buffer = audioBuffer; bufferSource.connect(this.gainNode); Buffersource.start (play time); . }Copy the code

2.5 Overall Framework

So far, we have looked at each of the core components of playback from the flow of data. Here is the complete framework of the player:

3. Keep exploring

It’s not enough to decode live streams, our goal is to use them in a production environment so that we can enjoy the benefits of the H265. To this end, for the production environment application of this big appeal, we did some exploration.

3.1 First frame Duration

The duration of the first frame refers to the time from opening the page to the appearance of the first frame. The duration of the first frame is affected by multiple factors such as resource loading time and setting of decoder cache threshold. First, let’s take a look at what happens in the process of resource loading:

In addition to HTML parsing and subsequent decoding and rendering, there are four resources that the player loads serially. Firstly, the player resource Player. js is loaded, the code worker is initially dissolved in player.js, the worker resource is loaded, the compiled WASM file is loaded in the worker, and finally the video stream is pulled.

Resource loading optimization is an area we are very familiar with, and there are three commonly used optimizations:

  • Merge resource requests (this is not useful for now)

  • Reduce resource size

    Player. js and worker.js use common code compression to minimize the size of JS resources.

    The wASM file contains the decoding logic of the player and the FFMPEG library. The size of the WASM file is up to 2.8 MB. After disabling the demuxers and decoders of other FFMPEG and only turning on HEVC and AAC, the wASM file size is reduced to 1.4 MB. The CDN has 400K after gZIP is enabled.

    --disable-demuxers --enable-demuxer=hevc --enable-demuxer=acc --enable-demuxer=flv \
    --disable-decoders --enable-decoder=hevc --enable-decoder=aac
    Copy the code
  • Serial to parallel

    Resource preloading allows you to load future resources ahead of time while loading player.js. The decoder worker can be used in parallel with player.js. Read ivWeb to play with WASM series – Web Worker serial loading optimization

    FLV stream resources can also be preloaded and injected into the player’s cache pool when the player is ready. Page straight out scenarios, FLV stream load time can be advanced to the HTML parsing phase.

    Parallel loading is currently being tested, and the team will have a detailed article on load time comparisons later.

The second factor that affects the duration of the first frame is the setting of the decoder cache threshold, which is set to cache the video stream to a certain threshold before it is fed into the player. Avformat_find_stream_info will read part of audio and video streams to probe when FFMPEG is unsealed. Decoding before night detection will fail, and the decoding time will be delayed until the video stream is loaded to a certain threshold.

Ffmpeg provides probeSize and analyzeduration to control the size and duration of the probe data. When both exist, the probe can be completed if a condition is met. Reducing probeSize or analyzeduration reduces the probe time of avFormat_find_STREAM_info, thus reducing the first frame time.

3.2 Audio and painting synchronization

The playing picture and the playing sound should correspond, otherwise it will greatly reduce the user’s viewing experience. Referring to the traditional audio and painting synchronization scheme of the player (recommended reading player technology share (3) : Audio and painting synchronization), we adopt the scheme of video synchronization to audio.

When the audio continues to play, the PTS of this audio is used as the main clock every time the audio is played. The video frame is taken out of the video frame cache pool to judge whether the PTS of the video frame is within the playback threshold (-50ms-50ms), and the video frame is discarded when the video frame is delayed beyond the minimum threshold (-50ms). The video frame is 50ms ahead of the maximum threshold and is synchronized when the next audio clip is finished.

// Audio player
bufferSource.onended = () = > {
    onAudioUpdate(frame.timestamp);
};

function onAudioUpdate(timestamp) {...const video = videoPool.shift();
  const diff = video.timestamp - timestamp;

  // The video is played directly in the synchronization interval
  if (diff <= this.min && diff > -this.min) {
    renderImage(this.videoPool.shift());
  }

  // The video lag exceeds the threshold and the frame is discarded
  if (diff < -this.min) {
    emit('discardFrame', diff); discardFrame(); }... }Copy the code

3.3 compatibility

The player uses several new browser apis such as WASM/Web Worker/Web Audio API /webGl/ FETCH, which poses great compatibility challenges. The player needs to provide an API to determine whether the current environment is usable.

/** * Whether the current player can play normally *@returns {Boolean} True or false; * /
function isSupported() {
    const supportWasm = isSupportWasm();       // Whether wASM is supported
    const supportWorker = isSupportWorker();   // Whether web workers are supported
    const supportAudio = isSupportAudio();     // Whether web Audio API is supported
    const supportWebGl = isSupportWebGL();     // Whether webGL is supported
    const supportAbortController = isSupportAbortController(); // Whether fetch abort is supported

    return supportWasm && supportWorker && supportAudio && supportWebGl && supportAbortController;
}
Copy the code

Performance of 4.

4.1 Test machine configuration

MacBook Pro (13-inch, 2017, Four Thunderbolt 3 Ports CPU 3.1 GHz Intel Core I5 Ram 8 GB 2133 MHz LPDDR3 Chrome version 80.0.3968.0 dev (64-bit)

4.2 Live stream Parameters

Video resolution: 540 * 960 Video frame rate: 24 Video bit rate: 1400Kbps Audio bit rate: 64Kbps

4.3 CPU and Memory Usage

product Average memory footprint Average CPU usage CPU fluctuations Memory footprint fluctuation
Ivweb Wasm player 276.95 MB 29.97% 8% – 75%. 198M- 457M

Look to the future

In the future, when chrome and other major players can support H265 natively, will our research on WASM player be meaningless? My view is that WASM can still be used as a battleground for cutting-edge technology exploration on the Web front end. Audio and video codec can be introduced to the front end, and other rich new things in video can be introduced to the browser through WASM.