It mainly includes five parts: three common features of scalable video coding; The encoder adopted by WebRTC and its application mode; Application status of scalable coding in WebRTC; Target detection and bit rate allocation based on scalable coding; The application prospect and research direction of AI and scalable coding. [Rongyun Global Internet Communication Cloud]

Three commonly used features of scalable video coding

The existing network and storage devices cannot directly store the original video image, so the video and image must be compressed. The existing mainstream video compression algorithms are H.264, VP8, VP9, HEVC, VVC, etc. On the one hand, from H.264 to VVC, the coding complexity is getting higher and higher, and the compression efficiency is getting higher and higher. On the other hand, the bandwidth of the transmission network varies and changes at any time. A single bit stream cannot adapt to the network and device environment of different receivers. For example, the transmission bandwidth of 4G network is different from that of 5G network. If the same set of code stream is transmitted in 4G network and 5G network, the 5G network bandwidth may not be fully utilized and the video viewing effect may be affected.

There are many different receivers in the video application environment, and the following two technologies can be used to solve this problem: Simulcast and scalable video coding (SVC).

As shown in Figure 1, Simulcast means simultaneous transmission of multi-channel code streams. Different code streams have different bit rates to transmit code streams under different bandwidths. When the terminal device is in a high bandwidth network environment, it can transmit high bit rate video, so as to obtain a better video watching experience. When a terminal device is on a low-bandwidth network, it can transmit low-bit rate videos to reduce video playback lag. However, Simulcast supports a limited variety of bit rates and is difficult to adapt to the complex network environment. To address this problem, researchers proposed scalable video encoding SVC, in which video data is compressed once but can be decoded at multiple frame rates, spatial resolution, or video quality. For example, three layers of airspace can be classified and two layers of time domain can be classified, and six modes can be combined. Compared with Simulcast, the adaptability of the system has been greatly improved.

(FIG. 1 Syndication & gradable)

There are three commonly used classification codes: Spatial Scalability, Quality Scalability and Temporal Scalability.

(Figure 2 three common methods of scalable coding)

Spatial scalable coding (FIG. 3) means that multiple images of different spatial resolutions are generated for each frame of the video, and low-resolution images are obtained by decoding basic layer code stream. If enhanced layer code stream is added to the decoder, high-resolution images are obtained.

(Figure 3 airspace can be classified)

The quality can be graded (FIG. 4). A feasible method is to perform rough quantization on the original image after DCT transformation during basic layer code stream coding, and form basic layer code stream after entropy coding. The data after rough quantization is back-quantized to form the basic layer coefficient, which is subtraction from the original IMAGE DCT transformation coefficient to form a difference signal, and then the difference signal is fine-quantized and entropy coded to generate the enhanced layer code stream.

(FIG. 4 Quality can be graded)

Time-domain gradability (FIG. 5), that is, video sequence is divided into multiple layers without overlapping, and ordinary video coding is performed on frames of the basic layer to provide basic layer code stream with basic time resolution. For the enhancement layer, the basic layer data is used to encode the inter-frame prediction of the enhancement layer to generate the enhancement layer data.

(FIG. 5 Time-domain gradability)

Encoder adopted by WebRTC and its application method

WebRTC supports encoders including VP8, VP9 and H.264. In terms of user experience, the effect of VP8 and H.264 encoders is basically similar. VP9, as the next generation of VP8 encoder, is better than VP8 and H.264 in hd video compression.

As shown in Figure 6, based on the performance of the encoder and the support of the browser encoder, the following conclusions can be drawn: VP8 and H.264 have basically the same encoding effect, and both are generally acceptable; VP9 is mainly used in Google’s own video products. It should be noted that VP9 supports a variety of SVCS. HEVC can only be used in Apple system at present, so it is not recommended to use it. AV1 is also so new that it is only well supported in Google products and is not recommended for now.

(Figure 6 encoder support in browser)

Application status of scalable coding in WebRTC

Before introducing the application of scalable coding in WebRTC, this section describes the communication and networking process of WebRTC.

As shown in Figure 7, the communication between client A and client B can adopt either the direct connection mode or the server mode. In large-scale networks, the server-based mode will be adopted for forwarding and signal processing.

(Figure 7 simple WebRTC process)

WebRTC provides three solutions: Mesh, MCU and SFU for multiple applications with multiple receivers.

Mesh scheme (Figure 8), that is, multiple terminals are connected in pairs to form a Mesh structure. For example, terminals A, B, and C communicate many-to-many. When terminal A wants to share media (such as audio and video), terminal A needs to send data to terminal B and terminal C respectively. Similarly, if B wants to share media, it needs to send data to A, C, and so on. This scheme requires high bandwidth of each terminal.

(Figure 8 Mesh Scheme)

MCU (Multipoint Conferencing Unit) solution (Figure 9), which consists of one server and multiple terminals forming a star structure. Each terminal will share their own audio and video streams to the server, the server will be in the same room in all terminals mixed audio and video streams, and finally generated a mixed audio and video stream and then sent to each terminal, so that each terminal can see/hear the audio and video of other terminals. In fact, the server side is an audio and video mixer, which can be very stressful for the server.

(FIG. 9 MCU Scheme)

Selective Forwarding Unit (SFU) scheme (FIG. 10) is also composed of one server and multiple terminals. However, different from MCU, SFU does not mix audio and video streams. After receiving the audio and video streams shared by a terminal, SFU directly sends the audio and video streams to other terminals in the room.

(FIG. 10 SFU Scheme)

The different bandwidths of the three networks are shown in Figure 11. It can be seen that the maximum bandwidth of SFU is 25Mbps and the minimum bandwidth of MCU is 10Mbps.

(Figure 11 Bandwidth of three different networks)

In terms of features, the flexibility of Mesh scheme is poor. The MCU scheme requires transcoding, confluence, shunt and other operations to the code stream. SFU solution server has been widely welcomed for its low pressure and better flexibility.

FIG. 12 shows the schematic diagram of the forwarding mode in Simulcast mode and SVC mode. It can be seen from the above two figures that svC-based bit stream allocation has greater modifiability for PC. No matter which networking mode is used, SVC will be more robust than Simulcast.

(FIG. 12 Simulcast and SVC mode forwarding mode)

The support is shown in Figure 13. As can be seen from the figure, H.264 only supports Simulcast, VP8 supports time-domain gradability, and VP9 supports SVC coding in all aspects. VP9 is the main codec promoted by Google, but the promotion of h. 264 codec optimization is not strong, which limits the application of WebRTC to a certain extent. For example, apple’s latest iPhone13 has h.264 hardware acceleration function. If AV1 encoder is used, the advantages of SVC can be obtained, but hardware decoding is not possible. In WebRTC, Simulcast starts multiple OpenH264 encoders at the same time by default through multi-threading technology, while SVC can call OpenH264 for hierarchical coding in time domain and airspace.

(Figure 13 Support of scalable coding in WebRTC)

Target detection and bit rate allocation scheme based on scalable coding

For the N-channel SFU, the SFU must consider the sum of the remaining N-1 terminal bit rates. For most video conferencing, the ratio of bit rate to total bit rate is basically constant in given time domain and space domain layer conditions. See Figure 14.

(Figure 14 Code flow distribution of different layers)

According to the phenomenon in FIG. 14, the video motion is used as a main measurement index to allocate the bit stream. The specific scheme framework of relevant papers is shown in Figure 15.

(FIG. 15 Code rate allocation of SVC encoder)

This scheme has two room for improvement: the first is the difference between the current frame and the previous frame, which is difficult to accurately reflect the motion changes of the video. The second is to add features other than motion to better reflect changes in the image and video. The proposed solution is shown in Figure 16.

(Figure 16 proposed solution)

In WebRTC, the encoder of H.264 is open source encoder of Cisco, and the hierarchical encoding configuration file of OpenH264 is shown as follows. This configuration file sets up the time domain hierarchical layer with two layers.

(Figure 17 OpenH264 scalable encoding configuration file)

SVC code streams are characterized by a multi-layer structure. In practice, code streams need to be extracted. For time domain gradable, the code stream was extracted by analyzing Temporal ID in each NAL. For Spatial classifiable, the code stream is extracted by analyzing the Spatial ID in each NAL. For Quality grading, the code stream is extracted by analyzing the Quality ID in each NAL.

As can be seen from Figure 18, the code stream of the basic layer of OpenH264 can be decoded directly by AVC decoder, and the SVC_extension_flag of the basic layer is equal to 1.

(Figure 18 Decoding diagram of basic layer of scalable coding)

The NAL of SVC enhanced layer code stream contains SVC grammar, and it is necessary to transcode SVC code stream. JSVM, the reference software for hierarchical coding, has a special transcoding module. Figure 19 shows the transcoding process, and it can be found that multiple NAL units are rewritten into AVC format.

(FIG. 19 Transcoding of NAL layer in graded coding enhancement layer)

Figure 20 shows the decoding effect of codestream after JSVM transformation, which can be decoded with the standard AVC decoder.

(Figure 20 Decoding diagram of NAL layer after transcoding)

The application prospect and research direction of AI and scalable coding

The most frequently used method in scalable coding is spatial scalable technology, but the quality of different resolutions in the conversion is more obvious. At the ICME2020 conference, some scholars proposed a super-resolution model for video coding, which can reconstruct high-resolution images by extracting images at different moments for feature fusion. The experimental results show that the superfraction effect is improved.

(Figure 21 Structure of video super-resolution)

The model can be used in scalable encoders to improve the discomfort of switching between different resolution codes.

MPEG5 proposed Low Complexity Enhancement Video Coding (LCEVC), which has higher compression efficiency than H.264 under the same PSNR. The encoder is shown in Figure 22. The basic Encoder Base Encoder can choose any kind of ready-made Encoder, such as H.264, VP8, VP9, etc.

The combination of WebRTC and LCEVC is a future development direction. As a new video coding standard, it has several characteristics: it improves the compression ability of basic layer coding, has low encoding and decoding complexity, and provides an additional feature enhancement platform.

As can be seen from Figure 22, the coding complexity mainly depends on Base Encoder. If H.264, which is widely used in WebRTC, is enhanced by LCEVC, the coding effect will be significantly improved with the increase of complexity. Generally speaking, real-time sports video stream with 1080P high frame rate using H.264 encoding needs a maximum bit rate of 8Mbps, while LCEVC only needs 4.8Mbps.

(Figure 22 LCEVC encoder)

Considering the effect of LCEVC coding, it can be judged that the combination of LCEVC and WebRTC will be an important research and application direction.