Abstract: The next generation of video coding technology is still using the traditional evolution idea — on the classical architecture to do module enhancement.
With the rise of short videos and live streaming, online video users have already spent more time than social media users. In the 5G era, video will account for 85 to 90 percent of Internet traffic. Faced with the continuous improvement of users’ requirements for video picture quality, how to achieve clearer and lower cost video signal transmission in limited bandwidth network environment has been the concern of many video application enterprises.
Zuo Wen, product manager of Cloud video service of Huawei Cloud, shared with you huawei cloud video’s ideas and application results on the development of next-generation video coding technology. The topics to be shared include three parts. The first is huawei Cloud Video’s views on the development trend of the video industry and the challenges these trends pose to the next generation of video coding technology. Secondly, the next generation video coding technology is introduced from the standard point of view. Finally, from the perspective of cloud video application, this paper introduces some practices and explorations of Huawei cloud in video coding technology, hoping to bring inspiration to everyone.
1. Video industry trends
5G, cloud and AI have become the development trend of ICT industry and even the whole society, promoting the continuous evolution of the demand and technology of the entire video industry, and promoting the continuous upgrading of the entire video industry. Every link in the video life cycle is being updated, including video production, video processing, video transmission and video consumption.
- Video production: Multi-source data collection, including UHD, VR, free view, 3D modeling and video rendering
- Video processing: Ai-based video processing is more real-time, intelligent and accurate, including various encoding methods
- Video transmission: ultra-low delay transmission, cloud side collaboration and so on
- Video consumption: deep combination of intelligent terminals to provide the best experience of video services
The essence of the video industry is the processing of media data, which is supported by computing power, storage, network and AI. At the same time, the video industry promotes the continuous progress of 5G, cloud and AI, which complement each other.
Video evolution leads to a large increase in computing power, storage, and bandwidth requirements. To put it simply, video resolutions are getting higher and higher, from HD to UHD to 8K/VR. Computing power grows 24 times, storage 12 times, and bandwidth 20 times. These needs can be well met through the cloud and only through the cloud for a high-quality video experience. Cloud native video is the industry trend, and video will become the basic service capability of cloud.
The previous is the overall trend of the industry, the following is a specific scenario. The development of Internet video has gone through two stages. The first stage, from 2008 to 2013, focused on long video VOD and on-demand viewing. The second stage, from 2013 to 2013, was actually continued last year, with live broadcasts and short videos as the focus. What will the third stage, the next generation, focus on? We believe that due to the promotion of 5G, cloud and AI, video will enter the era of real-time interaction and VR/AR.
The new video gameplay puts forward new demands, and the interactive video mode transitions from IM to real-time audio and video. With the upgrading of live broadcasting with mic, anchor PK, live broadcasting with goods, and video distribution, the ultra-low latency next-generation video RTC of 100 milliseconds has become the trend; VR/AR, 360 degree perspective immersive experience revolution, users from watching video to playing video transition, experience improvement at the same time, video transmission capacity from Megabyte to ten or even 100 Megabyte single-stream bandwidth; Cloud gaming is revolutionizing the gaming industry, with ten-millisecond latency requirements driving the migration of media processing power from the cloud to the edge.
RTC real-time audio and video will become the core control point of infrastructure in the 5G era. RTC is widely used, and its market growth rate exceeds 30% annually. Moreover, this technology can not only empower pan-entertainment industries such as live broadcasting and games, but also penetrate into large video industries such as online medical treatment, education and finance.
The existing real-time audio and video market is exploding with a lot of players, but it’s not sustainable because it’s a non-cloud player. One of the reasons is that it has a relatively high technical threshold, especially for the construction of audio and video coding or the entire RTC network. Another reason is that at present, each family uses private protocol to access, so it is difficult for each family to communicate with each other and for customers to switch freely. In RTC business products, we believe that audio and video coding processing will be one of the keys to build technical barriers and performance differentiation competitiveness.
Another application scenario is Cloud VR. We always think VR is the key scenario in the development of 5G technology. VR development has been bumpy, but at present, some of the problems encountered before are gradually improving. From the terminal point of view, the previous terminal is very expensive, but the current thousand yuan terminal has gradually come, and the experience will be better and better. In addition to device terminals, VR has previously faced a major problem of lack of content, and VR live broadcasting has largely alleviated the problem of lack of content.
While the plight is gradually improving, VR is facing new problems. It is difficult for Internet VR business to form a commercial closed loop, mainly because VR business brings an increase in revenue, but at the same time, the cost of bandwidth increases. The high-quality experience PURSUED by VR needs to be realized through higher bandwidth. High bandwidth is bound to bring high cost, and high cost will lead to the failure of commercial closed loop.
On the premise of this, many players will by reducing experience in order to develop VR, such as the 4 k following content, rate using below 10 million, terminal USES the card machine VR to experience, although such VR business can be get through, but experience effect is very poor, also lead to paying customers is very little, industry development is slow. Therefore, in the development of VR, we believe that reducing bandwidth through video compression coding is the key, which is a key element that can help achieve commercial closed loop.
It is not difficult to see from the video industry trend mentioned above that user experience upgrade, video industry upgrade and commercial cost drive video all-round upgrade, with resolution from HD to 8K, frame frequency from 30 to 120, field Angle from less than 90 degrees to 360 degrees, from SDR to HDR, etc. These parameters upgrade to promote the video compression coding technology evolution, the pursuit of compression ratio is eternal!
In addition, HEVC/H.265 is actually an excellent coding technology, but due to the unfriendly patent policy in the early stage, its market share has not been higher than 13%. Fortunately, it is better at present! The whole industry is in urgent need of video coding technology with higher compression ratio, better ecology and more reasonable patent policy.
There are two ways to improve compression ratios, and this is what vendors are doing:
Standard technical route, as the base kernel, H.266, AV1, AVS3, AI coding
Non-standard technical route, rely on basic standards, combined with human eye perception characteristics, perception coding, content coding, ROI coding
2. Next generation video coding technology
The following will introduce some work of Huawei Cloud Video in the next generation video coding technology from these two perspectives. These technologies benefit from the full support of Huawei 2012 Institute of Media Technology.
2.1 Next generation video coding standard technology
As can be seen from the figure above, the next generation of video coding standards can be roughly divided into three camps or types:
International standards: jointly promoted by MPEG and VVC like VVC/H.266 and EVC
Domestic standards: phase1 of AVS3 and phase2 of AVS3 are being launched or have been launched by national standards bodies. The main difference between the two is that AVS3’s phase1 standard is aimed at h.266, while the phase2 standard is aimed at the future and may include some intelligent coding technologies
AV1, launched by the AOM consortium led by Google, is an open source technology
The next generation of video coding technology is still using the traditional evolution idea — on the classical architecture to do module enhancement. In H.266CFP, Huawei jointly proposed the P41 proposal with several other companies, which ranked first in BOTH PSNR and MOS evaluation. This proposal is also the basis for the following. The number of core patents of Huawei in VVC already belongs to the first camp, which is a great achievement, and also shows that the basic research of video compression coding in China is actually not weaker than traditional companies in Europe and America.
Take VVC as an example to take stock of its new enhancement tools. The vertical axis is the compression benefit for each tool, the horizontal axis is the codec complexity for each tool, and the weight of coding complexity may be a little higher. VVC is enhanced in block partitioning, intra-frame prediction, inter-frame prediction, entropy coding, transformation quantization and other modules, among which the main enhancement is the enhancement of intra-frame prediction, inter-frame prediction, block partitioning, filtering and the benefits brought by the evolution of machine learning tools. VVC has not yet introduced such coding tools as deep learning.
There are also three tools highlighted in red, which are the three VVC tool points where trade off is generally considered to be good. The blue one is ALF, which is a familiar adaptive loop filter. In fact, it has been around since H.266, which introduced it into the standard. The green one is affine motion prediction, which was primarily developed by Huawei; The orange ones are quantitative techniques.
The EVC standard was proposed in part because of the unfriendly patent policy of H.265/H.266, which may have made it difficult to implement EVEN H.266. MPEG hopes to promote adoption with a new patent-friendly standard, as well as a change in licensing policies for H.266 and H.265. EVC was jointly proposed and promoted by Huawei, Samsung and Qualcomm, and Huawei added a lot of technologies to it. When the standard was approved, it was expected to improve the compression performance by 20% compared with H.265, and the compression efficiency of 4K entertainment video was measured to improve by more than 30% compared with H.265. At present, it has entered the final standard voting stage.
AVS3 is a standard proposed in China, and its Phase1 is aimed at H.266 standard, and has been completed and first launched in March 2019. In September 2019, Huawei Hesi also launched AVS3 8K decoding chip simultaneously. AVS3 offers a performance improvement of more than 20% over H.265, with a number of specific designs for entertainment and surveillance video that could be further improved.
H.266 has basically been finalized, and its compression efficiency can be increased by about 40% compared with H.265 in 4K video scenes, and its decoding complexity is increased by 60%. At present, the biggest problem is that the patent policy is not transparent, and the patent fee may be relatively high, and the promotion pace may be relatively slow.
EVC has also been finalized, and its compression efficiency can be increased by about 30%, and the decoding complexity can be increased by 60% compared with H.265. Its patent fee may be relatively low. Second, its patent fee is relatively transparent and clear. At present, it mainly relies on Samsung, Huawei and Qualcomm to promote the industry and build the ecology.
AVS3 launched in March 2019, the performance is guaranteed, compression efficiency could increase by 25%, increased complexity is relatively low, the patent fee is relatively low, is through the Internet and other industries do the industry to promote and ecological building, at present, in fact, there are a lot of alliance and companies are doing, we also hope to AVS3 fall to the ground as soon as possible.
The data of AV1 is not listed in the table, mainly because it is different from the other three standards. AV1 open source software is actually aimed at commercialization, and we are quite clear about its compression efficiency and decoding complexity. A big advantage of AV1 is that it has no royalty, which is a promise of the AOM alliance. In terms of industrial landing, AV1 has done a good job in ecological construction.
2.2 the AI code
AI coding is another trend in the next generation of video coding standards, which has been proposed since the establishment of HEVC and VVC standards, but has been temporarily shelved due to the consideration of computational complexity and the universality of AI hardware. But it’s a technology trend.
AI coding includes two evolution ideas: the first is a new architecture, similar to image coding. In fact, AI image coding has achieved good results. The AI image coding technology led by Google has been well applied, but the application of video is still in the process of exploration. The so-called new architecture, instead of the traditional architecture, will produce a compressed video after entering the black box. This video may not be divided into blocks or in various other ways, and its compression efficiency will be very high, but all this is still in the process of research.
Another idea is to build on the classic architecture and enhance the modules in each architecture. For example, different AI network adaptation and enhancement for block partitioning, transformation, vector quantization, and intra-frame prediction. In fact, Huawei is also doing research in this area and may put forward some papers or proposals on AI coding in the future. And we believe that these two ideas of AI coding will eventually be a process of integrated design, and will not be isolated from each other.
3. Huawei cloud video application and practice
3.1 Introduction to Cloud Video
The above briefly introduces the next-generation video coding standard technology. The following introduces the application and practice of Huawei Cloud Video in video coding technology from the perspective of commercial and non-standard applications.
Huawei Cloud Video was established in 2017. At present, huawei Cloud Video includes two categories of services, one is the traditional live broadcasting, on-demand broadcasting, media processing and monitoring services, and the other is the emerging services of the whole industry, such as RTC, VR/AR and ULTRA HD live broadcasting. Huawei Cloud Video is oriented to many scenarios, such as entertainment live broadcast, short video, online education, enterprise live broadcast, 4K live broadcast, 4K production and so on. We are committed to helping industrial customers, partners, developers and ISVs quickly launch applications, and helping them build differentiated competitiveness to achieve commercial closed-loop. RTC is huawei cloud Video’s understanding of the next generation of video and has actually promoted it. For RTC, we focus on building differentiated competitiveness such as ultra-low delay and audio and video quality.
3.2 Video coding technology
3.2.1 Video coding framework
Combined with today’s theme, the following focuses on some work of Huawei cloud video in video coding technology. These technologies are fully supported by Huawei 2012 Institute of Media Technology. The coding kernel uses a standard encoder, similar to h.264, H.265, AVS3, H.266 or EVC mentioned above. Based on this coding kernel, we have optimized and practiced different coding techniques for different scenarios. For example, RTC real-time audio and video scene, using low delay coding technology; For VR scene, FOV tile coding is used; For multi-view scenes, spatial cloud-edge collaborative coding is adopted. For monitoring scenarios, intelligent semantic coding is adopted. Oriented to live broadcast, on-demand, using perceptual coding and image quality enhancement; In addition, Huawei cloud video with the help of kunpeng and Centeng have hardware to accelerate the efficiency of video coding and transcoding. Kunpeng mainly focuses on CPU computation, while Shengteng mainly focuses on AI acceleration.
3.1.2 Standard coding kernel
Next, we will introduce the video coding technology. The first is the coding kernel. Huawei cloud also has a lot of technology accumulation in commercial encoders. For example, in the MSU competition in recent years, HW265 encoder for two consecutive years to obtain a number of evaluation of the first place, this year we will also launch a new encoder to MSU.
3.2.3 HD low code
The second technology is hd low code, hd low code at present in the field of each manufacturer or commercial is to compare the default technology, that is to say, on the basis of the kernel based on the standard codes, can reduce the bit rate at the same time guarantee the quality of subjective without falling, but in fact hd low bit theoretical feasibility of existing video coding is based on the shannon theorem, its rate distortion model is continuous, However, the human vision model is stepped discontinuous, and there is a space for bit rate reduction on this step.
Hd low code generally includes three modules: the first is based on the human eye JND model, that is how to find JND; The second is to do perceptual coding based on JND. The third is to control the output of the standard coding kernel through perceptual coding, which greatly reduces the bit rate under the condition that the subjective quality remains unchanged.
Huawei Cloud video has done a lot of work in this regard, and can achieve a bit rate reduction of 30~50% for different application scenarios.
Hd low code technology has also reached a bottleneck. The original consideration of HD low code only comes from coding and transmission channel. With the development of AI technology, is there room for further development? Huawei puts forward a new train of thought: the original rate distortion model, join a receiving end (end) decoding complexity factors, namely active degradation in the sender, put it through temporal or spatial sampling relative data into a smaller video, it makes the code rate is relatively lower, effectively reduce the rate of the goal. Through some auxiliary information and low bit rate and low resolution code stream, the video can be restored by super segmentation, frame insertion or enhancement through AI technology at the receiving end, so that the bit rate transmitted on the whole link will be significantly reduced. Our preliminary test found that the bit rate can be reduced by at least 60%.
3.2.4 Ultra-low delay coding
RTC scenario is our key to build the next generation of video industry oriented service ability, RTC scenario is mainly low delay coding, we put forward a comprehensive scheme of low delay, such as encoding and render joint optimization, the kernel of the encoding and layered coding and source channel coordination technology, such as different combinations for different real-time scene will do or applications, Our preliminary test found that the overall delay of encoding and decoding in 1080P could reach the level of 10 milliseconds.
3.2.5 VR FOV coding
For VR scene, especially for 360° scene, we propose FOV TWS coding technology. The principle of this technology is to fragment the high-resolution panoramic video, multiple FOV fragments and a stream of 4K background stream, so that the 4K terminal player can achieve 8K VR panoramic video playback through the corresponding perspective FOV fragments and 4K panoramic background stream, and at the same time ensure MTP and no dizziness. This technique has been written into the OMAF standard. The overall experience has also been recognized by users.
3.2.6 Intelligent semantic coding
When facing the monitoring scene, we propose a kind of intelligent semantic coding, mainly through background modeling plus video content and motion analysis, plus some real-time supersegmentation and interpolation side to build intelligent semantic coding scheme. Monitoring scenes often have a lot of details, the recognition rate of various machine analysis can not be reduced, if the pressure is too hard, the recognition rate may be reduced. Preliminary prototype results show that more than 70% bit rate savings can be achieved without reducing the recognition rate of both human and machine.
3.2.7 Cloud side collaborative coding of spatial video
Another technology is spatial video coding, the so-called spatial video is free view or multiple view, which is also a direction of future technology development. People are no longer satisfied with watching videos from a fixed perspective, but want to watch videos from multiple viewpoints or free perspectives. In spatial video codec, we propose a cloud side co-coding, which can dynamically generate switched streams at any time on demand in a very short time at the edge, greatly reducing the bit rate of switched streams in general schemes, and the initial test shows that the bandwidth cost can be reduced by at least 60%.
3.2.8 AI Video enhancement
Video quality and video bit rate are two key indexes in video industry. The technology mentioned above, whether it is standard technology, or non-standard technology, is the pursuit of the same picture quality under the premise of how to reduce the bit rate. The other side of the coin is how to pursue the subjective experience quality of video at the same bit rate. We have also made a lot of attempts in this regard. Based on cloud and terminal AI capabilities, we have repaired, enhanced and reconstructed videos from the dimensions of resolution, frame frequency and dynamic range according to different scene characteristics. Considering that the real scene often contains a variety of mixed distortion factors, we propose a multi-task video enhancement framework for mixed distortion, which can well adapt to different scenes and different needs.
The above content introduces some practices and exploration of Huawei cloud video in video encoding and decoding, hoping to bring you some inspiration. Thank you!
This article is shared by Huawei Cloud community “Exploration of the Video Cloud Application of the next generation of video coding Technology”, originally written by: Audio and video manager.
Click to follow, the first time to learn about Huawei cloud fresh technology ~