In recent years, content business has been booming in the industry. Taobao is also actively transforming into content. At this LiveVideoStackCon 2021 Shanghai conference, we have invited Wang Libo (Zhuang Shu), senior algorithm expert of Alibaba Taobao Technology, to review the development process of Taobao from text and text to short video live broadcasting. This paper introduces the application of audio and video algorithm and its future investment direction, including encoding and decoding, video processing, audio communication and interaction.

By Wang Libo (Zhuang Shu)

Organizing/LiveVideoStack

I am very glad to communicate and share with you. First of all, I would like to introduce myself. I am Wang Libo from the Technology department of Tao Department, and my name is Zhuang Shu. When I first received this assignment, I felt that the topic was very large and there were many things to talk about. After consideration, I decided to focus on three points of view, which can be introduced to you in more detail.

The three views are: effectively reduce the cost of video compression (a view that should have been widely accepted in the industry), video processing of improving image quality experience (with ali cloud of narrow-band hd, also gradually accepted by people), audio technology is a new productivity (this is taobao send force to explore the direction of the past 1-2 years, hope to be able to bring new thinking to the industry).

01 Service Introduction

First of all, I briefly introduce taobao’s content business. With the development of communication technology, the Internet content ecology has changed from text in the 2G era to pictures in the 3G era, and then to live broadcast and short video in the 4G era. Will tell to Taobao, also face likewise “content change” comprehensive upgrade.

Taobao has changed from PC to mobile, from text and text to live short video, from traditional e-commerce to content e-commerce and then to discover e-commerce and interest e-commerce. In 2020, the GMV of Taobao live broadcast exceeded 400 billion. Last year, more than 700 million people watched Taobao live broadcast in one day. At the end of 2020, with the launch of Diantao and Hangar, short video has become a new engine for the development of content business. In this rapid development process is facing huge cost pressure.

02 Video compression effectively reduces costs

2.1 Start from picture compression

First, video compression effectively reduces costs.

Speaking of codec, may have to ask first compressed images, this is a small commodity details figure, live in the short video before the rise, image is a major means people access to information, as the user growth, increasing cost pressure, the past few years, taobao picture shown daily average volume of more than one hundred billion times, if quality to reduce costs by reducing will sacrifice you experience, So we want to rely on technology upgrades to improve compression efficiency.

2.1.1 Evolution of picture compression standards

You should be familiar with image compression. Image compression is actually a process to remove spatial redundancy. In terms of standard development, it has mainly experienced JPEG, VP8-based WebP, and HEVC MSP.

JPEG is the most widely used image compression standard. It has been nearly 30 years since its release in 1992. It is very simple and efficient. A decade or so ago, Google released the WebP format based on the VP8 kernel. WebP is better than JPEG in block partitioning, prediction, variation, quantization, entropy encoding, and adds Deblock function. HEVC goes one step further than WebP, upgrading several tools to improve compression efficiency. On the other hand, due to the introduction of Tile and Wpp technology, HEVC provides many parallel means for Codec engineering implementation, which is friendly to modern multi-core CPUS.

In order to compare the compression efficiency of the three formats in different scene data sets, we designed an experiment in the diagram. It is concluded that WebP can improve compression performance by 29% compared with JPEG, and HEVC can improve compression efficiency by nearly 50% compared with JPEG.

2.1.2 APG format developed by Taobao itself

APG is a picture format developed by Taobao, which has three characteristics. First, very high compression efficiency, compared with JPEG save 50% bit rate, very close to HEVC; Second, the high efficiency of mobile terminal decoder, compared with Webp reduce the decoding time by 20%; Third, support Alpha channel and GIF, GIF is widely used in GIF format, and GIF does not take into account the correlation between frames, compression efficiency is not high, and APG saves 10 times the volume compared with GIF;

In addition, we also carried out a lot of work on the architecture of the whole system, such as high-concurrent real-time response, CDN delivery strategy, storage and computing separation, multi-content disaster and other strategies, and finally realized taobao’s real-time picture processing system of 100 billion yuan, which greatly saved business costs and guaranteed the picture quality experience.

2.1.3 The content service has evolved to be video-oriented

As the content business evolves, video and live streaming dominate traffic. On the one hand, information expression has expanded from the spatial dimension to the time dimension, and on the other hand, the resolution has been improved to 720P, 1080P or even 4K, and the consumption time has doubled. (According to the data of Teacher Chen just now, the average time spent on videos by each person may be several dozens of minutes.) We also know that each generation of video compression standard has 50% bit rate savings compared to the previous generation, from MPEG4 to H.264/AVC to H.265/HEVC, and then to H.266/VVC released last year, it is natural to think of upgrading the coding standard to save video costs.

2.1.4 HEVC challenges in video business

First, let’s discuss the challenges of HEVC’s landing in the video business. It has been 8 years since THE PUBLICATION of HEVC standard in 2013, but it has not been implemented on a large scale in business until recent years. There are several reasons for this:

The first is encoding speed. HM, the official model of the H265, can encode 720P video at just 0.1 FPS on a normal PC. Imagine that it would take a day or even days to compress a 10-minute video. X265, the best open source encoder in the industry, is only 6.8fps at slow speed, which is quite a big gap from the requirement of 30FPS real-time coding;

Secondly, the coding quality. Due to complexity, X265 only saves 18% bit rate compared with X264, far less than the theoretical upper limit of HEVC compared with AVC (HEVC standard is designed to save 50% bit rate).

The third is bit rate control. There are many bit rate control methods for scenarios in the industry, such as ABR, CBR and CRF. However, the business scenarios of real-time audio and video are very complex, and such a bit rate control method cannot be used directly.

The fourth is the compatibility and performance of decoding, which is also a common concern. There are various viewing environments for users, including Android, IOS and Web H5 viewing. H265 is not well supported on H5, which limits its development. The compatibility of hardware devices is not perfect, so it also faces great challenges in decoding compatibility.

2.1.5 S265 coding kernel optimization

The S265 encoder was developed in 2017 and took more than three years to reach a better state. We mainly in the coding tools, fast algorithm, engineering optimization, code control methods, framework optimization to do a lot of optimization work to improve the compression efficiency and coding speed.

Here is a detailed introduction to speed optimization related technologies.

1. Cu depth prediction algorithm, HEVC block division from 64 to 8*8, depth has 4 layers, how to predict CTU division depth is challenging. In order to ensure the coding performance, HM calculated its RDcost for each layer. We combined texture complexity, adjacent blocks in time-space and pre-processing motion information here to achieve more accurate Cu depth prediction. There are many researches on further deep prediction. For example, some machine learning methods can accurately predict the division level of blocks.

2. Adaptive EarlySkip algorithm and RecursionSkip algorithm. The two algorithms are similar in that EarlySkip aims to skip calculations of other modes at this layer, and RecursionSkip aims to skip the current layer and stop splitting. X265 also has a similar technology. We go further than X265. When making RS, satD of SKIP and merge mode sub-blocks will be considered simultaneously.

3. Detection of All Zero Block. If a Block is All Zero after quantization, there is no need for RDO and coding, but how to predict a Block is All Zero Block. We found a method that might be suitable for a Block of one size, which is accurate in 88, but not in 3232, so we need to predict more accurately whether it is an All Zero Block based on Block size and internal texture strength.

4, Fast Intra Prediction This technique has been studied by many people. H.265 has a total of 35 prediction methods. There are many papers on how to find the prediction Angle quickly. We use The Bayes estimation model to find the correct direction in the transverse and longitudinal directions, and then do subdivision Angle prediction to improve the speed of Angle prediction.

5. Search by pixel. The traditional search for sub-pixel is to find 4 or 8 points around the optimal point of the whole pixel, which has a relatively large amount of calculation. Based on the results of the whole pixel, the position of sub-pixel points can be deduced through the error plane estimation model, so as to save the number of sub-pixel calculation.

6. Selection of multiple reference frames. In order to improve the compression efficiency, the current encoder will select more reference frames, such as 3 to 4 reference frames in one direction. We will make weights according to the quality and distance of the reference frame, and select the appropriate reference frame. How to jump out of traversal of other reference frames before getting a good result? Multiple reference frames are a good choice to improve compression quality, but you need to avoid computationally complex escalations.

7. Rapid estimate of Distortion. In the process of RDO, it is necessary to carefully estimate the error. It is not efficient to use traditional SATD, but it is very time-consuming to complete THE RDO calculation. Therefore, we have the residual prediction model, and use the model to derive the result of Distortion from the quantized coefficient to avoid reverse quantization and change. As for Bits estimation, another component of RDCost is Bits cost. It is very time-consuming to use a complete entropy encoding. If Bits can be estimated using a piecewise linear model, we can quickly calculate RDO.

Others, such as Deblock and SAO optimizations, are more engineering.

In addition to the fast algorithm, I have also made some summaries of the compression performance, with the same speed, we have >30% performance gain compared to the X265. Our encoders have performance benefits in the following areas:

1. The first bit rate control. The goal of bit rate control is to allocate bit rate to more valuable places. It is divided into frame level code control and block level code control. In the frame level code control, we do accurate pre-analysis of I frame code control and P frame code control, and in the block level code control, we design an enhanced CU-tree algorithm.

2. Optimization of layered B-frame and reference structure. Layered B frames are not difficult to implement and are of great help to compression efficiency. The optimization of GOP structure was introduced before, and we will weigh the choice of reference frame.

3. As we all know, in static scenes, the more layers, the higher the compression efficiency, but it is not so effective in moving scenes. Therefore, we have realized the adaptive GOP Size and developed Scenecut algorithm by ourselves. Able to adapt to different motion intensity, scene switching needs including fade in and out.

4. In addition, bi-Search, GPB and LTR tools are not available in open source software X265, but they are helpful to improve prediction efficiency; MCTF tool is very helpful for the removal of coding noise. After we added these technologies to S265, we achieved a BD-rate gain of more than 5%.

5. 2-pass is solving a globally optimal Qscale, which is used in offline transcoding scenes, allowing more analysis of the video. However, in the solving process, the distortion measure is MSE, and we re-derive the measure function to obtain 5% compression performance.

6. For Dynamic CRF and Pboffset, the general practice is to fix the Offset of frame P and frame B, and we will adjust the QP value of frame level according to the complexity of frame;

The cost calculation of AQ and RDO is usually based on MSE, but if it is for SSIM indicator, SSIM model can be derived, so can RD.

8. For the meeting scene, we also implemented IBC tool, which is helpful for PPT screen casting. Moreover, the search algorithm for screen content optimization is specially designed. Traditional fast algorithms such as diamond search and hexagon search are very inefficient in SCC scenarios, and it is difficult to find the optimal solution. However, the self-developed algorithm can improve the efficiency.

The above is the introduction of S265 technology. I remember that in 2016, Kingsoft KS265 participated in MSU competition for the first time and achieved very good results. Later, many domestic competitors also won good rankings. Alibaba S265 has been involved in research since 2017 and participated in MSU competition for the first time in 2020. In the competition, we won the first place in three categories: 1080P 30FPS PNSR index, 1080P 1FPS PNSR index and 1080P 30FPS subjective quality.

2.2 Scene adaptive coding

In addition to the core encoder, we also developed a set of scene adaptive coding method for encoder application, which is divided into three steps:

1. Video analysis: use machine learning method to do video segmentation and get high-level semantic classification, such as animation, sports, shows, commodity introduction, etc.

2. In the other dimension, signal analysis means are used to detect the underlying features of the video, such as motion intensity, texture feature, noise intensity, brightness feature, etc., and coding parameters are determined according to the high and low dimensional information.

3. Adaptive decision engine (ADE) : according to semantic features, signal features and network conditions, the optimal encoding parameter combination is determined. The decision process is modeled as a constrained optimization model.

2.3 S265 decoding improves coverage

H265 decoding compatibility has always been a topic of concern. If the production end encodes 265 bit streams, but the player end cannot support 265 decoding, it is necessary to transcode to 264 format at the server end, which not only cannot reduce the CDN bandwidth, but will increase the transcoding cost. We did a lot of work on the decoding side:

1, hard decoding adaptation. Basically all models on the market (>1000) have been adapted;

2. Self-developed high-performance Native H265 decoder, tested on mi 5 720P device and obtained a speed of about 240 frames, and realized real-time decoding with very low power consumption.

3. Decoding of H5. H.265 is not supported in H5, we support H5 playback based on WebAssembly technology. At present, it can achieve real-time decoding of 1080P 30fps on i7 computer, and CPU consumption is less than 30%.

2.4 Landing process of Taobao encoder

Based on the above codec kernel and application of exploration, I will introduce the landing process of Taobao encoder. In 2017, I began to invest in the research of S265. After more than a year, I began to implement the live broadcasting business. The first version of Q1 in 2019 was not very good after its launch, saving about 30% bit rate. In 2020, 40% bit rate will be saved after the launch of Q1 second edition; This year’s Q3 Q1 achieves 50% bitrate savings with narrowband HD technology. In terms of short video, we try to apply S265 to Taobao short video transcoding. The first version of Q3 will be online in 2019, and the second version will be online in 2020.

Based on the accumulation in the S265 core encoder, we started the research and development of the S266 codec in Q2 2020, and took the lead in announcing the commercial S266 codec in the industry.

In terms of decoder, compared with VTM, single-core 3.5 times faster, multi-core 16 times faster; In high-end mobile phones (IPhone12, P40) to achieve 4K 30FPS decoding, low-end mobile phones 720P dual-core 30FPS decoding. 720P memory consumption <35m< span=””> and Binary<1m< span=””> is critical for large apps. Installation and download will be blocked if the package size is too large. We are making VVC encoder internally, the target is to save 50% bit rate at 1FPS Slow compared to X265 Veryslow; The 30FPS Fast file saves 40% bit rate compared to the X265 Medium. As we all know, VVC is slower than HEVC. It takes several days for a 1-minute 4K video to be compressed with HEVC’s HM, while it may take a month if VVC’s VTM is used. The time consumption will be very long. At present, our Slow gear has achieved 100 times higher speed than VTM, but the compression efficiency is similar.

To sum up, Taobao’s S265 intelligent coding scheme aims to make video clearer and cover all business scenarios, including picture compression, conference SCC, live broadcasting, cloud transcoding and even cloud gaming. Business strategies include adaptive scene classification, intelligent code control, adaptation and optimization of delay according to the delay requirements of different scenes (achieving compression efficiency close to unlimited delay under very low delay), and adaptive computing force (adjusting the speed level of coding according to different devices). Codec kernel includes code control and pre-processing, coding tool set, fast algorithm, coding framework and other optimization. The system platform includes arm-based platform (armV7/ ARM64) and X86 implementation (SSE/AVX), currently considering implementation based on FPGA and ASIC, as well as quality evaluation system and training cluster to assist the encoder development.

03 Video processing improves picture quality experience

Here are some ideas about how video processing can improve the picture experience.

3.1 Video processing improves picture quality

Video distortion comes from many aspects: overexposure, zoom, out of focus, stuttering, color loss, compression loss, noise, jitter, frame rate under sampling. We have our own set of video enhancement tools for adaptation such as block distortion removal (DeBlk), super resolution (biased end-to-end and server-side versions of the model), texture detail enhancement, video shaking (DEI), color enhancement, dark light enhancement, and time-space denoising.

3.2 Taobao short video narrowband HD transcoding

Taobao the whole short video transcoding migrated to narrow-band hd technology, from video production chain into content editing, upload (upload request to upload the success rate is high, the speed, we have PASS to upload more, segmentation upload, transcoding, audit (with low-quality, distorted video screen), play (according to playback devices the ability to do post-processing and rendering). I would like to introduce transcoding service. Its core technology is narrowband HD and S265, with two visual processing models respectively. First of all, the processing model of narrowband HD includes quality classification, fine texture removal, defocus area weakening (save bit rate), perceptive texture enhancement (enhance visual experience), face protection (avoid excessive enhancement to cause bad body sensation), Mosaic repair and interlacing scanning. The S265 visual compression model has three points, namely, inflection point of human perception and distortion (BD-rate curve is steep at the beginning and slow at the back, in order to find the point of high cost performance, no more distortion can be felt and appropriate bit rate can be achieved), and sweet inflection point of bit rate and resolution (different bit rates are suitable for different resolution compression on different contents. If a very low bit rate of 300K is forced to compress 1080P, the subjective experience of block is very poor, if the compression of 540P, 360P can obtain higher visual experience), scene classification coding (different classification scenes are suitable for different coding parameters and bit rate selection).

3.3 Beauty in esports

Beauty as a basic function in the content production, has been widely accepted by people, but there are some problems in the conventional beauty in the e-mall scene, such as excessive beautification, commodity discoloration, background blur, resource consumption and so on; In PixelAI Beauty, we used Face3D reconstruction to keep the deformation natural, and AI skin model to make sure the beauty didn’t hurt the background and merchandise.

3.4 HDR10 End-to-end system

With the development of acquisition and display devices, HDR+10Bit has some applications. We feel that HDR10 has three core technologies: the first is dynamic range. Helps us see content in low-light, overexposed scenes; The second is the gamut space. Support BT2020 enhanced color restore; The third is the 10bit depth. HDR will be very helpful for our commodity restoration, because the core of e-commerce live broadcast and short video is to restore commodities rather than beautify them. But HDR is an end-to-end system, need to consider the compatibility of all kinds of equipment, so we did some adaptation to improve the user experience, such as the content of the ordinary camera out with common channel decoding, and some high-end equipment support HDR10, can put the content to do 10 bit compression, transmission, in the end, according to broadcast equipment capability, HDR To SDR, 10Bit To 8Bit conversion To ensure quality, good mobile phones can get the best HDR experience, general mobile phones can also get basic HDR experience. You can see that the HDR technology makes the photo color more similar to the actual color.

Audio technology improves experience and productivity

The third argument is that video technology improves experience and productivity. Audio has been accompanied by video for the last few years, but with the Clubhouse launch last year, people felt that audio could be played on its own, which was a great inspiration for audio technology. In addition, audio technology can also assist us in content production, audit and various audio processing. I think audio will be a very important productivity in the follow-up.

4.1 Service: Number of users and Duration

For the content business, the core is the number of users and duration. For technology, how to improve the consumption experience, improve the production efficiency of anchors, and do a good job in platform governance is the core work. Audio can play a very important role in all of these aspects. From an anchor perspective, audio technology can be used for automatic mouth editing, guest alerts (not constantly at the computer), broadcast assistance, subtitles and soundtrack generation. From the user’s point of view, audio interaction can support link, games, guessing prices, voice comments. In the platform Angle, audio can monitor yellow tyranny, stolen broadcast, stolen chain, detect empty mirror.

Share a few typical cases. First, audio technology improves the sound quality experience. This is a simple audio transmission system, acquisition, pre-processing, AEC/ANS/AGC, encoding, network transmission (FEC/NACK technology), peer (Jitter Buffer/NETEQ), decoding, resampling. To achieve a high sound quality experience, we need to do very careful work in every aspect: High fidelity in acquisition, double channel processing (Alidenoise, echo suppression, intelligent bel), adaptive bit rate (HE-AAC) in coding, transmission process to ensure transmission quality QOS (FEC/NACK), and original audio data (PLC/NETEQ) in receiving. Various sound technologies (3D sound, spatial sound, bass) are applied to enhance the listening experience. Audio adaptation is also very important in live broadcast content. (Chen mentioned an example: if it is a music studio, the sound quality will be poor with ordinary templates, so you need several sets of sound templates for different types of studio.) In such a system, we support Taobao live, voice chat rooms and other businesses.

4.2 AliDenoise — Make the sound clearer

Taobao’s own AliDenoise technology is an intelligent noise reduction technology that can make the sound clearer. The traditional noise reduction is based on the Time-domain Fourier transform + Villa gain to do, the pain point is the non-stationary noise suppression is poor, in the low SNR failure, while AliDenoise according to the end-to-end speech noise reduction, using data-driven mode, based on the prior SNR method to do model training. In addition, the streaming processing of Cache Buffer and 1D convolution + model miniaturization work, the core advantages are strong noise reduction ability, high voice fidelity (we have compared some competing products, and AliDenoise’s subjective and objective indicators are better than those of competing products), extremely light and small model (1.6m model can achieve noise reduction in common mobile phones. The CPU consumption is only 6%), and the delay is controllable (the delay can be adjusted according to the capacity of the device). There are three audio segments. The first segment is the street scene, which is the original sound; The second paragraph is the effect after RTC treatment (the sound of cars passing by is more obvious); The third paragraph is the effect of AliDenoise treatment, it can be heard that it has a good suppression of non-stationary noise, and the human voice retention is high.

Street scene – the original RTC the effect after the effect after processing AliDenoise processing (audio audio-visual effect above please visit: mp.weixin.qq.com/s/FZL2nr8qd…

The second example is end-to-end interaction. Live answering is a very popular interactive way of playing in the past few years. In 2020, Double Eleven will launch the price guessing activity. The interaction of answering from touch screen to voice requires low delay, high concurrency and low error rate. If you use the server ASR, you need thousands of servers to support 100,000 people online at the same time. Based on such pain points, we adopt our own offline ASR technology to do speech recognition on the end. Model size can be 13M, memory size 50M, word error rate 1.3%, recognition delay <50ms level.

The third case is that the voice technology can assist the live editing and production of short videos. The content production of “browsing” can be assisted in the “pro-shooting” APP. There are many requirements for content production, including removal of useless snippets, automatic captioning, voice-over, music labels, automatic soundtrack, audio speed, sound change, noise reduction. With the help of a set of background technologies and music libraries (Xiami 10 million music library, ASR and signal processing algorithm), we provide one-click import, one-click edit and other audio functions such as pause deletion, music perception, automatic subtitles, automatic voice change. Through audio technology, editing efficiency can be greatly improved. The original 30 minutes of editing can be reduced to 3 minutes, and the quality will be relatively guaranteed.

4.3 Live short Video and Audio solution — TaoAudio

We have provided taobao business with a live broadcast short video and audio solution, TaoAudio. In business support taobao live, point tao, browsing, pro shooting, voice chat rooms and other needs. In the application program, there are live broadcast hotspot, live broadcast interaction, live broadcast security, short video editing. There are three core technologies in algorithm technology: audio processing, audio security and voice interaction. Infrastructure includes end-to-end thrust engines, on-cloud resources, end-to-end equipment, etc. In short, the core of audio is good sound quality, strong interactive experience and platform security, and there may be rich music experience in the future.

05 Development of Taobao audio and video algorithm

Finally share the development of Taobao audio and video algorithm.

1, the next generation OF APG2 should achieve higher compression efficiency than the previous generation; 2. The landing of S266, and the actual application of S266 to business scenarios; 3. Exploration of AR+3D+ multi-perspective live broadcasting. Traditional live streaming has been fixed for many years, and we want to improve the interactive and immersive experience through more technologies. 4. Next generation narrowband HD technology. Present at higher quality and lower cost; 5. End-to-end ASR technology. When the “price guessing” mentioned above is applied to ASR, it needs to further improve accuracy and reduce cost. 6. Scene adaptive voice enhancement. Traditional voice enhancement does not take into account the voice environment and does not do more adaptive (such as in a noisy environment or quiet environment respectively what model to use), adding scene detection mechanism can adapt to the radio scene and listening environment; 7. Intelligent music and soundtrack service; 8. Large-scale evaluation system without reference. The above is the content of this sharing, thank you!