Author | Tian Xiaoxu

Interview guests | Chen Wei, Zhao Xiaohan, Wang Qi

This article is part of the “2021 Technology Roundup” series, focusing on the major developments in audio and video in 2021.

“2021 Year-end Technical Inventory” is a major project launched by Digger, covering Serverless, Service Mesh, large front-end, database, artificial intelligence, audio and video and many other technical fields. Looking to the past, to the future, review the development of IT technology in 2021, inventory the major events of IT technology, and forecast the future trend of IT technology. In the meantime, we’re kicking off our 15th tech-themed essay, with a look at what you see as the tech trends of 2022.

Nowadays, audio and video technology has penetrated into every aspect of our life, such as online classes, live broadcasting, playing games and even handling various certificates. As the metasverse becomes the new “wind” and audio and video technology becomes more and more attention, what new developments and changes will be seen in audio and video technology in 2022?

According to Sullivan’s “China Audio and video Solutions Series Tracking Report for the first Half of 2021”, the scale of China’s audio and video market in the first half of 2021 exceeded 30 billion yuan, among which pan-entertainment, online vocational education, e-commerce and other sub-markets have great potential. In the second half of 2021, the concept of the meta-universe came out and was highly recognized by the capital. Although the metasurverse is still in the embryonic stage, audio and video is undoubtedly an important way to realize the metasurverse. In the short term, the popularity of the metasurverse will continue in 2022, driving the small growth of the audio and video market. In the long run, with the breakthrough and precipitation of multi-field technology, the application of audio and video combined with the strong C-end scene of the meta-universe is bound to break the ceiling, break through imagination and bring more fresh landing practices.

Breakthrough imagination, from “usable” to “usable”

As the epidemic continues to promote the rapid development of real-time interactive audio and video, online video contactless communication and collaboration has become a critical requirement for people’s life, study and work. From the earliest social and interactive live broadcast, real-time audio and video has expanded to all walks of life, such as enterprise collaboration, live shopping, quality education, online medical care, financial services, people’s livelihood and government affairs.

According to statistics of nearly 10,000 applications in various industries in several major domestic app stores, the penetration rate of real-time audio and video has broken 30% in 2021. What about the future? “In the next few years, the penetration rate of real-time audio and video technology in key industries will exceed 50 percent,” Zhao Bin, founder and CEO of Sonnet, said at the RTE conference.

The marketization of ultra hd audio and video is a surprise change in 2021.

There are several notable industry events happening in uHD audio and video:

  • In November, academician Gao Wen won the first prize of The State Technological Invention Award for the project “Key Technologies of Ultra-HIGH Definition Video Polymorphic Primitives Encoding and Decoding”.
  • AVS series standards with independent property rights have been adopted as international common formats by the Global UHD Alliance, and have an important say in the formulation of international standards such as VVC/H266.
  • The company has developed its own ultra-high definition real-time encoder and decoding chip, forming a complete industrial chain of “technical standard – chip terminal – system application”.
  • The upcoming Winter Olympics will adopt AVS standard and 5G+8K ultra HD live broadcasting, which will become a landmark event.

So, what are the challenges facing uHD audio and video marketing? The most obvious is to ensure a stable network and adequate bandwidth. Any network fluctuation will affect the quality of audio and video, so in order to market, it is necessary to realize high-speed detection of network status, as well as resistance and transmission strategies to adapt to various network status.

In March 2019, the Ministry of Industry and Information Technology and the Broadcasting Administration jointly issued the Action Plan for the Development of ultra-HIGH-DEFINITION Video Industry (2019-2022). The plan says: by 2022, uHD video users will reach 200 million; In culture, education, entertainment, security monitoring, health care, intelligent transportation, industrial manufacturing and other fields to achieve large-scale application of ultra hd video. “The scale of China’s UHD industry was close to 1.2 trillion yuan in 2019, and the overall scale of China’s UHD video industry is expected to exceed 4 trillion yuan by 2022,” Nie Chenxi, head of the State Administration of Television, said.

According to the White Paper on the Development of UHD Video Industry (2021) released by The China Institute of Electronic Information Industry Development, the domestic UHD video market will reach 1.8 trillion yuan in 2020, of which the direct sales revenue of the core link of UHD video will exceed 810 billion yuan, and the industrial application scale will exceed 980 billion yuan. The direct sales revenue of hardware was about 90 billion yuan, and the sales revenue of solutions and integrated solutions exceeded 890 billion yuan.

This year is the last year of the action plan for the development of uHD video industry. It remains to be seen how the marketization of uHD video industry will be promoted by 5G.

Another change is that the field of audio and video codec is in a transitional phase.

Chen Wei, head of sonnet’s video engineering team, explained: “At present, audio and video enterprises are mostly engaged in the coexistence of third-generation codec technologies in their practice. A typical scenario is that h.264 is the main traffic in low-delay communication, H.265/HEVC is more used in short-delay live broadcasting, and H.266/VVC and other new-generation encoders are being tried in VOD. However, Chrome is starting to support AV1 encoders, and WebRTC’s evolution toward AV1 is accelerating. Compared to VP9, H.264 and H.265, AV1 has higher compression efficiency and visual quality, and as a free and open source video encoding format, has simpler patent terms. In 2021, AV1 was quickly adopted by hardware and software developers and accelerated adoption among Internet companies.”

In early 2021, Google introduced AI Codec Lyra, a low bit rate speech Codec based on deep learning. This has ignited the long-dormant codec community, and a number of companies have released voice AI Coedc, which demonstrates the feasibility of using computational power to get rate from different underlying technology perspectives.

In August 2021, Google detailed SoundStream, its currently experimental audio codec. It is an end-to-end “neural” audio codec that processes audio including speech, music, and ambient sounds, and compresses and enhances the audio to eliminate background noise. The 3kbps SoundStream is reported to perform close to the 9.6 KBPS us EVS processor and outperforms the 12kbps Opus codec. SoundStream performed better than the then-version of Lyra at the same bit rate. A few months ago, most practitioners would not have believed that 3kbps could encode a “hearable” musical signal.

2021 is a breakthrough year for codec, and in the future, it is more important to find a breakthrough in direction.

In addition to codec technology, transmission is also an important part of the audio and video field. Therefore, the third change in 2021 is that low latency audio and video interaction will be the future development direction.

In 2021, people are still pursuing low delay of audio and video transmission. Many applications will require RTC service providers to reduce end-to-end transmission delay to less than 100ms, such as popular cloud games, real-time chorus and so on. According to wang Qi, product manager of sonnet, sonnet can reach 64ms under ideal effect at present.

Also noteworthy is the rapid rise in ClubHouse’s valuation from $100 million to $4 billion. One of the biggest reasons behind the valuation rise is the realistic sense of presence, which perfectly improves the audience’s participation and restores the offline salon experience through the low delay of RTC technology.

The Night before the Revolution: Audio and video from the cosmic tuyere

The explosion of the metaverse also means that audio and video technology is at the forefront of the revolution.

According to tianyan, as of December 30, 2021, there have been more than 12,000 trademark applications containing “meta-universe” in the name, and more than 1,700 trademark applications containing Meta and METAVERSE, and more than 1,000 trademark applications, respectively.

“Augmented reality” and “virtual reality” are two technical directions strongly related to the meta-universe. Among them, augmented reality is to restore the state of the real world through audio and video technology and modify it, involving uHD video transmission, real-time transmission of real VR video, 3D light acquisition and rendering and other technologies.

Virtual reality is to build virtual elements in the real world through technical means, or to add real elements in the virtual world, to realize the integration and mutual influence of virtual and reality, involving AR real-time projection and real-time background segmentation and other technologies.

So what will happen to audio and video in 2022, buoyed by the metasverse concept? In the opinion of Chen Wei, senior video algorithm engineer of Sonnet, high-definition video, immersive experience and low-delay interaction will be the new development direction with the update of network devices and terminal devices.

  • Hd video: With the popularization of equipment and network, HD video application will become a new trend, 4K/8K resolution, 10bit color depth, HDR high dynamic range, high frame rate of 60 frames, ultra HD video collection, encoding and decoding, rendering and other technologies will be popularized.
  • Immersive experiences: The explosion of the metasverse has driven the development of immersive video experiences. VR/AR/MR/XR starts from concept to commercialization, from plane image to 3D scene, from fixed perspective to free perspective, from real scene to virtual and reality combination, immersive experience will continue to evolve, the combination of image and graphics technology will bring immersive reality and interactive experience.
  • Low-latency interaction: With the upgrade of terminal equipment and network infrastructure, and the optimization of codec and transmission technologies, the application of low-latency interaction scenarios will be further developed and popularized. On the one hand, real-time audio and video will permeate into all walks of life and various application scenarios, becoming an infrastructure. On the other hand, new application scenarios put forward higher requirements for end-to-end delay, such as online karaoke, cloud games, industrial Internet, automatic driving, etc., which further promote the development of codec technology, network architecture upgrade, transmission protocol optimization and other technologies.

“I agree with the general trend of the metasexes,” wang said. “I’m looking forward to seeing what sparks of music and video practitioners will spark in the metasexes in the next 3-5 years.”

Challenges are everywhere, and the solution is to sink the underlying technology

Opportunities and challenges coexist. In 2022, the audio and video industry is in a period of rapid development. For practitioners, the first thing is to quickly adapt to changes. With the penetration of audio and video into various industries, audio and video technologies and innovative applications should be actively combined with 5G, AI, cloud computing, computer vision, graphics and other technologies.

Therefore, in the field of more basic technology, Sonnet audio algorithm engineer Zhao Xiaohan proposed the following possible challenges for the landing of audio and video technology in 2022:

  • Smarter, more robust noise reduction algorithms: Personalized noise reduction to meet industry needs, which may evolve from the convergence of voice and audio technologies, as Lyra did.
  • More immersive spatial audio algorithms and end-to-end spatial reconstruction solutions landing: Existing spatial audio algorithms are essentially built spatial sensation, and optimization of algorithm details helps make the built space more realistic. In the long run, perfect end-to-end spatial audio reconstruction may be the ultimate form, depending heavily on hardware advances and reduced bandwidth costs.
  • Hd audio codecs combined with AI: the ai-based voice codecs achieved a bandwidth drop from 9kbps to 3kbps. The next step worth studying is the 196kbps to 48kbps drop, which is a very high benefit-risk ratio in absolute terms of bandwidth drop.
  • More accurate and credible QoE quantification scheme: it will contain two parts: one is offline quality assessment, more professional testing methodology and more accurate measurement of hardware facilities construction; The other part is online quality assessment, mainly algorithm design, that is, how to use online data to directly deduce experience indicators aligned with offline tests.

Zhao Xiaohan said: “With the change of business and economic environment, practitioners should be more focused on the research and optimization of the underlying technology, to achieve the best of a certain technology, or to find a new direction of power, to achieve 99% of the 95% of the technology in the business explosion period.”


Chen Weisheng network Agora video engineering team leader

He has been responsible for the chip and solution development of image processor in Hays Semiconductor, Happy and AMD successively. He has accumulated many years of experience in image processing, multimedia and real-time communication solutions, and currently serves as the head of the large front-end video engineering team of Sonnet.

Zhao Xiaohan, Agora Audio algorithm engineer

Graduated from Beijing Institute of Technology, joined the Sound network as an audio algorithm engineer, has been SOLOX series of codec, noise reduction, frame loss compensation, real-time voice quality monitoring system algorithm research and implementation.

Wang Qisheng Agora product manager

I have been in charge of communication and audio and video TO B products in Huawei, Wangsu and Tencent successively. I am familiar with audio and video related products of mainstream cloud manufacturers in the market and have a profound understanding of the product business model and details of actual operation and expansion. Now I am in charge of Agora pan-entertainment and overseas industry.

I have more than ten years of practical experience in b-end product manager, and have many successful product design and commercialization cases, covering the main fields of wireless communication, big data, live broadcasting and real-time audio and video. I have a profound understanding and rich experience in the application of communication industry, and have sufficient precipitation in the methodology of B-end products.

During the epidemic, PaaS provided services for Tencent Conference,, Gaosi Classroom and other applications through audio and video PaaS products, supporting the needs of tens of millions of students and staff to attend classes and work remotely, with more than 3 billion phone minutes per day

Related links:

Serverless: industry, academic, community blossomed everywhere, domestic manufacturers quickly stuck

Kubernetes Ecology: large version “inside volume”, safety is worth paying attention to

Big front End: The front end is in deep water, and low code for developers continues to heat up

Year-end inventory service grid: Practicality first, ecology first

Ecological landscape year-end inventory Rust | the ocean (last)

Ecological landscape year-end inventory Rust | the ocean (next)

Annual inventory database: from upper cloud to cloud native

Year-end review of Software 2.0: Strengths and Limitations of model programming