From the invention of the telephone in 1860 to today’s voice interaction via the Internet, voice has always been the most natural and basic way of real-time interaction. In the past few years, real-time voice interaction has become a part of daily life for more and more people. However, everyone will encounter a weak network environment, which will directly affect the voice call experience. So Sonnet is constantly using cutting-edge technology to improve the voice calling experience. We are now the first in China to officially launch the machine learning-based speech Codec (speech AI Codec) — Agora Silver. It can provide ultra-wideband encoded sound quality at 32KHz sampling rate at ultra-low bit rate, and further optimize sound quality and natural sound perception through AI noise reduction algorithm.

Why did traditional encoders introduce AI?

In the process of voice interaction, all users will encounter weak networks. Some are caused by network infrastructure problems in the area; Some may be in the area with better network facilities, but network congestion still occurs in the peak period of network use, thus reducing the effective bandwidth allocated to users. No one can guarantee the constant stability of the network, weak network environment exists for a long time.

In the face of weak networks, they usually choose to reduce the bit rate to reduce the occupation of bandwidth, so as to avoid the situation of voice lag. However, while this approach solves the problem of stuttering and unavailability, it introduces new problems.

Traditional codecs, at very low bit rates, maintain only a certain amount of speech intelligibility (that is, hearing what the other person is saying), but struggle to retain other information such as timbre. For example, Opus can only encode narrowband speech at a bit rate of 6kbps, resulting in an effective spectral bandwidth of only 4KHz. What is this concept?

Opus is the most widely used audio codec in the industry and the default WebRTC codec. In order to adapt to different network conditions, its bit rate can be adjusted between 6kbps and 510kbps. So when the network is weak, or the network bandwidth is limited, the minimum bit rate can be reduced to 6kbps. At this bit rate, only narrow band speech can be encoded. According to industry definition, the sampling rate of narrowband speech coding is 8KHz. According to the sampling theorem, also known as Nyquist sampling theorem, only when the sampling frequency is higher than twice the highest frequency of the audio signal, the digital signal can be restored to the original sound. In other words, when the sampling rate is 8KHz, the effective spectral bandwidth is only 4KHz. The human voice can sound dull because a lot of the high frequencies are lost.

After so many years of development, it has become difficult to help traditional codecs break through this bottleneck through algorithm-based tuning. With the continuous development of AI speech synthesis technology, especially WaveRNN based speech generation technology, people find that the combination of AI and audio codec can restore speech more completely under the encoding condition of lower bit rate.

What is the voice AI Codec?

At present, the industry has a lot of exploration on the combination of AI and audio codec. For example, there are WaveRNN at the decoding side to optimize the sound quality of low bit rate, and there are AI at the encoding side to optimize compression efficiency. So broadly speaking, as long as the use of machine learning, deep learning to compress or decode speech, are speech AI Codec.

Voice AI Codec now faces difficulties

Although in the design and development of many codec standards, the application of AI has been explored. Voice AI Codec has been implemented from academic and standard to practical business scenarios. For example, Google’s recently released Lyra can achieve 3kpbs bit rate and restore 16KHz sampled broadband speech. The idea is to use machine learning models at the decoding end to reconstruct a higher-quality signal from the low-bit-rate speech data it receives, making it sound more accurate. A similar voice AI Codec was also developed by Microsoft’s Satin, which provides ultra wideband speech at a rate of 6kpbs at a sampling rate of 32KHz.

However, compared with traditional vocoder, the application of voice AI Codec still needs to solve some difficulties:

Noise robustness

According to Shannon’s theorem, low bit rate requires higher signal-to-noise ratio. Since speech AI Codec decoding mostly uses speech generation model to generate audio signals, in the case of noise, a more intuitive feeling is that the noise has become some unnatural noise similar to speech, which greatly affects the sense of hearing. Coupled with low bit rate compression, the noise situation is likely to cause a rapid decline in speech intelligibility, making it sound as if the person on the other end has a “big tongue” and slurs his speech. Therefore, in practical use, often need an excellent noise reduction module as a pre-processing, and then the coding.

Optimization of algorithm model for mobile terminal

Decoding AI models often requires a lot of computational power. The speech generation model used in decoding is time-consuming, and the real-time interaction scenario requires that the model can be calculated in real time on most mobile devices. Because most real-time interactions happen on mobile devices. For example, Google’s open source Lyra measures an audio package containing 40ms of information on kirin 960 chip, and decoding takes 40ms-80ms. If your mobile phone is equipped with this chip, such as Huawei Honor 9, Lyra cannot be used in real-time interactive scenes. And that’s just single-channel decoding. If you need multi-channel decoding (a real-time conversation between multiple people), the computing power needs to be multiplied, and ordinary devices may not be able to support it. Therefore, if you want to make voice AI Codec applicable to real-time interactive scenes, it is necessary to optimize the computing power of mobile terminals to meet the requirements of real-time performance and delay.

The tradeoff between language naturalness and computational power

To get a natural sense of speech, you often need models with higher computational power. This is precisely the interlocking relationship with the second “challenge” we just mentioned.

A model with low computational power may result in a lot of distortion and unnatural hearing. For example, the current most natural point-by-point generation model (Sample by Sample) usually requires 3-20GFLOPS of computation. We can generally use MUSHRA (subjective evaluation test method for relevant coding of streaming media and communication, full score of 100) to evaluate the speech intelligibility and naturalness of the speech generation model. A 20GFLOPS model, such as WaveRNN, can achieve a MOS score of around 85. A less powerful model, such as the 3GFLOPS LPCNET, scored 75.

Silver characteristics and lateral measurement results

In Silver codec, we solve these three problems by using our own algorithm. As shown in the figure below, Silver first uses real-time full-band AI denoising algorithm to provide noise robustness. On the decoding side, Silver implements speech decoding with minimal computational power based on the deeply optimized WaveRNN model.

Features of Silver include:

1. Solve the problem of noise robustness: combine the independently developed real-time full-band AI noise reduction algorithm.

2. Machine learning model can be run on mobile terminals: WaveRNN model based on deep optimization realizes speech decoding with minimal computational power. Measured on Qualcomm 855 single-core, decoding 40ms speech signal only needs 5ms computational time, and supports various real-time interactive scenes smoothly.

3, ultra-low bit rate: the minimum bit rate of 2.7 KPBS, more bandwidth saving.

4, high sound quality: support 32KHz sampling rate, ultra wideband coding sound quality, full natural tone.

We compared the speechintelligibility and naturalness of Silver, Opus (6kbps), and Lyra based on the MUSHRA standard. As shown in the figure below. Where REF is the full score anchor point and Anchor35 is the low score anchor point, which is to mix the original pronunciation (full score anchor point) and poor synthetic data (low score anchor point) into the test corpus for test scoring. We tested three languages, and Silver scored higher than the other codecs.

At the same time, we also compared and tested the above three codecs in different noise environments, and the scoring results are as follows. With the help of AI noise reduction algorithms, Silver provides users with a more natural voice interaction.

In both noise-free and noise-free environments, the original sound and the effect transmitted through different codecs can be compared with the audio we prepared. Since the platform cannot upload the audio, interested developers can click “here” to listen to the full audio comparison.

Due to space constraints, the amount of audio that can be shared is limited. If you still want to know more about Silver, please visit the Audio developer community and leave a comment in the forum.