From the invention of the telephone in 1860 to voice interaction over the Internet today, voice has always been the most natural and fundamental real-time interaction. Over the past few years, voice real-time interaction has become a part of more and more People’s Daily lives. But everyone will experience a weak network environment, which directly affects the voice call experience. So VoiceNet is constantly using cutting-edge technology to improve the voice call experience. We are now the first in China to officially launch a speech AI Codec based on machine learning — Agora Silver. It can provide UWB encoded sound quality at 32 kHz sampling rate at ultra-low bit rate, and further optimize the sound quality and natural listening of speech through AI noise reduction algorithm.
Why did traditional encoders introduce AI?
In the process of voice interaction, all users will encounter weak network. Some are caused by the network infrastructure problems in the region; Some may be in the network facilities better area, but meet the peak of network use, there will still be network congestion, thereby reducing the effective bandwidth of users. No one can guarantee the stability of the whole time network, weak network environment exists for a long time.
In the face of weak network, usually choose to reduce the bit rate, so as to reduce the bandwidth occupation, in order to avoid the situation of voice stuttering. But while this approach solves the problem of being stuck and unusable, it creates new problems.
Traditional codecs, at very low bit rates, can only maintain a certain degree of speech intelligibility (that is, the ability to hear what is being said), but struggle to maintain other information, such as timbre. For example, OPUS can only encode narrow-band speech at a bit rate of 6kbps, and the effective spectrum bandwidth is only 4kHz. What is this concept?
Opus is the most widely used audio codec in the industry and is the default codec for WebRTC. In order to adapt to different network conditions, its bit rate can be adjusted between 6kbps and 510kbps. When the weak network is encountered, or the network bandwidth is limited, the bit rate can be reduced to the lowest 6Kbps. At this rate, only narrow band speech encoding can be performed. According to industry definition, narrowband speech coding has a sampling rate of 8 kHz. According to the sampling theorem, also known as Nyquist sampling theorem, only when the sampling frequency is higher than twice the highest frequency of the sound signal, the digital signal can be restored to the original sound. That is to say, when the sampling rate is 8kHz, the effective spectral bandwidth is only 4kHz. The human voice will sound dull because a lot of the high frequencies in the voice are lost.
After so many years of development, it is difficult to use algorithm tuning to help traditional codecs overcome this bottleneck. With the continuous development of AI speech synthesis technology, especially the Wavernn-based speech generation technology, it has been found that combining AI with audio codec can restore speech more completely under the coding condition of lower bit rate.
What is Voice AI Codec?
At present, there are many explorations on the combination of AI and audio codec in the industry. For example, there are ways to optimize low bit-rate sound quality through Wavernn at the decoding end, and there are ways to optimize compression efficiency using AI at the coding end. Therefore, in a broad sense, as long as machine learning and deep learning are used to compress or decode speech, all of them are regarded as voice AI Codec.
Voice AI Codec is now facing difficulties
In the design and development of many codec standards, however, AI has been explored. Voice AI Codec from academic, standard, landing to the actual business scene, such as Google recently released Lyra can do 3kpbs bit rate to restore 16kHz sampling broadband voice. Its approach is to use machine learning model at the decoding end, according to the received low bit rate speech data, reconstruction and restoration of high quality signal, so as to make the voice restoration degree sound higher. A similar voice AI Codec is Microsoft’s Satin, which can restore 32 kHz sampling rate UWB voice at 6kpbs bitrate.
However, compared with traditional vocoder, there are still some difficulties to be solved in the application of voice AI Codec:
Noise robustness
According to Shannon’s theorem, low bit rate requires higher signal-to-noise ratio. As speech AI Codec decoding mostly uses speech generation model to generate audio signals, in the case of noise, a relatively intuitive feeling is that the noise has become some unnatural noises similar to speech, which greatly affects the listening sensation. Combined with low bit-rate compression, the noise situation is likely to cause a rapid decline in intelligibility, making it sound as if the person on the other end has a “big tongue” and is slurring his or her speech. Therefore, in practical use, an excellent noise reduction module is often needed as a pre-processing, and then coding.
Algorithm model optimization for mobile terminal
AI models often require a lot of computing power when decoding. The calculation of speech generation models used in decoding is time-consuming, and real-time interaction scenarios require that the models can be calculated in real time on most mobile devices. Because most of the real-time interaction happens on mobile devices. For example, Google’s open source Lyra measured an audio package containing 40ms information on the Kirin 960 chip, and the decoding needs 40ms-80ms. If your mobile phone is equipped with this chip, such as Huawei Honor 9, Lyra cannot be used in real-time interactive scenes. This is just single-channel decoding, but if you need to multichannel decoding (real-time communication between multiple people), the computing power needs to be multiplied, and the average device may not be able to support it. Therefore, if the voice AI Codec is to be applied to real-time interactive scenes, it must also be optimized for mobile terminals to meet the real-time performance and delay requirements.
The trade-off between language naturalness and computation power
To get a natural sound, you often need a more powerful model. This interacts with the second challenge we just mentioned.
Smaller power models can result in a lot of distortion and unnatural sound in the generated speech. For example, the most natural speech point by point generation model (Sample by Sample) usually requires 3-20GFLOPS of computation. In general, the Mushra (Subjective Evaluation Test for Streaming-Media and Communication Coding) method is used to evaluate the speech intelligibility and naturalness of speech generation models. A 20GFLOPS model, such as Wavernn, can achieve MOS scores of around 85. Less powerful models, such as the 3GFLOPS LPCNET, can only achieve 75 points.
Silver characteristic and transverse measurement effect
In the Silver codec, we solved all three of these problems with a self-developed algorithm. As shown in the figure below, Silver first utilized real-time full-band AI noise reduction algorithms to provide noise robustness. On the decoding side, Silver uses a deeply optimized Wavernn model to decode speech with minimal computing power.
<center size=1>Silver codec flowchart </center>
Silver’s features include:
1. Solve the problem of noise robustness: combining with the self-developed real-time full-band AI noise reduction algorithm.
2. Machine learning model can be run on mobile terminals: Wavernn model based on deep optimization realizes speech decoding with minimal computing power. It only takes 5ms computing time to decode 40ms speech signal measured on Qualcomm 855 single core, and it smoothly supports various real-time interactive scenes.
3, ultra-low bit rate: the bit rate can be up to 2.7KPBS, more bandwidth saving.
4, high sound quality: support 32KHz sampling rate, ultra-wideband encoding sound quality, sound full and natural.
We compared Silver, Opus (6kbps), and Lyra’s speech intellibility and naturalness based on the Mushra standard. As shown in the figure below. Where REF is the full score anchor, and Anchor35 is the low score anchor, which means that raw speech (full score anchor) and poorly synthesized data (low score anchor) are mixed into the test corpus to receive the test score. We tested three languages, and Silver scored higher than the other codecs.
At the same time, we also compared and tested the above three codecs under different noise environments, and the test score results are as follows. Powered by AI noise reduction algorithms, Silver can provide users with a more natural voice interaction.
In both noise-free and noise-free environments, the original sound can be compared to the sound transmitted through different codecs. Since the platform cannot upload audio, interested developers can listen here.
Due to space constraints, the amount of audio that can be shared is limited. If you’d like to learn more about Silver, please visit the audio developer community and let us know in the comments section below.