Voice social has been around for decades, but the recent “interactive podcasting” scene has put audio interaction back in the spotlight. But we don’t want to talk about the scene. We want to talk about the real-time audio interaction technology that engineers have developed over decades of experience on the ground floor.

What factors, from software algorithms to transport architecture, affect the sound quality of your calls? Why not the lower the delay, the better? How can machine learning, big data help optimize delivery, sound quality? We’re going to start today with a series of four parts on audio technology that will answer these questions on multiple levels and share practical experiences with the audio web.

We have talked about such an audio and video data transmission process in “Detailed Low Latency”, as shown in the following figure. At each link of this end-to-end link, there are technical difficulties affecting delay and sound quality.

The next four articles will talk about the technical principles and “transformation” ideas behind high sound quality and low delay from the aspects of codec, noise reduction and echo cancellation algorithm, network transmission and sound quality optimization.

This article starts with the speech codec. But before we talk about voice codecs, we need to understand how audio codecs work to understand what affects the sound quality experience more quickly.

Speech coding and music coding

Here first to some do not understand the codec principle, the role of engineers, a brief introduction.

Audio coding refers to the process of converting an audio signal into a digital stream (as shown in the figure below). During this process, the audio signal is analyzed to produce specific parameters. These parameters are then written to the bitstream according to certain rules. This bit stream is also known as a bit stream. After receiving the code stream, the decoder will restore the code stream to parameters according to the agreed rules, and then use these parameters to build the audio signal.

Audio codecs have a long history of development. The core algorithm of early codecs is nonlinear quantization, which is a relatively simple algorithm now. Its compression efficiency is not high, but it is suitable for most audio types, including speech and music. Later, with the development of technology and the refinement of the division of codec, the evolution direction of codec is divided into two ways — speech encoder and music encoder.

Speech codecs, which are mainly used to encode speech signals, gradually evolve to the time-domain linear prediction framework. This codec takes into account the articulation characteristics of the sound channel and decomposes the speech signal into the main linear prediction coefficient and the minor residual signal. Linear predictive coefficient coding requires very few bit rates, but it is very efficient to build the “skeleton” of a speech signal (think of it as being able to hear the speech roughly speaking but not who is saying it); Residual signals are like “flesh and blood,” adding details to the speech signal (with flesh and blood, you can imagine that you can hear who is speaking). This design greatly improves the compression efficiency of speech signal, but this time-domain linear prediction framework cannot encode music signal well in the finite complexity.

Music codecs, which encode music signals, have evolved in a different way. Compared with the time domain signal, the information of frequency domain signal is more concentrated on a few frequency points, which is more convenient for the encoder to analyze and compress it. So music codecs generally choose to encode the signal in the frequency domain.

Later, as the technology matured, the two codec architectures came together again, the voice-music hybrid encoder, which is the default codec used in WebRTC, Opus. The characteristic of this type of codec is that it combines two encoding frameworks and automatically switches the appropriate encoding framework according to the signal type. Some well-known products in China and abroad use Opus, such as Discord.

What influences the interactive experience in speech coding?

To evaluate sound quality, we need to know some technical specifications of the codec. Technical indicators generally include sampling rate, bit rate, complexity, anti-packet loss ability, etc. What do these technical indicators represent and how do they affect the audio experience?

You may have seen statements like “the higher the sampling rate, the better the sound quality” and “the higher the coding complexity, the better”, but that’s not true in a real-time interactive setting!

1. Sampling rate

From an analog signal that can be heard by the human ear to a digital signal that can be processed by a computer, a sampling process is required. Sound can be decomposed as a superposition of sinusoidal waves of different frequencies and of varying intensity. Sampling can be thought of as picking up a point on a sound wave. The sampling rate refers to the number of points sampled per second in this process. The higher the sampling rate is, the less information is lost in this transformation process, that is, the closer it is to the original sound.

The sampling rate determines the resolution of the audio signal. In the range of human ear perception, the higher the sampling rate is, the more high-frequency components are retained, and the clearer and brighter the listening sense of this signal is. For example, when we make a traditional telephone call, we tend to feel the other party’s voice is relatively dull. This is because the sampling rate of traditional telephone is 8kHz, and only the low-frequency information to ensure intelligibility is retained, while many high-frequency components are lost. So the better the audio interaction, the higher the sampling rate, the better the human ear can perceive it.

Second, the bit rate

After sampling, the sound is converted from analog to digital. The bit rate represents the amount of data per unit time of the digital signal.

The bit rate determines the degree of detail restoration of audio signal after encoding and decoding. The codec assigns a given bit rate to the output parameters of each analysis module in order of priority. In the case of limited coding rate, the codec preferentially encodes parameters that have greater impact on speech quality, and abandons some parameters that have less impact on speech quality. In this way, in the decoding end, because the parameters used are not complete, the constructed speech signal will also have inevitable damage. Generally speaking, the higher the bit rate of the same codec, the less damage after codec. But the higher the bit rate, the better. On the one hand, the relationship between bit rate and codec quality is not linear. After exceeding the “quality sweet spot”, the improvement of bit rate on quality becomes not obvious. On the other hand, in real-time interaction, a high bit rate may occupy the bandwidth and cause network congestion, which in turn causes packet loss, which in turn destroys the user experience.

Quality sweet spot: In the video world, quality sweet spot refers to the optimal subjective video quality experience by setting the right resolution and post rate for a given bit rate and screen size. It’s a similar story in audio.

3. Coding complexity

The coding complexity is generally concentrated in the coding end signal analysis module. In general, the more detailed the speech signal is analyzed, the higher the potential compression rate may be, so there is a certain correlation between coding efficiency and complexity. Similarly, the relationship between encoding complexity and codec quality is not linear. There is also a “quality sweet point” between the two. The availability of codec is often directly affected by the design of high-quality codec algorithm under the premise of limited complexity.

4. Anti-packet loss ability

First, what is the principle of anti-packet loss? When we transmit audio data, we will encounter packet loss. If the current packet is lost, we hope to guess or get the general information of the current frame by some means, and then use these incomplete and accurate information to decode a voice frame similar to the original signal. Of course, just guessing is generally not a good result, if the previous packet or the following packet can tell the decoder some key information about the current packet loss, the more information, the better the decoder to recover the lost speech frame. This “critical information” contained in the “previous packet” or “later packet” is also referred to as “interframe redundancy information”. (We talked more about packet loss confrontations last time)

Therefore, anti-packet loss capability and coding efficiency are relatively mutually exclusion. The improvement of coding efficiency often needs to reduce the information redundancy between frames as far as possible, while the anti-packet loss capability depends on a certain amount of information redundancy between frames, which can ensure that when the current packet is lost, the current speech frame can be recovered through the preceding/subsequent speech frames. In a real-time interactive scenario, because the user’s network is unreliable, a user may walk into an elevator or sit in a speeding car. In this kind of network, packet loss and delay jitter are full of, so the ability of anti-packet loss of codec is indispensable. Therefore, how to balance the coding efficiency and anti-packet loss ability also needs to go through detailed algorithm design and polishing verification.

How to balance audio experience with technical metrics?

So how does Soundnet do it? Our engineers took all of these into account and built Agora Nova (Nova), a high-definition voice codec for real-time communication.

32 KHZ sampling rate

First of all, in terms of sampling rate, rather than choosing the 8khz or 16khz sampling rate used by other speech codecs, Nova chose the higher 32kHz sampling rate. That means Nova has a big head start on voice quality. Although the industry commonly used 16kHz sampling rate (note: Microcredit is 16kHz) has met the basic requirements of speech intelligibility, but some voice details still need a higher sampling rate to capture, we hope to provide a higher definition of voice calling ability, which not only ensures the intelligibility, but also improves the clarity, which is why we choose 32kHz.

Optimizing coding complexity

The higher the sample rate, the higher the speech clarity, and the more sample points per unit time to analyze/encode/transmit, the higher the coding rate and complexity. The increase of coding rate and complexity is bound to put pressure on users’ bandwidth and device performance power consumption. But that’s not what we want to see. To this end, we have designed a set of simplified high-frequency voice component coding system through theoretical deduction and a lot of experimental verification. Under the premise of a small increase in the complexity of analysis, the coding of high-frequency signals can be realized with a minimum of 0.8 KBPS (based on different technologies, the code rate of high-frequency signals usually needs to be higher than 1~2kbps). Greatly increases the articulation of speech signals.

Balancing packet loss resistance and coding efficiency

In order to guarantee the anti-packet loss ability, we also choose the most balanced scheme under the premise of ensuring the coding efficiency. After experimental verification, this scheme not only guarantees the coding compression efficiency, but also guarantees the recovery rate when the packet is lost. In addition to Nova, we have developed and open source Agora Solo and SoloX, which are voice codecs with better packet loss resistance, for unstable network environments.

Agora Nova vs. Opus

Nova has a variety of modes to choose from for different scenarios, such as adaptive mode, high quality mode, low power high quality mode, uHF mode and ultra-low bit rate mode.

If you compare Nova to Opus, an advanced open source codec, the effective spectral information of Nova is 30% more than that of Opus at the same bit rate, thanks to Nova’s efficient signal processing algorithm. Under the subjective and objective evaluation system, Nova’s voice coding quality is higher than Opus:

  • At the objective evaluation level, the objective quality evaluation algorithm defined by ITU-T P.863 standard was used to score the codec-decoding corpus of the two codecs, and the Nova score was always slightly higher than Opus.

  • At the subjective evaluation level, the reduction degree of speech signal through Nova codec is higher than that through Opus codec, which is reflected in more transparent listening sense and less quantized noise.

Thanks to this high definition voice codec, the Agora SDK provides users around the world with a consistent high quality interactive audio experience. In fact, the quality of a voice call experience is directly related to the encoding quality of the codec, but also greatly affected by other modules, such as echo cancellation, noise reduction, network transmission, etc. In the next article, we will introduce the best practices of echo cancellation and noise reduction algorithms in the sound network.

To learn more about voice web technology, please visit 🔎 and follow the “Voice Web Agora Developer” official account.