At the beginning of the year, I started to learn WebRTC due to work needs. I was impressed by its complex compilation environment and huge amount of code, which was bound to be a hard nut to crack. As the saying goes, all things are difficult before they are easy. If you insist on perseverance, you will eventually learn the essence of WebRTC design. Today I’m going to talk to you about audio in WebRTC.

WebRTC consists of three modules: voice engine, video engine and network transmission. Among them, voice engine is one of the most valuable technologies in WebRTC, realizing a series of processing processes such as audio data acquisition, pre-processing, coding, sending, receiving, decoding, mixing, post-processing and playing.

Audio engine mainly includes: audio equipment module ADM, audio encoder factory, audio decoder factory, Mixer, audio pre-processing APM.

Audio working mechanism

Want to understand the audio engine system, first need to understand the core implementation classes and audio data flow, next we will simply analyze.

Audio Engine core class diagram:

Audio engine WebrtcVoiceEngine includes AudioDeviceModule AudioDeviceModule, AudioMixer AudioMixer, audio 3A processor AudioProcessing, audio management class AudioState, and audio encoder factory AudioEnco DeFactory, audio decoder factory AudioDecodeFactory, voice media channel includes sending and receiving, etc.

1. AudioDeviceModule is responsible for the hardware layer, including audio data collection and playback, and hardware operations.

2. AudioMixer AudioMixer is mainly responsible for the mixing of audio sending data (mixing of equipment collection and accompanying sound) and audio playing data (mixing of multi-channel receiving audio and accompanying sound).

3. Audio 3A processor AudioProcessing is mainly responsible for pre-processing audio data, including echo cancellation (AEC), automatic gain control (AGC), and noise suppression (NS). APM is divided into two streams, a proximal stream and a distal stream. Near-end stream refers to data coming in from the microphone; A far-end stream is the received data.

4. The audio management class AudioState includes audio device module ADM, audio pre-processing module APM, audio Mixer and Data flow center AudioTransportImpl.

5. AudioEncodeFactory includes Opus, iSAC, G711, G722, iLBC, L16 and other CODec.

AudioDecodeFactory includes Opus, iSAC, G711, G722, iLBC, L16 and other COdec.

Audio workflow flow chart:



1. The initiator collects sound through a microphone

2. The initiator sends the collected sound signal to the APM module for echo cancellation AEC, noise suppression NS, and automatic gain control for AGC

3. The initiator sends the processed data to the encoder for speech compression and coding

4. The initiator sends the encoded data through the RtpRtcp transmission module and transmits it to the receiver through the Internet

5. The receiver receives the audio data transmitted from the network, and first sends it to the NetEQ module for jitter elimination, packet loss, hidden decoding and other operations

6. The receiver sends the processed audio data to the sound card for playback

NetEQ module is the core module of Webrtc speech engine

In NetEQ module, it is roughly divided into MCU module and DSP module. MCU is mainly responsible for calculation and statistics of delay and jitter, and generate corresponding control commands. DSP module is responsible for receiving and processing corresponding data packets according to MCU control commands, and transmitting them to the next link.

Audio data flow

Based on the audio workflow flow chart introduced above, we will continue to refine the audio data flow. The role of AudioTransportImpl, the data flow center, in this process will be highlighted.

AudioTransportImpl realizes the data processing interface RecordDataIsAvailbale and the data processing interface NeedMorePlayData. RecordDataIsAvailbale is responsible for collecting audio data for processing and distributing it to all Streams that send it. NeedMorePlayData is responsible for mixing all Streams received, sending them to APM for reference signal processing, and resampling them to the sampling rate of the requested output

RecordDataIsAvailbale internal main process:

  1. The audio data collected by hardware is directly resampled to the transmission sampling rate
  2. The audio data after resampling is processed by audio pre-processing
  3. VAD treatment
  4. Digital gain adjusts acquisition volume
  5. Audio data is called back externally for external pre-processing
  6. All the audio data to be sent by the mixer sender, including the collected data and the accompanying data
  7. Calculate the energy value of audio data
  8. Distribute it to all Streams that send it
NeedMorePlayData internal main processes:

  1. Remix all Streams’ received audio data
1.1 CalculateOutputFrequency()

1.2 Collect audio data from Source GetAudioFromSources(), select the three channels without mute and with the largest energy for mixing

1.3 Perform the mix operation FrameCombiner::Combine()

  1. Under specific conditions, noise injection is carried out for the acquisition side as a reference signal
  2. Mix local audio
  3. Digital gain adjusts playback volume
  4. Audio data is called back externally for external pre-processing
  5. Calculate the energy value of audio data
  6. Resampling the audio to the sampling rate of the requested output
  7. The audio data is fed to the APM for reference signal processing
Given the data flow in the figure above, why FineAudioBuffer and AudioDeviceBuffer? Because WebRTC’s audio pipeline only supports processing of 10 ms of data, different operating system platforms provide different audio data acquisition and playback duration, and different sampling rates also provide different data duration. For example, on iOS, the 16K sampling rate provides 128 frames of 8ms of audio data; The 8K sampling rate will provide 128 frames of 16ms audio data; The 48K sampling rate will provide 512 frames of audio data at 10.67ms.

The AudioDeviceModule plays and collects data. The AudioDeviceBuffer always brings in or sends out 10 ms of audio data. For platforms that do not support collecting and playing 10 ms of audio data, a FineAudioBuffer is also inserted into the platform AudioDeviceModule and AudioDeviceBuffer. Used to convert the platform’s audio data format to 10 ms of webrTC-capable audio frames. In the AudioDeviceBuffer, the number of sampling points and sampling rate corresponding to the audio data from the current hardware device will be calculated periodically for 10s, which can be used to detect a working status of the current hardware.

Audio related changes

  1. The realization of audio Profile supports Voip and Music scenarios, and realizes the comprehensive technical strategy of sampling rate, coding rate, coding mode and number of sound channels. IOS achieves the separation of acquisition and playback threads, and supports dual-channel playback.
  2. Compatibility Of audio 3A parameters Delivers adaptation solutions.
  3. Headset adaptation, Bluetooth headset adaptation and common headset adaptation, dynamic 3A switching adaptation.
  4. Noise_Injection algorithm, as a reference signal, plays a particularly important role in echo cancellation in headphone scenes.
  5. Supports local audio files file and network audio files HTTP&HTTPS.
  6. The implementation of Audio Nack, which improves the anti-packet loss capability of Audio, is currently under way In- Band FEC.
  7. Audio processing optimization in single and double talk.
  8. IOS built-in AGC Research:
(1) Built-in AGC is effective for Speech and Music, but has no effect on noise and ambient background noise.

(2) The microphone hardware gain of different models is different, iPhone 7 Plus > iPhone 8 > iPhone X; Therefore, when both software AGC and hardware AGC are turned off, the sound heard by the remote end is different in size.

(3) In addition to the switchable AGC provided by iOS, there is also an AGC that will work all the time to fine-tune the signal level; I guess the AGC that has been working is the analog AGC of iOS, probably related to the hardware, and there is no API to switch on and off, while the AGC that can be switched on is a digital AGC.

(4) On most iOS models, the volume of input will decrease in the external mode “After the earphone is inserted again”. The current solution is to add a preGain to bring the input volume back to normal after the headset is inserted again.

Troubleshooting Audio Problems

Share with you some of the most common audio phenomena and why: