The RTC is the technology behind the online education industry’s growth spurt during last year’s pandemic. This article will share the practical experience of RTC technology in the music education scene.
Author | Yicheng Review | Tai Yi
Music Education scene – Online sparring partner
The COVID-19 pandemic in 2020 has changed the ecology of quality-oriented education in online education, and music practice partners are a typical scenario. Many offline music companies closed down because they could not afford The high rent, and The users of online music education platforms surged, which were represented by The One, VIP practice partner, quick practice partner, Meiyue practice partner, music Notes, etc. According to public information, the number of VIP training partners has reached 30,000 classes per day, and the number of users of quick training partners exceeded 1.2 million in October 2020. Some investment institutions pointed out that by 2022, the music education market is expected to reach more than 400 billion yuan, while the demand for online sparring market is nearly 100 billion yuan.However, it is not easy to create a training App that delivers high-pitch quality. In the actual development process, music training apps have higher requirements on sound quality than ordinary online education apps. Next, I will take piano education as an example to analyze the problems and solutions encountered by WebRTC in the musical instrument education scene from a technical perspective.
Spectrum of Musical Instruments
Taking piano as an example, the frequency spectrum is mainly concentrated below 5KHz. The following figure shows the frequency spectrum of a piano piece with 44.1khz sampling rate decoded by FFmpeg. As can be seen from the following figure, considering the possible influence of high-frequency harmonics or other environmental sounds in the actual recording scene, the frequency range is concentrated in the frequency band below 7kHz:
Analysis of factors affecting sound quality
The recording
The pre-processing process of WebRTC after audio collection is: Record -> ANS -> AEC -> AGC. Let’s analyze the first part, the impact of recording. The following test is based on the piano music played by an Android phone, which is about 15cm away from the Mac computer. In the single-speaker mode, the spectrum of the original piano music is as follows:The spectrum after recording is as follows:In the figure, the spectrum below 400Hz is almost lost. Considering that the sound is played from the mobile phone, transmitted through the speaker of the mobile phone, transmitted through the air, and then received by the mic of the peer end, it is different from the scene of real piano education, so we recorded a real piano education language as follows:
It can be seen that the spectrum fidelity of real piano education recording is different from that of mobile phone playback and recording, so the recording factor can be temporarily ignored in the real piano scene.
3 a algorithm
Single talk (AEC does not take effect) :Spectrum after ANS: Conclusion: After ANS, there is a great loss of frequency, and the loss of middle and high frequency is more serious.
In the case of double-talk (through ANS and AEC) : post-ANS spectrum (with someone speaking at the far end) :Post-aec spectrum:Double lecture is also a great loss to music, focusing on aeC module loss.
codecs
Opus is a hybrid encoder made of SILK and CELT, which is technically called USAC, Unify Speech and Audio Coding. The codec has a Music detector to determine whether the current frame is voice or Music. The voice is encoded by Silk frame and the Music is encoded by Celt frame. It is generally recommended that there is no restriction on which mode encoding is fixed by the encoder.
At present, WebRTC uses Application as Kvoip, and the mixed coding mode is enabled by default. There is no restriction on the fixed Celt only or Silk Only mode. Decision of music and speech coding algorithm in mixed coding mode in encoder:Test corpus:After selecting music mode encoding + mixed encoding:After selecting voice encoding + mixed encoding mode:The test feedback shows that switching silk mode is very sensitive in the case of music encoding, but if the switching is not sensitive in the case of VoIP mode, there will be a delay of switching silk encoding for music after voice. Therefore, silk encoding has a slight loss of high-frequency parts in the seconds after speech encoding, which is slightly worse than Celt encoding.
summary
To sum up, the main factors affecting the sound quality of piano education are noise cancellation module and echo cancellation module.
The technical scheme of piano education scene
The complete solution needs to consider the coexistence of voice and music in the piano education scene. It is necessary to make mode discrimination on the current voice frame and identify whether it is voice or music. If it is voice frame, the normal 3A processing process is followed; if it is music frame, the algorithm of 3A needs to be adjusted to ensure the integrity of music to the maximum. The general flow chart is as follows:
Related technical issues
This paper analyzes the factors and technical solutions that affect the piano education scene. The following mainly analyzes the related technical problems from the perspective of implementation. According to the above analysis conclusion, in VoIP scenario, ios and Android generally adopt the hardware 3A solution, but in musical instrument education scenario, software 3A solution must be adopted, otherwise 3A algorithm cannot be dynamically adjusted according to music and human voice.
1. The collection mode on the iOS platform is faulty
WebRTC uses AudioUnit collection on iOS platform, the relevant code is as follows:According to Apple’s API, iOS offers three I/O units, of which the Remote I/O Unit is the most commonly used. Connects input and output audio hardware, provides low-latency access to incoming and outgoing sample values, and provides format conversion between hardware audio formats and application audio formats. The voice-processing I/O Unit is an extension of the Remote I/O Unit. It adds echo cancellation for Voice chat, automatic gain correction, Voice quality adjustment, and mute. The Generic Output Unit does not connect to audio hardware, but rather provides a mechanism for sending the Output of the processing chain to the application. It is usually used for offline audio processing.
Therefore, in the musical instrument education scenario, you need to initialize the AudioUnit type to Remote I/O mode, so that the recording data will not be processed by hardware 3A. However, after Remote I/O is enabled, the recording data is as follows:Note After Remote I/O is enabled, the system hardware does not provide volume gain. As a result, the recorded volume is low (about -14 dB). Therefore, the agC algorithm gain of the corresponding software needs to be adjusted accordingly.
2. The collection mode of the Andorid platform is faulty
Generally, in Android platform and VoIP mode, most models can use hardware 3A through adaptation, which can not only ensure the effect, but also bring lower power consumption and delay. Andoird platform can choose between Java acquisition and play and Opensles acquisition and play. In general, Opensles has less delay and better adaptation. We will also use OpenSLES as an example to describe the differences between the setup in the VoIP scenario and the instrument education scenario. Opensles provides several modes for audioSource and AudioType to choose from:
For VoIP scenarios, the openSLES options are audiosource:VOICE_COMMUNICATION, stream_type:STREAM_VOICE; However, in musical instrument education scenarios, audioSource: MIC and stream_type:STREAM_MEDIA should be used. Otherwise, hardware 3A May be triggered.
Audiosouce: MIC, Stream_type: As can be seen in the figure, due to the incorrect mode of stream_type, it will affect the system acquisition and trigger the hardware 3A during playback, causing serious damage to the music signal.
3. The music and voice detection module is faulty
As mentioned in the previous article, OpUS provides a voice and music detection module. According to the test, the detection of voice and music is very sensitive when the encoder is set to music by default. However, if the encoder is set to voice encoding mode, there will be a delay of several seconds when the algorithm is switched from voice to music.Encoder when doing mode decision according to set the default encoding mode to set the threshold method, speech coding, the threshold will be higher, this is when by voice switch for the music encoder is not immediately switch music mode, in order to keep maximum voice information, because the voice information interframe correlation will be strong.
Therefore, it is recommended to adopt music coding mode by default in piano education scene, in order to retain music information to the maximum extent and reduce the damage caused by 3A processing to sound quality.
conclusion
Based on WebRTC music education scene engineering practice has many details to consider, from the audio signal acquisition, adaptation to 3 a, then to audio encoder parameter adjustment, all need to do targeted tuning, can it be the maximum clear speech signal can be ensured, and can ensure that the details of the rich music signal without distortion. In addition, as the market segment of online education continues to mature, some special Musical Instruments, such as percussion instruments, will bring new technical difficulties, which need to be further explored and optimized by RTC.
“Video cloud technology” your most noteworthy audio and video technology public account, weekly push from Ali Cloud front-line practice technology articles, here with the audio and video field first-class engineers exchange exchange.