In the industry, real-time audio and video QoE (Quality of Experience) method has always been an important topic, every year RTE Real-time Internet Conference will involve topics. This is important because there is currently no good QoE evaluation method for real-time interactive scenarios in the RTE industry. Based on the objective real-time data and practice summary of large-scale commercial use in the world, Sonnet officially launched its own non-reference objective evaluation method for evaluating real-time audio user experience — Agora real-time audio MoS method. This method, which has been integrated into the 3.3.1 and later version of Agora audio/video SDK, currently only provides the score of the downlink (codec – transport-play) link, and the uplink quality scoring interface will be provided in the future. After calling this method, developers can objectively judge the current user’s audio interaction experience in real time, and provide important reference data for their own business and operation optimization. Click “Read Original” to search for “mosValue” to browse the detailed documentation of this method. So someone may ask, what are MoS points and QoE? What is the principle of this MoS approach to sonnet? How does it differ from existing open source approaches?
From “feed feed feed” to QoS, QoE
When voice calls are made, Quality of Service (QoS) is not available. Call quality can only be judged by the number of feeds. Later web-based voice interaction faced the same problem. QoS is born under such background. Its purpose is to provide end-to-end quality of service assurance according to the requirements and characteristics of various businesses. QoS mechanism is mainly for operators and networks, focusing on network performance and traffic management rather than end user experience. People gradually find that the traditional evaluation system based on QoS is difficult to match the user experience. Therefore, QoE (Quality of Experience), which pays more attention to user Experience, was put forward. After a long period of time, the evaluation system based on QoE began to develop gradually. In the field of communication, several evaluation methods strongly related to QoE have gradually emerged, which can be divided into subjective evaluation methods and objective evaluation methods. All of these methods express the current user experience through MoS score.
Shortcomings of existing QoE methods
Subjective evaluation method
The subjective evaluation method is to map people’s subjective feelings to the quality score, which is limited by the professional and individual differences of listeners. In the industry, there is no consistent standard for audio subjective testing. Although the ITU has some recommendations and guidelines for subjective audio testing, each test has its own focus and is designed and implemented differently. Generally more common practice is to invite enough people to collect statistically significant samples, and then do certain listening training for testers. Finally, the audio calls are rated for signal distortion, background intrusion, and overall quality. Therefore, it takes a lot of manpower and time to get relatively accurate subjective voice quality rating, so subjective tests are rarely used to evaluate communication quality in the industry.
Objective evaluation method
Objective evaluation methods are divided into reference evaluation method and no reference evaluation method. Among them, the reference evaluation method can quantify the damage degree of the damaged signal on the premise of the reference signal (lossless signal), and give the objective speech quality score that is close to the subjective speech quality score. In 2001, P.862 standard (P.862 is the ITU international Telecommunication Union standard) defined the reference objective evaluation algorithm PESQ, which is mainly used to evaluate the codecs damage in narrowband and broadband. This algorithm has been widely used in the evaluation of communication quality in the past two decades. With the development of technology, the application scope of PESQ has become narrower and narrower, so in 2011, P.863 standard defined a more comprehensive and accurate reference objective evaluation algorithm POLQA. Compared to PESQ, POLQA has a wider evaluable bandwidth, is more robust to noise signals and delays, and its speech quality scores are closer to subjective scores. The objective evaluation method without reference does not need reference signal and can get a quality score only by analyzing the input signal itself or parameters. The famous objective evaluation methods without reference include P.563, ANIQUE+, E-Model, P.1201, etc. Among them, P.563 was proposed in 2004, which is mainly oriented to narrowband speech quality assessment; ANIQUE+ was proposed in 2006, which is also for narrowband speech. According to the author, its scoring accuracy exceeds that of the reference evaluation method PESQ. However, THE MEASUREMENT of PESQ cannot reflect network delay, packet loss, etc., so it is not perfect for the real-time interactive scenes based on Internet transmission nowadays. E-model was proposed in 2003. Different from the above two methods, e-Model is a damage quantitative standard based on VoIP link parameters and will not be directly analyzed based on signal domain. P.1201 series was proposed in 2012. For the audio part, the standard does not directly analyze the audio signal, but scores the communication quality based on the network state and signal state.
Limited AI algorithm improvement & hard to land real-time scenes
In recent years, there are related papers that use deep learning to score speech signals, and the fitting output is often the output of PESQ or other objective evaluation methods with reference to the speech under test. But this approach has two obvious drawbacks:
- First, its accuracy depends on model computing power, and when the product is launched, the complexity and package volume requirements of non-quality improvement functions are often very high because the user experience cannot be directly improved.
- Second, the robustness of this method will be strictly tested under the multi-scene characteristics of RTE. For example, chat room scenes with background music or special effects will bring great challenges to this method based on deep learning.
There are objective evaluation methods for reference. Because non-destructive reference corpus is required, more value is to verify the quality of the algorithm, App or scene before it goes online. If your App or scene is already online, it is impossible to evaluate its voice interaction experience. However, the industry expects non-reference objective evaluation methods to provide some help for the post-release experience evaluation. However, it is difficult to regret that due to the diversity of scenarios or the complexity of algorithms, the objective evaluation method without reference mentioned above cannot be fully applied to the FIELD of RTE. Taking the objective evaluation method without reference P.563 as an example, the effective spectrum it can measure is only 4kHz, and it can only measure speech signals, so its robustness to different corpora is very poor: In the early stage, we made the core algorithm of P.563 real-time and transplanted it into SDK. However, the variance of its scoring error for different types of corpus was too large after testing, and finally it was not productized. However, the deep learning-based method can theoretically train an end-to-end evaluation algorithm with better robustness and less error than P.563, but its algorithm complexity and low return on investment are still two stumbling blocks.
A new QoE evaluation method for real-time audio interactive scenes
In conclusion, if we need an evaluation method for real-time feedback of call quality deployed on the end, none of the above methods is appropriate. We need to find a new way to design a new evaluation system, which needs to have the following characteristics: \
- The corpus (music/speech/mixture) in multiple real-time interactive scenarios should be robust without significant evaluation errors.
- Multi-sampling rate (narrowband/wideband/uWB/full band) assessment capability is required.
- The complexity should be low enough to evaluate the voice quality of each path in a multi-party call on any device without introducing significant performance gains.
- The online quality score can be aligned with the offline test result, i.e. the same call, and the score of the current online call should be almost identical to the score of the subsequent analysis of the call using the reference evaluation method.
When this QoE evaluation system meets the above characteristics, it is equivalent to allowing you to do the “pre-launch quality evaluation” after the launch of the product, you can see the current call experience rating of your users at any time. This not only improves the capabilities of the rating system, but also helps you to improve the user experience significantly in a targeted way. Based on the characteristics of real-time interaction in acoustic net, a real-time speech quality evaluation method based on hidden state was designed — audio interaction MoS method of acoustic net Agora. The method combines signal processing, psychology and deep learning to score the voice quality of a call in real time with very low algorithmic complexity.
* Figure: Agora audio interactive MoS method compared with existing evaluation methods in the industry
The method is mainly divided into two parts: the first part is the uplink quality evaluation at the sender, which is mainly used to evaluate the quality score of acquisition and signal processing; The second part is the downlink quality assessment at the receiving end, which is mainly used to evaluate the score after the codec damage and network damage. For the overall architecture, please refer to this diagram:
This article focuses on the downward quality assessment, which is the most important part of the real-time interactive experience. In this part we have also taken into account the encoding module at the sending end. Therefore, this part contains the link of encoding – sending – transmission – decoding – post-processing – playback. Different from the previous method of score fitting based on network state, we focus on monitoring the state of each module in SDK. The core idea of this design is very simple. If in a completely damage-free network, the downlink only contains coding damage before playing, each weak network adversation algorithm module will not be triggered. Once the network fluctuates, each weak network counter module will start to operate, and each startup will more or less affect the final sound quality. Therefore, the core of building a downlink quality assessment algorithm becomes to get the mapping relationship between SDK modules and sound quality. Of course, there are several other influencing factors in the actual downlink quality assessment algorithm design, such as encoder architecture, efficiency of encoding different corpora, effective bit rate, network damage model, etc., which will obviously affect the final listening sensation and quality score. Generally speaking, the mean error (RMSE) of the state of Art evaluation method is about 0.3 in the multi-weak network environment, while the mean error of the evaluation method designed by us can be controlled within 0.2 in the multi-weak network environment. \
Quality assessment algorithm based on the downside, we built a global network for the audio quality map, users can real-time monitoring in all corners of the world call quality, below is the map, in the corner of the picture in the transverse and longitudinal axis respectively in different regions of the user, the form of the specific provisions of MoS embodies the QoE of their current call:
\
This MoS method for global audio network interaction has been opened in Agora audio/video SDK 3.3.1 and later. You can obtain the quality score of each call in real time through the mosValue in AgoraRtcRemoteAudioStats. At present, only the score of the downlink (codec – transmission – play) link is provided, and the uplink quality scoring interface will be provided in the future. For detailed description of interface parameters, please click “Read the original text” to enter the audio Network documentation center, and search for “mosValue” to refer to the detailed documentation.