By Menseth

preface

The optimization of audio quality is a complex system engineering, and noise reduction is an important part of the system engineering, the traditional noise reduction technology has fallen into a bottleneck after decades of development, especially the non-stationary noise suppression is increasingly unable to meet the needs of new scenes. In recent years, the rise of AI technology represented by machine learning/deep learning has brought new solutions for audio noise reduction in special scenarios. Agora has gradually formed its own accumulation with the development of online audio and video live broadcasting services. This article is a series of audio evaluation articles under special scenes produced by Agora audio technology team — AI noise reduction. As there are still different opinions on the evaluation criteria of audio in the industry, Agora’s practice is more focused on engineering implementation from participation to non-participation. Here, we invite the industry colleagues to add criticism and correction.

background

As developers, we want to provide users with a real-time interactive experience with high clarity and fluency, and high fidelity sound quality. However, noise is always present, which can interfere with people’s calls. Different occasions have different noise, noise can be stationary or non-stationary or transient, stable noise is not changing with time, such as white noise; Non-stationary noise changes with time, such as people talking, road noise and so on. Transient noise can be classified as non-stationary noise, which is intermittent noise with short duration, such as keyboard tapping, table knocking, door closing and so on. In actual interactive scenario, when both sides when using mobile devices, one party in a restaurant, the noisy streets, the metro or noisy environments, such as the airport, and the other party will receive a large number of speech signal containing noise, when the noise is too large, two people can’t hear each other talk content, prone to negative emotions anxiety, influence the end-user experience. Therefore, in order to reduce the interference of Noise to voice signals and improve the pleasure of users, Noise suppression (NS) is often used to filter out Noise signals from noisy voice signals and retain voice signals as much as possible, so that the voice heard by both parties is not disturbed by Noise. An ideal NS technique is to remove noise while preserving the clarity, intelligibility, and comfort of speech.

The research on noise reduction was first started in the 1960s. After decades of development, great progress has been made. We roughly divide noise reduction algorithms into the following categories.

(1) Subspace method, the basic idea of which is to map noisy speech signal to signal subspace and noise subspace. Pure speech signal can be estimated by eliminating noise subspace components and reserving signal subspace components;

(2) Short-time spectrum subtraction method, which assumes that the noise signal is stable and changes slowly, uses the spectrum of the noisy signal to subtract the spectrum of the estimated noise signal, so as to get the speech signal after noise reduction; (3) Wiener filter, the basic principle of the algorithm is to estimate the speech signal with wiener filter according to the minimum mean square error criterion, and then extract the speech signal from the noisy signal;

(4) the method based on auditory masking effect, the method of simulating to the human ear awareness, at a certain moment in a certain frequency to determine a human ear can feel the noise energy minimum threshold calculation, through to control the noise energy below the threshold, so as to achieve the maximum residual noise masking and prevent the purpose of the speech distortion;

(5) The method based on noise estimation, which is generally based on the difference between noise and speech characteristics, and distinguishes noise components and speech components through Voice Activity Detection (VAD) or speech probability. However, when noise and speech characteristics are similar, This algorithm can not distinguish the speech and noise components of noisy speech accurately.

(6) noise reduction, AI AI noise reduction technology to a certain extent, solve the problems existing in the traditional noise control technology, such as in some transient noise, the noise of the short duration, high energy, such as door shutting, tapping, etc.) and some non-stationary noise (unpredictable change over time is fast, random fluctuations, such as the noisy street) on the treatment of AI noise reduction advantages more apparent.

Whether it is traditional NS technology or AI NS technology, we need to consider the impact of package volume and computing power when the product is launched, so as to apply it to mobile terminals and loT devices. In other words, we need to ensure the MAXIMUM NS performance while ensuring the lightweight model. This is also one of the most challenging aspects of the actual product launch. The magnitude of the model has been guaranteed after it goes online, so can the performance of NS reach the standard? Here we focus on how to evaluate the performance of NS technology. In view of NS parameter tuning, NS reconstruction, the proposal of new NS algorithm, and the comparison of different NS performance, how can we evaluate the performance of NS technology from the perspective of user experience?

First, we classify the NS methods as subjective and objective, in which the objective tests can be divided into Intrusive and non-intrusive, or with and without input. The following are the definitions, advantages and disadvantages.

way meaning The advantages and disadvantages
Subjective test The subjective evaluation method is based on a certain presupposition principle, which reflects the subjective impression of the listener on the quality of speech. In general, Absolute Category Rating (ACR) is adopted, mainly through the average opinion score (MOS) for subjective evaluation of sound quality. In this case, there is no reference speech, and the listener only listens to the distorted speech and rates the speech on a scale of 1 to 5. Advantages: Directly reflect the user experience; Disadvantages: high labor cost, long test cycle, poor repeatability, affected by individual subjective differences.
Objective to test Intrusive: Prediction of subjective mean score (MOS) score by relying on some form of distance feature between reference and test speech. For example, most of the literatures and papers used PESQ, SNR, segmenting SNR, and the distance between board and warehouse to evaluate their own NS algorithms. Advantages: batch automatic testing, saving labor cost and time cost; Disadvantages :(1) can not be completely equivalent to user subjective experience; (2) Most objective indicators only support the sampling rate of 16K; (3) Reference signals and test signals are required to be aligned by frame, while real-time RTC audio is inevitably affected by the network, resulting in data cannot be aligned by frame, which directly affects the accuracy of objective indicators.
Non-intrusive: Prediction of speech quality based solely on the test speech itself. Advantages: RTC audio quality can be evaluated in real time without direct prediction of original reference signal. Disadvantages: high technical requirements, difficult to build model

We believe that subjective tests can directly reflect user experience, and if the subjective test results are consistent with the objective test results, the accuracy of the objective test can be proved. In this case, the objective test can also reflect user experience. Let’s look at how sonnet evaluates the performance of NS.

Sonnet NS evaluation

We are building a comprehensive, reliable, can rely on the NS of measurement system for a long time, we believe that it can deal with any future scenarios with noise (currently covering more than 70 types of noise) and any NS technology, and we do not specify a particular test corpus, the sampling rate and the effective spectrum, anyone’s speech can be as the object to be tested. For this purpose, we verified existing NS measurement techniques and found that they could not cover all of our call scenarios, nor could they cover the noise types we tested, nor could they represent subjective feelings. Therefore, we fit the new all-reference NS index, and use the deep learning algorithm to make the no-reference model. The two schemes are carried out simultaneously. Here is a brief description of the existing NS evaluation indicators, our verification methods, and how we make fully referenced and unreferenced NS evaluation models.

1. Existing NS evaluation indicators: a large number of literatures, authoritative papers and some open source websites, such as github.com/schmiph2/py… According to our actual scene requirements, we developed an objective index library for NS evaluation, including common PESQ, SegSNR, STOI, etc., as well as some form of distance features between reference speech and test speech. For example, Cepstrum Distance (CD) can reflect the influence of non-linear distortion on sound quality, and Log Spectral Distance ( Normalized Covariance Measure (NCM) is used to calculate the Covariance between the envelope signals of pure speech signals and noisy speech signals in the frequency domain. Csig and Cbak rating [1-5] of Distortion and predicted noise rating [1-5] of Distortion Rating [1-5] of overall quality is a comprehensive measure formed by combining multiple objective measures. The reason for using comprehensive measures is that different objective measures capture different characteristics of distorted signals, so combining these measures in a linear or non-linear way may significantly improve correlation.

Each indicator corresponds to the change of some characteristics of the audio before and after NS. Each indicator measures the PERFORMANCE of NS from different angles. We can’t help but wonder, right? Are these indicators equal to subjective feelings? Besides being algorithmically sound, how do we make sure it’s consistent with the subjective? Is there no problem with these objective indicators, and there is no problem with the master observation? How do we ensure coverage of these indicators?

2. Our verification method: In order to verify the accuracy of the objective index database we established and its relevance to subjective experience, we conducted subjective audio test based on crowdsourcing and developed an APP specifically for crowdsourcing subjective labeling. The whole process followed P808,P835 and the REFERENCE NS Challenge. Requirements are made for test data, duration, environment, equipment, testers and so on. We mainly pay attention to three dimensions, namely, voice clarity SMOS, noise comfort NMOS and overall quality GMOS, which all range from 1 to 5 points. The relevant descriptions of MOS score and APP page design are given below.

So how much is the correlation between the results of subjective annotation and the indexes in the objective index database mentioned before? We have made statistics of all objective indicators in the objective indicator database. Here, we only give PESQ and the Pearson Linear correlation coefficient (PLCC) labeled subjectively:

PLCC PESQ
Subjective SMOS 0.68
Subjective NMOS 0.81
Subjective GMOS 0.79

The subjective SMOS, NMOS and GMOS here are calculated from the mean values of 200 data items/each data item marked by 32 people.

3. How to make NS evaluation models with full reference and without reference: With the accumulation of subjective annotation data, we found that the accuracy of existing indicators was not enough to cover all our scenes and noise types, let alone represent subjective feelings. Therefore, a new comprehensive measure MOS score was fitted to evaluate the performance of NS.

The first scheme is the all-reference model, that is, the indicators in the objective index database are used as feature input, and the results of crowdsourced labeling are used as labels to train the three models, and the outputs of the three models are respectively to measure the scores of speech, noise and the whole.

The following is a data set composed of 800 pieces of data, 70% of which are randomly selected as the training set and 30% as the test set. The model selects GBDT(Gradient Boosting Decision Tree) training and testing of GMOS. The upper part of the figure below is real GMOS of training set and Predicted GMOS of model prediction training set after training model. The lower part is the Real GMOS of the test set and the Predicted GMOS of the predicted GMOS after the trained model. The PLCC between the Real GMOS and the Predicted GMOS of the test set can reach 0.945. The correlation coefficient of SROCC(Spearman rank-order correlation coefficient) was 0.936, and the Root Mean Square Error (RMSE) was 0.26.

Our second scheme is the no-reference model. Since objective indicators with full reference require that reference signals and test signals must be aligned by frame, real-time RTC audio is inevitably affected by the network, resulting in data not aligned by frame, which directly affects the accuracy of objective indicators. In order to avoid the influence of this factor, we are also making SQA (Speech Quality Assessment) model without reference. The core technology at present is to convert the audio into Mel spectrum map, then cut the Mel spectrum map, and use CNN to extract the Quality characteristics of each segment after cutting. Then, self-attention is used to model the feature sequence in time to realize the interaction of feature sequence in time. Finally, the contribution degree of each segment to the entire MOS score is calculated by attention model, so as to map to the final MOS.

The horizontal coordinate represents epoch; the blue line represents the change of training loss with epoch; the red line represents the increase of training set with epoch; and the PLCC of training set tag. The green line represents the test set with the increase of epoch and PLCC of the test set label. We can see that the offline effect is ideal at present, and we will add more scene data for model training later.

In the future

In the future, we will do Audio Quality Assessment (AQA) directly, and noise is only one factor in Audio that affects subjective experience. We will build a complete set of real-time audio evaluation system online. This evaluation system will be long-term reliable and high-precision, and will be used to evaluate the degree of disgust or pleasure generated by users in real-time audio interaction. The whole process includes scheme establishment, data set construction, crowdsourcing annotation (establishment of annotation standard, cleaning and screening of annotated data, data distribution verification), model training and optimization and online feedback, etc. Although we are facing some challenges right now, if we set a SMART goal, it will be achieved.

The Dev for Dev column

Dev for Dev (Developer for Developer) is an interactive and innovative practice activity jointly initiated by Agora and RTC Developer community. Through technology sharing, exchange and collision, project construction and other forms from the perspective of engineers, it gathers the strength of developers, excavates and delivers the most valuable technical content and projects, and fully releases the creativity of technology.