Today, I would like to introduce the quality of real-time voice, roughly introduce some existing methods in this field, and then introduce the existing methods, and introduce some things I want to do in the future.
Speech quality assessment methods
First of all, I would like to introduce the speech quality evaluation, which is generally divided into a subjective evaluation method and an objective evaluation method. The subjective evaluation method, is actually completely on an emotional, the subjective is divided into two kinds of, one is I don’t give you a reference signal of the original, I only give you a speech, and then after you finish listen to you to tell me, do you think it is that it should be how much is the score, that there’s a way, gives you a anchor, And then tell you that this is the worst, and then ask you to make an evaluation based on this worst, which is also the most used method in the present paper, is a subjective evaluation method.
Objective evaluation method
Need for an objective evaluation method, according to a primitive condition of reference signal, divided into the reference of objective conditions, there are reference of objective evaluation method, the first is probably around 96, almost have a standard called p. 861, the first is to ask a way, is to give a nondestructive, give a damaged voice signal, And then compare them for some similarity, or some auditory impairment, and then give a score. In 2000, there was a p.862, and then around 2004, there was a method called pesq-wb, which expanded the test range of the previous PESQ from 8khz to 16khz, and then, we generally use this pesq-wb now. Now many papers, including for example: noise reduction, lossless and so on, will still use this method to do an evaluation. Around 2012, ITPO came up with a new standard, P.863. This POLQA method is actually an upgraded version of PESQ, which has made some improvements in noise suppression. In addition, its accuracy is actually quite high. The POLQA test results are similar to the audible scores. The closer they are, the higher the test will be.
There are objective evaluation methods for reference
- P.861 The earliest standard for PSQM
- P.862 PESQ and PESQ-WB are the most widely used reference evaluation methods
- P.863 POLQA, the latest reference evaluation method
Objective evaluation method without reference
- P.563, the best known narrowband unreferenced evaluation method
- ANIQUE, according to the authors, is more accurate than the reference PESQ
- E-model /P.1201, Parameter domain evaluation method
- XxNet, Deep learning domain evaluation method
Actually is very much, for example, the most commonly used the Itot p. 563, actually is mainly just give him a voice doesn’t need to give it an original condition of voice, he then it will be from the integrity of its voice, and then get a noise level, and then see if it is smooth enough to judge the voice is OK or not. If it thinks that all these features have no problems, it will give a high score. If there are some features that may be caused by very big reasons, such as the fracture between voices or excessive noise, it will also give a relatively low score. After P.563, Then there’s ANIQUE, an American standard that, according to its literature, is more accurate than the referenced PESQ method just described. Then there is the method of parameter domain. In parameter domain, the speech signal will not be processed, but some state information will be used to do an estimation. For example, the e-Model method, from the collection to the echo to the whole code, if any module has some damage, they will cut out the influence factor of the damage from the whole. There is also a relatively new P. 1201 standard, which includes audio and video evaluation methods. The audio part mainly includes network parameters, codec, volume parameters and so on.
Objective evaluation method pain points
- There are reference methods that can only be used before going live
- No reference method – Traditional signal domain, narrow application scenarios, poor robustness
- No reference method – traditional parameter field, accuracy can only be maintained under finite weak network conditions
- No reference method – Deep learning, limited application scenarios and corpus, slightly higher complexity
- Scene: the narrow
- Accurate rate is poor
- Robustness is poor
- The complexity of the high
Offline testing online
On-line quality perception is characterized by high accuracy, wide coverage, low complexity and strong robustness. The quality assessment is accurate enough to cover most business scenarios, without introducing too much algorithm complexity, and is weakly correlated with voice content.
Downlink quality evaluation method
A standard process: encoding – transmission – decoding – playback, so it involves factors: codec performance, network quality, weak network adversation algorithm quality, equipment playback ability, etc. We conducted a set of data tests: in the multi-weak network, multi-device and multi-mode test case, the MAE of this method and POLQA’s reference score were less than 0.1, THE MSE was less than 0.01, and the maximum error was less than 0.15. The following figure is the multi-weak network test result of a certain mode of an equipment:
Uplink quality evaluation method
There are many modules and each module is independent. Therefore, first of all, each module has its own independent detection capability. For example: echo module, currently may miss the echo, this, itself need to know. Then, after all the modules have self-checked, before coding, there will be a unified detection module, which is like a guard, to do the whole process. To extract the commonness of all scenes, we can summarize four points:
- Device acquisition stability
- Echo cancellation capability
- Noise suppression capability
- Volume adjustment ability
The cause of missing echoes
In fact, we really want to know whether there will be an echo leakage at present. The reasons for the echo leakage are generally divided into four categories:
- Delay jitter. There may be many reasons for delay jitter, such as: the thread is stuck and the signal is not sent in time; there may also be serious nonlinear of the current external device; dual-device, non-causal and non-causal reasons are generally caused by buffer
- Large reverberation environment, reverberation length exceeds filter length
- The acquisition signal overflows, resulting in non-convergence of the filter
- Double talk, strong dependence on NLP, take care of one and lose the other
Cause of noise and noise
- Device noise, single frequency noise, power frequency noise, laptop fan noise, disorderly noise
- Ambient noise, Babble, honking, etc
- Signal overflow, plosive sound
- Algorithm introduction, residual echo, etc
The volume is small
-
Weak equipment acquisition ability/low voice, most
The device has weak playback capability
-
Analog gain, analog Boost gain is small, PC
Digital gain small, bidirectional gain
Independent detection module
- Howl detection, detection and suppression
- Noise detection, early warning
- Noise detection, quantifying the impact of noise introduction
- Hardware detection, estimate equipment performance outside
In the future
Integration of perception, feedback, and monitoring
- The internal state is finer
- Experience coverage is wider
- Faster feedback
- Call coverage is more complete