Talk about real-time voice quality monitoring system

Today, Mr. Wang will talk about the past and present of real-time voice quality monitoring system. Real-time voice must be familiar to everyone, such as wechat voice chat and video live broadcast. Examples in life are everywhere.

In the past voice communication system, there are many factors affecting voice quality, including but not limited to delay, packet loss, packet delay variation, echo, and distortion caused by coding.

Generally speaking, speech quality evaluation methods can be divided into three types: objective evaluation method with reference, subjective evaluation method and objective evaluation method without reference.

Objective evaluation methods with reference:

It refers to comparing the original reference audio and video with the distorted audio and video at each corresponding pixel in each corresponding frame. To be precise, this method does not measure real video quality, but rather how similar or fidelity the distorted audio and video is to the original audio and video. The simplest methods, such as mean square error (MSE) and peak signal to noise ratio (PSNR), are widely used.

PESQ speech quality is an important indicator to measure the performance of speech transmission. How to obtain accurate and reliable QoE(Experience quality) evaluation system has become the focus of current research. PESQ(Perceptual Evaluation of Speech Quality) is a qOe-based speech quality evaluation algorithm proposed by ITU, and subsequently becomes itU-T P.862 standard. PESQ algorithm is a popular voice quality evaluation algorithm at present. When it comes to P.862 standard, P.861 PSQM is the earliest standard. Itu-t P.861, also called PSQM, is a voice quality evaluation system based on PAQM. At present, P.862 PESQ and PESQ-WB are the most widely used reference evaluation methods, and the latest reference evaluation method is P.863 POLQA, which all rely on lossless reference signals.

Objective evaluation method without reference:

Since the 1970s, the objective evaluation of speech quality has developed rapidly. Thousands of objective evaluation methods have been proposed by scholars at home and abroad. Objective evaluation is mainly based on the comparison of characteristic parameters in time-frequency domain or transform domain between original speech signal and distorted speech signal. It is mainly aimed at the deficiency of subjective evaluation method. People have long hoped to have objective evaluation method to evaluate the sound quality of voice equipment. After this, many people have proposed objective evaluation method based on objective measure. It is hoped that these methods can be used to evaluate the speech quality of the speech system under test conveniently and quickly, but the main body of the evaluation is completed by the hardware or software of the machine. PSQM, PAMS, PSQM+ and other objective evaluation methods are widely used at home and abroad. P.563 is the most famous narrowband unreferenced evaluation method. ANIQUE+, according to the authors, is more accurate than reference PESQ, while others include e-Model /P.1201 parameter domain evaluation methods and xxNet deep learning domain evaluation methods.

Objective evaluation method also has many disadvantages:

** can only be used before the launch
** No reference method – Traditional signal domain: ** Narrow application scenarios and poor robustness
** No reference method – Traditional parameter field: ** Accuracy can be maintained only under finitely weak network conditions
** No reference method – Deep learning: ** Limited application scenarios and corpus, slightly higher complexity

Generally, we can put forward various objective speech quality assessment methods from different directions, but the performance and reliability of the objective speech quality assessment must be ultimately determined by its correlation with the subjective speech quality assessment. We usually make the above judgment through the fitting process of the subjective and objective speech quality assessment. The fitting process is to input subjective and objective speech values under different conditions through subjective and objective speech quality evaluation, and then perform least square fitting on the subjective and objective values, where the target value on the horizontal axis is the target value on the vertical axis. Draw the subjective and objective voice quality evaluation curve and get the comparison between the subjective and objective voice quality evaluation. Predictive mean square error (MSE) is usually used to reflect the correlation between subjective and objective speech quality assessments. The closer the predicted mean square error value is, the better the correlation between subjective and objective speech quality assessment is, that is, the better the performance of objective speech quality assessment is. On the contrary, it shows that the worse the correlation between subjective and objective speech quality assessments is, the worse the performance of objective speech quality assessments is.

At present, the offline testing is mainly online, with high precision, wide coverage, low complexity, strong robustness and other characteristics.

Quality assessment is accurate enough

Cover most business scenarios

No excessive algorithm complexity is introduced

Weak correlation with speech content

Uplink quality evaluation method: acquisition -AEC-NS-AGC- diagnosis, with independent detection + unified detection

Features: equipment acquisition stability, echo cancellation ability, noise suppression ability, volume adjustment ability

Downlink quality evaluation method: adopt encoding – transmission – decoding – playback

Take a laboratory as an example, its core indicators for validating data and mapping global audio quality include: codec performance, network quality, weak network adversation algorithm quality, and device playback capability.

In the multi-weak network, multi-device and multi-mode test case, MAE and MSE of this method and POLQA reference score are less than 0.1, less than 0.01, and the maximum error is less than 0.15

The figure below is the multi-weak network test result of a certain mode of a certain equipment

Multi-weak network test results for a model of a device

Let’s briefly talk about NOMA, NOMA (Non othoO Multiple Access). The theoretical basis of NOMA is called multi-user information theory. NOMA, or non-orthogonal multiple access, is a promising 5G technology. Its advantage is that it can improve frequency spectrum efficiency (rate/bandwigth) and access volume, which is exactly in line with the explosive data growth and access demand in the upcoming 5G era. NOMA technique can be used to make a simple comparison in up-down link quality evaluation method.

Comparison of upstream and downstream link quality evaluation methods

1. The user sends different power.

In downlink NOMA, the transmit power of each user is always sending power by the base station and the influence of other individual users send power, and the distribution of different users on the channel quality of sending power different (poor quality of channel is the channel gain low high to send power user distribution, distributed conversely low power.

The uplink is where the transmission power of each user is only affected by the maximum transmission power of its device. And have differences of channel quality for the users make the use itself the largest sending power (that is, each user is maximum transmitted power for their own hair), channel under the condition of small differences in quality can use to guarantee the quality of poor performance of the channel at the same time improve the quality of channel good allocation methods, but often in this case will cause bad effects to the users on the poor quality of channel.

2. SIC decoding sequence is different.

In the downlink, each receiving end receives the superimposed signal from the base station, and each receiving end has its own SIC receiver, for the received signal, through continuous decoding, to get the required signal. For a certain receiver, the superimposed signal is transmitted through the same channel, so when calculating the rate, we multiply the channel gain is the same, and at this time, the highest receiving power is demodulated first.

In decoding order in the uplink, by contrast, because users can understand the transmitter into hardware launch performance is no difference, they have a high and low points of the channel gain, but they would with their own the most high power transmitter launch, this base station close to the user’s signal to the base station over there the received power greater (receive) power transmission power = x channel gain, At this point, the one with the highest receiving power is demodulated first (that is, the one with the highest channel gain, since the transmitting power is the same).

Decoding sequence: priority decoding will be given to those with good channel quality (that is, those with high receiving power at the receiving end); Therefore, in NOMA system, regardless of uplink or downlink, the first demodulation at the receiving end is the highest receiving power at the receiving end.

3, the user is affected by different interference.

In the downlink, users with poor channel quality are more likely to interfere with other users in the cluster due to the high transmission power assigned by users with poor channel quality. In other words, users with good channel quality are more likely to be disturbed.

In the uplink, users with poor channel quality are more likely to be disturbed than those with good channel quality because they send signals to the base station to generate superimposed signals to be received by the base station.

4. The difficulty of implementation is different.

An uplink is easier to implement than a downlink. In NOMA technology, multi-user detection and continuous interference elimination are finally realized, and the continuous interference elimination needs to be realized by differentiating the signal receiving power of different users by SIC receiver. For downlink, the base station sends superimposed signal to the user, so the user terminal is needed to realize multi-user detection and continuous interference elimination technology; On the uplink, each user sends their own signal to the base station, and only the multi-user detection and continuous interference elimination technology need to be realized at the base station. Compared with the base station, the processing capacity of the user terminal is too limited, so it is difficult to realize multi-user detection and continuous interference elimination in the user terminal.

If you are interested in NOMA technology, you can search relevant papers and materials to learn about it. Positioning is a promising 5G technology.

The following describes the causes of missing echoes, noise, murmurs, and low volume in real-time voice

Causes of missing echoes:

In the process of delay jitter, busy threads, serious device nonlinearity, dual devices, and non-causality may occur
Large reverberation environment: Reverberation length exceeds filter length
Acquisition signal overflow: the filter does not converge
Double talk: strong dependence on NLP, easy to lose

Cause of noise and noise

Equipment noise: such as single frequency noise, power frequency noise, laptop fan noise, disorderly noise
Ambient noise: Babble, honking, etc
Signal overflow: plosive
Algorithm introduction: residual echo, etc

The volume is low

Weak collection ability and low voice (most of them)
The device has weak playback capability
Analog gain and analog Boost gain are small
Small digital gain

Finally, the independent monitoring module can be divided into four parts: howling detection, noise detection, noise detection, hardware detection.

Small outlook

In the future, I think perception, feedback and monitoring will be integrated, and will become finer, wider, faster and more complete. Internal states will be finer, the experience will be broader, feedback will be faster, and calls will be more fully covered. I also believe that 5G technology, real-time audio and video transmission technology and quality evaluation system in China will become better and better.

Talk about real-time voice quality monitoring system

Related Posts

WebGL rendering pipeline

Automated front-end deployment

Interview – Browser negotiation cache and strong cache