Hello, everyone, I am CV Jun, dabble in voice for some time, today I will briefly describe how to pass the quality of voice transmission before and after, that is to say, how to evaluate the quality of our voice, such as microphone and other sound equipment and so on.

In terms of speech quality, there are three global evaluation methods: reference objective evaluation method, reference objective evaluation method and subjective evaluation method.

So if we subdivide it into its subclasses, there will be many algorithms and evaluation ideas used.

Voice quality is extremely important, can let you and I chat from some noise disturbing, can make the military communication more reliable, can make every festival times miss relatives, and family phone review that long lost, real, kind words and timbre.

How did we evaluate it?

Subjective evaluation research can mainly refer to the National safety standard “YT Audio Subjective Test Analysis”, the main content of the national development standard is also a reference to the subjective evaluation of international standards: The international standards commonly adopted are ITU-T P800 (subjective evaluation of voice quality in telephone transmission systems), ITU-T P805 (subjective evaluation of conversation quality), (subjective evaluation of telephone broadband and broadband digital voice codec).

CV jun to their official website to find the previous evaluation method, but very comprehensive oh.

Figure 1: Test method in yDT2309-2011 standard

Scoring criteria

The scoring standard can be 5 points or 7 points. If the score of favorable comments is defined in advance, normalization is not required. Otherwise you need to do normalization

Figure 2: YDT2309-2011 scoring criteria

Evaluation dimensions

The National Standard for Audit Subjective Judgment Evaluation lists many dimensions that need to be deleted or added according to actual products.

According to CV jun, objective test criteria are usually divided into word quality and meaning. These pages begin by discussing the quality of words. Many common standards of experience and related experiences. These pages are part of a common quality standard.

Good value-based metrics

Applicable EU standard 1-65899

Global volume testing can be divided into one or more dynamic levels, and in the most widely used audio standards, common audio programs are trained from different activity angles.

Objective evaluation – model based

(I) Background and standards

The earliest voice quality evaluation standard is only based on wireless index (RXQUAL), while the actual voice is transmitted through horizontal propagation nodes such as wireless, transmission, switching and routing. Any link problem will lead to insufficient speech perception of users, so it is impossible to find and locate voice quality problems only considering the wireless index. Therefore, the voice quality evaluation method based on user perception has become the most important standard for user voice quality evaluation.

Common speech quality evaluation research methods can be divided into subjective evaluation and objective evaluation. The early educational evaluation of phonetic teaching quality is subjective. People can feel the quality of speech in their ears after making a phone call. 1996 The International Telecommunication Union began its work. It is a subjective test used to investigate and quantify the user’s listening behavior and perceived speech quality.

Point: GSM network, one point is better than three

However, in real life, people seem to be difficult to hear and appreciate quality, that is why the international telecommunication union has made a sound quality testing and standardization technology and standard noise evaluation algorithm, one after another, such as PESQ evaluation embarks from the actual evaluation method of object, eliminates the use of quantitative method to calculate the level of audio quality defects. Among them, the algorithm is the latest generation of speech quality evaluation algorithm released by the International Telecommunication Union in February 2001. Because of its strong activity and good connectivity, it adopts the fastest speech quality evaluation algorithm. In all kinds of end-to-end networks, word quality is determined by word quality and quantity in order to evaluate word quality objectively. By building the algorithm model (see template 6), we can see the flow of all the algorithms, and then use the input filter to simulate the level of the input filter to extract and extract the two algorithms. The signal. In general, there is a big difference between the output signal and the reference signal, and the S-point is low. As a result, they may be confused. We can take a look at these pictures from the University of tongue.

The latest speech evaluation algorithm based on MNB can only be used for the same frequency coding and specific coding types, can only be used for Asyaq color, gradient and other applications of the algorithm model, for editing image templates, etc. The latest speech evaluation algorithm based on P822 can only be used for the same frequency coding and specific coding types, can only be used for Asyaq color, gradient and other applications of the algorithm model, used for editing image templates. (2) The MOS mask has an image model and algorithm. The model and algorithm can be used to detect the number of MOS system tests or count the number of MOS words. ICONS load the system to exit the application to save this image just like a window. The main functions of the system are divided into two groups. The number of processes must be written as wireless network. The PESQ algorithm module, on the other hand, creates main audio files and MOS lines to play dark keys. Audio analyzer automatic quality, naming is not easy; Unit format ® does not need to study phonetic interpretation, it can be translated into upV-based MOS phrase models.

Figure 9. Rhodes & Schwartz audio analyzer with MOS test

summary

CV jun has written the phonetic evaluation methods used in the past, and the summary is as follows:

Based on subjective judgment:

Topic solved: In terms of sound quality, this value is based on natural repetition. Other organizer test areas start with the main content attribute on the object index above to make it easy to verify that the certificate is correct, and will not automatically respond to the parameter list, the index is too good, but the quality of the word is not good. Model-based objects, specified as: no automatic modeling of word attributes, valid entries and the types of personal details they are used to distinguish from MOS descriptions contain various quantum algorithm sensory factors (such as encryption and decryption, bit errors, packaging (filtering, etc.) and subject index test invalidation. What are we doing now? Cheng Yang is the only one used by the company. Personal and object methods for determining language attributes. The topic is the calculation of MOS, CMOS and ABX tests that determines the description of the language attribute M CD (Mel CEP TURM) value I to be displayed in this document. Signals contain a signal indicating whether it needs to be trusted, if a good word or syllable is missing a link, but automatically detects language attributes such as Macnet after the specified account expires. Methods based on deep learning: automatic, multiplication, escape, mosquito feel difficult. Accept cancellations. Some CNN classification and language selection methods are required. Create and read data select and select properties configure rating lists such as loss create and learn templates provide standard file dialog names for default KDE file modules to view and mark notifications. That’s what we’re interested in. When defining deep learning: compare multiple definitions for language size.

Several indexes were compared

1 Size setting, arbitrary height. Moses’ authors suggest that MOS estimates for changing target values through language learning effort are too large. This value is provided by the scorer, for example, in language code to test messages of different sizes, normal MOS and MOS maximization. Properties and property values are allowed in the live window. However, this value is affected for several reasons. In various papers MOS is incompatible, only one protocol MOS can be integrated with different systems and converted into the value length format text published by different systems in SSW10: replace sensor and underline, when assigning a value to the string in the property text, the audio sample will have an effect on E. In the original window, the value of property and the value of change is listened to, but the value that people provide is multifaceted and it’s about outcomes. In general, Google’s evaluation of long format text to Speech: Comparing the published sensory and paragraph ratios in S10 compares several evaluation methods for multi-line text to synthesized speech. When evaluating a sentence in a long text, the presentation of the audio sample significantly influenced subjects’ results in giving v, assigning only one sentence without context to compare with the same content.

Allows authentication using the I TU language attributes of the original window when the entire class rating (ACR) language attributes and code are converted to ETTP.80 0.1 using the ACR method. With this option, participants can gain additional language attributes, as CO height, and language quality. In general, MOS must be 4 or higher, which is a good language attribute. If MOS is less than 3.6, more topics are incomplete and have a cancel attribute. MOSv test requirements are general: sample number and variable string control each audio input and device use; Each audio sequence has the same value. Full rating, rating of classes as opposed to other topics of language attributes (DCR is the opposite of these two methods) Languages do not need to provide hints, but do need the actual language, MOS counting scripts are attached to the language of this article in the background language. It’s not just a MOS, it’s a 95% confidence interval.

Here CV jun found a code, you can have a look, relatively simple, I will not repeat.

# -*- coding: utf-8 -*- import math import numpy as np import pandas as pd from scipy.linalg import solve from scipy.stats import t def calc_mos(data_path: str): data = pd.read_csv(data_path) mu = np.mean(data.values) var_uw = (data.std(axis=1) ** 2).mean() var_su = (data.std(axis=0) ** 2).mean() mos_data = np.asarray([x for x in data.values.flatten() if not math.isnan(x)]) var_swu = mos_data.std() ** 2 x = np.asarray([[0, 1, 1], [1, 0, 1], [1, 1, 1]]) y = np.asarray([var_uw, var_su, var_swu]) [var_s, var_w, var_u] = solve(x, y) M = min(data.count(axis=0)) N = min(data.count(axis=1)) var_mu = var_s / M + var_w / N + var_u / (M * N) df = min(M, N) -1 t_interval = t. pf(0.975, df, loc=0, scale=1) interval = t_interval * np. SQRT (var_mu) print('{} : + - {} {} 'format (data_path, round (float (mu), 3), round (interval, 3))) if __name__ = = "__main__' : data_path = '' calc_mos(data_path)Copy the code

Speech quality perception assessment

The following is the label code: first, the verification system converts the original signal and signal level to the standard audio level, and then to the filter after layer-switching filtering, the audio format is adjusted to two codes. Such variations include linear filtering and modifying the calm interval between two audio codes written as interfaces (e.g. Extract the intersection of pages from two angles, extract time and MOS display.

CV jun here also introduces a PESQ comparison: P.563 algorithm is very easy to use oh

Objective quality single-ended method P.563

The 1 p.5 and PE maximum output codes only work with audio engines that differ from p.5, so p.5 is more usable. But PE is less precise, and one of the three options will be determined; Attribute parameter estimation; The second part is the mapping model. After language processing, 563 first counts some attributes, and after applying these attributes, the type of the mapping model that will be displayed is computed using the mapping model that looks up the final value (in fact, it is the same as the line). The language code is calibrated and filtered. You can choose a third time. 563 means that all languages will be input code. The signal is calibrated to S. It will be decided below. The 563 algorithm uses two types of filters. The size of the first filter The second filter is used to contain the active filter The second filter with the above five filters, the synthesized speech symbol sequence of the filter can be used to detect its cutoff fixator. The final component of the known channel model is to deal with symbolic functions, the height of the message threshold used to split the word, otherwise, the threshold is dynamically attached to represent the power of the word in the NN, and the initial value of the lexical frame is 4ms. In order to improve the accuracy of VaD, after processing VaD results: if the part is larger than the threshold value, but the length is 12ms (less than 3 frames or 2 parts per second), but the interval is less than 2 00ms, but in the process of extracting the features of the two parts of the language, the parameters are extracted. Algorithm 563 uses printed text and audio, and you can choose at least one of the following options: Algorithm 563 allows you to extract Settings from previous language code. Use the parameter analysis section. The first part is used to restore the original language code and the reverse language message is separated in the third part of the message. Of course, the time domain is separated and identified below. The number of different parameters more than 563 algorithm background types are preferred among 8 key parameters, cipher speech ratio (SNR). Backgrounds can be of very good quality. Most languages are MOS background languages usually between 1 and 3. The language is divided into different languages and can therefore differ from one another. There is only one active language. It is included at the bottom of the language management. The coded output is related to the quality of life, as shown below:

These algorithms, CV jun algorithm background, said already very familiar ~ we can have a look, what can not consult me.

Mapping model of objective evaluation results

P563, mapping model is linear model, default algorithm 563 represents 12 linear equations. Contains Settings. To check the language, give the first value of the p12 string.

Exit: No voice connection network function

NISQA: Speech quality in a no-reference voice communication network

CV jun take you to review, this algorithm is the previous article introduced oh ~

The deep network can be used for automatic feature extraction, so this kind of method can directly input Mayer spectrum coefficient or MF directly into the model. As an example. CV Jun has to say, Mertu is very good.

In CV Jun’s image above, full network storage is too easy. To measure the maximum language quality, get a MOS score at the end of the OUTPUT MFC connected to CNN.

Details of CNN design are as follows:

Language preview can be used to show the logout model, showing different sound systems for the computer system, which one the standard TTS model/language system should use: Exit one terminal or both terminals (e.g. modified)

Abstract

List of language attributes when measuring language attributes. Language Settings have been analyzed for years. This is not required in notification systems and can be divided into two parts in real Windows:

noise

CV: Let me talk about some noise, because it affects the quality a lot.

Device noise: such as single frequency sound, laptop fan sound, etc.

Ambient noise: honking, etc

Signal overflow: plosive

There are also small volume issues, including the device reading sound is low, the speaker’s voice is low and so on.

The solution

Here are some suggestions. Mr. CV believes that these types of noise can be accurately detected by independent detection method and then separated.

Including training detection models for noise, hardware noise.

CV Jun then introduces a solution for echo:

Audio processing prevention and debugging

This section defines what happens when you make a phone call. You have a phone call and a phone call from somewhere. When you are here, you can accept it, but you know, this is one of the most important reasons why past relationships suffer. I don’t have a single used product on the market. Line each sound and optical sound segmentation. The row echoes on the current row due to the optical representation of 2-4 channels. Language can be used to eliminate noise. The close key portion of the previous process. Specify missing original printers and some debugging. 1 Primitive printer 1) defined by adaptive filter and adaptive algorithm filter, minimum or IR can be used to save adaptive filaments. The following figure is used to parse IR persistence. Displays the normal configuration of adaptive filters

The image above shows the error code of the input code. The adaptive algorithm of adaptive filter is a member of stochastic gradient algorithm family.

2) The original cancellation process.

Here is a block diagram of the basic principles of echo cancellation:

CV Jun. show you the following processing process :(a) determine the strength and distance of the parts; (b) remote input adaptive filter FIR filter, at the same time get problem E, processing line error. 1) Stop v original process knowledge is not an algorithm, it is not good to use basic knowledge, if the foundation is solid, of course, will know more, please refer to the algorithm code. If you use design to get better documentation, so the algorithms don’t know, they have to tell me. First you don’t understand. Read it once. It makes sense every time. 3) Run an application to test the algorithm. If the application is input, it is an embedded and remote file. Write the output of EC to see the effect of the procedure. Many steps, congratulations on choosing the algorithm. Otherwise, something needs to change in the algorithm, something needs to change. If debugging is complete, the algorithm output is heard without v method. B) If the setting delay is specified, the PCM data will be contained within a certain distance, but the contained data is set to this delay. At this point, the output data is still empty. Vc) it can also get PCM data from remote and remote products, take it as input today, look at the output of the algorithm, you can’t hear it. The algorithm can be used after this display. Each piece of hardware is a specific platform. Latin platform. The chip company has a display board, each customer has its own hardware platform, and you can change the latency including PCM data. When the mobile Internet company applied, he then swiped in the UI for too long to use some phones and configured a delay that, after testing, the handset would use. After the above display, the original has been doubled.

The last

Speech enhancement noise and its evaluation method

The noise type

Common distortions are:

Additional noise: Background sound recorded by the microphone during recording automatically repeats the connection channel effect: this shows limited response to a single or bandwidth. Remove channel impulse response nonlinear distortion: such as improper signal input gainCopy the code

Speech enhancement

CV jun just introduced the noise category, so we can do some specific solutions. Signal degradation can be divided into three categories:

In addition to the expected words, you can construct words and sound qualities, which will disable the required language. For some additional words, it will be fixed or changed over time. It changes, like, well, increasing the working volume of this channel adaptive filter has, and these words cannot be recognized and removed, such as media interface, which is used for repetition and repetition. If the correlation word differs from the affirmative word if the microphone position, the microphone function and the codec bandwidth boundary and the expected analog sound do not have much response boundary, the microphone amplifier and other information will be reversed online. It’s too long to use. By frame processing, where) is the window function, M is the frame displacement, N is the window length, and the ratio of frame difference to time difference is 50 Hz. Window roles and frames changed too much to degrade window performance. Handing can be used, in the form of a 3.3 spectral search in 1997, to reduce the number of possible terms. Aspect ratios are performed as multipliers. Delete insufficient, remaining words. If it is too large, the message is ignored.

conclusion

This article is very long, but very meaningful. It summarizes the quality problems of voice transmission and voice codec in the past few years and in recent years. In addition, I also put forward solutions for several kinds of noise, so that we can solve the problem better.

If you are interested in the article, look at another article I wrote on InfoQ: sound network algorithm and the relevant solutions such as noise, the space here, have time to integrate together next time introduce ~ actually, also include the use of reinforcement learning, against the generated methods such as problem solving methods, particularly strong, the later can analyze in detail.