“This is the first day of my participation in the First Challenge 2022. For details: First Challenge 2022”

preface

In the actual production environment, the use of machines to identify objects has been very common, such as the use of machine recognition of fruit, the use of machine recognition license plate and face recognition, high precision recognition accuracy for the real life has brought great convenience. However, the above recognition algorithms are all for image recognition, which is nothing more than feature extraction of sufficiently large data sets and neural network training, so as to obtain a robust model. So, how do machines recognize sounds?

Brief introduction to voice print recognition

Voice print recognition, also known as speaker recognition. Speaker recognition is a speaker recognition technology based on sound. Its applications mainly include speaker recognition, speaker verification and speaker segmentation and clustering.

  • Speaker recognition: Literally, identify which person in the audio library the input audio belongs to (1:N), i.e. who are you?
  • Speaker verification: Verify whether the input audio and the speaker to be verified are the same (1:1).
  • Speaker segmentation cluster: An audio cluster that contains multiple speakers in a speech and belongs to the same speaker, mainly used in doctor-patient conversations.

What does sound look like in a machine? How does a machine convert sound?

In fact, sounds exist in the machine as files (duh), and different files have different encoding formats, such as WAV files:

So how does a machine convert sound? In a machine, a sound is actually a signal, and the read function can be used to retrieve the contents of the signal, as in Python:

audio, sample = soundfile.read(file_path)
Copy the code

Where audio is a list full of numbers, the number represents the amplitude, sample represents the sampling rate, and the length of the list/sampling rate is the length of the speech. Sampling frequency, bit depth and so on are not introduced in detail.

Characteristics of sound?

Voice of many characteristics, such as the speaker’s tone, content as well as the speed and so on, the image could be obtained by extracting feature we focus on information, for the voice, too, in a voiceprint recognition, often is the first sound signal feature extraction, feature extraction of sound FBank and MFCC main brief introduction FBank here, Let’s take a look at this code:

Import torch import torchaudio import Soundfile as sf full_path = "/disc1/ XXX /1 microphone. Wav "audio, sr = sf.read(full_path) audio = torch.FloatTensor(numpy.stack(, axis=0)) Fbank= torchaudio.transforms.MelSpectrogram(sample_rate=16000, n_fft=512, win_length=400, hop_length=160, f_min=20, f_max=7600, window_fn=torch.hamming_window, N_mels = 80) # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- # fbank_feature = MEL (audio). The log () # here you just need to so simpleCopy the code

Visualize the extracted information:

How do you feel? Isn’t that the picture?

This feeling is right. In fact, the principle of machine processing sound and image processing is the same. Different from image, the audio file should be converted to a special image first, and then the special image should be convolution pooling and other common operations of neural network can be carried out.

Embedding

Like most tasks, if you want to identify a person’s voice, the key is to extract the speech of a particular speaker, according to the said quality determines the discriminant accuracy, after after the commonly used method is to model a classification layer, through continuous training to get a better model, then to save layer classification delete is in front of a suitable model, The output of the model is what we need.

Speaker recognition

The most critical embedding is obtained, so how to recognize speakers? The most common method is cosine similarity. If two embedding belong to the same person, the similarity is greater, and vice versa. By comparing the embedding similarity in the sound print library, the user can be identified as the most similar. Finally, a minimum threshold is set.

Speaker verification

Speaker verification is 1:1, so just compare the similarity between a given speaker encoding and the validation audio, setting a larger threshold for a strict task and a smaller threshold for an ordinary task.

The speaker splits the cluster

Speaker clustering segmentation is complicated, usually for a period of contains more than one speaker’s voice, the first to VAD speech detection, after for speech segmentation, such as length of each section of speech coding, by comparing the similarity between the two two voice determine whether for the speaker change point, or directly on the voice in all clustering.

Speaker recognition system flow is as follows: