A list,
Vector Quantization (VQ) is a commonly used compression technique. This paper mainly reviews:
1) VQ principle
2) Speaker Recognition based on VQ
Speaker recognition is also a matter of classification:
Speaker recognition techniques mainly fall into the following categories:
Template matching method this kind of method is relatively mature, the main principle: feature extraction, template training, matching. Typical are: dynamic time warping DTW, vector quantization VQ and so on.
DTW uses the idea of dynamic programming, but it also has disadvantages: 1) over-reliance on VAD technology; 2) It does not take full advantage of the temporal dynamics of speech, so it is easy to understand why it was replaced by HMM.
VQ algorithm is a method of data compression. Codebook resume and codeword search are two basic problems. Codebook resume is to train a better codebook from a large number of signal samples. Codeword search is to find a codeword that best matches the input.
Classification method based on statistical model
The essence of this kind of method is still a pattern recognition system, which requires feature extraction, classifier training, and classification decision. Typical framework:
Commonly used models include: GMM, HMM, SVM, ANN, DNN or a variety of joint models.
GMM basic framework:
Similarly, there is gMM-UBM (Universal Background Model) algorithm, which differs from GMM in that it trains a large GMM for the whole sample of CLASS L, unlike GMM which trains a GMM model for each class. SVM MFCC as a feature, each frame as a sample, can use VAD to delete invalid audio segments, direct training classification. In recent years, there are also methods to use sparse expression:
2 the principle of
Vector Quantization is widely used in signal processing and data compression. In fact, there is a VQ step in multimedia compression formats such as JPEG and MPEG-4.
The name Vector Quantization may sound a bit silly, but it’s actually not that sophisticated. As we all know, analog signals are continuous values, while the computer can only process discrete digital signals. When converting analog signals into digital signals, we can replace an interval with a certain value in the interval, for example, all values on [0, 1) become 0, and all values on [1, 2) become 1. A more formal definition is that VQ is the process of encoding points in a vector space with a finite subset of them.
A typical example is the encoding of images. In the simplest case, consider a grayscale image where 0 is black, 1 is white, and each pixel has a value of a real number on [0, 1]. Now to encode this as a 256-order grayscale image, the easiest way to do this is to map each pixel value x to an integer floor(x255). Of course, the raw data space does not have to be continuous. For example, if you want to compress the image and store only 4 bits per pixel (instead of 8 bits), a simple mapping scheme is x15/255 to encode the integer values on the original [0, 255] interval with the integer values on [0, 15].
However, such a mapping scheme is quite Naive. Although it can reduce the number of colors to achieve the effect of compression, if the original colors are not evenly distributed, the resulting image quality may not be very good. For example, if a 256 greyscale image consists entirely of 0 and 13 colors, the mapping above will result in an all-black image because both colors are mapped to 0. A better approach is to combine clustering to pick representative points.
The actual practice is to treat each pixel point as a data, run k-means, get K centroids, and then replace the pixel values of all points in the corresponding cluster with the pixel values of these Centroids. The same can be done for color images, such as RGB tri-color images, where each pixel is treated as a point in a 3-dimensional vector space.
Ii. Source code
17 / 18
% Demo script that generates all graphics in the report and demonstrates our results.
[s6 fs6] = wavread('s6.wav');
[s1 fs1] = wavread('s1.wav');
%Question 2
disp('> Question 2: draw the original speech waveform ');
t = 0:1/fs1:(length(s1) - 1)/fs1;
plot(t, s1), axis([0, (length(s1) - 1)/fs1 0.4 0.5]);
title('Waveform of original speech S1');
xlabel('time/s');
ylabel('幅度')
pause
close all
%Question 3 (linear)
disp('> Question 3: Draw a linear spectrum ');
M = 100; % Number of current frames N =256; % frames = blockFrames(s1, fs1, M, N); % frame t = N /2;
tm = length(s1) / fs1;
subplot(121);
imagesc([0 tm], [0 fs1/2].abs(frames(1:t, :)).^2), axis xy;
title('Energy spectrum (M = 100, N = 256)');
xlabel('time/s');
ylabel('frequency/Hz');
colorbar;
%Question 3 (logarithmic)
disp('> Question 3: draw the logarithmic spectrum);
subplot(122);
imagesc([0 tm], [0 fs1/2].20 * log10(abs(frames(1:t, :)).^2)), axis xy;
title('Log energy spectrum (M = 100, N = 256)');
xlabel('time/s');
ylabel('frequency/Hz');
colorbar;
D=get(gcf,'Position');
set(gcf,'Position',round([D(1) *. 5 D(2) *. 5 D(3) *2 D(4) *1.3]))
pause
close all
%Question 4
disp('> Question 4: draw a graph with different frame length ');
lN = [128 256 512];
u=220;
for i = 1:length(lN)
N = lN(i);
M = round(N / 3);
frames = blockFrames(s1, fs1, M, N);
t = N / 2;
temp = size(frames);
nbframes = temp(2);
u=u+1;
subplot(u)
imagesc([0 tm], [0 fs1/2].20 * log10(abs(frames(1:t, :)).^2)), axis xy;
title(sprintf('Energy log spectrum (frame length = % I, frame number = % I)', M, N, nbframes));
xlabel('time/s');
ylabel('frequency/Hz');
colorbar
end
D=get(gcf,'Position');
set(gcf,'Position',round([D(1) *. 5 D(2) *. 5 D(3) *1.5 D(4) *1.5]))
pause
close all
%Question 5
disp('> Question 5: space ');
plot(linspace(0, (fs1/2), 129), (melfb(20.256, fs1))');
title('Mel filter');
xlabel('Frequency/Hz');
pause
close all
%Question 6
disp('> Question 6: ');
M = 100;
N = 256;
frames = blockFrames(s1, fs1, M, N);
n2 = 1 + floor(N / 2);
m = melfb(20, N, fs1);
z = m * abs(frames(1:n2, :)).^2;
t = N / 2;
tm = length(s1) / fs1;
subplot(121)
imagesc([0 tm], [0 fs1/2].abs(frames(1:n2, :)).^2), axis xy;
title('Original energy spectrum');
xlabel('time/s');
ylabel('frequency/Hz');
colorbar;
subplot(122)
imagesc([0 tm], [0 20], z), axis xy;
title('Energy spectrum corrected by MEL cepstrum');
xlabel('time/s');
ylabel('Number of filters');
colorbar;
D=get(gcf,'Position');
set(gcf,'Position'[0 D(2) D(3) *2 D(4)])
pause
close all
%Question 7
disp('> Question 7: 2D plot of accustic vectors');
c1 = mfcc(s1, fs1);
c2 = mfcc(s2, fs2);
plot(c1(5, :), c1(6, :).'or');
hold on;
plot(c2(5, :), c2(6, :).'xb');
xlabel('5th Dimension');
ylabel('6th Dimension');
legend('Speaker 1'.'Speaker 2');
title('2D plot of accoustic vectors');
pause
close all
%Question 8
disp('> Question 8: 画出已训练好的VQ码本')
d1 = vqlbg(c1,16);
d2 = vqlbg(c2,16);
plot(c1(5, :), c1(6, :).'xr')
hold on
Copy the code
3. Operation results
Fourth, note
Version: 2014a complete code or write plus 1564658423