The chapter directory

Preamble Principle of Speech recognition Signal processing, acoustic feature extraction recognition character, composition text acoustic model Language model vocabulary model Speech acoustic feature extraction: The principle of MFCC and LogFBank algorithm ASR speech recognition model system process based on THE HTTP protocol API interface client future practice two baidu and IFLYtek API three offline speech recognition VoskCopy the code

preface

Principles of Speech Recognition

The first are speech tasks, such as speech recognition and voice wake-up. When you hear this, you think of Chinese platforms like IFlytek and Baidu. Since these two companies account for 80% of the voice market in China, they are doing very well. But because of the high-precision technology, they can not open source, other companies have to spend a lot of money to buy their API, but speech recognition and other applications are difficult to learn (I trained a speech recognition project, 10 graphics cards need to run for 20 days), which leads to the slow development of civil speech recognition. Chen jun collected a large number of SOTA principles and practical parts in the current field. Let’s feast our eyes today!

Voice sampling

In the process of digitalization after speech input, the start and end of speech should be determined first, and then noise reduction and filtering (besides human voice, there are many noises) should be carried out to ensure that the computer can recognize the filtered speech information. For further processing, audio signal frames need to be processed. At the same time, from a microscopic point of view, people’s voice signals are generally relatively stable in a period of time, which is called short-term stationarity. Therefore, interframe processing of voice signals is needed to facilitate processing.

Usually, a frame takes 20~50ms, and there is overlapping redundancy between frames, which avoids the weakening of signals at both ends of the frame and affects the identification accuracy. The next step is key feature extraction. Because the recognition of original waveform can not achieve good recognition effect, it is necessary to extract characteristic parameters through frequency domain transformation. The commonly used transformation method is to extract the MFCC features and transform each frame waveform into the original waveform vector matrix according to the physiological characteristics of human ear.

Frame by frame vectors are not very intuitive. You can also use the spectrum diagram in the figure below to represent speech. Each column is a 25-millisecond block from left to right. It is much easier to find patterns in such data than in raw sound waves.

However, spectrum maps are mainly used for speech research, and speech recognition also needs to use feature vectors frame by frame.

Recognising characters that make up text

After feature extraction, feature recognition and character generation are carried out. This part of the job is to find the current phoneme in each frame, and then form words from multiple phonemes, and from words again. Of course, the hardest part is finding the current phoneme from each frame, because each frame is less than one phoneme, and only multiple frames can form a phoneme. If it’s wrong in the beginning, it’s hard to correct later. How to determine which phoneme each frame belongs to? The simplest way is probability, which phoneme has the highest probability. What if the probability of multiple phonemes in each frame is the same? After all, it is possible. Everyone has a different accent, speed and intonation, and it can be difficult for people to understand whether you’re saying hello or Hall. However, the text result of speech recognition is only one, and it is impossible for people to participate in the choice of error correction. At this point, multiple phonemes constitute the statistical decision of words, and words constitute the text

This allows us to get three possible transcriptions – “hello”, “hula” and “Olo”. Finally, based on the probability of the word, we find that Hello is the most likely, so we print the text of Hello. The example above clearly describes how probability determines everything from frames to phonemes, and then from phonemes to words. How do you get these probabilities? Can we count all the phonemes, words and sentences humans have spoken for thousands of years in order to identify a language and then calculate the probabilities? It’s impossible. What should we do? Then we need the model:

Acoustic model

CV believes that you must know the acoustic model. According to the basic state and probability of speech, we try to obtain the speech corpus of different people, age, gender, accent and speaking speed. Meanwhile, we try to collect various quiet, noisy and distant speech corpus to generate the acoustic model. For better results, different languages and dialects use different acoustic models to improve accuracy and reduce computation.

Language model

And then a lot of text training on basic language models, probabilities of words and sentences. If there are only two sentences “Today Monday” and “tomorrow Tuesday” in the model, we can only identify those two sentences. If we want to identify more sentences, we only need to cover enough corpus, but the model will increase and so will the computation. Therefore, the models in our practical application are usually limited to application fields, such as smart home, navigation, smart speakers, personal assistants, medical care, etc., which can reduce the amount of calculation and improve the accuracy.

Vocabulary model

Finally, it is a more commonly used vocabulary model, complementing the language model with a language dictionary and annotations for different sounds. For example, place names, people’s names, song names, popular words, and special words in certain fields are updated regularly. At present, there are many simplified but effective calculation methods, such as HMM hidden Markov model. The hidden Markov model is mainly based on two assumptions: first, the internal state transition is only related to the previous state, and second, the output value is only related to the current state (or current state transition). The problem is simplified, that is, the probability of a sequence of words in a sentence is only related to the previous word, so the calculation is greatly simplified.

Finally, speech is recognized as text. Speech acoustic feature extraction: MFCC and LogfBank algorithm principles

In almost all automatic speech recognition systems, the first step is to extract the features of the speech signal. By extracting the relevant features of speech signal, it is helpful to identify the relevant speech information, and eliminate irrelevant information such as background noise and emotion.

1 MFCC

Just CV you said MFCC, this is very classic oh ~ MFCC full name is “Mayer frequency cepstrum coefficient”, this speech feature extraction algorithm is one of the commonly used algorithms in these decades. This algorithm is achieved by a linear transformation of the logarithmic energy spectrum of the nonlinear Mayer in the sound frequency.

1.1 the frame

Since raw WAV audio files stored on a computer’s hard disk are of variable length, we first need to cut them up into several fixed-length pieces called frames. According to the characteristics of rapid change of voice signal, the duration of each frame is generally 10-30ms to ensure that there is enough period in a frame and the change is not too drastic. Therefore, this Fourier transform is more suitable for stationary signal analysis. Due to the different sampling rate of digital audio, the dimension of each frame vector is also different.

1.2 preemphasis

Since the sound signal from human glottis has 12dB/ octave attenuation, and the sound signal from lips has 6dB/ octave attenuation, there are few components in the high-frequency signal after FAST Fourier transform. Therefore, the main purpose of pre-weighting operation of speech signal is to enhance the high frequency part of each frame of speech signal, so as to improve the resolution of high frequency signal.

1.3 add window

In the previous framing process, a continuous speech signal was directly divided into several segments, which would lead to spectrum leakage due to truncation effect. The purpose of windowing is to eliminate the problem of short-time signal discontinuities at both ends of each frame. In MFCC algorithm, the window functions are usually Hamming window, rectangular window and Hanning window. It should be noted that pre-emphasis must be carried out before opening the window.

1.4 Fast Fourier Transform

After a series of processing above, we still get the time domain signal, but in the time domain can directly obtain less speech information. In the further feature extraction of speech signal, the time domain signal of each frame needs to be converted into its frequency domain signal. For speech signals stored in a computer, we need to use discrete Fourier transform. Because of the high computational complexity of ordinary discrete Fourier transform, fast Fourier transform is usually used to implement it. Because MFCC algorithm is divided into frames, each frame is a short time domain signal, so this step is also called short time Fast Fourier transform.

1.5 Calculation of amplitude Spectrum (Modulo for complex numbers)

After the FAST Fourier transform, the speech feature is a complex matrix, which is an energy spectrum. Because the phase spectrum in the energy spectrum contains very little information, we generally choose to discard the phase spectrum and retain the amplitude spectrum.

There are generally two methods for discarding phase spectrum and preserving amplitude spectrum, namely, finding the absolute value or square value of each complex number.

1.6 Mel filter

The process of Mel filtering is one of the keys of MFCC. The Mel filter is composed of 20 triangular bandpass filters, which convert linear frequencies into Mel frequencies with nonlinear distribution.

2 logfBank

Logarithmic bank feature extraction algorithm is similar to MFCC algorithm, which is based on the feature extraction results of logarithmic bank. However, the main difference between logfBank and MFCC algorithm is whether to perform discrete cosine transform.

With the emergence of DNN and CNN, especially the development of deep learning, neural network can better utilize the correlation between fBank and logfBank features to improve the accuracy of final speech recognition and reduce WER, so the steps of DCT can be omitted.

SOTA principle + Actual combat 1 Deep convolutional neural network speech recognition

In recent years, deep learning has emerged in the field of artificial intelligence and has had a profound impact on speech recognition. Deep neural networks have gradually replaced the original HMM hidden Markov model. In human communication and knowledge dissemination, about 70 percent of information comes from speech. In the future, speech recognition will inevitably become an important part of intelligent life, it can provide the necessary foundation for speech assistance and speech input, which will become a new way of human-computer interaction. So we need machines to understand the human voice.

The acoustic model of the speech recognition system uses a deep convolutional neural network and is input directly as a sound spectrum diagram. In the structure of the model, the best network configuration VGG in image recognition is used for reference. This network model is very expressive, can see long history and future information, and is more robust than RNN. On the output side, the model can be completed by CTC scheme

The acoustic model of the speech recognition system uses a deep convolutional neural network and is input directly as a sound spectrum diagram. In the structure of the model, the best network configuration VGG in image recognition is used for reference. This network model is very expressive, can see long history and future information, and is more robust than RNN. At the output end, the model can be perfectly combined with THE CTC scheme to realize the end-to-end training of the whole model, directly transcribe the sound waveform signal into the Chinese Putonghua pinyin sequence. In the language model, the maximum entropy hidden Markov model is used to convert pinyin sequence into Chinese text. And, in order to provide services to all users over the network. Feature extraction transforms ordinary WAV speech signals into two-dimensional spectral image signals required by neural networks through framing and windowing operations

CTC decoder in the voice acoustic analysis model of information system which can identify the output, often contains enterprise extensive use of continuous repetitive symbols, therefore, we need to put the same conform to the combined into the same symbols in a row, and then by removing mute separation markers, the development finally solving sequence of phonetic learning pinyin symbols.

The language model uses a statistical language model to convert pinyin into the final recognized text and output it. The essence of pinyin to text is modeled as hidden Markov chain with high accuracy. The following deep parsing code, the package will series ~

Import the Keras series.

import platform as plat import os import time from general_function.file_wav import * from general_function.file_dict import * from general_function.gen_func import * from general_function.muti_gpu import * import keras as kr import numpy  as np import random from keras.models import Sequential, Model from keras.layers import Dense, Dropout, Input, Reshape, BatchNormalization # , Flatten from keras.layers import Lambda, TimeDistributed, Activation,Conv2D, MaxPooling2D,GRU #, Merge from keras.layers.merge import add, concatenate from keras import backend as K from keras.optimizers import SGD, Adadelta, Adam from readdata24 import DataSpeechCopy the code

The default output size of pinyin for the imported acoustic model is 1428, that is, 1427 pinyin +1 blank block.

NUM_GPU = 2 class ModelSpeech(): def __init__(self, datapath): Initialize the default output of the pinyin representation size is 1428, That is, 1427 pinyin +1 blank block "MS_OUTPUT_SIZE = 1428 self.ms_output_size = MS_OUTPUT_SIZE # The size of each character vector dimension that the neural network finally outputs # self.batch_size = Batch self.label_max_string_length = 64 self.audio_length = 1600 self.audio_feature_length = 200 self._model, self.base_model = self.CreateModel()Copy the code

Transformation path

Self.datapath = datapath self.slash = "system_type = plat.system()" If (system_type == 'Windows'): self.slash='\\' # backslash elif(system_type == 'Linux'): self.slash='/' # forward slash else: Print ('*[Message] Unknown System\n') self.slash='/' # slash if(self.slash! = self.datapath[-1]): # add slash to end of directory path self.datapath = self.datapath + self.slashCopy the code

Define CNN/LSTM/CTC model, use functional model, design input layer, hidden layer and output layer.

def CreateModel(self): Define CNN/LSTM/CTC model, using functional model Input layer: 200 dimensional sequence of eigenvalues, the maximum length of one voice data is set to 1600 (about 16s) Hidden layer: convolutional pooling layer, the size of the convolution kernel is 3x3, the size of the pooling window is 2 Hidden layer: full connection layer Output layer: Full connection layer, number of neurons is self.ms_output_size, softmax is used as activation function, CTC layer: Using CTC loss as the loss function, Input_data = Input(name='the_input', shape=(self.AUDIO_LENGTH, self.AUDIO_FEATURE_LENGTH, 1)) layer_h1 = Conv2D(32, (3,3), use_bias=False, activation='relu', padding='same', Kernel_initializer ='he_normal')(input_data) # convolution layer #layer_h1 = Dropout(0.05)(layer_h1) layer_h2 = Conv2D(32, (3,3), use_bias=True, activation='relu', padding='same', Strides ='he_normal')(layer_h1) # Strides = MaxPooling2D(pool_size=2, strides=None, Padding ="valid")(layer_h2) # Pooled layer #layer_h3 = Dropout(0.05)(layer_h3) # Random break part of neural network connection, Prevent overfitting layer_h4 = Conv2D(64, (3,3), use_bias=True, activation='relu', padding='same', Kernel_initializer ='he_normal')(layer_h3) # convolutional layer #layer_h4 = Dropout(0.1)(layer_h4) layer_h5 = Conv2D(64, (3,3), use_bias=True, activation='relu', padding='same', Strides =None kernel_initializer='he_normal')(layer_h4) # strides= MaxPooling2D(pool_size=2, strides=None, Padding ="valid")(layer_h5) #layer_h6 = Dropout(0.1)(layer_h6) layer_h7 = Conv2D(128, (3,3), use_bias=True, activation='relu', padding='same', Kernel_initializer ='he_normal')(layer_h6) # convolutional layer #layer_h7 = Dropout(0.15)(layer_h7) layer_h8 = Conv2D(128, (3,3), use_bias=True, activation='relu', padding='same', Strides =None kernel_initializer='he_normal')(layer_h7) # strides= MaxPooling2D(pool_size=2, strides=None, Padding ="valid")(layer_h8) #layer_h9 = Dropout(0.15)(layer_h9) layer_h10 = Conv2D(128, (3,3), use_bias=True, activation='relu', padding='same', Kernel_initializer ='he_normal')(layer_h9) # convolutional layer #layer_h10 = Dropout(0.2)(layer_h10) layer_h11 = Conv2D(128, (3,3), use_bias=True, activation='relu', padding='same', Strides =None kernel_initializer='he_normal')(layer_h10) # strides= MaxPooling2D(pool_size=1, strides=None, Padding ="valid")(layer_h11) #layer_h12 = Dropout(0.2)(layer_h12) layer_h13 = Conv2D(128, (3,3), use_bias=True, activation='relu', padding='same', Kernel_initializer ='he_normal')(layer_h12) # convolutional layer #layer_h13 = Dropout(0.3)(layer_h13) layer_h14 = Conv2D(128, (3,3), use_bias=True, activation='relu', padding='same', Strides =None kernel_initializer='he_normal')(layer_h13) # strides= MaxPooling2D(pool_size=1, strides=None, Padding ="valid")(layer_h14) #test=Model(inputs = input_data, outputs = layer_h12) #test.summary() layer_h16 = Reshape((200, 381))(Layer_H15) # layer_H16 = Dropout(0.3)(Layer_H16) # Random break part of the neural network connection Prevent overfitting layer_h17 = Dense(128, activation="relu", USe_bias =True, Kernel_initializer ='he_normal')(layer_h16) #layer_h5 = LSTM(256, activation='relu', Use_bias =True, return_sequences=True)(layer_h4) # LSTM layer RNN_size =128 GRU_1 = GRU(RNN_size, return_sequences=True, kernel_initializer='he_normal', name='gru1')(inner) gru_1b = GRU(rnn_size, return_sequences=True, go_backwards=True, kernel_initializer='he_normal', name='gru1_b')(inner) gru1_merged = add([gru_1, gru_1b]) gru_2 = GRU(rnn_size, return_sequences=True, kernel_initializer='he_normal', name='gru2')(gru1_merged) gru_2b = GRU(rnn_size, return_sequences=True, go_backwards=True, kernel_initializer='he_normal', name='gru2_b')(gru1_merged) gru2 = concatenate([gru_2, Gru_2b]) Layer_h20 = GRU2 #layer_h20 = Dropout(0.4)(GRU2) Layer_h21 = Dense(128, activation="relu", USe_bias =True, Kernel_initializer ='he_normal')(Layer_h20) # Full Connection layer # Layer_H17 = Dropout(0.3)(Layer_h17) Layer_h22 = Dense(self.MS_OUTPUT_SIZE, use_bias=True, Kernel_initializer ='he_normal')(layer_h21) # Activation('softmax', name='Activation0')(layer_h22) model_data = Model(inputs = input_data, outputs = y_pred) #model_data.summary() labels = Input(name='the_labels', shape=[self.label_max_string_length], dtype='float32') input_length = Input(name='input_length', shape=[1], dtype='in label_length = Input(name='label_length', shape=[1], dtype='int64') # Keras doesn't currently support loss funcs with extra parameters # so CTC loss is implemented in a lambda layer #layer_out = Lambda(ctc_lambda_func,output_shape=(self.MS_OUTPUT_SIZE, ), name='ctc')([y_pred, labels, input_length, label_length])#(layer_h6) # CTC loss_out = Lambda(self.ctc_lambda_func, output_shape=(1,), name='ctc')([y_pred, labels, input_length, label_length])Copy the code

Model loading mode

model = Model(inputs=[input_data, labels, input_length, label_length], Outputs =loss_out) model. Summary () # clipnorm seems to speeds up speeds # SGD = SGD(LR =0.0001, decay=1e-6, Momentum =0.9, nesterov=True, clipnorm=5) #ada_d = Adadelta(LR = 0.01, rho = 0.95, Epsilon = 1E-06) opt = Adam(LR = 0.001, beTA_1 = 0.9, beTA_2 = 0.999, decay = 0.0, epsilon = 10e-8) #model.compile(loss={'ctc': lambda y_true, y_pred: y_pred}, optimizer=sgd) model.build((self.AUDIO_LENGTH, self.AUDIO_FEATURE_LENGTH, 1)) model = ParallelModel(model, NUM_GPU) model.compile(loss={'ctc': lambda y_true, y_pred: y_pred}, optimizer = opt)Copy the code

Define CTC decoding

# captures output of softmax so we can decode the output during visualization test_func = K.function([input_data], [y_pred]) #print('[* prompt] Print ('[*Info] Create Model Successful, Compiles Model Successful. ') return Model, model_data def ctc_lambda_func(self, args): y_pred, labels, input_length, label_length = args y_pred = y_pred[:, :, :] #y_pred = y_pred[:, 2:, :] return K.ctc_batch_cost(labels, y_pred, input_length, label_length)Copy the code

Define training model and training parameters

def TrainModel(self, datapath, epoch = 2, save_step = 1000, batch_size = 32, Filename = abspath + 'model_speech/m' +ModelName + '/speech_model'+ModelName) Save_step: Save the model every number of steps Default save file name, Data =DataSpeech(datapath, Num_data = data.getDatanum () # YieldDatas = data.data_genetator(batch_size, self.AUDIO_LENGTH) for epoch in range(epoch): Print ('[running] train epoch %d.' epoch) n_step = 0 print('[message] epoch %d . Have train datas %d+'%(epoch, N_step *save_step)) # data_genetator is a generator function #self._model. Fit_generator (save_step, nb_worker=2) self._model.fit_generator(yielddatas, save_step) n_step += 1 except StopIteration: print('[error] generator error. please check data format.') break self.SaveModel(comment='_e_'+str(epoch)+'_step_'+str(n_step * save_step)) self.TestModel(self.datapath, str_dataset='train', data_count = 4) self.TestModel(self.datapath, str_dataset='dev', data_count = 4) def LoadModel(self,filename = abspath + 'model_speech/m'+ModelName+'/speech_model'+ModelName+'.model'): Self._model.load_weights (filename) self.base_model.load_weights(filename + '.base') def SaveModel(self,filename = abspath + 'model_speech/m'+ModelName+'/speech_model'+ModelName,comment=''): Save_weights (filename+comment+'.model') self.base_model.save_weights(filename +comment+) '.model.base') f = open('step'+ModelName+'.txt','w') f.write(filename+comment) f.close() def TestModel(self, datapath='', str_dataset='dev', data_count = 32, out_report = False, show_ratio = True): Data =DataSpeech(self.datapath, Str_dataset) # data.loaddatalist (str_dataset) num_data = data.getDatanum () Data_count = num_data): # if data_count is less than or equal to 0 or greater than the value of the test data, then use all data to test data_count = num_data. Ran_num = random. Randint (0, num_data-1) # Retrieve a random number words_num = 0 word_error_num = 0 nowtime = time.strftime('%Y%m%d_%H%M%S',time.localtime(time.time())) if(out_report == True): txt_obj = open('Test_Report_' + str_dataset + '_' + nowtime + '.txt', 'w', Encoding =' utF-8 ') # open file and read TXT = "for I in range(data_count): data_input, Data_labels = data.getData ((ran_num + I) % num_data) data_labels = data.getData ((ran_num) % num_data) Use the next wav file instead to run num_bias = 0 while(data_input.shape[0] > self.audio_length): print('*[Error]','wave data lenghth of num',(ran_num + i) % num_data, 'is too long.','\n A Exception raise when test Speech Model.') num_bias += 1 data_input, GetData((ran_num + I + NUM_bias) % num_data) data_labels = data.GetData((ran_num + I + num_bias) % num_data self.Predict(data_input, Data_input.shape [0] // 8) words_n = data_allages.shape [0] # Retrieve words_num += words_n # add edit_distance = GetEditDistance(data_labels, pre) # Retrieve edit distance if(edit_distance <= words_n): # word_error_num += edit_distance # else: Word_error_num += words_n # if(I % 10 == 0 and show_ratio == True): word_error_num += words_n # print('Test Count: ',i,'/',data_count) txt = '' if(out_report == True): txt += str(i) + '\n' txt += 'True:\t' + str(data_labels) + '\n' txt += 'Pred:\t' + str(pre) + '\n' txt += '\n' txt_obj.write(txt)Copy the code

Define prediction functions and return prediction results.

#print('*[dataset] '+ str_dataset +' ', word_error_num / words_num * 100, '%') print('*[Test Result] Speech Recognition ' + str_dataset + ' set word error ratio: ', word_error_num / words_num * 100, '%') if(out_report == True): TXT = '*[result] voice recognition '+ str_dataset +' set voice single word error rate:  ' + str(word_error_num / words_num * 100) + ' %' txt_obj.write(txt) txt_obj.close() except StopIteration: print('[Error] Model Test Error. please check data format.') def Predict(self, data_input, input_len): The prediction results return a list of phonetic symbols after speech recognition. Batch_size = 1 in_len = np.zeros((batch_size), dType = Np.int32) in_len[0] = input_len X_in =  np.zeros((batch_size, 1600, self.AUDIO_FEATURE_LENGTH, 1), dtype=np.float) for i in range(batch_size): x_in[i,0:len(data_input)] = data_inputCopy the code
base_pred = self.base_model.predict(x = x_in) #print('base_pred:\n', base_pred) #y_p = base_pred #for j in range(200): # mean = np.sum(y_p[0][j]) / y_p[0][j].shape[0] # print('max y_p:',np.max(y_p[0][j]),'min y_p:',np.min(y_p[0][j]),'mean y_p:',mean,'mid y_p:',y_p[0][j][100]) # print('argmin:',np.argmin(y_p[0][j]),'argmax:',np.argmax(y_p[0][j])) # count=0 #  for i in range(y_p[0][j].shape[0]): # if(y_p[0][j][i] < mean): # count += 1 # print('count:',count) base_pred =base_pred[:, :, :] #base_pred =base_pred[:, 2:, :] r = K.ctc_decode(base_pred, in_len, greedy = True, beam_width=100, top_paths=1) #print('r', r)Copy the code
r1 = K.get_value(r[0][0]) #print('r1', r1) #r2 = K.get_value(r[1]) #print(r2) r1=r1[0] return r1 pass def RecognizeSpeech(self, wavsignal, fs): The function that will eventually be used for speech recognition, #data = DataSpeech('E:\\ speech dataset ') # data.loadDatalist ('dev') #data_input = GetMfccFeature(wavsignal, fs) #t0=time.time() data_input = GetFrequencyFeature3(wavsignal, fs) #t1=time.time() #print('time cost:',t1-t0) input_length = len(data_input) input_length = input_length // 8 data_input = np.array(data_input, dtype = np.float) #print(data_input,data_input.shape) data_input = data_input.reshape(data_input.shape[0],data_input.shape[1],1) #t2=time.time() r1 = self.Predict(data_input, Input_length) #t3=time.time() #print('time cost:', t3-T2) list_symbol_dic = GetSymbolList(self.datapath) #print('time cost:', t3-T2) list_symbol_dic = GetSymbolList(self.datapathCopy the code

Finally do speech recognition with the function, recognize a WAV sequence of speech

r_str=[] for i in r1: r_str.append(list_symbol_dic[i]) return r_str pass def RecognizeSpeech_FromFile(self, filename): The function that will eventually be used for speech recognition, RecognizeSpeech(WAVSignal,fs) RecognizeSpeech(wavSignal, FS) return R pass RecognizeSpeech(wavSignal, FS) return r passCopy the code
@property def model(self): "" return keras model" "return self._model if(__name__=='__main__'):Copy the code

The main function starts

datapath = abspath + '' modelpath = abspath + 'model_speech' if(not os.path.exists(modelpath)): Makedirs (modelPath) # If not, create a new directory to avoid shatteringsystem_type = plat.system() # If (system_type == 'Windows'): datapath = 'E:\\ voice data set 'modelPath = modelPath + '\\' elif(system_type == 'Linux'): datapath = abspath + 'dataset' modelpath = modelpath + '/' else: print('*[Message] Unknown System\n') datapath = 'dataset' modelpath = modelpath + '/' ms = ModelSpeech(datapath)Copy the code

Principle + Actual combat ii Baidu and IFLYtek voice recognition

End-to-end deep cooperative learning research methods can be used to identify English or Mandarin Chinese, which are two very different languages. Because using neural networks to manually design every component of the entire process, end-to-end learning allows us to deal with a wide variety of sounds, including noisy environments, stress and different languages. The key to our approach is to improve the HPC technology that we can use, so that experiments that used to take weeks can now be done in days. This allows students to iterate more quickly on our own to identify superior architectures and algorithms. Finally, using the technology called Batch Dispatch with GPU in the data information center, our study shows that our system analysis can provide a service for mass user management with low cost deployment and low delay through online configuration at the same time.

End-to-end speech recognition is an active research area and has been used to reevaluate the output of dnN-HMM with convincing results. The Rnn codec uses the encoder Rnn to map the input to a fixed-length vector, while the decoder network maps the fixed-length vector to an output prediction sequence. RNN encoder with its own attention – decoder performs well in predicting phoneme teaching. CTC loss function and RNN are combined to simulate the time information, and good results are achieved in the end to end speech recognition of character output. Ctc-rnn models also do a good job of predicting phonemes, although a dictionary is still needed in this case.

Data technology is also key to the success of the end-to-end speech recognition system. Hannun et al. used over 7,000 hours of markup language speech in China. Data enhancement is very effective for improving the performance of deep learning such as computer vision and speech recognition. Existing voice systems can also be used to guide new data collection. We get inspiration from the previous methods and guide larger data sets and data increase to increase the amount of effective markup data in baidu system.

Now the demonstration is identifying the contents of an audio file. Note: I can apply for the following tokens by myself: I applied for the following tokens by myself. It is suggested to apply for exclusive tokens according to my article.

import requests import os import base64 import json apiUrl='http://vop.baidu.com/server_api' filename = "16k.pcm" # Size = os.path.getsize(filename) # get the size of the local voice file file1 = open(filename, Data = {"format":" PCM ", {"format":" PCM ", "Dev_pid ":1536, # mandarin "channel":1, # channel, A fixed value 1 "token" : "24.0 c828682d414bf79b08f89c4c7dcd83a. 2592000.1562739150.282335-16470175", # key, authentication authentication Access token, "Dc-85-de-f9-08-59"; "Len ":size, # base64 address "speech":text, # base64 address} try: r = requests.post(apiUrl, data = json.dumps(data)).json() print(r) print(r.get("result")[0]) except Exception as e: print(e)Copy the code

Iflytek in the same way, see the official website tutorial.

Combat 3 Offline speech recognition Vosk

CV jun introduced a lot of principles and Sota algorithm due to space problem today, so now I will share one last, need to know more, welcome to continue to follow this series.

Vosk supports over 30 languages and is doing well in offline voice, github.com/alphacep/vo…

For Android, you need to install the Android package, and then download the compilation tool, Gradle, and compile it through gradle.

Apk installation package will be generated after successful compilation. Mobile phone can be installed and used offline.

/** * Adds listener. */ public void addListener(RecognitionListener listener) { synchronized (listeners) { listeners.add(listener); } } /** * Removes listener. */ public void removeListener(RecognitionListener listener) { synchronized (listeners) { listeners.remove(listener); } } /** * Starts recognition. Does nothing if recognition is active. * * @return true if recognition was actually started */ public boolean startListening() { if (null ! = recognizerThread) return false; recognizerThread = new RecognizerThread(); recognizerThread.start(); return true; }Copy the code

The actual combat here is relatively simple, I made a lot of optimization, support Android, python, c++, Java language and other deployment, welcome to consult CV jun.

Intelligent voice interaction diagram

conclusion

I have said a lot today, and welcome you to watch it. This article mainly introduces Tricks, interactivity and fun. In the actual combat part, it is also the algorithm of voice, so I can’t show you many interesting things today.

After that, we can give you a step-by-step approach to other areas of the phonetic part:

A: such as Siri, xiao Ai students such wakeup word algorithm and model and SOTA;

Two: the speaker distinguishes (discriminates ideas) SOTA

Three: multi-language ideas + few languages + difficult language ideas and SOTA

Four: each voice competition SOTA scheme