Wow, Python can do real-time translation

Welcome to pay attention to me, together to fulfill my promise before, even more within a month, to finish several articles.

The serial number Estimated completion time Develop dome name and features & publish article content Is it finished The article links
1 On September 3 Text translation, single text translation, batch translation demo. Has been completed CSDN:Let me directWechat Official Account:Let me direct
2 On September 11 Ocr-demo, complete batch upload identification; In a demo, you can select different types of OCR recognition “include handwriting/print/ID card/form/whole topic/business card), and then call the platform capabilities, specific implementation steps, etc. Has been completed CSDN:Let me directWechat Official Account:
3 On October 27 Voice recognition Demo, demo upload a video, and capture the video short voice recognition – Demo audio for short voice recognition CSDN:Let me directWechat Official Account:
4 On September 17 Intelligent voice evaluation – Demo CSDN: wechat Official Number:
5 On September 24 Essay correction – Demo CSDN: wechat Official Number:
6 On September 30 Voice synthesis – Demo CSDN: wechat Official Number:
7 On October 15 Single question pat-demo CSDN: wechat Official Number:
8 On October 20 Picture translation – Demo CSDN: wechat Official Number:

Recently, amid much anticipation, a fruit phone manufacturer held a press conference that did not announce the anticipated mobile phone products, and unveiled some other products besides the mobile phone, including the latest fruit 14 system. A few days later, people who had updated the system discovered a very interesting feature in the new system — translation, like this:

Strange translation knowledge increased!

Compared with common translation tools, simultaneous translation tools have more practical value, think about not proficient in other languages and friends can communicate barrier-free scenes, is really a beautiful thing, it is better to implement a tool for backup! For a simultaneous translation tool, the logic may be to identify first and then translate. The accuracy of recognition is a key factor in the success of translation. To ease the difficulty, I decided to complete the tool development in two separate sessions. Let’s try the speech recognition part first.

This demo continues to call Youdao Wisdom Cloud API to achieve real-time speech recognition.

Take a look at the interface and results:

You can choose from a variety of sounds, but here are just four common ones:

I tested Chinese, Korean and English separately. Looks good

Here, the translation results are recognized in real time, word by word, according to the audio. Because the recognition speed is relatively fast, it seems that there is no time difference.

First of all, you need to create instances, create applications, bind applications and instances on the personal page of Youdao Wisdom Cloud, and obtain the ID and key of the application used to invoke the interface. For details about the process of individual registration and application creation, see the development process of a batch file translation shared in the article

(1) Preparation

The following describes the specific code development process.

The first is to analyze the input and output of the interface according to the real-time speech recognition document. The purpose of the interface design is to recognize the continuous audio stream in real time, convert it into text information and return the corresponding text stream, so the communication uses Websocket, and the call process is divided into authentication and real-time communication two stages.

During the authentication phase, you need to send the following parameters:

parameter type mandatory instructions The sample
appKey String is ID of the application that has been applied ID
salt String is UUID UUID
curtime String is Time stamp (seconds) TimeStamp
sign String is Encrypts digital signatures. sha256
signType String is Digital signature Type v4
langType String is Language selection, refer to the list of supported languages zh-CHS
format String is Audio format, support WAV wav
channel String is Channel, support 1 (mono) 1
version String is API version v1
rate String is Sampling rate 16000

The method for generating signature sign is as follows: signType=v4; Sign =sha256(app ID+salt+curtime+ app key)

Certification, he entered the stage of real-time communication to send audio stream, to obtain recognition as a result, the sending end to end communication, here it is important to note that send audio is best 16 bit depth of the mono, 16 k, sampling rate of clear wav audio file, here I developed at the beginning because there is something wrong with the audio recording equipment, The audio effect is poor, and the interface keeps returning error code 304 (manual face covering).

2. Development

This demo is developed using python3 and includes maindow. Py, audioandprocess.py, recobynetease. Interface part, using Python’s own Tkinter library, for language selection, recording start, recording stop and recognition operations. Audioandprocess. py implements the logic for recording, audio processing, and finally calls the real-time speech recognition API through methods in Recobyyahoos.py.

1. Interface

Main elements:

root=tk.Tk() root.title("netease youdao translation test") frm = tk.Frame(root) frm.grid(padx='80', Pady ='80') label=tk. label (FRM,text=' select language type: ') label.grid(row=0,column=0) combox=ttk.Combobox(frm,textvariable=tk.StringVar(),width=38) combox["value"]=lang_type_dict combox.current(0) combox.bind("<<ComboboxSelected>>",get_lang_type) Grid (row=0,column=1) btn_start_rec = tk.Button(FRM, text=' start recording ', command=start_rec) btn_start_rec.grid(row=2, column=1) btn_start_rec.grid(row=2, column=1) column=0) lb_Status = tk.Label(frm, text='Ready', anchor='w', Fg ='green') lb_status. grid(row=2,column=1) btn_sure=tk.Button(FRM,text=" end and identify ",command=get_result) btn_sure.grid(row=3,column=0) root.mainloop()Copy the code

After the language type is selected, recording begins, and after recording ends, the interface is identified by calling the get_result() method.

def get_result():
    lb_Status['text']='Ready'
    sr_result=au_model.stop_and_recognise()
Copy the code

2. Development of audio recording

The audio recording section introduces the PyAudio library (via PIP installation) to call the audio device and record the WAV files required by the interface, and calls the Wave library to store the audio files.

Construction of the Audio_model class:

    def __init__(self, audio_path, language_type,is_recording):
        self.audio_path = audio_path,					
        self.audio_file_name=''							
        self.language_type = language_type,				
        self.language_dict=["zh-CHS","en","ja","ko"]	
        self.language=''
        self.is_recording=is_recording					
        self.audio_chunk_size=1600						
        self.audio_channels=1
        self.audio_format=pyaudio.paInt16
        self.audio_rate=16000
Copy the code

(2) The development of record() method

Record () method to achieve the logic of recording, call PyAudio library, read audio stream, write files.

    def record(self,file_name):
        p=pyaudio.PyAudio()
        stream=p.open(
            format=self.audio_format,
            channels=self.audio_channels,
            rate=self.audio_rate,
            input=True,
            frames_per_buffer=self.audio_chunk_size
        )
        wf = wave.open(file_name, 'wb')
        wf.setnchannels(self.audio_channels)
        wf.setsampwidth(p.get_sample_size(self.audio_format))
        wf.setframerate(self.audio_rate)

        
        while self.is_recording:
            data = stream.read(self.audio_chunk_size)
            wf.writeframes(data)
        wf.close()
        stream.stop_stream()
        stream.close()
        p.terminate()
Copy the code

(3) Development of stop_AND_Supplement () method

The stop_AND_RECOGNISE () method marks the recording status of Audio_model as false and starts a method to call the Youdao Zhiyun API.

    def stop_and_recognise(self):
        self.is_recording=False
        recognise(self.audio_file_name,self.language_dict[self.language_type])
Copy the code

3. Development of real-time speech recognition

The real-time voice recognition interface of Youdao Wisdom Cloud uses socket for communication. In order to simplify the presentation logic, an interface for displaying recognition results is developed here, which is displayed using Tkinter:

root = tk.Tk()
root.title("result")
frm = tk.Frame(root)
frm.grid(padx='80', pady='80')
text_result = tk.Text(frm, width='40', height='20')
text_result.grid(row=0, column=1)
Copy the code

Recognise () method concatenates required parameters to the URI according to an interface document and passes it to the start() method request interface:

def recognise(filepath,language_type): print('l:'+language_type) global file_path file_path=filepath nonce = str(uuid.uuid1()) curtime = str(int(time.time())) signStr = app_key + nonce + curtime + app_secret print(signStr) sign = encrypt(signStr) uri = "wss://openapi.youdao.com/stream_asropenapi?appKey=" + app_key + "&salt=" + nonce + "&curtime=" + curtime + \ "&sign=" +  sign + "&version=v1&channel=1&format=wav&signType=v4&rate=16000&langType=" + language_type print(uri) start(uri, 1600)Copy the code

The start() method is the core method of the real-time recognition part, invoking the recognition interface through the WebSocket.

def start(uri):

    websocket.enableTrace(True)

    ws = websocket.WebSocketApp(uri,
                                on_message=on_message,
                                on_error=on_error,
                                on_close=on_close)

    ws.on_open = on_opend
    ws.run_forever()
Copy the code

When requesting the interface, the previously recorded audio file is first read and sent:

def on_open(ws):
    count = 0
    file_object = open(file_path, 'rb')  
    while True:
        chunk_data = file_object.read(1600)
        ws.send(chunk_data, websocket.ABNF.OPCODE_BINARY)  
        time.sleep(0.05)
        count = count + 1
        if not chunk_data:
            break
    print(count)
    ws.send('{\"end\": \"true\"}', websocket.ABNF.OPCODE_BINARY)
Copy the code

Then, in the communication process, the message returned by the interface is processed and the identification result returned by the interface is collected:

def on_message(ws, message):
    result=json.loads(message)
    resultmessage= result['result']  
    
    if resultmessage:
        resultmessage1 = result['result'][0]
        resultmessage2 = resultmessage1["st"]['sentence']
        print(resultmessage2)
        
        result_arr.append(resultmessage2)
Copy the code

Finally, the identification results are displayed after the end of communication:

def on_close(ws): print_resule(result_arr) print("### closed ###") def print_resule(arr): Text_result. delete('1.0', tk.end) for n in arr: text_result.insert("insert", n + '\n')Copy the code

Youdao ZhiYun provided interfaces works as usual, and the main energy wasted in this development due to the poor quality of my own recording audio and recognition on the issue of failure, audio quality ok, accurate identification results, the next step is to translate, have youdao ZhiYun API, real-time translation can be so simple!