This article is reprinted on WebRTC Chinese website. Gustavo Garcia, who works at Epic Games, has over 20 years of experience in RTC server development.
In real-time communication services, there are many applications with speech recognition capabilities. Such as real-time captioning, real-time translation, voice commands or storing/summarizing audio conversations.
A few months ago at Hangouts Meet, voice recognition for live captions went live. But recently this option has been promoted to the main user interface.
I’m most interested in recognition technology, especially how to integrate DeepSpeech into the RTC media server to provide a cost-effective solution. But identification technology is not the subject of this article. I want to spend some time looking at how real-time captioning is implemented in Hangouts Meet from a signal perspective.
There are at least three speech recognition constructs available in RTC advanced services:
A) Speech recognition in equipment: this is the least cost solution. But not all devices support speech recognition, the quality of the model is not as good as the cloud model, and it requires CPU space. This is a problem for some devices that do not meet the above criteria.
B) Identification in a single server: This scheme is an inefficient network transport. The client needs to send audio twice at the same time. This is too costly for service providers. But there is no need to make changes in the media server.
C) Identify from media server: this is a very effective solution from client and network perspective. But this solution requires some changes to the media server. So the cost to the service provider will be high.
Given that Google has its own speech recognition technology, cost is not their primary concern. So the most reasonable plan for a break party is plan (C). I tried to use this scheme to figure out how transcribed text was transferred between the media server and the browser.
The first thing I did was look at the compressed code during the session and the HTTP requests sent from the Hangouts Meet page. But I didn’t find it. So I searched by voice for “STT, transcription, recognition” or something similar. I also run a very simple snippet of WebSpeech code in the browser console when I connect to Hangouts. The results are different for Hangouts pages and recognition on devices. (It works better on the page).
Next I decided to try Chrome: // WebrtC-Internals and DataChannels in Voila, and I was getting a lot of messages while I was talking. So these messages should be speech recognition specific content.
Then I looked at some weBRTC-Internals events. I’ve found that if you don’t always have DataChannel open by default when you first log in to Hangouts Meet. Webrtc has a dedicated DataChannel for those who activate subtitle for the first time.
Options, and in order to try to find the DataChannel exactly in the message to send information format, I with their own API replaced the RTCPeerConnection createDataChannel API. So that you can intercept these calls using this simple code snippet in the browser console.
The DataChannel is created for the identification of subtitles. The DataChannel is created using maxRetelevised = 10. The payload is binary data, probably in Protobuf format. But we can still convert it to a string and a few fields with a user ID and text: