Video conferencing should be available to everyone, including users who use sign language to communicate. However, since most video conferencing applications have transition Windows focused on loud speakers, it’s difficult for signers to “speak up,” so they can communicate easily and effectively. Enabling real-time SIGN language detection in video conferencing is challenging because applications need to use a large number of video sources as input to perform classification, making the task computationally expensive. In part because of these challenges, research on sign language detection is very limited.
In the “Real time Registration using Language detection for human posture estimation”, in the proposed SLRTP2020 and the world demonstration for ECCV2020, we present a real time sign language detection model and demonstrate how it can be used to provide a video conferencing system mechanism to identify people as event speakers to sign.
Our model
To provide a real-time working solution for a variety of video conferencing applications, we needed to design a lightweight model that was easy to plug and play. Previous attempts to integrate video conferencing application models on the client side have demonstrated the importance of lightweight models that consume fewer CPU cycles to minimize the impact on call quality. To reduce the input dimension, we isolate the information required by the model from the video so that each frame can be classified.
Since sign language involves the user’s body and hands, we first run PoseNet, a posture estimation model. This greatly reduces the input from the entire HD image to a small number of landmarks on the user’s body, including eyes, nose, shoulders, hands, etc. We use these landmarks to calculate frame-to-frame optical flow, thereby quantifying user actions for use by the model without retaining user-specific information. Each pose is standardized according to the person’s shoulder width to ensure that the model takes care of the signer within a certain distance of the camera. The light flow is then normalized through the frame rate of the video before being passed on to the model.
To test this approach, we used the German Sign Language Corpus (DGS), which contains long videos of people signing and includes span annotations indicating which frames the signature took place in. As a simple baseline, we trained a linear regression model to predict when a person will sign using optical flow data. The baseline achieved about 80% accuracy, using only about 3 microseconds (0.000003 seconds) of processing time per frame. By using the first 50 frames of optical flow as the context of the linear model, it was able to achieve 83.4%.
To summarize the use of context, we use the long short-term memory (LSTM) architecture, which contains the memory of previous time steps, but no backtracking. Using a single layer of LSTM followed by a linear layer, the model achieved 91.5% accuracy with a processing time of 3.5 milliseconds (0.0035 seconds) per frame.
Proof of concept
Once we have an effective sign language detection model, we need to devise a way to use it to trigger the active speaker function in video conferencing applications. We developed a lightweight, real-time, sign language detection web demo that connects to various video conferencing applications and can be set to “speaker” when a user signs. This demo utilizes PoseNet fast human pose estimation and a sign language detection model running in a browser using TF.js to make it work reliably and in real time.
When the sign language detection model determines that the user is signing, it delivers ultrasonic audio over a virtual audio cable that can be detected by any video-conferencing application as if the signing user is “speaking.” Audio is transmitted at a frequency of 20kHz, which is usually beyond the range of human hearing. Because videoconferencing applications often detect audio “volume” as speech rather than just speech, this can lead the application to believe that the user is speaking.
You can try our experimental demonstration right away! By default, this demo acts as a sign language detector. The training code and model, as well as the web demo source code, are available on GitHub.
demo
In the video below, we demonstrate how to use the model. Note the yellow chart in the upper left corner, which reflects the model’s confidence in detecting that the activity is indeed sign language. When the user signs, the chart value rises to nearly 100, and when she stops signing, it drops to zero. This happens in real time at 30 frames per second, the maximum frame rate of the camera used.
User feedback
To better understand how the demo worked in practice, we conducted a user experience study where participants were asked to use our experimental demo during a video conference and communicate via sign language as usual. They were also asked to sign each other’s names and tested the switching behavior of the speakers on the speaking participants. The participant responded positively, the sign language was detected and treated as an audible voice, the demo successfully identified the signed participant and triggered the audio table icon of the meeting system to draw the focus to the signed participant.
conclusion
We believe that everyone should have access to videoconferencing applications and hope that this work is a meaningful step in that direction. We have shown how to leverage our model to make video conferencing more convenient for signers.
Update note: first update the website “rain night blog”, then update the wechat public number “rain night blog”, then will be distributed to each platform, if you know more in advance, please pay attention to the wechat public number “rain night blog”.
Blog Source: Blog of rainy Night