The popularity of the metasemes has filled people with fantasies about the future of virtual worlds. Previously, we revealed how sonnet’s 3D spatial audio technology can perfectly simulate the real auditory experience in virtual worlds and increase players’ immersion. Today we leave yuan universe, back to the real world, to talk about sound network itself Agora Lipsync labial synchronization technology is how to implement without having to open the camera, no facial capture technology, just upload a/a face image can be through the speaker voice audio signal, can drive the static face of movement in the mouth.

Before introducing Agora Lipsync technology, let’s take a brief look at two similar technologies in the industry:

  • Oculus Lipsync Oculus Lipsync is a Unity integration for syncing the lip movements of virtual characters to speech. It analyzes audio input offline or in real time and then predicts a set of vocal mouth shapes used to animate the lips of virtual characters or non-player characters (NPCS). To improve the accuracy of audio-driven facial animation, Oculus Lipsync uses a neural network model to learn the mapping between speech and phonemes. Input audio is converted into phonemes through the model, and phonemes can correspond to specific visual phonemes. Then, the posture and expression of lips and face of virtual characters are realized based on Unity integration technology. This technology is mainly used in the field of virtual anchor and game.

  • Facial capture technology, in the moment a lot of conferences and activities is to use hologram, meeting guests outside the stage wearing a specific hardware device, his body movements and speech mouth movements can real-time synchronization in stage in the big screen in the virtual image, which wants to realize the labial synchronization, you need to use key facial expressions to capture technology and related hardware equipment.

In contrast to these two technologies, Sonnet’s Agora Lipsync has a core difference. Agora Lipsync does not require a camera or facial expression capture technology. Instead, Agora Lipsync uses generative adversarial networks in deep learning algorithms to intelligentially correlate mouth movements and facial expressions in Chinese and English (or other languages). The driving portrait simulates the real person speaking mouth type, supports the 2D portrait picture and the 3D portrait model.

Next, we’ll focus on the technology behind Agora Lipsync’s speech-driven mouth movements.

Generative adversarial network + model lightweight to achieve voice signal driven portrait mouth movement

Voice driven mouth technology, as the name implies, through the speaker’s voice audio signal, to drive the mouth movement of the static face head, so that the generated face head mouth state and the speaker’s voice highly match. The realization of real-time voice driven face image speaking technology needs to overcome many challenges. First of all, we need to find the corresponding relationship between voice information and face information. Phonemes are the smallest articulation units of human speech, and we can find the corresponding mouth shape through phonemes, but there is more than one mouth shape state that emits the same phoneme. In addition, different people have different facial features and speech state, so it is a complicated one-to-many problem. Then there are other challenges, including whether the speaker’s face is distorted and whether the speaker’s mouth shape changes smoothly. In addition, if it is used in low-latency real-time interactive scenarios, computational complexity and other issues need to be considered.

■ Figure 1: For example, for the phoneme A, there is no unique mouth shape

The traditional Lipsync method can be realized by combining speech processing with face modeling. However, the number of mouth shapes that can be driven by speech is often limited. However, Sonnet’s Agora Lipsync can generate real-time speech face images through deep learning algorithm. At present, the performance of deep learning algorithms is constantly improved with the increase of data scale. By designing neural networks, features can be automatically extracted from the data, reducing the work of manually designing feature extractors for each problem. Deep learning is already making waves in many fields such as computer vision and natural language processing.

In the task of voice-driven face image, we need to map the one-dimensional speech signal to the two-dimensional pixel space of the image. The sound net uses generative adversarial network (GAN) in deep learning. The IDEA of GAN comes from zero-sum game theory and consists of two parts. One is a Generator, which receives random noise or other signals to generate target images. One is the Discriminator, which determines whether an image is “real.” The input is an image and the output is the probability that the image is real. The goal of the generator is to fool the discriminator by producing a near-realistic image, and the goal of the discriminator is to try to tell the difference between the fake image generated by the generator and the real image. The generator hopes that the false image will be more realistic and the discriminant probability will be high, while the discriminant hopes that the false image will be realistic and the discriminant probability will be low. Through such dynamic game process, the Nash equilibrium point will be finally reached. There are many examples of confrontation in nature. In the process of biological evolution, prey will gradually evolve their own characteristics to deceive predators, and predators will adjust their recognition of prey according to the situation and coevolution.

After the gan-based deep neural network training is completed, the generator can transform the input signals and generate realistic images. To solve this problem, Sonnet designed a deep learning model for voice-driven image tasks, using large-scale video corpus data, so that the model can generate a speaker’s face based on the input speech. Model within the input voice and image of the two different mode of signal feature extraction, image implicit vector can be calculated and voice implicit vector, and further study to two cross modal vector implicit mapping relation between Cain, thus according to this relationship will become and implicit vector feature reconstruction should match the original audio talk face image. In addition to the verisimilitude of the generated image, time sequence stability and tone and picture matching degree should also be considered, for which we designed different loss functions to restrict the training. The whole process of model inference is realized end – to – end.

Agora Lipsync is also suitable for Chinese, Japanese, German, English and other languages, as well as people of different skin colors, to meet the user experience of different countries and regions.

From figure 2 below, we can more intuitively understand how generative adversarial networks learn to generate speaker face avatars end-to-end.

Figure 2 can be divided into four processes:

1. The Generator in the deep learning model receives a face image and a short piece of speech, and generates a Fake image through feature extraction and processing inside the Generator.

2. “Real Data” in the figure refers to the video sequence used for training, from which the target image matching Audio is extracted. The differences between the target Image and the Fake Image generated by the Generator are compared, and the model parameters in the Generator are further updated by back propagation according to the loss function, so that the Generator can learn better and generate more realistic Fake Image.

3. When comparing the differences, input the target Image and Fake Image in Real Data into the Discriminator to make the Discriminator learn to distinguish between true and false.

4. During the whole training process, generator and discriminator fight and learn from each other until the performance of generator and discriminator reaches a balance state. The resulting generator will produce images that more closely resemble the mouth shape of a real human face.

■ Figure 2: How does the generative adversarial network generate the corresponding face image

Deep learning model can generate speaker face image end-to-end, but it often has a large amount of computation and parameter number. Due to the requirements of storage and power consumption, it is still challenging to apply the algorithm in real time under low resources. At present, some commonly used model lightweight techniques include manual design of lightweight structure, neural architecture search, knowledge distillation and model pruning. In Agora Lipsync’s voice-driven mouth task, the model designed by sonnet is essentially an image generation model with a relatively large volume. By using model lightweight technology, we designed an end-to-end lightweight voice-driven image model, which can drive a static image to generate a speaker’s face only by transmitting the voice stream. On the basis of ensuring the effect, the calculation amount and the number of parameters of the model are greatly reduced, so as to meet the landing requirements of the mobile terminal. By inputting voice signals, it can drive a still face image to produce mouth movement in real time, and achieve the effect of synchronization of sound and painting.

After introducing the technical principle of Agora Lipsync, let’s take a look at its application scenarios. Compared with metasurverse virtual world and real video social scenes, Agora Lipsync fills the gap in voice social scenes, where you can experience both the visual sense of human video and microphone without turning on the camera. It has great application value in chat room, interactive podcast, video conference and other scenes.

Language chat room: In traditional language chat room, the user will usually choose real or virtual avatar for voice even the wheat, often need through sexual, interesting topics to chat content to ensure quality of language the content of the chat room and staying power, and by adding voice drive mouth movement technology, can be in the form to process more vivid and interesting chat, For players who do not want to turn on the camera, they can choose a nice or funny photo of themselves as their profile picture, so that they can see each other’s face as if they are talking in real life without turning on the camera, which ultimately increases the motivation for players to further chat in the chat room.

Interactive podcast: Last year represented by the Clubhouse interactive podcast platform was popular all over the world, compared with the traditional language chat room, interactive podcast platform topic content, user relationship has obvious difference, podcasts chat room topic mainly by science and technology, the Internet, work, entrepreneurship, the stock market, the topics such as music, users to upload their willingness to human head is also very high, By adding voice-driven mouth-motion technology, the chat between users can be more participatory and realistic.

Video conference: In video conference scenarios, participants are usually required to turn on their cameras as much as possible. However, it is often difficult for some users to turn on their cameras, resulting in meetings where some users open video and others open voice. Agora Lipsync, on the one hand, can help users who cannot open their cameras avoid embarrassment. The mouth movement that drives the head of a human face creates the impression of a real person participating in a video conference. On the other hand, by voice driven face speaking, video conference transmission can not transmit video stream, only need voice stream, especially in weak network conditions, not only avoid picture lag or delay, but also reduce transmission cost.

At present, Agora Lipsync technology mainly supports 2D portrait pictures and 3D portrait models. In the future, under the continuous research of sound Net algorithm team, this technology will be further upgraded. It can not only support cartoon head, but also is expected to further drive the movement of head, eyes and other organs through voice, so as to achieve a wider range of application scenes and scene values.

If you want to further consult or access Agora Lipsync technology, please click “Read the original article” to leave your information, we will contact you in time for further communication.