What exactly is the flaming Avatar?

With the popularity of the concept of the meta-universe, the word Avatar began to appear more and more frequently. The English word Avatar was introduced to many people in 2009 by James Cameron’s 3D sci-fi blockbuster Avatar. However, many people do not know that the word is not invented by the director, but comes from Sanskrit, an important term in Hinduism. According to the Cambridge English Dictionary, Avatar currently has three main meanings.

Avatar in the Cambridge Dictionary © Cambridge University Press

Originally, Avatar originated from the Sanskrit avatarana, consisting of Ava (off, down) + Tarati (cross over), literally meaning “to descend to the earth,” referring to the incarnation of the gods upon earth, It usually refers specifically to the form of the Lord God VISHNU in human or animal form. It entered the English language in 1784.

In 1985 chip Morningstar and Joseph Romero used the term Avatar to describe the online representation of the user when designing Habitat for Lucasfilm Games (LucasArts), an online role-playing game. Then, in 1992, science fiction writer Neal Stephenson wrote Snow Crash, which described a parallel meta-universe to the real world. All people in the real world have online avatars in the meta-universe, and this is the first time the term has appeared in the mass media.

In the Internet era, the term Avatar began to be widely used by programmers in software systems to describe an image representing the user or his or her personality, often referred to as “Avatar” or “personal show.” The avatar can be a three-dimensional image from an online game or virtual world, or it can be a two-dimensional image commonly used in online forums or communities. It is a token that can represent the user himself.

From QQ show to Avatar

Nowadays, it has become a standard for various software applications to allow users to create their own avatars. With the development of technology, the avatars used by users have developed from ordinary 2D images to 3D images. In 2017, Apple released Animoji, a new feature on the iPhone X that uses facial recognition sensors to detect changes in a user’s facial expression, record the user’s voice with a microphone, and ultimately generate cute 3D animated emojis. Users can share emojis with friends through iMessage. However, the first generation does not support user – defined images, only the built-in animal cartoon avatar. Animoji ii, which was later updated, allows users to freely pinch their faces to create stylized avatars of faces. At present, many scenes can see the automatic face pinching function, which can automatically generate CG models in line with the characteristics of the user’s face by taking only one or several photos, but it relies on the support of complex CG modeling and rendering technology.

Avatar can also skip the expensive CG modeling and rendering process and use machine learning algorithms to “stylize” filmed human faces. In other words, the target training style is automatically transferred and fused with the original facial features of the photographer to create stylized face Avatar that conforms to the user’s facial features.

Four technical realization routes of face stylization Avatar

What is face stylization?

The so-called face stylization is to convert the real face avatar into a specific style avatar, such as cartoon style, animation style, oil painting style, as shown below:

Basically speaking, face stylization can be achieved through texture mapping, style transfer, cyclic adversarial network and implicit variable mapping.

Texture map

Texture mapping is generally given a sample picture, and the algorithm automatically attaches the texture of the picture pixel by pixel or block by block to the face of the target, forming a reasonable, natural and movable face mask [1].

[1]

Style migration

Given one or a group of style photos, style encoding is extracted from style images and content encoding is proposed from target face images based on learning method, and corresponding stylized images are automatically generated through two groups of encoding [2, 3]. Only changed the surface texture of the face picture, but can not reasonably retain or adjust the structural attributes of the face, the formation of meaningful structural style change.[3]

Cyclic adversarial network

By using the cyclic adversarial network and its reconstruction constraints, the stylization effect of training samples without pairs can be achieved. Often used in conjunction with style migration, which extracts style encoding and content encoding separately. Face stylization also displays modeling and deformations of face structure information based on target style attributes (e.g. based on face key points). However, due to the lack of constraints on intermediate results (such as A->B-> B in A), the final generation effect is uncontrollable and unstable (that is, the rationality of A->B cannot be guaranteed) [4].[4]

Implicit variable mapping

Implicit variable mapping generally uses a pre-trained real face generation model to fine-tune the target style with a set of style images, so as to obtain a corresponding face stylized generation model [5, 6]. A coding network is used to map the input face image to or get the corresponding implicit variable based on multi-step optimization, and the variable is used as the input of face stylization generation model, so as to get the corresponding stylization image of the face image. The hidden variable mapping method based on optimization usually gets better results, but it needs a lot of calculation in actual operation. Although the hidden variable after mapping contains the global information of the face, it is easy to lose the details of the original input face, which is easy to cause the generated effect cannot reflect the individual identification features and detailed expressions.

[5] Sample images from Toonify. Photos /

[6]

Ali Cloud video cloud research cartoon wisdom draw Avatar

In 2020, Avatar, a cartoon developed by Aliyun Video Cloud, was born and attracted the attention of the industry. In the Computing Conference in October 2021, the cartoon intelligent drawing project of Ali Cloud video cloud was presented in the ali Cloud developer booth. Nearly 2000 participants competed for experience and became a popular model in the conference.

Ali Cloud Card intelligent drawing adopts the technical scheme of hidden variable mapping, which can automatically generate virtual images with personal characteristics (that is, the stylized effect) by exploring the salient features (such as eye size and nose type) of the input face pictures.

First of all, a model that can generate high definition face pictures is trained in an unsupervised way by using the massive copyright hd face data set of its own, that is, the real face simulator, which generates a large number of high definition face pictures with different face features under the control of hidden variables. The model is fine-tuned with a small number of collected target-style images (the target-style images do not need to correspond to real faces one to one), and the stylized simulator is obtained. Real face simulators and stylized simulators share hidden variables, that is, a hidden variable can be mapped to a pair of “fake” face images and their corresponding stylized images.

By sampling a large number of hidden variables, we can obtain a large number of data pairs covering different face attributes (gender, age, expression, hairstyle, whether to wear glasses, etc.), so as to train the image translation network. Based on congenital structural face (such as eyes, nose, etc.), and structural differences of the real face and stylized virtual image, such as cartoon eyes tend to be large, round), in the network to join the local regional correlation calculation module, and facial reconstruction constraint, to train the network generated virtual image is vivid and lovely, and with individual characteristics.

Model design

Based on congenital structural face (such as eyes, nose, etc.), and structural differences of the real face and stylized virtual image, such as cartoon eyes tend to be large, round), in the network to join local correlation calculation module (that is, the hope of human eyes and the virtual image features of the eye has a certain corresponding relationship) as well as the facial reconstruction constraint, So that the generated virtual image is lively and lovely, and has personal characteristics.

Effect display:

The future of the Avatar

Thanks to the rapid development of AI technology, we already have the technology to create virtual human, but I believe this is only the beginning. In the foreseeable future, Avatars will appear more and more frequently in virtual worlds as digital avatars of the digital inhabitants of the metasomes. And Avatars will become an extremely important digital asset in the virtual world.

Finally, to quote Zuckerberg’s description of digital people, “The hallmark of a virtual world is a sense of presence, that you can actually feel another person or be in another place. Creation, virtual people and digital objects will become central to how we express ourselves, leading to new experiences and economic opportunities.”

“The defining quality of The Metaverse is presence, Which is this feeling that you’re really there with another person or in another place,” Mr. Zuckerberg told Analysts in July. Creation, Avatars, and digital objects are going to be central to how we express ourselves, And this is going to lead to entirely new experiences and economic opportunities.”

References: [1] Aneta Texler, Ondřej Texler, Michal Kučera, Menglei Chai, and Daniel Sykora. FaceBlit Instant Real-time Example-based Style Transfer to Facial Videos, In Proceedings of the ACM in Computer Graphics and Interactive Techniques, 4(1), 2021. [2] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. A Neural Algorithm of Artistic Style. Journal of Vision September 2016, Vol.16, 326. [3] Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A Learned Representation for Artistic Style. In International Conference on Learning Representations 2017. [4] Kaidi Cao, Jing Liao, and Lu Yuan. CariGANs: Unpaired Photo-to-Caricature Translation. In ACM Transactions on Graphics (Siggraph Asia 2018). [5] Justin N. M. Pinkney and Doron Adler. Resolution Dependent GAN Interpolation for Controllable Image Synthesis Between Domains. In NeurIPS 2020 Workshop. [6] Guoxian Song, Linjie Luo, Jing Liu, Wan-Chun Ma, Chunpong Lai, Chuanxia Zheng, and Tat-Jen Cham. AgileGAN: Stylizing Portraits by Inversion-Consistent Transfer Learning. In ACM Transactions on Graphics (Siggraph 2021).

“Video cloud technology” your most noteworthy audio and video technology public account, weekly push from Ali Cloud front-line practice technology articles, here with the audio and video field first-class engineers exchange exchange. You can join ali Cloud video cloud product technology exchange group, and the industry together to discuss audio and video technology, get more industry latest information.