In iOS 12 released this year, Apple introduced an interesting new feature called Memoji, a new Animoji that uses the phone’s front-facing camera to create its own animated avatar. When the feature first launched, there was a wave of Memoji impressions online.


While playing Memoji, did you ever think you could create an Memoji from just one photo? In this article, machine learning expert Pat Niemeyer has created Memoji for real photos of people using neural networks. Specifically, he tested the VGG16Face model, a network trained for face recognition, to see how well it would compare real photos with Memoji, which “looks” like multiple objects. You then use this to guide you through selecting features and creating Memoji for new objects.


Code address at bottom.

Photo: Memoji created for Trump’s headshot

Photo: Memoji for Barack Obama’s headshot


According to the plot

The two diagrams above show the results of neural network generation. It leaves a lot to be desired, but I think it’s fun, simple and effective. My personal feeling is that the neural networks may not be seeing these Memoji as faces, and there are several disadvantages, let me just mention them:


  • The cartoon image

First of all, the first question is what do people “look” like when they’re represented in cartoon form? Cartoon images exaggerate the most obvious features of a character, but some features, such as hairstyles, are not set in stone and vary greatly from photo to photo, even from day to day. For this reason, it seems that training neural networks to recognize individuals can capture hair information in an abstract way that takes account of this variation. Conversely, this means it may not be suitable for generating hairstyles from random images.


  • Skin color and hair color

It’s hard to infer skin color from photos taken under random lighting conditions, and I did a simple test that proved pretty bad. In tests, the neural network generally chose brighter skin tones and did not always do a good job of distinguishing between realistic and unrealistic options (not shown). And while the tests performed well in distinguishing between light and dark hair, they were more or less wrong in presenting colourful hair.



  • There is no API

One of the limitations I faced when I experimented with Memoji generation is that there is currently no API for producing them in bulk (there is no direct way to automate them on iOS). This limits how efficiently we can find possible Memoji domains as part of the generation process. Ideally, we would like to use genetic algorithms to drastically optimize feature combinations without relying on their separability, but this is not feasible in this case.


  • Pictures to choose

Which photo you choose to create Memoji also has a big impact on the final outcome. Some photos generated much better results than others, and I’m not sure why. In general, I try to look for photos that are representative, well-cropped and face forward.


Neural networks & Settings

The actual code for this experiment is actually very small, I’ll go through them one by one, the source link is at the end of the article.


  • VGG

VGG is a very popular neural network architecture for image recognition. VGG Face implements a model specifically for Face recognition, and the entire trained network (including network layer information and learning weights) can be downloaded and used publicly. So using this pre-training model is one of the best options, and it takes a lot of time to train a model like this from scratch.


Figure: VGG Face image model architecture


  • Torch

I used the Torch Scientific Computing framework, which provides the environment needed to run VGG models, including a Lua-based programming environment, libraries for mathematical calculations using tensors, and building blocks for neural networks. I chose the Torch because I used it a lot before.


Torch can load a good VGG Face model for us to run an image on with just a few lines of code. The basic process is:


Net = torch. Load ('./torch_model/VGG_FACE.t7'Img = load_image(my_file) output = net:forward(img)Copy the code


A few more steps involve loading images and image standardization, which can be found in the source code.


  • Using the Network Layer

As shown in the figure above, VGG contains different types of network layers. The first is a tensor containing RGB image data, which applies a series of convolution, pooling, weighting, and other types of transformations. The “shape” and dimensions of the data change as each network layer learns more and more abstract features. Finally, the final layer of the neural network generates a one-dimensional prediction vector of 2622 elements. This vector represents the matching probability of a particular person used by the training network.


In our case, we didn’t care about these predictions, but wanted to use neural networks to compare our own dataset of arbitrary faces. To achieve this, we can use the output of the layer below the prediction layer. This layer provides a 4096 element vector that describes the features of a human face.

output = net.modules[selectedLayer].output:clone(a)Copy the code


VGG16 has 16 network layers, and the actual implementation results of Torch generated a 40-layer “model” setup, of which layer 38 generated the results we needed.


  • similarity

What we do is basically run pairs of images in a neural network and compare their output with similarity metrics. One way to compare two very large number vectors is to use the dot product:

torch.dot(output1, output2)
Copy the code


This generates a scalar value, which you can think of as the ratio of vectors to higher dimensions that are “consistent.”


In this test, we wanted to compare the expected Memoji to multiple reference images and merge them. So I need to normalize and average each pair of values to get a “score.”


sum = 0
for i = 1, #refs do
  local ref  = refs[i]
  local dotself = torch.dot(ref , ref) 
  sum = sum + torch.dot(ref, target) / dotself
end
...
return sum / #refs
Copy the code


Standardization means that the image will score 1.0 when compared to itself, so higher “score” values imply higher similarity.


There are many other types of metrics we need to use, but the two most obvious ones are the Euclidean distance between the outputs and the mean square error. I tried it briefly with these two, but it seems like the dot product will yield better results, and I’m a little confused here.


The first test: a set of heads

The first thing I wanted to check out was if the face network could handle cartoon-style Memoji. I started by grabbing 63 Google-googled Memoji (mostly from Apple’s demo).



I then chose one and had the neural network sort all Memoji to show me the first three matches based on a single reference image.


It worked out pretty well. Not only did the neural network find the same Memoji (ranked first with a score of 1.0), but the second and third results were unreliable.


Real images

Now comes the “real” image test: How will the neural network react to real life images when compared to these Memoji? I grabbed some photos of famous people and here are the results:

The results are interesting. Note that the neural network can only choose from a limited number of Memoji, and the main features it sees in these images may not match our expectations. Also, note that the scores (confidence) in these comparisons are substantially lower than when compared to other Memoji’s.


The generation process

Next, I want to reverse this process and create an Memoji using neural network selection features. That’s where it gets a little tricky. As I mentioned earlier, there is no obvious way to automatically create Memoji on iOS. While it might be possible to do this on a jailbreak device or hack into some art network, I’d like to try something as simple as a neural network.

In my tests, I connected my phone to my laptop using QuickTime Player’s video-recording capability, and left it in the corner just enough for the script to grab the screen for processing. I tried every possibility for each feature, selecting each option and pressing enter to grab and arrange the output.



Obviously this wasn’t ideal for a couple of reasons: First, the process was painful (93 pairs of choices, a dozen tries). More importantly, it only allows us to assess one feature difference at a time. In theory, we could iterate over it and try again until the neural network no longer suggests changes, but this is not perfect, so there may be “local minima” (the order of evaluation matters if one feature affects the perception of another).


Plus, the annoying thing is that the camera is always following me, and even small movements can affect the score.


The results of

I gave away a few results at the beginning, but it could have been polished.

However, the neural network seems to be very unstable in choosing facial features, meaning that in some cases the first three images selected by the model are very similar, but sometimes they are not. For example, here are the top three Memoji the neural network chose for Trump’s haircut:

Some of the features matched my expectations, such as Obama’s supposedly “flapping ears,” and the neural network did come up with three results, ranked from most to least matched:

Photo: Ears chosen for Obama Memoji

But the eye changes a lot:


Photo: The eyes chosen for Obama Memoji

In contrast, the eyes chosen for Mr. Trump did not change much:

Photo: The eyes chosen for Trump Memoji


At first, Obama’s chin looked too square for me, but after a while I thought it looked just right (so it’s not good to be too subjective in this regard).


Hair color

As I said before, skin color can be difficult to distinguish, and hair color can be an issue. Trump’s and Obama’s hair colors looked ok, but when I later tested them with bright red hair, the neural network always preferred gray:



Although the top three generated results included red hair, the results were not ideal. I tried a lot of photos to see what factors influenced it, like tweaking the way the image was preprocessed. However, even after I substantially modified the input image, the neural network kept producing my gray-haired Memoji.


In another experiment, I looked at the generation results of other network layers. Recall that we chose network layer 38 because it is the highest layer representing facial features. However, we can compare it to the lower network layer to see the difference. The results were indeed different when using the 32nd network layer, with hair color getting more attention. This layer corresponds to the last “pooled” layer in the VGG16 model, just before the first “fully connected” layer, so it probably retains more spatial and color information as a result. This layer and the other lower network layers generated Memoji for bright red hair better.


In addition, I tried several ways to average out the hair color (following the first three choices and the new network layer). While this produced more reasonable results, it did not end up with the desired bright red hair.


Torch script code for this project is shown in my GitHub project:

Github.com/patniemeyer…


References:


https://patniemeyer.github.io/2018/10/29/generating-memoji-from-photos.html