Modern Face Recognition in Deep Learning

Modern Face Recognition with Deep Learning
By Adam Geitgey
Permalink: Modern Face Recognition in Deep Learning
Translator: Vgbhfive

We’re going to learn how modern face recognition works and it would be too easy just to know your friends. Now we can push the limits of this technology, and solve more important problems — Will Farrell and Chad Smith.

One of those people was Will Farrell. The other is Chad Smith. I swear they’re different people!

How do you use machine learning on very complex problems

So far, in Parts 1, 2, and 3, we have used machine learning to solve isolated one-step problems. All of these problems can be solved by choosing a machine learning algorithm, feeding in the data and getting the results.

But face recognition is actually a series of related problems:

First, look at the picture and find all the faces in it.
Second, focus on each face and be able to understand that (even if the face is turned in a different direction or in a poorly lit place) they are still the same person.
Third, they can pick out features that distinguish them from other faces, such as eye size and face length.
Finally, the unique features of the face were compared with people they already knew to determine the person’s name.

As humans, the human brain can do this automatically and instantly. And humans are actually pretty good at recognizing all the faces we see in everyday life. Computers don’t have that ability (at least not yet), so we have to teach them how to do each step of the process separately.

We need to build a pipeline in which each step of face recognition is addressed separately and the results of the current step are passed on to the next step. In other words, we need to link several machine learning algorithms together.

How might the basic tubes used to detect faces work

Face recognition

In solving this problem, we will learn a different machine learning algorithm for each step. I won’t fully explain each algorithm, but you can learn the main ideas of each algorithm and learn how to build your own face recognition system in Python.

1. Find all the faces

The first step in the development process is face detection. Obviously, we have to find all the faces in the picture before we can separate them all.

If you’ve used any smartphone camera in the past decade, you’ve probably seen face detection in action:Face detection is an important function of camera. When the camera can automatically pick out faces, you can make sure you bring them all into focus before taking a picture. However, we will use this capability for another purpose – to find the area of the image to pass to the next step.

In the early 2000s, Paul Viola and Michael Jones developed a facial detection method that could run fast enough on cheap cameras to become mainstream. But now there are more reliable solutions. We will use a method invented in 2005 called the directional gradient histogram, or HOG.

To find faces in the image, we need to set the image to black and white, so we don’t need color data to find faces:

We will then look at each pixel in the image at once. For each pixel, we look at the pixels directly surrounding it:

Our goal is to figure out how dark the current element is compared to the pixels directly surrounding it. Then, we draw an arrow to indicate which direction the image is darkening:

Just look at a pixel and touch it and the image darks to the top right.

If you look at the graphEach pixelRepeat the process, and eventually each pixel is replaced by an arrow. These arrows are calledThe gradientAnd they show the flow of the whole image from light to dark:

This may seem like a random thing to do, but there are good reasons for converting pixels to gradients. If we analyze pixels directly, a really dark image of the same photo and a really bright image will have completely different pixel values. However, considering only the reverse of the brightness changes between adjacent pixels, both the dark image and the bright image will have the same specific representation. This will make the problem much easier to solve.

If so, saving gradients for each element would give us too much detail. At this point, it would be nice if we could see the basic flow of light and shade at a higher level, so that we could see the basic pattern of the image.

To do this, we split the image into small blocks of 16 by 16 pixels each. In each small square, we will calculate how many gradient points (up, vertical, right, etc.) there are in each major direction. Then, we replace the square in the image with the strongest arrow reverse.

The end result is that we have converted the original image into a very simple representation that captures the basic structure of the face in a simple way:

The original image is converted to HOG representation, which can capture the main features of the image regardless of the brightness of the image.

We’ve made a lot of progress from that. But to find the face in this HOG image, all we have to do is find the part of the image that looks most similar to the known HOG pattern extracted from other trained faces.

Using this technology, we can easily find faces in any image.

If you want to try it out yourself using Python and Dlib, the following code will help you generate and view HOG representations of images.

2, pose and projection surface

At this point, we will find a problem, we recognize the face in our image, when facing different reverse face expression and the computer looks completely different:

People could easily identify both images as Belonging to Will Ferrell, but the computer would treat the images as two completely different people.

To solve this problem, we will try to deform each image so that the eyes and lips are always in the sample position of the image. This step will make it easier to compare faces as we go along.

Based on this, we will use an algorithm that becomes the face boundary marker estimation. There are many ways to do this, but we will use the method invented by Vahid Kazemi and Josephine Sullivan in 2014.

The basic idea is that we have 68 specific points (called landmarks) on each face, the top of the chin, the outer edge of each eye, the inner edge of each eyebrow, and so on. We then train a machine learning algorithm to find the following 68 specific points on any face:

Find 68 landmarks on each face.

Here are the results of 68 facial landmarks found on test images:

Now that we know the eyes and mouth exist, we simply rotate, zoom, and cut the image to keep the eyes and mouth as centered as possible. In this, we do not do any 3D deformations, we only use basic image transformations (such as rotation, scaling) that preserve parallel lines, so calledAffine transformation.

Now, no matter how the face is rotated, we can center the eyes and mouth at roughly the same place in the image, which makes our next step more accurate.

If you want to try it yourself using Python and Dlib, the code here for finding a face boundary is a great help.

3. Encode faces

Now that we’ve solved most of the problems, it’s time to recognize and label faces in the image. To do this we need a way to extract some basic measurements from each face. We can then measure unknown faces in the same way and find known faces with the closest measurements (we can measure the size of a girl’s ears, the distance between her eyes, the length of her nose, etc.).

The surest way to measure faces

It turns out that measurements that are obvious to us humans (eye color, for example) don’t make any sense for computers to look at individual pixels in an image. The researchers found that the most accurate method would be for computers to find measurements to collect on their own. Deep learning is better than humans at determining which parts of faces are important.

The solution is to train a deep convolutional neural network. But instead of training him to recognize picture objects, we could train him to generate 128 measurements for each face.

The training was performed by viewing three images at a time:

Load the training face image of a known character.
Load another image of the same known person.
Load images of completely different people.

The algorithm then looks at the measurements currently being generated for each of the three images. He then tweaks the neural network slightly to ensure that the measurements he generates for #1 and #2 are slightly closer, and that the measurements for #2 and #3 are slightly separate:

After repeating the process millions of times over millions of images of thousands of people, the neural network learned to reliably generate 128 measurements for each person. Any ten different pictures of the same person should be subjected to roughly the same measurements.

The 128 measurements generated by machine learning for each face are called embeddings by humans. The idea of reducing complex raw data, such as pictures, to a computer-generated list of numbers has sprung up a lot in machine learning.

Encode images of our faces

The process of training convolutional neural networks to output face embedding requires a lot of data and computing power. Currently, even with NVidia Telsa graphics cards, it takes about 24 hours of continuous training to get good accuracy.

But once we train the neural network, we can generate measurements of any face, even ones that have never been seen before. So this is where you can find trained neural networks.

At this point, all we need to do is run an image of a face through a pre-trained neural network to get 128 measurements for each face. Then test the size of the image:

So which part of the face are these 128 accurate measurements of? As it turns out, we didn’t, but it didn’t matter to us. The concern is that when looking at two images of the same face, the neural network produces almost identical measurements.

If you want to try this step yourself, OpenFace provides the corresponding Lua script that will generate all the images embedded in the folder and write them to a CSV file.

Look up the name of the face from the code

The last step is actually the simplest of the whole process. All we need to do is find the name of the face in the known database that most closely resembles the image we tested.

To do this we can use any basic machine learning classification algorithm. Without fancy deep learning techniques, we just use a simple linear SVM classifier, and many other classification algorithms are available.

All we need to do is train a classifier that can take measurements from the new test image and tell which known person is the closest match. Running this classifier needs to be as fast as possible, and the result of the classifier is the person’s name.

At this point, we started testing. First, I trained a classifier embedded with about 20 images from Will Ferrell, Chad Smith, and Jimmy Falon:

And then the corresponding test effect.

The results were obvious. The neural network was perfectly capable of recognizing faces (even profiles) in different poses.

review

Let’s review the steps above:

The HOG algorithm is used to encode the image to create a simplified version of the image. Using this simplified image, find the part of the image that looks most like a human face for the ordinary HOG encoding.
Figure out the pose of the face by finding the main landmarks of the face. Once these landmarks are found, they can be used to deform the image to center the eyes and mouth.
Images of central faces are transmitted through a neural network that knows how to measure facial features. Save the 128 measurements simultaneously.
Look at all the faces we’ve measured before, and see which one is closest to the measurements we need to identify now. That’s what we ended up with.

The OpenFace used in the above steps has been released by the authors as a new Python-based face recognition library, face_recognition.

reference

Github.com/ageitgey/fa…

medium.com/@ageitgey/m…

Github.com/cmusatyalab…