Christmas, goddess mood without guessing!

This paper will combine the mobile device camera capability and Tensorflow. js, in the browser to achieve a real-time face emotion classifier. Given the longer story, those who are more interested in the implementation itself can skip to the technical solution overview.

The DEMO DEMO

preface

Seen over 25 years of snow on the wind flower, this flourishing already not sad not happy. Never thought, on Christmas Eve, the goddess actually agreed to my request for a date. But in the face of such a good opportunity, this front-end engineering lion suddenly panic up. Want to be like some of the officials here, although yushu Lin feng, romantic suave, but always because of guess the girl’s mind, a not careful to end up a mother single. Now already is 8102 years, resemble I wait so outstanding juvenile if do not take off single, the party and the people that is ten thousand do not agree! Reflecting on the pain, this is about to play their own technical advantages, will observe the “yan” view color skills tree point full, do an insight into the goddess of the passions of the outstanding youth, the Christmas peak!

To start

Demand analysis

Our front-end engineers finally ushered in TensorFlow.js in 2018, which means that even if we are weak in law and don’t know how to trade PY, we can also learn a trick and a half by relying on JS to follow the algorithm students. If we can get photos of the goddess through formal channels during the date, can we use an algorithm to analyze whether the goddess is happy or… No, it must be happy!

However, the dating scene can change so quickly that we can’t just take a picture and put it on our phone, and then go back to our quiet home and do code analysis. That’s too young too late! Time is life, if you can’t know the mood of the goddess on the spot, we might as well give ourselves -1s!

So, our goal is to be able to see this in real time on the phone (well, it’s a bit crude, but this article will focus on functionality, haha) :

Technical Solution Overview

Very simple, we need two points, image acquisition & model application, as to how to show the results, ro ah, as a front-end engineer, Render is a common matter. For those of you on the front end, the only thing you might not be familiar with is how the algorithmic model works; The only thing that might not be familiar to those of you who are algorithmic is how the camera is used on a mobile device.

Our process is shown in the figure below (the following will be optimized for the problem of computing speed) :

Next, let’s comb through how to achieve it according to this flowchart!

Core 1: Image acquisition and display

Image acquisition

How do we capture images or video streams using mobile devices? That’s where WebRTC comes in. WebRTC, or Web Real-time Communication, is an API that enables Web browsers to conduct real-time voice or video conversations. It became open source on 1 June 2011 and was included in the W3C recommendation of the World Wide Web Consortium with the support of Google, Mozilla, and Opera.

Pull up the camera and get the collected video streaming, which is exactly what we need to use the provided by the WebRTC ability, the core API is the navigator. MediaDevices. GetUserMedia.

The compatibility of this method is as follows, you can see, for common mobile phones, or can be better support. Still, there may be some differences between phones, system types and versions, and browser types and versions. For better compatibility, consider using Adapter.js for shim, which isolates our App from Api differences. In addition, some interesting examples can be seen here. The specific adapter.js implementation can be viewed by yourself.

So how does this work? We can look it up by MDN. MediaDevices getUserMedia() will ask the user for permission to use the media input to get a MediaStream of the specified type (e.g. audio stream, video stream) and resolve a MediaStream object. If there are no permissions or no matching media, the corresponding exception will be reported:

navigator.mediaDevices.getUserMedia(constraints)
.then(function(stream) {
  /* use the stream */
})
.catch(function(err) {
  /* handle the error */
});
Copy the code

Therefore, we can do this in the entry file uniform:

class App extends Component {
  constructor(props) {
    super(props);
    // ...
    this.stream = null;
    this.video = null;
    // ...
  }

  componentDidMount() {
    // ...
    this.startMedia();
  }

  startMedia = (a)= > {
    const constraints = {
      audio: false.video: true}; navigator.mediaDevices.getUserMedia(constraints) .then(this.handleSuccess)
      .catch(this.handleError);
  }

  handleSuccess = (stream) = > {
    this.stream = stream; // Get the video stream
    this.video.srcObject = stream; / / to the video
  }
  
  handleError = (error) = > {
    console.log('navigator.getUserMedia error: ', error);
  }
  // ...
}
Copy the code

Real-time display

Why do we need this. Video? We not only need to show the video stream captured, but also intuitively mark the facial expression of the goddess, so we need to display the video stream and draw the basic graph at the same time through Canvas.

canvas.getContext('2d').drawImage(video, 0.0, canvas.width, canvas.height);
Copy the code

Of course, we don’t need to actually provide a video DOM in the view, but the App maintains it inside the instance. Canvas. Width and Canvas. Height need to consider the size of the mobile device.

Drawing rectangular boxes and text information is very simple, we only need to get the location information calculated by the algorithm model:

export const drawBox = ({ ctx, x, y, w, h, emoji }) = > {
  ctx.strokeStyle = EmojiToColor[emoji];
  ctx.lineWidth = '4';
  ctx.strokeRect(x, y, w, h);
}

export const drawText = ({ ctx, x, y, text }) = > {
  const padding = 4
  ctx.fillStyle = '#ff6347'
  ctx.font = '16px'
  ctx.textBaseline = 'top'
  ctx.fillText(text, x + padding, y + padding)
}
Copy the code

Core two: Model prediction

Here, we need to take the problem apart. Since “identifying the emotion behind the goddess emoji” is an image classification problem, this problem requires us to accomplish two things:

Extract the face part of the image from the image;
The extracted image blocks are given to the model as input for classification calculation.

Let’s discuss these two points step by step.

Face to extract

We’ll do that with face-api.js. Js is a face detection and recognition library based on tensorFlow. js core API (@tensorFlow/tfJs-core) for use in the browser environment. SSD Mobilenet V1, Tiny Yolo V2, and MTCNN are available in three very lightweight models suitable for mobile devices. Is easy to understand the effect of nature is a lot of discount, these models are calculated from the model size, complexity, machine power consumption and so did the downsizing, though some specially used to calculate the Mobile device can control the complete model, but our phone is certainly no card commonly, nature can only use Mobile version of the model.

We’ll use MTCNN here. We can take a quick look at the design of the model, as shown below. As can be seen, our image frames will be converted into tensors of different sizes and passed into different nets, and a bunch of max-pooling will be performed. Finally, face classification, BB box regression and landmark positioning will be completed simultaneously. Basically, by typing in an image, we can get all the categories of faces in the image, information about the location of the detection boxes and more detailed information about the location of the eyes, nose and lips.

Of course, you don’t have to think about this very carefully when you use face-api.js, it’s a lot of abstraction and encapsulation, and it’s even pretty brutal to the front end of the class by shielding the idea of tensors, you just take an IMG DOM, yeah, An img DOM with SRC loaded as input to the wrapper method (img is a Promise) is internally converted to the desired tensors. With the following code, we can extract the face from the video frame.

export class FaceExtractor {
  constructor(path = MODEL_PATH, params = PARAMS) {
    this.path = path;
    this.params = params;
  }

  async load() {
    this.model = new faceapi.Mtcnn();
    await this.model.load(this.path);
  }

  async findAndExtractFaces(img) {
    / /... Some basic nulls are guaranteed to be used after loading
    const input = await faceapi.toNetInput(img, false.true);
    const results = await this.model.forward(input, this.params);
    const detections = results.map(r= > r.faceDetection);
    const faces = await faceapi.extractFaces(input.inputs[0], detections);
    return{ detections, faces }; }}Copy the code

Sentiment classification

Well, finally the core feature! A “good” habit is to check GitHub to see if there’s any open source code to look at. Forget it if you’re a big shot. Here we will use a real-time face detection and emotion classification model to complete our core function, which can distinguish happy, angry, sad, disgusting, expressionless, etc.

In terms of using TensorFlow.js in the browser, most of the time we apply the existing model and convert the existing TensorFlow model and Keras model into one that can be used by TFJS-Converter. It is worth mentioning that the mobile phone itself is integrated with a lot of sensors and can collect a lot of data. I believe there will be room for TFJS to play in the future. The specific transformation method can refer to the documentation, we continue to talk about.

So can we pass img DOM into the model as we did with face-api.js? No, in fact, the input to the model we use is not an arbitrary image, but a tensor that needs to be converted to a specified size and is preserved only in grayscale. So before we continue, we need to do some preprocessing of the original image.

Ha ha, hide can not hide the first 15, we still come to understand what is a tensor! Here’s what the TensorFlow website explains:

Tensors are generalizations of vectors and matrices to potentially higher dimensions. TensorFlow internally represents tensors as n-dimensional arrays of basic data types.

Never mind, let’s draw a picture of what a tensor looks like:

Therefore, we can simply think of it as a higher-dimensional matrix, and store it as an array of arrays. Of course, we usually use RGB image has three channels, does that mean that our image data is a three-dimensional tensor (width, height, channel)? No, in TensorFlow, the first dimension is usually N, specifically the number of images (more accurately, batch), so the shape of an image tensor is generally [N, height, Width, channel], also known as the four-dimensional tensor.

So how do we preprocess the image? First, we centralize the pixel values distributed in [0, 255] to [-127.5, 127.5], and then standardize them to [-1, 1].

const NORM_OFFSET = tf.scalar(127.5);

export const normImg = (img, size) = > {
  // Convert to a tensor
  const imgTensor = tf.fromPixels(img);

  // Normalize from [0, 255] to [-1, 1].
  const normalized = imgTensor
    .toFloat()
    .sub(NORM_OFFSET) / / centralized
    .div(NORM_OFFSET); / / standardization

  const { shape } = imgTensor;
  if (shape[0] === size && shape[1] === size) {
    return normalized;
  }

  // Adjust to the specified size
  const alignCorners = true;
  return tf.image.resizeBilinear(normalized, [size, size], alignCorners);
}
Copy the code

Then convert the image to grayscale:

export const rgbToGray = async imgTensor => {
  const minTensor = imgTensor.min()
  const maxTensor = imgTensor.max()
  const min = (await minTensor.data())[0]
  const max = (await maxTensor.data())[0]
  minTensor.dispose()
  maxTensor.dispose()

  // The gray-scale image should be normalized to [0, 1], according to the interval of pixel values
  const normalized = imgTensor.sub(tf.scalar(min)).div(tf.scalar(max - min))

  // The grayscale value is the average value of RGB
  let grayscale = normalized.mean(2)

  // Extend the channel dimension to get the correct tensor shape (h, w, 1)
  return grayscale.expandDims(2)}Copy the code

In this way, our input goes from a 3 channel color image to a 1 channel black and white image.

Note that the preprocessing we are doing here is relatively simple, in part because we are avoiding understanding the details, and in part because we are using a trained model, there is no need to do complex preprocessing to improve the training.

With the images ready, we need to start preparing the model! Our model mainly needs to expose two methods, load, which loads the model, and classify, the image. Loading the model is as simple as calling tf.loadModel. It should be noted that loading the model is an asynchronous process. The project we built using creation-react-app already supports async-await methods in the encapsulated Webpack configuration.

class Model {
  constructor({ path, imageSize, classes, isGrayscale = false{})this.path = path
    this.imageSize = imageSize
    this.classes = classes
    this.isGrayscale = isGrayscale
  }

  async load() {
    this.model = await tf.loadModel(this.path)

    // Warm up
    const inShape = this.model.inputs[0].shape.slice(1)
    const result = tf.tidy((a)= > this.model.predict(tf.zeros([1. inShape])))await result.data()
    result.dispose()
  }

  async imgToInputs(img) {
    // Convert to a tensor and resize
    let norm = await prepImg(img, this.imageSize)
    // Convert to grayscale input
    norm = await rgbToGrayscale(norm)
    // This is what is said to set batch to 1
    return norm.reshape([1. norm.shape]) }async classify(img, topK = 10) {
    const inputs = await this.imgToInputs(img)
    const logits = this.model.predict(inputs)
    const classes = await this.getTopKClasses(logits, topK)
    return classes
  }

  async getTopKClasses(logits, topK = 10) {
    const values = await logits.data()
    let predictionList = []

    for (let i = 0; i < values.length; i++) {
      predictionList.push({ value: values[i], index: i })
    }

    predictionList = predictionList
      .sort((a, b) = > b.value - a.value)
      .slice(0, topK)

    return predictionList.map(x= > {
      return { label: this.classes[x.index], value: x.value }
    })
  }
}

export default Model
Copy the code

So we can see that our model is returning something called logits, and in order to know what the classification is, we’re doing getTopKClasses. This may be a little confusing for those of you who are less familiar with this area. In fact, for a classification model, the result we return is not a specific class, but a probability distribution for each class. For example:

/ / hint
const classifyResult = [0.1.0.2.0.25.0.15.0.3];
Copy the code

In other words, the result of our classification is not that the object in the image must be a person or a dog, but that it could be a person or a dog. For example, if our label corresponds to [‘ woman ‘, ‘man ‘,’ big dog ‘, ‘little dog ‘,’ two ha ‘], then the above results should be interpreted as: the object in the image has a 25% chance of being a big dog and a 20% chance of being a man.

Therefore, we need to do getTopKClasses. According to our scenario, we only care about the most likely emotions, so we will take the value of the probability distribution of top1, so as to know the most likely prediction results.

How about, is the high-level method after TFJS encapsulation semantically clear?

Finally, we integrate the face extraction function mentioned above with the emotion classification model, and add some basic canvas drawing:

  // Slight adjustment
  analyzeFaces = async (img) => {
    // ...
    const faceResults = await this.faceExtractor.findAndExtractFaces(img);
    const { detections, faces } = faceResults;

    // Classify each extracted face
    let emotions = await Promise.all(
      faces.map(async face => await this.emotionModel.classify(face))
    );
    // ...
  }

  drawDetections = (a)= > {
    const { detections, emotions } = this.state;
    if(! detections.length)return;

    const { width, height } = this.canvas;
    const ctx = this.canvas.getContext('2d');
    const detectionsResized = detections.map(d= > d.forSize(width, height));
    detectionsResized.forEach((det, i) = > {
      const { x, y } = det.box
      const { emoji, name } = emotions[i][0].label; drawBox({ ctx, ... det.box, emoji }); drawText({ ctx, x, y,text: emoji, name });
    });
  }
Copy the code

And we’re done!

Real-time optimization

In fact, another thing we should consider is real-time. In fact, we used two models in our calculation process, and even though we had streamlined the model optimized for mobile devices, we still had performance problems. If we anticipate in a blocking way as we organize the code, frame by frame there will be delays, and the goddess’s smile will be shaky and stiff.

So, we’re going to think about doing some optimizations to get a better picture.

In this paper, a flag is used to mark whether there is currently a model calculation in progress. If so, it enters the next event loop; otherwise, it enters the asynchronous operation of model calculation. At the same time, each event loop performs a Canvas operation to ensure that the tag box is always displayed, and that the previous model calculation is cached in state each time. This kind of operation makes sense, because the movement of the face is usually continuous (if the world is not continuous, the world may have to reconsider), and this processing method can better display the results without blocking the model calculation and causing the lag, which is essentially a discrete sampling technique.

  handleSnapshot = async() = > {/ /... Some Canvas preparation operations
    canvas.getContext('2d').drawImage(video, 0.0, canvas.width, canvas.height);
    this.drawDetections(); // Draw the result maintained in state
    
    // Use flag to determine whether a model prediction is in progress
    if (!this.isForwarding) {
      this.isForwarding = true;
      const imgSrc = await getImg(canvas.toDataURL('image/png'));
      this.analyzeFaces(imgSrc);
    }

    const that = this;
    setTimeout((a)= > {
      that.handleSnapshot();
    }, 10);
  }

  analyzeFaces = async (img) => {
    / /... Other operating
    const faceResults = await this.models.face.findAndExtractFaces(img);
    const { detections, faces } = faceResults;

    let emotions = await Promise.all(
      faces.map(async face => await this.models.emotion.classify(face))
    );
    this.setState(
      { loading: false, detections, faces, emotions },
      () => {
        // After the new prediction value is obtained, set flag to false to make the prediction again
        this.isForwarding = false; }); }Copy the code

Results show

Let’s test this on the goddess to see how it works:

Well, just so-so! Even though he sometimes recognizes a smile as a lack of expression, Gakki’s acting is still a little… All right, all right, we’re running out of time, so get your weapons and get ready for your date. Put on a handsome plaid shirt, dressed as a programmer

At the end

Date night, eating hot pot singing, I and the goddess talked very happy. When the atmosphere was gradually ambiguous and the topic began to go deep into feelings, I naturally asked about the goddess’s ideal type. To my surprise, the goddess suddenly said:

At that moment I thought of Eason’s song:

Write a card to someone who can be sent heartbroken like scraps of paper in the street

reference

Developer.mozilla.org/en-US/docs/…
Github.com/webrtc/adap…
Github.com/justadudewh…
Github.com/tensorflow/…
Zhang K, Zhang Z, Li Z, et al. Joint face detection and alignment using multitask cascaded convolutional networks[J]. IEEE Signal Processing Letters, 2016, 23(10): 1499-1503.

The article can be reproduced at will, but please keep the original link.

If you are passionate enough to join ES2049 Studio, please send your resume to caijun.hcj(at)alibaba-inc.com.