• Real-time Object Detection with YOLO
  • By Matthijs Hollemans
  • The Nuggets translation Project
  • Translator: Danny Lau
  • Proofread by: Dalston Xu,DeepMissea

Deep Learning on iOS — Real-time object detection on iOS with YOLO

Translator’s Note: Here are some of the nouns you may encounter before reading this article.

  • Metal: Metal is a low-level rendering application programming interface (API) provided by Apple after iOS 8. It provides the minimum layer required for software to run on different graphics chips. (And OpenGL ES are juxtaposed)
  • Classifier: A function or model that maps data records in a database to one of a given category and can be used for data prediction.
  • Batch normalization: to solve the problem of data distribution change in the middle layer during training, so as to prevent gradient disappearance or explosion and speed up training.
  • The terms in this paper mainly refer to the translation of UFLDL Tutorial of Stanford University by Sun Xun et al

Object detection is one of the classic problems in computer vision: identifying what objects are contained in a given image and their position in the image.

Detection is a more complex problem than classification. Although classification also needs to identify objects, it does not need to tell you the position of objects in the image, and classification cannot identify images containing multiple objects.

YOLO is a clever neural network for handling real-time object detection.

In this blog post I will show you how to get a “mini” version of YOLOv2 to run on iOS using MetalPerformanceShaders.

Before you go any further, be sure to check out this shocking YOLOv2 trailer. 😎

How does YOLO work

You can use a classifier like VGGNet or Inception to convert the classifier into an object detector by moving a small window over the image. On each move, the classifier is run to get a guess of the type of object in the current window. Hundreds of guesses about the image are available through a sliding window, but only the one that is most certain of the classifier is retained.

This scheme works but it’s obviously very slow because you have to run the classifier multiple times. One way to improve slightly is to first predict which parts of the picture are likely to contain interesting information – so-called region suggestions – and then run the classifier only in those regions. The classifier does take a lot less work than moving Windows, but it still runs a lot more times.

YOLO takes a completely different approach. It is not a traditional classifier, but has been transformed into an object detector. YOLO actually Only looks at images Once (hence the name: You Only Look Once), but in a clever way. YOLO splits the image into a 13-by-13-cell grid:

The 13×13 grid

Each unit is responsible for predicting five bounding boxes. The bounding box represents that the rectangle contains an object.

YOLO will also output a certainty value that tells us how certain it is that the bounding box contains an object. The score doesn’t contain any information about what the object inside the bounding box is, just whether the box meets the criteria.

The predicted bounding box might look something like this (make sure the higher the value, the wider the box’s boundaries are drawn)

For each bounding box, the unit also speculates a category. This is like a classifier: it provides a probability distribution of all possible classes. This version of YOLO was trained with the PASCAL VOC Dataset, which can recognize 20 different classes, such as:

  • The bicycle
  • The ship
  • The car
  • The cat
  • The dog
  • people
  • And so on…

The confidence value of the bounding box and the prediction of the class combine into a final score that tells us how likely it is that the bounding box contains a particular type of object. For example, the big, thick yellow box on the left gives an 85% chance that it contains the object “dog.”

The bounding boxes with their class scores

There are 13×13 = 169 cells, and each cell predicts 5 bounding boxes, so we end up with 845 bounding boxes. As it turns out, most boxes have low confidence values, so we’ll only keep those with a final score of 30% or above (you can change the lower limit depending on how accurate you want it to be).

Here are the final predictions:

The final prediction

Out of a total of 845 bounding boxes, we only kept these three because they gave the best results. But notice that although there are 845 separate predictions, they all run simultaneously — the neural network only runs once. That’s why YOLO is so powerful and fast.

(The image above is frompjreddie.com).

The neural network

The architecture of YOLO is very simple, it is a convolutional neural network:

Layer         kernel  stride  output shape
---------------------------------------------
Input                          (416.416.3)
Convolution    3x3      1      (416.416.16)
MaxPooling     2x2      2      (208.208.16)
Convolution    3x3      1      (208.208.32)
MaxPooling     2x2      2      (104.104.32)
Convolution    3x3      1      (104.104.64)
MaxPooling     2x2      2      (52.52.64)
Convolution    3x3      1      (52.52.128)
MaxPooling     2x2      2      (26.26.128)
Convolution    3x3      1      (26.26.256)
MaxPooling     2x2      2      (13.13.256)
Convolution    3x3      1      (13.13.512)
MaxPooling     2x2      1      (13.13.512)
Convolution    3x3      1      (13.13.1024)
Convolution    3x3      1      (13.13.1024)
Convolution    1x1      1      (13.13.125) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --Copy the code

This neural network uses only standard layer types: convolution layer with a 3×3 core and maximum pooling layer with 2×2, without complex transactions. There is no full connection layer in YOLOv2.

Note: The “mini” version of YOLO we will be using has only 9 convolutional layers and 6 pooling layers. The full YOLOv2 model has three times the number of layers as the “mini” model and has a slightly more complex shape, but it is still a regular transformation.

The final convolution layer has a 1×1 core that reduces the data to 13x13x125. This 13×13 looks familiar: this is the mesh size of the original split image.

So we ended up generating 125 channels per grid cell. These 125 numbers contain the data and type predictions in the bounding box. Why 125? Well, each cell predicts 5 bounding boxes, and a bounding box is represented by 25 data elements:

  • X-coordinate, y-coordinate, width and height of the bounding box rectangle
  • Make sure the value
  • 20 types of probability distributions

Using YOLO is simple: you give it an input image (scaled to 416×416 pixels), which is passed through the convolutional network in a single pass, and finally transformed into a 13x13x125 tensor describing the bounding boxes of these grid cells. All you need to do is calculate the final score of these bounding boxes and discard any score less than 30%.

Hint: To learn more about how YOLO works and how it trains, check out this fascinating talk by one of its inventors. This video is actually describing YOLOv1, an older version that is slightly different in terms of build, but the main idea is the same. Worth a look!

Switch to the Metal

The architecture I just described is from Mini YOLO, which is the one we’ll be using in the iOS app. The full YOLOv2 network contains three times as many layers, and that’s a bit too big for the current iPhone to run quickly. As a result, the Mini YOLO uses fewer layers, which makes it considerably faster than its older sibling, but also loses some accuracy.

YOLO is written in Darknet, a custom deep learning framework by YOLO authors. Downloadable weights are only available in Darknet format. Although Darknet is open source, I don’t really want to spend too much time trying to figure out how it works.

Fortunately, someone has tried and converted the Dardnet model to Keras, which happens to be the deep learning tool I use. So all I had to do was execute the YAD2K script to convert the Darknet weights to Keras, and then write my own script to convert the Keras weights to Metal.

Still, something strange… YOLO uses a conventional technique called batch normalization behind the convolutional layer.

The idea behind batch normalization is that neural networks work best when the data is clean. Ideally, the average value of the data input to the layer is zero and there is not much divergence. Anyone who has done any machine learning should be familiar with this, because we often use something called “feature scaling” or “whitening” to achieve this effect on our input data.

Batch normalization does a similar feature scaling of data from layer to layer. This technique makes neural networks perform better because it suspends the contamination of data as it flows through the network.

To give you an idea of what batch normalization can do, take a look at the following two histograms, which show the different results of normalization after the first application of the convolution layer versus not normalization.

Batch normalization is important when training deep networks, but we show that it can be done without it when inferring. This works well, because not doing batch normalization makes our app faster. And in any case, Metal doesn’t have an MPSCNBATchnormalization layer.

Batch normalization usually takes place after the convolution layer and before the activation function (Relu called “leak” in YOLO) takes effect. Since convolution and batch unification are linear transformations of data, we can recombine the parameters of batch unification layer and the weights of convolution. This is called “folding” the batch unification layer into the convolution layer.

To make a long story short, with some math, we can remove the batch level, but that doesn’t mean we have to change the weights before convolving the level.

A quick summary of the calculation content of the convolution layer: If x is the pixel of the input image and W is the weight of this layer, convolution basically means calculating each output pixel in the following way:

out[j] = x[i]*w[0] + x[i+1]*w[1] + x[i+2]*w[2] +... + x[i+k]*w[k] + bCopy the code

This is the sum of the key product of the input pixels and the convolution weights plus a bias b,

The following is the batch normalization calculation operation for the above convolutional output results:

        gamma * (out[j] - mean)
bn[j] = ---------------------- + beta
            sqrt(variance)Copy the code

It subtracts the average of the output pixels, divides by the variance, multiplicates a scaling parameter, Gamma, and then adds the offset, beta. These four parameters — mean, variance, Gamma, and beta. – This is what the batch unification layer learned after training with the network.

To remove batch normalization, we can adjust these two equations to calculate new weights and biases for the convolution layer:

           gamma * w
w_new = --------------
        sqrt(variance)

        gamma*(b - mean)
b_new = ---------------- + beta
         sqrt(variance)Copy the code

The convolution operation with this new weight and bias term based on the input x will give you the same result as the convolution plus batch normalization.

Now we can remove the batch normalization layer and just use the convolution layer, but thanks to the weight adjustment and the new bias terms w_new and b_new. We’re going to repeat this for all the convolutional layers in the network.

Note: in fact, in YOLO, the convolution layer does not use a offset quantity, so b is always 0 in the above equation. But notice that after folding the batch normalization parameters, the convolution layer does get a bias term.

Once we have collapsed all the batch normalization layers into their previous convolutional layer, we can convert the weights to Metal. This is a simple array conversion (Keras is stored in a different order than Metal) and then writes them to a 32-bit floating-point binary file.

If you’re curious, take a look at the conversion script yolo2metal.py to learn more. To test this folding, the script generates a new model with adjusted weights instead of batch normalization layers, and then compares it to the predictions of the previous model.

IOS app

Needless to say, I used Forge to build iOS apps. 😂 you can find the code in the YOLO folder. To try it out: Download or Clone Forge, open Forg.xcWorkspace in Xcode 8.3 or later, and run YOLO as the target on iPhone 6 or later.

The easiest way to test the app is to point your iPhone at these YouTube videos:

Simple applications

The interesting code is in YOLO. Swift. First, it initializes the convolutional network:

let leaky = MPSCNNNeuronReLU(device: device, a: 0.1)

let input = Input()

let output = input
         --> Resize(width: 416.height: 416)
         --> Convolution(kernel: (3.3), channels: 16.padding: true.activation: leaky, name: "conv1")
         --> MaxPooling(kernel: (2.2), stride: (2.2))
         --> Convolution(kernel: (3.3), channels: 32.padding: true.activation: leaky, name: "conv2")
         --> MaxPooling(kernel: (2.2), stride: (2.2)) -- - >... and so on...Copy the code

The input from the camera is scaled to 416×416 pixels and then entered into the convolution and maximum pooling layers. This is very similar to any other conversion operation.

What’s interesting is what happens after the output. Recall that the transformation of the output is followed by a 13x13x125 tensor: each cell in the grid in the image has 125 channels of data. This 125 data contains predictions of bounding boxes and types, and then we need to sort the output in some way. This is all done in the function fetchResult().

Note: The code in fetchResult() is executed on the CPU, not the GPU. This is easier to implement. In other words, this nested loop might work better executing in parallel on a GPU. In the future I might look at this and write a GPU version.

Here’s how fetchResult() works:

public func fetchResult(inflightIndex: Int) -> NeuralNetworkResult<Prediction> {
  let featuresImage = model.outputImage(inflightIndex: inflightIndex)
  let features = featuresImage.toFloatArray()Copy the code

The output at the convolution layer is in MPSImage format. Let’s first convert it to an array of type Float values called Features so we can use it better.

The body of fetchResult() is a large nested loop. It contains all the grid cells and five predictions for each cell:

forcy in0.. <13 {
    forcx in0.. <13 {
      forb in0.. <5{..}}}Copy the code

In this loop, we compute the bounding box B for the grid element (cy, cx).

First we read the x, y, width, and height of the bounding box from the Features array, including the sure values.

let channel = b*(numClasses + 5)
let tx = features[offset(channel, cx, cy)]
let ty = features[offset(channel + 1, cx, cy)]
let tw = features[offset(channel + 2, cx, cy)]
let th = features[offset(channel + 3, cx, cy)]
let tc = features[offset(channel + 4, cx, cy)]Copy the code

The help function offset() is used to locate the appropriate reading position in the array. Metal stores data in textures in groups of four channels at a time, meaning that the 125 channels are not stored consecutively, but scattered. (You can look at the source code for further analysis).

We still need to deal with the five arguments tx, TY, TW, TH, and TC, because their format is a bit strange. If you don’t know where these treatments come from, take a look at this paper (which is one of the side-effects of training this neural network).

Translator’s note: This paper was written by YOLO’s author. The author formed this paper during the training process as a more detailed description of the training process.

llet x = (Float(cx) + Math.sigmoid(tx)) * 32
let y = (Float(cy) + Math.sigmoid(ty)) * 32

let w = exp(tw) * anchors[2*b    ] * 32
let h = exp(th) * anchors[2*b + 1] * 32

let confidence = Math.sigmoid(tc)Copy the code

X and y now represent the center of the bounding box in the 416×416 image we used to input into the neural network; W and h are the width and height of the bounding box in the above image space. The confidence value of the bounding box is TC, which we convert to percentage by sigmoid.

Now we have our bounding box, and we know how confident YOLO is that the box contains an object. Next, let’s look at the type prediction to see what type of object YOLO thinks the box is:

var classes = [Float](repeating: 0.count: numClasses)
for c in 0..< numClasses {
  classes[c] = features[offset(channel + 5 + c, cx, cy)]
}
classes = Math.softmax(classes)

let (detectedClass, bestClassScore) = classes.argmax()Copy the code

Recalling the Features array contains 20 channels for predicting objects in bounding boxes. We read into a new array classes. Since it is used as a classifier, we use SoftMax to translate this array into possible allocations, and then we choose the class with the highest score as the winner.

Now we can calculate the final score of the bounding box – for example, “This bounding box has an 85% chance of containing a dog”. Since there are 845 bounding boxes, we only want those whose scores are above a certain value.

let confidenceInClass = bestClassScore * confidence
if confidenceInClass > 0.3 {
  let rect = CGRect(x: CGFloat(x - w/2), y: CGFloat(y - h/2),
                    width: CGFloat(w), height: CGFloat(h))

  let prediction = Prediction(classIndex: detectedClass,
                              score: confidenceInClass,
                              rect: rect)
  predictions.append(prediction)
}Copy the code

The code above loops through each cell in the grid. When the loop ends, we usually have an array of 10 to 20 predictions.

We have filtered out the low-scoring bounding boxes, but some boxes still overlap a lot with others. So, the last thing we need to do in fetchResult() is something called non-maximal suppression, to get rid of those duplicate boxes.

var result = NeuralNetworkResult<Prediction>()
  result.predictions = nonMaxSuppression(boxes: predictions,
                                         limit: 10.threshold: 0.5)
  return result
}Copy the code

The nonMaxSuppression() function uses a simple algorithm:

  1. Start with the boundary box with the highest score.
  2. Remove all remaining bounding boxes with more than the minimum overlap (such as more than 50%).
  3. Go back to step 1 until there are no more bounding boxes.

This removes boxes that have high scores but have too many overlapping parts with other boxes. Only the best ones will be kept.

That’s pretty much what the above means: a regular convolutional network plus a bunch of processing of the results.

How well does it work?

The YOLO website claims that the mini version can achieve 200 frames per second. But of course this is on a desktop-class GPU, not a mobile device. So how fast can it go on the iPhone?

It takes about 0.15 seconds to process an image on my iPhone 6S. The frame rate is only 6, which is basically enough for real-time calls. If you point your phone at a passing car, you can see that there is a border box following it not far behind the car. Still, I was deeply shocked by the technology. 😁

Note: As I explained above, bounding boxes are handled on the CPU, not the GPU. Would it be faster to run it entirely on a GPU? Maybe, but the CPU code took 0.03 seconds, 20% of the time. It’s possible to do some of the work on the GPU, but I’m not sure it’s worth it, as the conversion layer still takes up 80% of the time.

I think one of the main reasons for the slowness is that the convolutional layer contains 512 and 1024 output channels. In my experiment, it seems that MPSCnconvolution has more difficulty in processing small pictures with multiple channels than large pictures with fewer channels.

One thing I wanted to try was to take a different network construction approach, such as SqueezeNet, and retrain the network to do bounding box predictions at the last layer. In other words, take the idea of YOLO and implement it on a smaller, faster transformation. Is the loss of accuracy worth the gain in speed?

Note: The Caffe2 framework, which has recently been released, also runs on iOS through Metal. Caffe2-ios project comes from a version of Mini YOLO. It seems to run 0.17 seconds per frame slower than the pure Metal version.

thanks

To learn more about YOLO, check out the following papers written by its authors:

  • You Only Look Once: Unified, Real-Time Object Detection by Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi (2015)
  • YOLO9000: Better, Faster, Stronger by Joseph Redmon and Ali Farhadi (2016)

My implementation is based in part on TensorFlow’s Android Demo TF Detect, Allan Zelener’s YAD2K, and Darknet’s source code.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. Android, iOS, React, front end, back end, product, design, etc. Keep an eye on the Nuggets Translation project for more quality translations.