After Geoffrey Hinton, the godfather of deep learning, has talked on numerous occasions about the Capsule he’s working on, here’s a paper to get a feel for its properties. AI Technology Review introduces the main results of this paper as follows.

background

In current neural networks, each neuron at each level does the same thing, for example, each neuron in a convolution layer does the same convolution operation. Hinton believes that it is perfectly possible for different neurons to focus on different entities or attributes, such as having different neurons focus on different categories at the beginning (rather than having a normalized classification at the end). Specifically, some neurons focus on location, some on size, and some on direction. This is similar to the way in which the human brain has separate areas for language and vision, rather than being scattered throughout the brain.

To avoid network clutter, Hinton proposed packing neurons that focus on the same category or attribute together, like capsules. When the neural network works, these pathways between capsules form a sparse activated tree structure (only part of the pathways in the whole tree are activated), thus forming his Capsule theory. Capsule is much more explanatory. It’s worth noting that Jeff Dean, who is also at Google Brain (but not in the same office), also sees sparsely activated neural networks as an important development in the future, and I wonder if he can come up with some different approaches as well.

While the network structure of Capsule is consistent with the intuitive perception of multiple attributes at once, it also raises the intuitive question of how the different capsules should be trained and how the network should determine the activation relationship between the capsules. Hinton’s paper focuses on the learning of connection weights (routing) between different capsules.

Resolving Routing Problems

First, the neurons in each layer are grouped into different capsules. Each capsule has an “activity vector”, which is the capsule’s representation of the category or attribute it is concerned with. Each node in the tree corresponds to an active capsule. Through a process of iterative routing, each active capsule selects one of the higher level capsules in the network and makes it its parent node. For higher-order visual systems, such an iterative process has the potential to solve the problem of how the parts of an object are layered together into a whole.

For the representation of the entity in the network, there is a special attribute among many attributes, that is, the probability of its occurrence (the confidence degree that the network detects the appearance of a certain type of object). Typically, this is represented by a single regression unit that outputs between 0 and 1, where 0 is absent and 1 is present. In this paper, Hinton wanted to use the activity vector to represent both the presence of an entity and the attributes of the entity. What he did was he took the values of the different dimensions of the vector to represent different properties, and then he took the magnitude of the whole vector to represent the probability of that entity occurring. In order to ensure that the length of the vector, that is, the probability of the entity appearing is no more than one, the vector is normalized through a nonlinear calculation, so that the different properties of the entity are actually represented by the direction of the vector in higher dimensions.

One of the great benefits of using such activity vectors is that it helps lower level capsules choose which higher level capsules they are connected to. The way it works is that the lower-level capsules initially provide input to all the higher-level capsules; The low-level capsule then multiplies its output by a weight matrix to get a prediction vector. If the scalar product of the prediction vector and the output vector of a high-level capsule is larger, the feedback from the top down can be formed to improve the coupling coefficient between the two capsules and reduce the coupling coefficient between low-level capsules and other high-level capsules. After a few iterations, the connection between the lower-level capsule that contributes more and the higher-level capsule that receives its contribution becomes more and more important.

In the authors’ view, this routing-by-agreement approach is far more efficient than previous approaches such as maximum pooling, which retain only the single most active feature.

Network building

The authors built a simple CapsNet. Except for the last layer, all layers of the network are convolution layers, but they are now “capsule” layers, in which vector output replaces scalar feature output of CNN and maximum pooling is replaced by consistent routing. Similar to CNN, higher-level networks look at a larger range of images, but location information is preserved all the time because it is no longer maximized. For the lower layers, the spatial position is determined only by which capsules are activated.

At the bottom of the network, the multi-dimensional capsules exhibit different characteristics, acting like different elements in traditional computer graphics rendering, with each capsule focusing on its own part of the characteristics. This is very different from the computational nature of current computer vision tasks, which combine elements in different spatial locations in an image to form an overall understanding (or that each region of the image activates the entire network first and then combines it). The PrimaryCaps layer and DigitCaps layer are connected after the bottom capsule.

The experimental results

Because the capsule has new characteristics, so the experimental results in this paper are not only running Benchmark, but also a lot of analysis of the new characteristics brought by the capsule.

Digital recognition

First of all, on MNIST data set, excellent error rate is obtained by CapsNet with three iterations of route iterative learning and not too many layers.

At the same time, the authors reconstructed the images that “the network thinks it recognizes” according to the representations in CapsNet, showing that in the correctly recognized samples (left of the vertical line), CapsNet can correctly recognize the details in the image and reduce the noise.

Robustness,

The DigitCaps part of the network structure is more robust to small changes because it can learn rotation, thickness, and style changes in writing separately. After training CapsNet with a random MNIST dataset with numbers blacked out, the authors used it to identify affNIST datasets. The samples in this data set are MNIST samples with small changes, as shown in the figure below. This CapsNet was directly used to identify affNIST correctly 79% of the time; Only 66% of CNN with similar number of parameters were trained synchronously.

Split highly overlapping numbers

The authors establish MultiMNIST dataset by superimposed the numbers in the MNIST dataset, in which the frame range of the two numbers overlaps 80% on average. It goes without saying that CapsNet’s identification results are better than CNN benchmarks, but the authors’ subsequent graphic analysis clearly shows the beauty of the capsules.

As shown in the figure, the authors took the numbers corresponding to the two most activated capsules as recognition results, and reconstructed the recognized image elements accordingly. To identify the correct sample in the picture below (L refers to the real labels, R refers to activate two capsules with high levels of the corresponding labels), can see, due to their different capsule work in a recognition results used in feature does not affect the other recognition as a result, is not affected by overlapping characteristics can reuse (or overlap).

On the other hand, each capsule still needs enough peripheral information support, rather than blindly thinking that the characteristics of overlapping parts need to be reused. The left figure below shows the result of selecting a high activation capsule and a low activation capsule (*R means that one of the numbers is neither the real label nor the recognition result, L is still the real label). It can be seen that in figure (5,0), the capsule focusing on “7” did not find enough features of “7”, so the activation was weak. (1,8) in the figure, there is no supporting feature of “0”, so the overlapped part is not used for the second time in the capsule of “0”.

Discussion of capsule effect

At the end of the paper, the authors discuss the performance of the capsules. They believe that capsule can improve the robustness of image transformation and perform well in image segmentation compared with CNN due to its ability to process different attributes separately. The assumption that capsules are based on “at most one entity of a certain category in the same position of the image” also enables capsules to record all aspects of the attributes of a certain category instance by means of a separate representation such as activity vector, and to make better use of spatial information through matrix multiplication modeling. But the capsules are just getting started, and they feel that capsules are for image recognition what RNN was for speech recognition in the early 2000s — just the beginning of something that’s going to be great.

Full paper also see: https://arxiv.org/pdf/1710.09829.pdf

AI Technology review collated and compiled

————— For you who love learning —————

With the gradual popularization of the concept of Internet financial management by the masses, the scale of financial management has expanded accordingly. The emerging intelligent investment service, with low cost, risk diversification and no emotion, has been gradually accepted by more and more middle class and mass affluent class. Dr. Wang zhen will take you to the road of intelligent customer care with real projects. For details, please identify the following TWO-DIMENSIONAL code or click on the bottom of the article to read the original article ~

— — — — — — — — — —