Summary of CNN Visualization Technology (I)- Visualization of feature map
Summary of CNN visualization Technology (II) — Visualization of convolution kernel
Welcome to the public account CV technical guide, focusing on the technical summary of computer vision, the latest technology tracking, classical paper interpretation.
Introduction:
Previously, we introduced two visualization methods, feature map visualization and convolution kernel visualization, both of which are more common in the paper. These two methods are more used to analyze what the model learns at a certain level. After understanding these two visualization methods, it is easy to understand how images are recognized and classified through the neural network.
However, last time ON Zhihu, I saw a question about falling detection through Yolov3, hoping to improve the accuracy of multi-task learning by adding face recognition. It’s clear that the questioner doesn’t understand how a neural network can analyze this kind of video with a time dimension to realize behavior recognition, and essentially, it doesn’t understand how a neural network can actually recognize a class. Therefore, when the understanding of this point is wrong, the model selection, scheme design and improvement, are not reasonable.
(I answered this question on Zhihu. What is the correct way to detect falls? If you are interested, you can check it out.
So in this article, we’ll show you a way to know what information the model is based on for different classes: a visualization of the classes, or, more generally, a heat map. This method is mainly CAM series, currently there are CAM, grad-CAM, and grad-cam ++.
CAM (Class Activation Map)
As shown in the figure above, the structure of CAM consists of CNN feature extraction network, global average pooling GAP, full connection layer and Softmax.
Implementation principle: A picture gets feature maps after the CNN feature extraction network, and then carries out global average pooling for each feature map to become a one-dimensional vector, and then obtains the class probability through the full connection layer and SoftMax.
Assuming that there are n channels in front of the GAP, then a vector of length 1x n is obtained after the GAP. Assuming that the number of categories is m, then the weight of the fully connected layer is a tensor of n x m. (Note: Batch-size is ignored here.)
For a certain category C, now you want to visualize that this model is important for identifying which areas of the original image are important for category C, in other words what information the model is based on to get the image is category C.
The method is to take out the value of the probability of getting category C in the full connection layer, denoted by W, which is the bottom half of the figure above. Then, the weighted summation of the feature map before GAP is carried out. Since the feature map is not the size of the original image at this time, upsampling is needed after the weighted summation to obtain the Class Activation map.
The formula is as follows :(k represents channel, c represents category, fk(x,y) represents feature map)
Effect:
The analysis of the CAM
CAM has a fatal defect, its structure is composed of CNN + GAP + FC + Softmax. That is to say, if you want to visualize an existing model, but most of the existing models do not have GAP operation, then you need to modify the original model structure and retrain, which is quite cumbersome. Moreover, if the model is large, Retraining after modification may not achieve the original effect, and visualization is meaningless.
Therefore, an improved version of Grad-CAM was developed to address this defect.
Grad-CAM
The biggest feature of Grad-CAM is that it no longer needs to modify the existing model structure or retraining. It can be visualized directly on the original model.
Principle: It is also the last layer of feature maps processing CNN feature extraction network. For the category C to be visualized, grad-CAM makes the probability value of the last output category C propagate back to the last layer of feature maps to get the gradient value of each pixel of the feature maps of category C, and takes global average pooling for the gradient value of each pixel. The weighted coefficient alpha of feature maps can be obtained. It is mentioned in the paper that the weighted coefficient obtained in this way is almost equivalent to the coefficient in CAM. Next, the weighted summation of the feature graph is carried out, and ReLU is used to modify it, and then the upsampling is carried out.
The reason for using ReLU is that for those negative values that can be considered unrelated to the identification of class C, these negative values may be related to other classes, whereas positive values have a positive effect on the identification of C.
The formula is as follows:
The structural diagram of Grad-CAM is shown in the figure above. For readers who are not familiar with Guided Backpropagation, please refer to the first article of CNN visualization Technology summary.
The renderings are as follows:
Grad-cam is followed by an improved version of Grad-CAM++, whose main improvement effect is more accurate positioning and more suitable for the situation of multi-target of the same kind. The so-called multi-target of the same kind refers to the occurrence of multiple targets for a class in an image, such as seven or eight people.
The improved method is a new method to obtain the weighting coefficient, which is too complicated to bear to see. Therefore, it will not be introduced here. Interested readers can access the paper through the link at the end of the article.
The next article will summarize all of the visualization tools. The content will be in the CV technology summary section.
CAM: https://arxiv.org/pdf/1512.04150.pdfGrad-CAM: https://arxiv.org/pdf/1610.02391v1.pdfGrad-CAM++: https://arxiv.org/pdf/1710.11063.pdf
Copy the code
Reference paper:
1. Learning Deep Features for Discriminative Localization
2.Grad-CAM: Why did you say that? Visual Explanations from Deep Networks via Gradient-based Localization
3. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks
This article comes from the technical summary series of the public CV technical guide.
Welcome to the public account CV technical guide, focusing on the technical summary of computer vision, the latest technology tracking, classical paper interpretation.
To get a PDF summary of the following articles, reply to the official account with the keyword “Technical summary”.
Other articles
Shi Baixin, Peking University: From the reviewer’s perspective, how to write a CVPR paper
Siamese network summary
Summary of computer vision terms (1) to build a knowledge system of computer vision
Summary of underfitting and overfitting techniques
Summary of normalization methods
Summary of common ideas of paper innovation
Summary of methods of reading English literature efficiently in CV direction
A review of small sample learning in computer vision
A brief overview of knowledge distillation
Optimize OpenCV video read speed
NMS summary
Technical summary of loss function
Technical summary of attention mechanisms
Summary of feature pyramid technology
Summary of pooling technology
Summary of data enhancement methods
Summary of CNN structure Evolution (I) Classic model
Summary of CNN structure evolution (II) Lightweight model
Summary of CNN structure evolution (III) Design principles
How to view the future of computer vision
Summary of CNN Visualization Technology (I) – Visualization of feature map
CNN Visualization Technology Summary (II) – convolution kernel visualization
Summary of CNN Visualization Technology (III) – Class visualization
CNN Visualization Technology Summary (IV) – Visualization tools and projects