Thesis title: grad-Cam :Visual translation from Deep Networks via gradient-based Localization
Georgia Institute of Technology, Facebook AI Research
Year: 2017
Public id: CVpython synchronous release
Some time ago, I used Grad-CAM to visualize the output of neural network. At that time, I did a multi-note classification task, but the results of visualization felt strange, and there was always something wrong. The summary of this paper makes me have a further understanding of GradM-Cam and finally know where the visualization problem was at that time. Oh yeah! .
1. What problem does the paper solve?
Although THE CNN model has made a great breakthrough in the CV field, CNN is just like a “black box”. It is still difficult to understand what is going on inside, and the explanation is poor. If the model doesn’t work, it’s hard to explain why. Therefore, the author puts forward the Grad-CAM model to make a visual explanation for the decisions made by CNN.
2. How does the model proposed in this paper solve the problem?
Many studies have shown that the deeper layer of CNN can capture higher-level visual structure information, and the spatial information in the convolutional feature will be lost at the fully connected layer. Therefore, at the last convolutional layer, we can achieve the best compromise between high-level semantic information and detailed spatial information (why is this a compromise?). . Grad-cam uses gradient information that “flows” into the final layer of CNN to understand how important each neuron is to decision making.
The overall structure of Grad-CAM is shown in the figure below:
The input image is calculated forward to get the feature map AAA, and for category CCC, there is a category score before SoftMax. Now assume that A(x,y)kA^k_{(x,y)}A(x,y)k is the value of the KKK channel of the eigenmap AAA at position (x,y)(x,y)(x,y), then calculate:
So let’s figure out what the derivative of ycy^cyc with respect to A(x,y)kA^k_{(x,y)}A(x,y)k gives us.
For a simple example, y=w1∗x1+w2∗x2y=w1*x1+w2*x2y=w1∗x1+w2 *x2y=w1∗x1+w2 *x2y=w1∗x1+w2∗x2, where X1, x2X1, x2X1,x2 are independent variables, and w1,w2w1,w2w1, w2w1, and w2 are the coefficients of the two independent variables respectively. Yyy for x1x1x1 partial derivative result is W1W1W1, if X1x1X1 is more important to YYY, w1W1W1 coefficient is naturally larger, so YYy for x1x1x1 partial derivative result is also larger, that is not the derivation can reflect the importance of variables from the function? The answer is obvious (if any of you think it’s not serious, please point out).
So what does the derivative of ycy^cyc with respect to A(x,y)kA^k_{(x,y)}A(x,y)k give you? The importance of the eigenvalue A(x,y)kA^k_{(x,y)}A(x,y)k to ycy^cyc is obtained, and the importance of the KKK channel of the eigenmap to ycy^cyc is obtained by global average pooling.
Formula (1) above calculates the coefficient of each channel of feature mapping, and then linearly combines it, as shown in Formula (2).
The reason for adding ReLUReLUReLU is that the author is only interested in those features that have a positive effect on category scores, so he filters out those features that have a negative effect.
Although grad-Cam’s visualization is class-discriminating and able to locate relevant areas, it lacks the ability to show the importance of fine granularity. For example, in Figure 1(c), although grad-CAM can locate the cat area, why the network predicts it as “Tiger cat” is difficult to draw conclusions from the low-resolution Heat-map. In order to be able to locate and display fine granularity, the author obtained Guided grad-cam by combining Guided Backpropagation and grad-cam through dot multiplication. As shown in Figure 1(d,j).
3. What was the result of the experiment?
Better positioning ability, classification is not weak.
4. What is the guiding significance for us?
It feels like this is the most important point.
- The visualization results of Grad-CAM (including region and fine granularity) provide us with an explanation of the model not working. For example, an image classification is wrong. Let’s visualize, is there a problem with the region of interest or the fine granularity features extracted?
- Deviation can also verify the data set, the thesis has an example of identification of doctors and nurses, for example, the visual results show that the model of location area in the face and hair, model to some female doctors identify as a nurse, and as a doctor, a male nurse exist gender bias, thought that the man is a doctor, a woman is a nurse, look at the data set, It turns out that probably because 78 percent of the doctors are men and 93 percent of the nurses are women, there is a data set bias.