· Analysis of advantages and disadvantages of Softmax

Personal Home –> xiaosongshine. Github. IO /

“Softmax” is Max of soft. In the classification problem of CNN, our ground truth is in one-hot form. Taking the four classifications as an example, the ideal output should be (1,0,0,0), or (100%, 0%, 0%, 0%), which is the ultimate goal we want CNN to learn.

The amplitude of the network output varies greatly, and the one with the largest output corresponds to the classification result we need. The classification confidence is usually calculated in percentage form, and the simplest way is to calculate the output proportion, assuming that the output characteristics areThe most direct and ordinary way, as opposed to soft Max, here we call itThe hard Max:

Now it is common for soft Max to nonlinearly enlarge each output x to exp(x) as follows:

What’s the difference between hard Max and soft Max? Let’s look at some examples

With the same output characteristics, soft Max is easier to achieve the ultimate goal of one-hot form than Hard Max. In other words, Softmax reduces the training difficulty and makes convergence of multi-classification problems easier.

What are you trying to say? Softmax encourages the output of the real target category to be larger than the other categories, but does not require it to be much larger. For feature mapping of face recognition, Softmax encourages feature separation of different categories, but does not encourage feature separation much. As shown in table (5,1,1,1), loss is already very small, and CNN is close to convergence gradient and no longer decreases.

Softmax Loss trained CNN and visualized 2-dimensional feature mapping of 10 categories on MNIST as follows:

Different categories are obviously separated, but this situation does not meet the requirement of feature vector comparison in face recognition. In face recognition, feature vector similarity calculation is commonly used L2 distance and cosine distance. We discuss these two cases respectively.

  • L2 distance: The smaller L2 distance is, the higher the vector similarity is. It is possible that the distance between eigenvectors of the same class (yellow) is larger than that between eigenvectors of different classes (green)

  • Cosine distance: The smaller the included Angle, the larger the cosine distance and the higher the vector similarity. It is possible that the Angle between eigenvectors of the same class (yellow) is greater than that between eigenvectors of different classes (green)

To sum up:

  1. The depth feature of Softmax training will divide the whole hyperspace or hypersphere according to the number of classification to ensure the classification is separable, which is very suitable for multi-classification tasks such as MNIST and ImageNet, because the test category must be in the training category.
  2. But Softmax does not require separation between compact in the class and class, it is not suitable for face recognition task, because the number of training set 1 w, relative test set 7 billion humans, the whole world is very small, and it is impossible for us to get all the training sample, more too, we also require training set and testing set generally don’t overlap.
  3. Therefore, Softmax needs to be reformed. In addition to ensuring separability, it should be as compact as possible within the feature vector class and as separate as possible between classes.