“Depth concept” measures the learning and deep understanding of loss function in learning
0. Concept introduction
Metric Learning, also known as Distance Metric Learning (DML), is a kind of machine Learning. Its essence is similarity learning, or distance learning. Because under certain conditions, similarity and distance can be converted to each other. For example, two vectors in space coordinates can be measured by cosine similarity or Euclidean distance.
Measurement learning in general involves the following steps:
- Encoder coding model: Used to encode raw data into feature vectors (focus on how to train the model)
- Similarity discrimination algorithm: compare the similarity of a pair of feature vectors (how to calculate the similarity and how to set the threshold)
Metric learning algorithms based on deep learning can be divided into two schools:
- Network Design school: Represents twin neural networks (Siamese Network)
- Loss improvement: represents XX-Softmax
This paper focuses on loss improvement, which is a rapidly developing and widely used method recently.
In face recognition and voice print recognition metric learning algorithm, algorithm improvement is mainly reflected in the design of loss function, loss function will have a guiding role in the optimization of the whole network. It can be seen that many commonly used loss functions, from the traditional Softmax Loss to Cosface, Arcface, have this certain improvement.
The loss functions of SphereFace, CosineFace and ArcFace are all modified based on Softmax Loss.
Base line | Softmax loss |
---|---|
All kinds of extended algorithms | Triplet loss, center loss |
The latest algorithm | A-Softmax Loss(SphereFace), Cosine Margin Loss, Angular Margin Loss, Arcface |
1.Softmax loss
This is the Softmax Loss function,Represents the output of the full connection layer. In the process of calculating the decrease of Loss, we letThe proportion of becomes larger, so that the number in the parenthesis of log() becomes larger and closer to 1. Then log(1) = 0 and the total loss will decrease.
W and B are the parameters of the classification layer, which are actually the classification center finally learned. The following figure shows the symmetry axis of each color. The set of various color points is x=encoder (row), which is the output of the previous layer of the classification layer.
How to understand the following picture? Shouldn’t the penultimate output be very multidimensional?
Image understanding: as a sphere, but for visual convenience, the ball is flattened. It becomes a two-dimensional image. (Personal understanding)
How to operate? You should use dimension reduction.
So how do you do the classification?
As we know, SoftMax categorizes the largest class (argmax), as long as the target class is larger than the others. Reflected in the diagram, the distance of each point to each type of center (determined by W and B), which center is closest to will be divided into which category.
It can be found that classification by Softmax Loss can complete the task well, but there will be a big problem if similarity comparison is carried out
(Participate in [Depth concept]· Analysis of advantages and disadvantages of Softmax)
- L2 distance: The smaller L2 distance is, the higher the vector similarity is. It is possible that the distance between eigenvectors of the same class (yellow) is larger than that between eigenvectors of different classes (green)
- Cosine distance: The smaller the included Angle, the larger the cosine distance and the higher the vector similarity. It is possible that the Angle between eigenvectors of the same class (yellow) is greater than that between eigenvectors of different classes (green)
To sum up:
- The depth feature of Softmax training will divide the whole hyperspace or hypersphere according to the number of classification to ensure the classification is separable, which is very suitable for multi-classification tasks such as MNIST and ImageNet, because the test category must be in the training category.
- But Softmax does not require separation between compact in the class and class, it is not suitable for face recognition task, because the number of training set 1 w, relative test set 7 billion humans, the whole world is very small, and it is impossible for us to get all the training sample, more too, we also require training set and testing set generally don’t overlap.
- Therefore, Softmax needs to be reformed. In addition to ensuring separability, it should be as compact as possible within the feature vector class and as separate as possible between classes.
This method only considers whether the classification can be correct, but does not consider the distance between classes. Therefore, the Center Loss function is proposed. (paper)
2. Center loss
Center Loss considers that not only the classification should be correct, but also there should be a certain distance between classes. In the formula aboveDenotes the center of a certain class,Represents the eigenvalue of each face. The author added in softmax Loss, and use parametersTo control the distance within the class, the overall loss function is as follows:
3. Triplet Loss
Triplet loss function, triplet is composed of Anchor, Negative and Positive. As can be seen from the figure above, Anchor was far away from Positive at the beginning, so we wanted Anchor and Positive to be as close as possible (in-class distance), and Anchor and Negative to be as far away as possible (inter-class distance).
The left side of the expression is the class distance, and the right side is the distance between different classes. The optimization process using gradient descent method is to make the distance between classes continuously decrease, and the distance between classes continuously increase, so that the loss function can be continuously reduced.
The algorithms above are more traditional and old, and the new algorithms are described below.
4. L-softmax
Before, Softmax Loss function did not consider the distance between classes, Center Loss function can make the class compact, but there is no classification between classes, while Triplet Loss function is time-consuming, so a new algorithm is generated.
The l-Softmax function has been refined from the start, from the softmax function loginto. The L-Softmax function not only expects the distance between classes to be larger, but also compresses the distance within classes to be more compact.
I changed the cosine theta to the cosine of m theta,
M times θ has the effect of increasing margin, making the intra-class distance more compact and the inter-class distance larger. The greater the m, the greater the distance between the classes, because the cosine function decreases monotonically over the interval (0, π), and the greater the m, the greater the cosine of mθ approaches 0.
5. SphereFace(A-Softmax)
A-softmax made A small change in the L-Softmax function, a-Softmax added two constraints when considering margin: normalize the weight WB is equal to 0. This makes the prediction of the model depend only on the Angle between W and X.
6. CosFace
Cosface’s loss function is as follows:
In the above formula, S is the radius of the hypersphere and M is margin.
7. ArcFace
By comparing arcface and cosface, it is found that arcface maximizes the classification boundary in Angle space directly, while cosface maximizes the classification boundary in cosine space. This modification is because Angle distance has a more direct influence on Angle than cosine distance.
The decision boundaries of classification are as follows:
Arcface algorithm flow is as follows:
References:
[1] blog.csdn.net/jningwei/ar…
[2] blog.csdn.net/u012505617/…