Softmax
Softmax Loss is a Softmax Activation plus a cross entropy Loss function. Softmax Activation outputs a probability for each class, that is, a probability distribution. The cross entropy loss function is the negative logarithm of the probability, and then the sum, the formula is as follows:
L-Softmax
L-softmax is the first one to add margin in Softmax Loss. The concept of margin is similar to that of Triplet Loss, which wants to make the distance between classes further and the inner class compact
Considering that the last layer generally maps hidden state to the dimension of class, and that the last layer is generally fully connected W∗xW*xW∗x, we replace sjs_{j}sj with the symbol fyif_{y_{I}}fyi, with component multiplication:
Further, write the inner product as follows:
Theta is the Angle between vector W and x, rewrite softmax Loss:
Dichototype, assuming that a sample X belongs to Class 1, softMax results have a high probability of requiring class 1, that is, W1∗x>W2∗xW1*x>W2*xW1∗x>W2∗x, that is:
The cosine function is decreasing from 0 to 180 degrees, requiring theta1 to be less than theta2 when modules W1 and W2 are equal
Now we want to be a little bit stricter and make margin a little bit wider, we can add a condition, let theta1 multiply by an integer m, where theta1 is greater than or equal to 0 and less than or equal to PI divided by m, and the distortion is:
As you can see, for samples of class 1, to satisfy the formula, you need m∗theta1m*theta_{1}m∗theta1 is smaller than theta2theta_{2}theta2. Remember that theta1+theta2 is the Angle between W1 and W2, with x between them, Thus, theta1 needs to be very small to satisfy the above equation. Similarly, the sample of class 2 needs to satisfy theta2 needs to be very small, so there are two decision boundaries and decision margin in the middle
L-softmax Loss formula deformation is:
When 0 or less theta PI or less m0 \ leq \ theta \ leq \ frac {\ PI} {m} 0 or less theta m PI, or less bits (theta) = cos theta (m) \ psi (\ theta) = \ cos (m \ theta) bits (theta) = cosine theta (m)
When PI m theta PI or less or less \ frac {\ PI} {m} \ leq, theta, leq, pim PI theta PI or less, or less bits (theta) = D (theta) \ psi (\ theta) = D (\ theta) bits (theta) = D (theta)
The function D needs to be a monotone decreasing function, and if θ=πm\theta= \ PI}{m}θ=mπ,
The psi function proposed in the original paper is as follows, k∈[0,m−1]k \in [0,m-1]k∈[0,m−1] :
A-Softmax
A-softmax is the same as L-softmax, except that the W parameter normalized to 1 and bias set to 0
Binary, class 1 only needs to satisfy:
An advantage of A-Softmax is that it can be interpreted well in hyperspheres
AM-Softmax
Margin is constructed by addition instead of multiplication, that is:
Am-softmax also removes bias, normalizes weight, and additionally introduces a new hyperparameter s
code
import torch import torch.nn as nn import torch.nn.functional as F class AdMSoftmaxLoss(nn.Module): Def __init__(self, in_features, out_features, s=30.0, m=0.4): ''' AM Softmax Loss ''' super(AdMSoftmaxLoss, self).__init__() self.s = s self.m = m self.in_features = in_features self.out_features = out_features self.fc = nn.Linear(in_features, out_features, bias=False) def forward(self, x, labels): ''' input shape (N, in_features) ''' assert len(x) == len(labels) assert torch.min(labels) >= 0 assert torch.max(labels) < self.out_features for W in self.fc.parameters(): X = f.normalize (x, dim=1) # normalize each x-component wf = self.fc(x) # normalize each x-component wf = self.fc(x) # S * (torch. Transpose (0, 1)[labels]) -cat numerator = self.s * (torch. Excl = torch. Cat ([torch. Cat ((wf[I, :y], wf[I, y + 1:])). Unsqueeze (0) for I, y in enumerate(labels)], Dim =0) # denominator denominator = torch.exp(numerator) + torch.sum(torch.exp(self.s * excl), dim=1) L = numerator - torch.log(denominator) return -torch.mean(L)Copy the code