Softmax

Softmax Loss is a Softmax Activation plus a cross entropy Loss function. Softmax Activation outputs a probability for each class, that is, a probability distribution. The cross entropy loss function is the negative logarithm of the probability, and then the sum, the formula is as follows:


f ( s ) i = e s i j C e s j f(s)_{i}=\frac{e^{s_{i}}}{\sum_{j}^{C}e^{s_{j}}}

C E = i C t i log ( f ( s ) i ) CE=-\sum_{i}^{C}t_{i}\log(f(s)_{i})

L-Softmax

L-softmax is the first one to add margin in Softmax Loss. The concept of margin is similar to that of Triplet Loss, which wants to make the distance between classes further and the inner class compact

Considering that the last layer generally maps hidden state to the dimension of class, and that the last layer is generally fully connected W∗xW*xW∗x, we replace sjs_{j}sj with the symbol fyif_{y_{I}}fyi, with component multiplication:


f y i = W y i T x i f_{y_{i}}=W^{T}_{y_{i}}x_{i}

Further, write the inner product as follows:


f j = W j x i cos ( Theta. j ) f_{j}=||W_{j}|| ||x_{i}|| \cos(\theta_{j})

Theta is the Angle between vector W and x, rewrite softmax Loss:


L i = log ( e W y i x i cos ( Theta. y i ) j e W j x i cos ( Theta. j ) ) L_{i}=-\log(\frac{e^{||W_{y_{i}}|| ||x_{i}|| \cos(\theta_{y_{i}})}}{\sum_{j}e^{||W_{j}|| ||x_{i}|| \cos(\theta_{j})}})

Dichototype, assuming that a sample X belongs to Class 1, softMax results have a high probability of requiring class 1, that is, W1∗x>W2∗xW1*x>W2*xW1∗x>W2∗x, that is:


W 1 x cos ( Theta. 1 ) > W 2 x cos ( Theta. 2 ) ||W_1|| ||x|| \cos(\theta_{1}) > ||W_2|| ||x|| \cos(\theta_{2})

The cosine function is decreasing from 0 to 180 degrees, requiring theta1 to be less than theta2 when modules W1 and W2 are equal

Now we want to be a little bit stricter and make margin a little bit wider, we can add a condition, let theta1 multiply by an integer m, where theta1 is greater than or equal to 0 and less than or equal to PI divided by m, and the distortion is:


W 1 x cos ( m Theta. 1 ) > W 2 x cos ( Theta. 2 ) ||W_1|| ||x|| \cos(m\theta_{1}) > ||W_2|| ||x|| \cos(\theta_{2})

As you can see, for samples of class 1, to satisfy the formula, you need m∗theta1m*theta_{1}m∗theta1 is smaller than theta2theta_{2}theta2. Remember that theta1+theta2 is the Angle between W1 and W2, with x between them, Thus, theta1 needs to be very small to satisfy the above equation. Similarly, the sample of class 2 needs to satisfy theta2 needs to be very small, so there are two decision boundaries and decision margin in the middle

L-softmax Loss formula deformation is:


L i = log ( e W y i x i Bits of ( Theta. y i ) e W y i x i Bits of ( Theta. y i ) + j indicates y i e W j x i cos ( Theta. j ) ) L_{i}=-\log(\frac{e^{||W_{y_{i}}|| ||x_{i}|| \psi (\theta_{y_{i}})}}{e^{||W_{y_{i}}|| ||x_{i}|| \psi (\theta_{y_{i}})}+\sum_{j\neq y_{i}}e^{||W_{j}|| ||x_{i}|| \cos(\theta_{j})}})

When 0 or less theta PI or less m0 \ leq \ theta \ leq \ frac {\ PI} {m} 0 or less theta m PI, or less bits (theta) = cos ⁡ theta (m) \ psi (\ theta) = \ cos (m \ theta) bits (theta) = cosine theta (m)

When PI m theta PI or less or less \ frac {\ PI} {m} \ leq, theta, leq, pim PI theta PI or less, or less bits (theta) = D (theta) \ psi (\ theta) = D (\ theta) bits (theta) = D (theta)

The function D needs to be a monotone decreasing function, and if θ=πm\theta= \ PI}{m}θ=mπ,


D ( PI. m ) = cos ( PI. m ) D(\frac{\pi}{m})=\cos(\frac{\pi}{m})

The psi function proposed in the original paper is as follows, k∈[0,m−1]k \in [0,m-1]k∈[0,m−1] :


Bits of ( Theta. ) = ( 1 ) k cos ( m Theta. ) 2 k . Theta. [ k PI. m . ( k + 1 ) PI. m ] \psi(\theta)=(-1)^{k}\cos(m\theta)-2k,\theta \in [\frac{k\pi}{m},\frac{(k+1)\pi}{m}]

A-Softmax

A-softmax is the same as L-softmax, except that the W parameter normalized to 1 and bias set to 0


L i = log ( e x i cos ( Theta. y i . i ) j e x i cos ( Theta. j . i ) ) L_{i}=-\log(\frac{e^{||x_{i}|| \cos(\theta_{y_{i},i})}}{\sum_{j}e^{||x_{i}|| \cos(\theta_{j},i)}})

Binary, class 1 only needs to satisfy:


cos ( m Theta. 1 ) > cos ( Theta. 2 ) \cos(m\theta_{1})>\cos(\theta_{2})

An advantage of A-Softmax is that it can be interpreted well in hyperspheres

AM-Softmax

Margin is constructed by addition instead of multiplication, that is:


Bits of ( Theta. ) = cos Theta. m \psi (\theta)=\cos \theta-m

Am-softmax also removes bias, normalizes weight, and additionally introduces a new hyperparameter s


L i = log ( e s ( cos Theta. y i m ) e s ( cos Theta. y i m ) + j indicates y i e s ( cos Theta. j ) ) L_{i}=-\log(\frac{e^{s(\cos \theta_{y_{i}}-m)}}{e^{s(\cos \theta_{y_{i}}-m)}+\sum_{j\neq y_{i}}e^{s(\cos \theta_{j})}})

code

import torch import torch.nn as nn import torch.nn.functional as F class AdMSoftmaxLoss(nn.Module): Def __init__(self, in_features, out_features, s=30.0, m=0.4): ''' AM Softmax Loss ''' super(AdMSoftmaxLoss, self).__init__() self.s = s self.m = m self.in_features = in_features self.out_features = out_features self.fc = nn.Linear(in_features, out_features, bias=False) def forward(self, x, labels): ''' input shape (N, in_features) ''' assert len(x) == len(labels) assert torch.min(labels) >= 0 assert torch.max(labels) < self.out_features  for W in self.fc.parameters(): X = f.normalize (x, dim=1) # normalize each x-component wf = self.fc(x) # normalize each x-component wf = self.fc(x) # S * (torch. Transpose (0, 1)[labels]) -cat numerator = self.s * (torch. Excl = torch. Cat ([torch. Cat ((wf[I, :y], wf[I, y + 1:])). Unsqueeze (0) for I, y in enumerate(labels)], Dim =0) # denominator denominator = torch.exp(numerator) + torch.sum(torch.exp(self.s * excl), dim=1) L = numerator - torch.log(denominator) return -torch.mean(L)Copy the code