This article is about A Univariate Bound of Area Under ROC.

This article was first published on the public account “Code Nongxiu Refining Factory”, keen on machine learning, recommendation system, data mining, deep learning papers and technology sharing, welcome to pay attention to!

Still that sentence level is limited, we will bear with you a lot, very welcome any form of discussion, we learn together and progress together. Code word is not easy, like please like, collect, forward three even!

Crab crab everyone’s support, welcome to pay attention to, forward, share three even!

Need to download the original paper, you can pay attention to the public number, background reply “UBAUC”, download the paper!

Abstract & Intro

Area under ROC (AUC) is an important index for binary classification and binary sorting problems. However, it is difficult to use AUC as a learning target for direct optimization (0-1 loss is discrete), so most existing algorithms are based on the optimization of alternative loss of AUC. A major disadvantage of alternative losses is that they require pairwise comparisons of training data, which results in slow runtime and increased local storage for online learning. This paper proposes a new alternative loss based on AUC risk, which does not require pair comparisons, but can still rank predictions. The authors further show that sorting operations can be avoided and that the learning objectives derived from this proxy have linear complexity in both time and storage. Finally, experiments verify the effectiveness of online AUC optimization algorithm and batch processing algorithm based on agent loss.

Problem Definition


{ ( x i . y i ) } i = 1 N \{(x_i,y_i)\}_{i=1}^N
Is the sample,
y i { 1 . + 1 } y_i \in \{-1, +1\}
As the label,
x i R d x_i \in \mathcal{R}^d
.
I + = { i y i = + 1 } . I = { i y i = 1 } \mathcal{I}^+=\{i|y_i=+1\}, \mathcal{I}^-=\{i|y_i=-1\}
Are the subscripts of positive and negative samples respectively, where
N + = I + . N = I . N + + N = N N^+=|\mathcal{I}^+|,N^-=|\mathcal{I}^-|, N^+ + N^- = N
. Define an indicator function
I : I a = 1 \ textbf {I} : \ textbf {I} _a = 1
, returns 1 for true and 0 for false. Binary classifier:
c w . Theta. : R d { 1 . + 1 } c_{w,\theta}: \mathcal{R}^d \mapsto \{-1, +1\}
:

Fw :Rd↦R, Wf_W: \mathcal{R}^d \mapsto \mathcal{R}, WFW :Rd↦R,w is the parameter, θ\thetaθ is the predicted threshold.

Let ci=fw(xi)c_i=f_w(x_i) Ci =fw(xi) be the predicted fraction of the ith sample, and assume that none of the predicted fractions are exactly equal.

The threshold
Theta. \theta
, the negative sample whose predicted value is greater than the threshold value is false positive, and the calculation is as follows:Again, the positive sample threshold is higher than
Theta. \theta
Is true positive, calculated as follows:AUC Risk is defined as follows:It can be seen that AUC is the loss of the misplacement of positive and negative samples, that is, the ranking of positive samples is lower than that of negative samples. so
L A U C = 0 L_{AUC}=0
Is the perfect sort!
L A U C L_{AUC}
and
Theta. \theta
Independent of each other.

Method

AUC Risk Without Pairwise Comparison

make
( c 1 write . c 2 write . . c N write ) (c_1^\uparrow, c_2^\uparrow,\dots,c_N^\uparrow)
for
( c 1 . c 2 . . c N ) (c_1, c_2,\dots,c_N)
In ascending order, namely
c 1 write < c 2 write < < c N write c_1^\uparrow<c_2^\uparrow<\dots<c_N^\uparrow
,
r i r_i
Is the position of the ith positive sample in the sorted list;
c i write + c_i^{\uparrow+}
Is the I th positive sample in the sorted list
( c 1 write . c 2 write . . c N write ) (c_1^\uparrow, c_2^\uparrow,\dots,c_N^\uparrow)
Score corresponding to, namely:
c i write + = c r i + write + c_i^{\uparrow+}=c_{r_i^+}^{\uparrow+}
.

Above assumes that the N + = 7, N – = 6, (r1 + r2 + r3 +, r4 +, + r5, r6 +, r7 +) = (4,6,7,8,9,11,13) N ^ + = 7, N ^ – = 6, (r_1 ^ +, , r_4, r_3 r_2 ^ + ^ + ^ +, r_5 ^ +, r_6 ^ +, r_7 ^ +) = (4, 6, 7, 8, 9, 11, 13) N + = 7, N – = 6, (r1 + r2 + r3 +, r4 +, + r5, r6 +, r7 +) = (4,6,7,8,9,11,13), The second positive sample circled is in front of the two negative samples, so its contribution to AUC risk is: N−+ I − RI +N^-+ I – R_I ^+N−+ I − RI +. In this way, for all positive and negative sample misalignment, we have: 3 + 2 + 2 + 2 + 2 + 1 + 0 = 123 + 2 + 2 + 2 + 2 + 1 + 0 = 123 + 2 + 2 + 2 + 2 + 1 + 0 = 12, AUC risk = 126 x 7 = 27 \ frac {12} {\ 6 times 7} =\frac{2}{7}6×712=72, which is consistent with earlier introduction.

According to this discovery, the following theorem can be obtained:According to this theorem, the number of positive and negative pairs of AUC can be calculated intuitively, as shown in the example above.

Theorem 2:When there is no tie prediction on the training set:

Proof: Considering RI + R_i ^+ RI +, the number of negative samples ranked lower than the third positive sample is RI +− IR_I ^+ -IRI +− I, that is, there are N−+ I − RI +N^- + I-R_I ^+N− I − RI + negative samples ranked higher than him, (causing an error pair). Summing over all the staggered pairs gives the result above.

∑ I = 1 N + (N + I) – \ sum_ {I = 1} ^ ^ +} {N (N + I) ^ – ∑ I = 1 N + (N – + I) to predict the score sort the list of the biggest ^ N + N + N + subscript values, ∑ I = 1 N + ri + \ sum_ {I = 1} ^ ^ + {N} r_i ^ + ∑ I = 1 N + ri + positive samples in the sorted list indices. Thus, the AUC risk defined in Formula (2) is proportional to the difference between the two sums.

Univariate Bound on AUC risk

Let’s begin the formal introduction of the method of this article….

By equation (2), a new, sorted predicted score (C1 ↑, C2 ↑…) can be obtained. Write, cN) (c_1 ^ \ uparrow, c_2 ^ \ uparrow, \ dots, c_N ^ \ uparrow) (c1 write, c2 write,… ,cN↑) AUC risk substitution loss:

L~\tilde{L}L~ is non-negative, because the second term is always less than or equal to the first term

Computing
L ~ \tilde{L}
without Explicit Ranking

Formula (3) requires sorting, which is still a time-consuming operation. According to theorem 3 below, we can find an equivalent form of Equation (3) without ordering.

Theorem 3: For
N N
A real number
z 1 < < z N z_1<\dots<z_N
, the sum-of-top-k problem can be equivalent to:The optimal parameter
Lambda. \lambda^*
meet
z N k Or less Lambda. Or less z N k + 1 z_{N-k} \le \lambda^* \le z_{N-k+1}

Proof: First we need to know,
N k + 1 N z i \sum_{N-k+1}^Nz_i
Is the solution of the following linear programming problem:Its corresponding Lagrange equation is:Among them
a p 0 . b p 0 a\ge 0,b \ge 0
.
Lambda. \lambda
The Lagrange multiplier. Will be about
L L
The partial derivative
p p
Set to
0 0
, can have:By substituting equation (13), the dual of equation (12) can be obtained:The limitation of Equation (14) tells us that:, when the equal sign is true, the objective function obtains the minimum value.Equation (4) can be obtained by rearranging them. Further, when we choose
Lambda. \lambda^*
meet
z N k Or less Lambda. Or less z N k + 1 z_{N-k} \le \lambda^* \le z_{N-k+1}
, we have:


k Lambda. + i = 1 N [ z i Lambda. ] + = k Lambda. + i = N k + 1 N ( z i Lambda. ) = i = N k + 1 N z i k\lambda^* + \sum_{i=1}^N [z_i – \lambda^*]_+ = k \lambda^* + \sum_{i=N-k+1}^N (z_i – \lambda^*) = \sum_{i=N-k+1}^N z_i

According to theorem (3), Equation (3) can be written as follows:

Further translated into:

According to the hinge loss properties of [a] + – a = [- a] + [a] = [to a] _ _ + – a + [a] + – a = [- a] + :

So, based on
L ~ \tilde{L}
, and the objective function formed is: In Equation (5), auxiliary variables
Lambda. \lambda
Can be understood as a threshold separating two categories and, as with the original definition of AUC risk,
L ~ ( w ) \tilde{L}(w)
Independent of the selection of the threshold by taking the overall minimum on all possible values of the threshold.

L~(w)\tilde{L}(w)L~(w) provides intuitive explanations in the context of binary classification, which penalizes only positive examples where the predicted value is below the threshold, The [lambda – fw (xi)] + [\ lambda – f_w (x_i)] _ + [lambda – fw (xi)] +, and the prediction is greater than the threshold value of negative samples [fw (xi) – lambda] + [f_w (x_i) – \ lambda] _ + + [fw (xi) – lambda]. In addition, according to lemma 3, optimal lambda ∗ ∈ \ [cN + write, write cN++ 1) lambda \ ^ * in [c_ +} {N ^ ^ \ uparrow, c_ + + 1} {N ^ ^ \ uparrow) lambda ∗ ∈ [cN + write, write cN++ 1)

Relation with SVM Objective

Careful students may find that fw(x)=wTxf_w(x) = W ^Txfw(x)=wTx, L~(w)\tilde{L}(w)L~(w) L~(w) is very similar to the objective function of SVM when the prediction function is linear.

SVM loss function reconstruction: The classifier in SVM is generally
w x + b w^\top x + b
Here, bias term is set to negative for easy comparison. If the threshold value
Lambda. \lambda
As bias in SVM, i.e
w x Lambda. w^\top x – \lambda
, the objective function of SVM:It can be seen that the above formula is the same as formula (5), with hinge-loss!
L ~ S V M ( w . Lambda. ) \tilde{L}_{SVM}(w,\lambda)
is
L ~ ( w ) \tilde{L}(w)
An upper-bound for
Because the


[ 1 + y i ( Lambda. w x i ) ] p [ y i ( Lambda. w x i ) ] [1 + y_i(\lambda – w^\top x_i)] \ge [y_i(\lambda – w^\top x_i)]

So,This helps explain some of the long-term experimental observations that standard SVM does not consistently outperform other methods that directly maximize AUC by AUC evaluation (because loss directly maximizes AUC has a tighter bound!).

The two objective functions also differ in two important respects:

  • The constant 1 in SVM loss is the margin needed to be established for classifier
  • λ\lambdaλ in L~(w)\tilde{L}(w)L~(w) is lost when minimized, And SVM in the lambda \ lambda lambda still. (because L ~ (w) \ tilde {L} L ~ (w) (w) at the end of the lambda ∗ ∈ \ [cN + write, write cN++ 1) lambda \ ^ * in [c_ {N ^ +} ^ \ uparrow, C_ {N^++1}^\uparrow)λ∗∈[cN+↑,cN++1↑), can be directly optimized iteratively)

OPTIMIZATION

Resolving Ties in Prediction Scores

Directly optimize Equation (5)There will be problems: in Equation (5)
w w
The value range of is not fixed. Therefore, the learning goal can be reduced by reducing the size of W, so as to obtain the trivial solution of w=0.

The fundamental reason is that the formula L~(w)\tilde{L}(w)L~(w) is based on the assumption that there are no ties in the predicted scores, while the trivial solution corresponds to the extreme opposite result, that is, regardless of the data. The prediction function always produces the same output (0).

To solve this problem, this article extends the objective function with two additional terms:

Where the second term corresponds to a least square term to counteract the effect of concentrating W to zero, ω (w)\Omega(w) ω (w) is the regularization term.

Linear Predictor

when
f w ( x ) = w x f_w(x)=w^\top x
.
Ω ( w ) = 1 2 w 2 \Omega(w)=\frac{1}{2}||w||^2
When,
[ x w Lambda. ] + [x^\top w -\lambda]_+
Is a convex function. when
Alpha. [ 0 . 1 ] . w . w . Lambda. . Lambda. \ alpha \ in [0, 1], w, w ‘, \ lambda \ lambda ‘
:so
i = 1 N [ x w Lambda. ] + + N + Lambda. \sum_{i=1}^N[x^\top w – \lambda]_+ + N^+ \lambda
Is a convex function.
min i = 1 N [ c i Lambda. ] + + N + Lambda. \min\sum_{i=1}^N[c_i-\lambda]_+ + N^+\lambda
It’s also convex.

Under the linear classifier, the loss function is as follows:

Batch Learning

Under batch learning setting, we can know all training samples, so block Coordinate Descent algorithm can be used to optimize (8) :Optimization of W can be transformed into the following restricted optimization problems:

This is a quadratic convex optimization problem, which can be solved by the interior point method when the dimension of W is medium and low. For high dimensional W, the online learning algorithm is more efficient because it avoids the establishment of Hessian matrix.

Online Learning

One step
eta t …… 1 t \eta_t \sim \frac{1}{\sqrt{t}}
.

Experiments