Support Vector Machine

Linear Separable: There is a straight line (2D)/plane (3D)/hyperplane (>= 4DIM) that separates the two classes

Linear inseparability: There is no single line/plane/hyperplane that can separate the two classes

1. Mathematical expressions

Mathematically define training samples and their labels

Suppose we have N training samples and their labels {(X1,y1),(X2,y2)… ,(XN,yN)}\{(X_1,y_1),(X_2,y_2),… ,(X_N,y_N)\}{(X1,y1),(X2,y2),… ,(XN,yN)}

The Xi = [xi1, xi2] TX_i = [x_ {i1}, x_ {i2}] ^ TXi = [xi1, xi2] T, yi = {+ 1, 1} – y_i = \ \} {+ 1, – 1 yi = {+ 1, 1} –

Yiy_iyi is the label. If XiX_iXi belongs to C1C_1C1, then yi=+1y_i =+ 1yi=+1; If XiX_iXi belongs to C2C_2C2, then Yi =−1y_i = -1yi=−1.

Let’s define linear separability in vector form

Suppose Xi=[xi1xi2]X_i = \left[\begin{matrix} x_{i1} \ x_{i2} \end{matrix} \right]Xi=[xi1xi2] w=[w1w2]w = \left[\begin{matrix} x_{i2} \end{matrix} \right]Xi=[xi1xi2] w=[w1w2] \begin{matrix} w_1 \\ w_2 \end{matrix} \right]w=[w1w2]

If (1) + 1 yi yi = + 1 y_i = = + 1, then wTXi + b > 0 w ^ TX_i + b > 0 wTXi + b > 0

(2) if yi yi = y_i = = – 1-1-1, the wTXi + b < 0 w ^ < 0 wTXi TX_i + b + b < 0

If yi = + 1 y_i yi = = + 1 + 1 or – 1-1-1, is a set of training sample {(Xi, yi)} \ {(X_i y_i) \} {(Xi, yi)}, I = 1 ~ Ni = 1 \ sim Ni = 1 ~ N linear separable, refers to the existence (w, b) (w, b) (w, b), For I = 1 ~ Ni = 1 \ sim Ni = 1 ~ N, are: yi (wTXi + b) > 0 y_i (w ^ TX_i + b) > 0 yi (wTXi + b) > 0

2. How to solve the linear separable problem

The optimal classification line sought by support vector machine should satisfy:

  1. The line divides into two classes;
  2. The line maximizes the margin;
  3. The line is in the middle of the interval and equidistant from all the support vectors.

Extending to a multidimensional environment, the support vector machine finds a hyperplane that is maximally spaced and equidistant from all support vectors on both sides. In order to derive the optimal problem, we need to pay attention to the following facts.

The facts a

WTx + b + b = 0 = 0 w ^ Tx wTx a + b = 0 and (awT) + x (ab) = 0 (aw ^ T) + x (ab) = 0 (awT) + x (ab) = 0 is the same hyperplane. (a indicates 0) (a \ not = 0) (a  = 0)

The second

A point X0X_0X0 to the hyperplane wx ^ x + b + b = 0 w = 0 wx + b = 0 distance d = ∣ wTx0 + b ∣ ∣ ∣ w ∣ ∣ d = \ frac {| w ^ T x_ + b |} {0} {| | w | |} d = ∣ ∣ w ∣ ∣ ∣ wTx0 + b ∣

(w,b)→(aw,ab)(w, b) \ RightarRow (aw,ab)(w, b)→(aw,ab) Finally in support vector x0x_0x0 ∣ wTx0 + b ∣ = 1 | w ^ Tx_0 | = 1 + b ∣ wTx0 + b ∣ = 1, and on the support vector ∣ wTx0 + b ∣ > 1 | w ^ Tx_0 + b | > 1 ∣ wTx0 + b ∣ > 1.

Based on fact number two, Support vector x0x_0x0 will be the distance to the hyperplane to d = ∣ wTx0 + b ∣ ∣ ∣ w ∣ ∣ = 1 ∣ ∣ w ∣ ∣ d = \ frac {| w ^ Tx_0 + b |} {| | w | |} = \ frac {1} {| | w | |} d = ∣ ∣ w ∣ ∣ ∣ wTx0 + b ∣ = ∣ ∣ w ∣ ∣ 1. Thus, if we want to maximize support vector, the distance to the hyperplane that is equivalent to minimize ∣ ∣ w ∣ ∣ | | w | | ∣ ∣ w ∣ ∣. So we minimize the optimization problem as 12 ∣ ∣ w ∣ ∣ 2 \ frac {1} {2} | | w | | ^ 221 ∣ ∣ w ∣ ∣ 2, facilitate the derivation.

Verifiable partial 12 ∣ ∣ w ∣ ∣ 2 partial w = w \ frac {\ partial \ frac {1} {2} | | w | | ^ 2}} {\ partial w = w partial w partial 21 ∣ ∣ w ∣ ∣ = 2 w

The WWW is a n * n \ times 1 n * 1 vector, w = w = \ [w1w2… wn] left [\ begin {matrix} w_1 \ \ w_2 \ \… \ \ w_n] {matrix} \ \ end right w = ⎣ ⎢ ⎢ ⎢ ⎡ w1w2… Wn ⎦ ⎥ ⎥ ⎥ ⎤.

Yi (wTxi+b)≥1, I = 1-NY_i (w^Tx_i +b) \geq 1, I =1\sim Nyi(wTxi+b)≥1, I = 1-n

111 can be any positive number. The function of Yiy_iyi is to coordinate the left and right sides of the hyperplane so that one side wTxi+ B > 1W ^Tx_i + B >1wTxi+ B >1 and the other side wTxi+ B < 1W ^Tx_i +b< 1wTxi+ B <1.

To sum up, in the case of linear separability, the optimization problem of support vector machine to find the optimal hyperplane can be expressed as

  • Minimize (Minimize) :
    1 2 w 2 \frac{1}{2}||w||^2

  • Constraints: yi (wTxi + b) 1 or more y_i (w ^ Tx_i + b) \ geq 1 yi (wTxi + b) or 1 (I = 1 ~ N) (I = 1 \ sim N) (I = 1 ~ N)

The above problem is quadratic programming problem, the objective function is quadratic term, the limitation is first term.

3. How to solve the problem of linear indivisibility

We need to relax the constraint, that is, introduce slack variable δ I \delta_iδ I for each training sample and label.

Constraints to yi (wTxi + b) 1 – or delta iy_i (w ^ Tx_i + b) 1 – \ \ geq delta_iyi p 1 – the delta (wTxi + b) I (I = 1 ~ N) (I = 1 \ sim N) (I = 1 ~ N)

3.1 Modified Opt SVM

  • To minimize:
    1 2 w 2 + C i = 1 N Delta t. i \frac{1}{2}||w||^2+C\sum_{i=1}^N \delta_i

    1 2 w 2 + C i = 1 N Delta t. i 2 \frac{1}{2}||w||^2+C\sum_{i=1}^N \delta_i^2
  • Restrictions:

    • Delta t. i p 0 \delta_i \geq 0

      ( i = 1 …… N ) (i=1\sim N)

    • y i ( w T x i + b ) p 1 Delta t. i y_i(w^Tx_i + b) \geq 1 -\delta_i

      ( i = 1 …… N ) (i=1\sim N)

The scale factor CCC is artificially set in advance (the super-parameter of the algorithm), such as 10000. In practical application, the value of CCC is constantly changed, and the recognition rate of the algorithm is tested, and then the best super-parameter is selected.

4. Low-dimensional to high-dimensional mapping

The advantage of support vector machine is that it can expand the range of optional functions by mapping the feature space from low dimension to high dimension. It is worth noting that SVM still uses linear hyperplane to classify data in high dimension space.

This is different from other types of algorithms, such as artificial neural networks and decision trees, which directly produce more alternative functions. For example, in artificial neural network, the combination of multi-layer nonlinear functions is used.

Example a

Manually specify a 2-d to 5-d mapping
phi ( x ) \varphi(x)
, the linear non-separable data set becomes a linearly separable data set.

theorem

Hypothesis: N training samples are randomly selected in an M-dimensional space, and labels +1 or -1 are randomly assigned to each training sample. The probability that these training samples are linearly separable is P(M). So, as M approaches infinity, P of M is equal to 1.

Understanding: When we increase the dimension M of the feature space, the parameters to be estimated by the hyperplane (w,b)(w,b)(w,b) will increase, and the degree of freedom of the whole algorithm model will also increase, so it is more likely to separate the data set that cannot be separated in low dimension.

Conclusion: The probability of linear separability can be increased by mapping the training sample from low dimension to high dimension.

Support vector machine optimization problem

Assuming that φ(x)\varphi(x)φ(x) has been determined, just modify the optimization problem XXX in SVM to φ(x)\varphi(x)φ(x).

  • To minimize:
    1 2 w 2 + C i = 1 N Delta t. i \frac{1}{2}||w||^2+C\sum_{i=1}^N \delta_i

    1 2 w 2 + C i = 1 N Delta t. i 2 \frac{1}{2}||w||^2+C\sum_{i=1}^N \delta_i^2
  • Restrictions:

    • Delta t. i p 0 \delta_i \geq 0

      ( i = 1 …… N ) (i=1\sim N)

    • y i [ w T phi ( x i ) + b ] p 1 Delta t. i y_i[w^T\varphi(x_i) + b] \geq 1 -\delta_i

      ( i = 1 …… N ) (i=1\sim N)
  1. All xix_ixi are replaced by φ(xi)\varphi(x_i)φ(xi).

  2. Implied premises: In lower dimensions, wiW_iwi dimension is the same as xix_iXI dimension, and in higher dimensions, WWW dimension is the same as φ(xi)\varphi(x_I)φ(xi).

5. Definition of kernel function

Vapnik, the founder of support vector machines, takes this issue further by pointing out that instead of knowing the exact form of φ(x)\varphi(x)φ(x), we can say that for any vector in space, We know (X1, X2) = K phi phi (X1) T (X2) K (X_1, X_2) = \ varphi (X_1) ^ T \ varphi (X_2) K (X1, X2) = phi phi (X1) T (X2), still can pass the SVM, Calculate wTφ(x)+bw^T\varphi(x)+bwTφ(x)+b to find the category to which XXX belongs.

Define kernel functionKAnd mapping
phi \varphi
They have a one-to-one correspondence. How to calculate by kernel
w T phi ( x ) + b w^T\varphi(x)+b
Before, we looked at what properties does the kernel satisfy in order for it to exist
phi ( x ) \varphi(x)
,
K ( X 1 . X 2 ) = phi ( X 1 ) T phi ( X 2 ) K(X_1,X_2)=\varphi(X_1)^T\varphi(X_2)
.

Mercer’s Theorem exists. Can, for example gauss kernel function K (X1, X2) = e – ∣ ∣ X1 – X2 ∣ ∣ 22 sigma 2 K (X_1, X_2) = e ^ {- \ frac {| | X_1 – X_2 | | ^ 2} {\ sigma 2 ^ 2}} K (X1, X2) = e – 2 sigma 2 ∣ ∣ X1 – X2 ∣ ∣ 2, Satisfy Mercer’s Theorem. But in this example, φ(x)\varphi(x)φ(x) cannot be written as an explicit expression. Nevertheless, there are ways to know the value of wTφ(x)+bw^T\varphi(x)+bwTφ(x)+b, and thus the category to which a test sample XXX belongs.

6. Prime Problem and Dual Problem

It is necessary to supplement the basic knowledge of primal and dual problems in optimization problems before further study. Definition review convex optimization 🙂 heh heh

Prime Problem

Minimize: f(w) F (w) F (w)

Subject to:

Gi (w) 0 or less g_i gi (w) (w) \ leq 0 0 or less, I = 1 ~ K, I = 1 \ sim K, I = 1 ~ K

Hi (w) = 0 h_i hi (w) (w) = 0 = 0, I = 1 ~ M, I = 1 \ sim M, I = 1 ~ M

Dual Problem

Define L (w, alpha, beta) = f (w) + ∑ I = 1 k alpha igi (w) + ∑ I = 1 m beta ihi (w) = f (w) + alpha Tg (w) + beta Th (w) (w, \ alpha, L \beta ) = f (w) + \ sum_ {I = 1} ^ K \ alpha_ig_i (w) + \ sum_ {I = 1} ^ M \ beta_ih_i (w) = f (w) + \ alpha ^ Tg (w) + \ beta ^ Th L (w) (w, alpha, beta) = f (w) + ∑ I = 1 K alpha igi (w) + ∑ I = 1 M beta ihi (w) = f (w) + alpha Tg (w) + beta Th (w)

Among them


Alpha. = [ Alpha. 1 . Alpha. 2 . . . . . Alpha. K ] T \alpha = [\alpha_1, \alpha_2,…,\alpha_K]^T


Beta. = [ Beta. 1 . Beta. 2 . . . . . Beta. M ] T \beta = [\beta_1, \beta_2,…,\beta_M]^T


g ( w ) = [ g 1 ( w ) . g 2 ( w ) . . . . . g K ( w ) ] T g(w) = [g_1(w), g_2(w),…,g_K(w)]^T


h ( w ) = [ h 1 ( w ) . h 2 ( w ) . . . . . h M ( w ) ] T h(w) = [h_1(w), h_2(w),…,h_M(w)]^T

The dual problem is

Maximize (Maximize) : theta (alpha, beta) = inf ⁡ wL (w, alpha, beta) \ theta (, alpha, and beta) = \ inf_ {w} L (w, \ alpha, beta) theta (alpha, beta) = infwL (w, alpha, beta,

Subject to: α I ≥0\alpha_i \geq 0α I ≥0, I =1 ~ K, I =1 \sim K, I =1 ~ K

6.1 Strong Duality theorem

If the original objective function is convex function, restrictive conditions is a linear function, the f ∗ (w) = theta (alpha ∗, beta ∗) f (w ^ *) = \ theta (\ alpha ^ *, \ beta ^ *), (w ∗) = f theta (alpha ∗, beta ∗), At this point the duality gap of f (w ∗) – theta (alpha ∗, beta ∗) f (w ^ *) – \ theta (\ alpha ^ *, \ beta ^ *) f (w ∗) – theta (alpha ∗, beta ∗) is equal to 0.

6.2 KKT conditions

If f (w ∗) = theta (alpha ∗, beta ∗) f (w ^ *) = \ theta (\ alpha ^ *, \ beta ^ *), (w ∗) = f theta (alpha ∗, beta ∗), necessary to launch in theorem 1, for all I = 1 ~ Ki ~ K = 1 \ sim Ki = 1. Alpha igi, (w ∗) = 0 \ alpha_i g_i alpha igi (w ^ *) = 0, (w ∗) = 0, or alpha I = 0 \ alpha_i = 0 alpha I = 0, or gi, (w ∗) = 0 g_i gi ^ * (w) = 0, (w ∗) = 0.

7. SVM is transformed into a dual problem

7.1 the original problem

  • To minimize:
    1 2 w 2 + C i = 1 N Delta t. i \frac{1}{2}||w||^2+C\sum_{i=1}^N \delta_i

    1 2 w 2 + C i = 1 N Delta t. i 2 \frac{1}{2}||w||^2+C\sum_{i=1}^N \delta_i^2
  • Restrictions:

    • Delta t. i p 0 \delta_i \geq 0

      ( i = 1 …… N ) (i=1\sim N)

    • y i ( w T phi ( x i ) + b ) p 1 Delta t. i y_i(w^T\varphi(x_i) + b) \geq 1 -\delta_i

      ( i = 1 …… N ) (i=1\sim N)

Will first delta I acuity 0 \ delta_i \ geq 0 delta p 0 I (I = 1 ~ N) \ sim (I = 1, N) (I = 1 ~ N) into the delta I 0 \ delta_i \ leq 0 or less delta I 0 or less (I = 1 ~ N) (I = 1 \ sim N) (I = 1 ~ N), get it

  • To minimize:
    1 2 w 2 C i = 1 N Delta t. i \frac{1}{2}||w||^2-C\sum_{i=1}^N \delta_i
    or
    1 2 w 2 + C i = 1 N Delta t. i 2 \frac{1}{2}||w||^2+C\sum_{i=1}^N \delta_i^2
  • Restrictions:

    • Delta t. i Or less 0 \delta_i \leq 0

      ( i = 1 …… N ) (i=1\sim N)

    • 1 + Delta t. i y i ( w T phi ( x i ) + b ) Or less 0 1 +\delta_i-y_i(w^T\varphi(x_i) + b) \leq 0

      ( i = 1 …… N ) (i=1\sim N)

It is easy to find that the constraint conditions are linear and the objective function is convex, which satisfies the strong duality theorem.

Next we need to solve the duality problem. It is worth noting that the independent variable WWW becomes (w,b,δ I)(w,b,\delta_i)(w,b,δ I). Gi (w)g_i(w)gi(w) includes two inequalities, and this problem does not include hi(w)h_i(w)hi(w).

7.2 Dual problem

On 12 ∣ ∣ w ∣ ∣ ∑ 2 – C I = 1 N the delta I \ frac {1} {2} | | w | | ^ 2 – C \ sum_ {I = 1} ^ N \ delta_i21 ∣ ∣ w ∣ ∣ ∑ 2 – C I = 1 N the delta I say:

  • To maximize the:
    Theta. ( Alpha. . Beta. ) = inf w . Delta t. . b { 1 2 w 2 C i = 1 N Delta t. i + i = 1 N Beta. i Delta t. i + i = 1 N Alpha. i [ 1 + Delta t. i y i ( w T phi ( x i ) + b ) ] } \theta(\alpha,\beta)=\inf_{w,\delta,b}\{\frac{1}{2}||w||^2-C\sum_{i=1}^N \delta_i+\sum_{i=1}^N\beta_i\delta_i+\sum_{i=1}^N\alpha_i[1+\delta_i-y_i(w^T\varphi(x_i) + b)]\}
  • Restrictions:

    • Alpha. i p 0 \alpha_i \geq0

      ( i = 1 …… N ) (i=1\sim N)

    • Beta. i p 0 \beta_i \geq 0

      ( i = 1 …… N ) (i=1\sim N)

Take the derivative of (w,b,δ I)(w,b,\delta_i)(w,b,δ I) and set the derivative to 0,

Substitute equations (1), (2) and (3) into the objective function, get

One of the first article is based on the constraints of alpha I 0, or beta I p \ alpha_i \ geq 0, 0 \ beta_i \ geq alpha I acuity 0 0, beta I acuity 0 and alpha beta I = I + C + \ \ alpha_i beta_i alpha beta I = I + C = C.

8. Solve the algorithm flow

φ(Xi)Tφ(Xj)=K(Xi,Xj)\varphi(X_j) ^T\varphi(X_j)=K(X_i,X_j)φ(Xi)Tφ(Xj)=K(Xi,Xj) kernel function Without knowing the explicit expression of φ(X)\varphi(X)φ(X), the duality problem can be solved by obtaining α I, I =1 ~ N\alpha_i, I =1\sim Nα I, I =1 ~ N. Work out after alpha \ alpha alpha, can according to w = ∑ I = 1 N alpha iyi phi (Xi) w = \ sum_ {I = 1} ^ N \ alpha_iy_i \ varphi (X_i) w = ∑ I = 1 N alpha iyi phi (Xi), get the WWW.

Since φ(x)\varphi(x)φ(x) does not necessarily have an explicit expression, so does WWW. We will show that we can calculate wTx+bw^Tx+bwTx+b without knowing the explicit expression of WWW.

So first we want to solve for BBB.

According to KKT conditions, for all I =1 ~ Ni=1\sim Ni=1 ~ N, Have alpha (1 + I delta I – yi (wT phi (Xi) + b)] = 0 \ alpha_i [1 + \ delta_i – y_i (w ^ T \ varphi (X_i) + b)] = 0 alpha I [1 + delta I – yi (wT phi (Xi) + b)] = 0, And beta delta I = 0 and I (c – alpha I) delta I = 0 \ beta_i \ delta_i = 0 \ rightarrow (c – \ alpha_i) \ delta_i = 0 beta delta I = 0 and I (c – alpha I) delta I = 0. Because w = ∑ j = 1 N alpha jyj phi (Xj) w = \ sum_ {j = 1} ^ N \ alpha_jy_j \ varphi (X_j) w = ∑ j = 1 N alpha jyj phi (Xj), the WT phi (Xi) = ∑ j = 1 N T (Xj) alpha jyj phi phi (Xi) = ∑ j = 1 N alpha jyjK (Xj, Xi) w ^ T \ varphi (X_i) = \ sum_ {j = 1} ^ N \ alpha_jy_j \ varphi ^ T (X_j) \ varphi (X_i) = \ sum_ {j = 1} ^ N \ alpha_jy_jK (X_j X_i) wT phi (Xi) = ∑ j = 1 N T (Xj) alpha jyj phi phi (Xi) = ∑ j = 1 N alpha jyjK (Xj, Xi)

On the other hand, if for some iii, α I ≠0\alpha_i\neq0α I =0 and α I ≠ C \alpha_i\neq cα I =c, then there must be δ I =0\delta_i=0δ I =0 according to KKT condition


1 + Delta t. i y i ( w T phi ( X i ) + b ) = 0 1 +\delta_i-y_i(w^T\varphi(X_i) + b)=0

Including yiwT phi (Xi) = ∑ j = 1 N alpha jyiyjK (Xj, Xi) y_iw ^ T \ varphi (X_i) = \ sum_ {j = 1} ^ N \ alpha_jy_iy_jK (X_j X_i) yiwT phi (Xi) = ∑ j = 1 N alpha jyiyjK (Xj, Xi)

So we only need to find a 0


b = 1 j = 1 N Alpha. i y i y j K ( X j . X i ) y i b=\frac{1-\sum_{j=1}^N\alpha_iy_iy_jK(X_j,X_i)}{y_i}

Let’s consider that for a test sample X, we need to determine which category it belongs to, We calculated the wT phi (X) + b = ∑ I = 1 n alpha iyi phi phi (Xi) T (X) + b = ∑ I = 1 n alpha iyiK (Xi, X) + bw ^ T \ varphi (X) + b = \ sum_ {I = 1} ^ N \ alpha_iy_i \ varphi (X_i) ^ T \ varphi (X) + b = \ sum_ {I = 1} ^ N \ alpha_iy_iK X_i, (X) + bwT phi (X) + b = ∑ I = 1 N alpha iyi phi phi (Xi) T (X) + b = ∑ I = 1 N alpha iyiK(Xi,X)+b

WT φ(x) +bw^T\varphi(x) +bwTφ(x) +b even if φ(x)\varphi(x)φ(x) is unknown, wTφ(x) +bw^T\varphi(x) +bwTφ(x) +b 【 Kernal was catnip 】

8.1 the conclusion

The judgment criteria are:

If ∑ I = 1 N alpha iyiK (Xi, X) + b acuity 0 \ sum_ {I = 1} ^ N \ alpha_iy_iK X_i, (X) + b \ geq ∑ 0 I = 1 N alpha iyiK (Xi, X) + b acuity 0, then X ∈ C1X \ in C_1X ∈ C1

If the ∑ I = 1 N alpha iyiK (Xi, X) + b < 0 \ sum_ {I = 1} ^ N \ alpha_iy_iK X_i, (X) + b < 0 ∑ I = 1 N alpha iyiK (Xi, X) + b < 0, then X ∈ C2X \ in C_2X ∈ C2

Finally, we can also complete the classification judgment of XXX only through the kernel function.

8.2 Some common kernel functions are introduced

  1. K(X1,X2)=X1TX2K(X_1,X_2)=X_1^TX_2K(X1,X2)=X1TX2 →\rightarrow→
  2. (Ploy) polynomial kernel (X1, X2) = K (X1TX2 + 1) dK (X_1, X_2) = (X_1 ^ TX_2 + 1) ^ dK (X1, X2) = d (X1TX2 + 1)
  3. Gaussian radial basis function (Rbf) kernel K (X1, X2) = e – ∣ ∣ X1 – X2 ∣ ∣ (X_1, X_2) and 22 sigma 2 K = e ^ {- \ frac {| | X_1 – X_2 | | ^ 2} {\ sigma 2 ^ 2}} \rightarrowK(X1,X2)= E −2σ2∣∣X1−X2∣ 2→ The most commonly used kernel function
  4. (tanh) sigmoid kernel K (X1, X2) = tanh ⁡ (beta X1TX2 + b) K (X_1, X_2) = \ tanh (\ beta X_1 ^ TX_2 + b) K (X1, X2) = tanh (beta X1TX2 + b), Tanh ⁡ (x) = the ex – e – xex + e – x \ tanh (x) = \ frac {e ^ – ^ e x {x} -} {e + e ^ ^ x {x} -} tanh (x) = ex + e – xex – e – x

9. The King of War problem in chess

In the sample data set shown here, pairs of letters and numbers represent the positions of the black king, white king and white pawn on the board, “draw” means a draw, and “six” means white can take up to six moves to checkmate black.

For specific examples of MATLAB, see Support Vector Machine (Matlab program). Note that the svMtrain function is not applicable to matlab2019 versions, and parameters need to be adjusted by yourself.

10. Identify performance metrics for the system

You cannot simply use recognition rates to evaluate performance.

10.1 Confusion matrix

Confusion Matrix

For example, there are TP+FNTP+FNTP+FN positive samples and FP+TNFP+TNFP+TN negative samples in total. The identification rate is TP+TNTP+FN+FP+TN\ Frac {TP+TN}{TP+FN+FP+TN}TP+FN+FP+TNTP+TN

The probability relation of confusion matrix TP+FN=1,FP+TN=1TP+FN=1,FP+TN=1TP+FN=1,FP+TN=1.

10.2 The ROC curve

The closer you are to the curve in the upper left corner, the better the system will perform. The larger the AUC, the better the system performance. The lower EER, the better system performance.

AUC

EER FP=FN

11. Deal with multiple categories

SVM has three ways to deal with multi-class problems, that is, problems with categories greater than 2:

  1. Objective functions and constraints of transformation optimization (not commonly used)
  2. One class VS other classes
  3. One class VS the other

11.1 Class one vs. Other Classes

Construct N

There are three types: C1,C2,C3C_1,C_2,C_3C1,C2 and C3

SVM1: (C1C2) VS (C3)

SVM2: (C1C3) VS (C2)

SVM3: (C2C3) VS (C1)

Suppose y=+1y=+1y=+1 points to the class in the first column, and y=−1y=-1y=−1 points to the class in the second column. Then, the hard decision can be used to determine which category the new sample X belongs to.

Y = + 1, for example, y = + 1, y = y = + 1-1, y = + 1, y = 1 y = + 1, y = + 1, y = – 1, X ∈ C1X \ in C_1X ∈ C1. If y = + 1, y = – 1, y = y = + 1-1, y = 1, y = 1 y = + 1, y = – 1, y = – 1, falls on the C1 and C2, Soft decision needs to be made according to the value of αiyiK(Xi,X)+b\alpha_iy_iK(X_i,X)+ B αiyiK(Xi,X)+b, which value is smaller (the more negative) is in which class.

11.2 One class vs. another

Structure of N ∗ (N – 1) 2 \ frac {N * (N – 1)} {2} 2 N ∗ (N – 1)

There are three types: C1,C2,C3C_1,C_2,C_3C1,C2 and C3

SVM1: (C1) VS (C2)

SVM2: (C1) VS (C3)

SVM3: (C2) VS (C3)