preface

“Statistical learning methods” the second edition has been out for some time, and can be read recently. When I read the first edition more than a year ago, I was ignorant and difficult to understand at that time. I hope I can gain something this time and gain something new. These articles are only used to record some key points and supplement the content for learning.

Table of contents in Chapter 2

  • Perceptron model

  • Perceptron learning strategies

    • Linear separability of data sets
    • Perceptron learning strategies
    • Perceptron learning algorithm
  • Perceptron learning algorithm

    • The primitive form of the perceptron learning algorithm
    • Convergence of the algorithm
    • Dual form of perceptron learning algorithm

Bits and pieces of records

Perceptron is a linear classification model of binary classification.

  • Empirical risk minimization of loss function L(w,b) L(w,b) L(w,b)
  • In this chapter, the inner product of vectors is involved, including the concept of hyperplane and the explanation of linear fractable data set. In the strategy part, the consideration of the choice of loss function is explained. In addition, more connections between perceptron and SVM are derived from the idea of margin. In fact, the idea of margin is not reflected in the introduction of this chapter, and corresponding literatures are given in the references.
  • For the two examples covered in this chapter, consider why η=1 \eta=1 η=1, and then consider the parameter space. The corresponding test case implementations are designed and shown later.
  • In the proof of convergence, the technique of combining bias into weight vector is mentioned. The merged weight vector is called extended weight vector, which is applied in BOTH LR and SVM. However, this technique is not expressed in the same way in the book, and the unified symbol system is not adopted, or is not unified. Three chapters of this book discuss convergence of algorithms, perceptron, AdaBoost, EM algorithm.
  • ⋅xj]N×N G=[x_i\cdot x_j]_{N\times N} G=[xi \ xj]N×N
  • The activation function of the perceptron is a symbolic function.
  • Perceptron is the basis of neural network and support vector machine.
  • When we talk about decision boundaries, we’re really thinking about geometric interpretations of algorithms.
  • The following figure illustrates why perceptrons cannot handle xor problems.

Blue and red are the two types of points above, and the linear partition hyperplane should be perpendicular to those lines.

  • The book mentioned the function interval, geometric interval, where the interval is margin
  • The separation hyperplane divides the feature space into two parts, one is positive class and the other is negative class. The normal vector points to a positive class on one side and a negative class on the other.
  • Perceptron loss function = Max ⁡ L (0, – y I ⋅ I + b) x (w) L = \ Max (0 – y_i (w \ cdot x_i + b)), L = Max (0, – yi ⋅ xi + b) (w).

The three elements

model

  • Input space: X⊆Rn \mathcal X\sube \bf R^n X⊆Rn
  • Output space: Y=+1,−1 \mathcal Y={+1,-1} Y=+1,−1
  • C (x)= S I gn(W ⋅x+b) F (x)=sign(W \cdot x+b) F (x)=sign(W \ x+b)

strategy

  • Defining a learning strategy is to define and minimize the loss function.
  • Note that experience is mentioned here, so learning is the operation of base on the training data set

Loss function selection

  • A natural selection of the loss function is the total number of misclassification points. However, such a loss function is not a continuous differentiable function of parameters W,b w,b w,b and is not easy to optimize
  • Another alternative to the loss function is the total distance from the misclassification point to the hyperplane S S S, This is the empirical risk function (loss function) L (w, B) = – ∑ x ∈ I M y I ⋅ x (I) + b (w) (w, b) = L – \ sum_ {x_i \} in M y_i (w \ cdot x_i + b) L (w, b) = ∑ – xi ∈ M yi ⋅ xi + b (w) which M M M is a collection of misclassification point. Given the training data set T T T, the loss function L(w,b) L(w,b) L(w,b) is the continuous differentiable function of W, W, W and b,b

algorithm

Original form

Notice the iterative formula in the original form, you can add one to x, x, x, merge W, W, w and B, b, and the one that you put together is called the augmented weight vector, and it’s in the book.

Dual form

The basic idea of the dual form is to express w w W and b b b as linear combinations of instances x I x_i xi and labels Y I y_i yi, and to find W W W and b b b by solving for their coefficients.

Gram matrix

In the dual form, the training instance only appears in the form of inner product.

⋅xj]N×N G=[x_i\cdot x_j]_{N\times N} G=[xi \ xj]N×N

example

Example 2.1

In this example, η=1 \eta =1 η=1

Perceptron learning algorithms may have different solutions because they adopt different initial values or select different misclassification points.

In addition, after this example, there is a section to prove the convergence of the algorithm, in order to facilitate the description and derivation, mentioned the method of incorporating bias into weight vector, which may be used in the calculation of the inner product.

Example 2.2

This is also a simple example. Notice two things

  • η=1 \eta=1
  • Alpha I ← alpha I +1, B ← B + Y I \alpha_i\leftarrow \ alpha_I +1, B \leftarrow B +y_i α I ← alpha I +1, B \leftarrow B +y_i α I ←b+yi

Above:

  • Why did η \eta η choose 1, giving it the order of 1
  • This expression uses the result η=1 \eta=1, which has been simplified

So, here we can experience the effect of adjusting the learning rate η \eta η. The learning rate determines the parameter space.

Logic_01

The often cited xOR problem cannot be realized with perceptrons because the corresponding data is nonlinearly separable. But you can use perceptrons to perform other logical operations, that is, provide the data for the corresponding logical operations, and then learn the model.

The data in this example is binary, where the NOT operation only applies to the first dimension of the input vector

Logic_02

The data in this example is ternary

MNIST_01

Select the two types of data to distinguish, different choices should get a certain difference in the results, data is not uploaded, there is corresponding data in sklearn, directly reference, note that the test case used 01, relatively easier to distinguish.

The problem

Loss function

There is a problem on Zhihu

Why can the denominator in the loss function in the perceptron be excluded? Some people say it's a positive number, it doesn't matter, but it has w in the denominator, and it's also unknown, and it doesn't matter when you're thinking about the maximum value of the loss function? Think impassability

The corresponding book P 27 P_ {27} P27 regardless of 1 / | | w | |, gained the loss function of perception machine learning

Does it matter if you consider the maximum loss function?

  • The perceptron deals with linearly fractionable data sets, dichotomies, Y={+1,−1} \bf Y=\{+1,-1\} Y={+1,−1}, so the operations involved in multiplying are actually contributing signs;

  • The loss function L of w, B) = – ∑ x ∈ I M y I ⋅ x (I) + b (w) (w, b) = L – \ sum_ {x_i \} in M y_i (w \ cdot x_i + b) L (w, b) = – ∑ xi ∈ Myi ⋅ xi + b (w), which M M M wrong points set point, A linearly separable data set will definitely find the hyperplane S S S, so this loss function is zero at best.

  • If correct classification, y I ⋅ I + b) x (w = ⋅ I x ∣ w + b ∣ y_i (w \ cdot x_i + b) = w \ | cdot x_i + b | yi ⋅ xi + b (w) = ⋅ xi ∣ w + b ∣, error classification, in order to ensure positive will add a minus sign, That’s the minus sign in the loss function, that’s the interval;

  • 1 ∣ ∣ w ∣ ∣ \ frac {1} {| | w | |} ∣ ∣ w ∣ ∣ 1 with a super plane normal vector, all get geometric interval, is the point, the distance to the hyperplane The difference between function interval and geometric interval lies in that after the same hyperplane (w,b) (w,b) (w,b) parameters are equally scaled up into (kw,kb) (kw, KB) (kw, KB), although the same hyperplane is represented, the function interval from point to hyperplane is also enlarged. But the geometric spacing is constant;

  • The specific algorithm implementation, w w w to initialize, and then for each iteration to adjust the wrong locations, since to initialize, that if the initialization ∣ ∣ w ∣ ∣ = 1 | | w | | = 1 ∣ ∣ w ∣ ∣ = 1 case also need not tangle, And don’t consider 1 ∣ ∣ w ∣ ∣ \ frac {1} {| | w | |} ∣ ∣ w ∣ ∣ 1 is the same;

  • This is how you adjust for mismatches

    W ← w+ η y I x I b ← B + η y I \begin{aligned} w& leftarrow w+ eta y_ix_i\\ b& leftarrow b+ eta y_i \end{aligned} Wb please w + eta yixi please b + eta yi

As mentioned above, y I y_i YI is a symbol, so the perceptron can be interpreted as moving the hyperplane to the side of the misclassification point by adjusting W,b W, B W,b, and finally the complete classification is correct.

  • The solution of perceptron is not unique, it is related to the initial value and the adjustment order of misclassification points;
  • That’s how you adjust it to find the solution to the perceptron, right? Yes, Novikoff also demonstrated that a separate hyperplane can be found that separates the training data exactly correctly through a finite number of searches.

So, if you only consider the maximum value of the loss function, it doesn’t matter, linearly separable data set, the final loss is 0; That denominator is normalized by the normal vector, and it’s the same if you don’t normalize it, the perceptron solution is not unique; The positive number doesn’t affect the direction of the hyperplane