This series is a review of the course “Fundamentals of Machine Learning” by Professor Hsuan-Tien Lin, Department of Information Engineering, Taiwan University. Focus on grooming rather than taking detailed notes, so you may leave out some details.
The course consists of 16 lectures divided into 4 parts:
- When will machines learn? When Can Machines Learn?
- Why can machines learn? Why Can Machines Learn?
- How do machines learn? (How Can Machines Learn?)
- How can machines learn better? How Can Machines Learn Better?
This article is part 3, corresponding to lectures 9-12 of the original course.
Main contents of this part:
- Detailed explanation of linear regression algorithm, guarantee of generalization ability, whether it can be used in binary classification problems, etc.
- Logistic regression algorithm is explained in detail and gradient descent method is introduced.
- The relation and difference of PLA, linear regression and logistic regression in classification problems are expounded, and the stochastic gradient descent method is introduced.
- OVA and OVO methods in multi-classification problems;
- Nonlinear transformation of features, and how to control the complexity of transformation.
1. Linear regression
In the first part, we discussed the classification of machine learning. When Y=R\mathcal{Y}= R\ mathbb{R}Y=R, it is regression.
1.1 Linear regression algorithm
The hypothesis set of linear regression is very simple, h(x)=wTxh(\mathbf{x})=\mathbf{w}^T\mathbf{x}h(x)=wTx.
It point by point error metric can be set to err (^ y, y) = (y ^ – y) 2 \ text {err} (\ hat {} y, y) = (\ hat {} y – y) ^ 2 err (^ y, y) = (y ^ – y) 2, Inside and outside the sample error respectively Ein (w) = 1 n ∑ n = 1 n (wTxn – yn) 2 e_ {\ text {in}} (\ mathbf {w}) = \ dfrac {1} {n} \ sum \ limits_ ^ {n = 1} {n} (\ mathbf {w} ^ T \ mathbf {x} _n y_n) ^ 2 – ein (w) = N1n = 1 ∑ N (wTxn – yn) and 2 Eout (w) = E (x, y) – P (wTx – y) 2 e_ {\ text {out}} (\ mathbf {w}) = \ mathop {\ mathcal {E}} \ limits_ {(\ mathbf {x}, y) \ sim P} (\ mathbf {w} ^ T Mathbf {x}-y)^2Eout(w)=(x,y) ~ PE(wTx−y)2 to minimize EinE_{\text{in}}Ein
Let’s make it 000. As shown in the figure:
If XTXX^T XXTX is reversible (which is essentially true when N d+1N\gg D +1N ² D +1), Can be directly obtained wLIN = (XTX) – 1 xty \ mathbf {w} _ {\ text {LIN}} = (X) X ^ T ^ {1} X ^ T \ mathbf {} y wLIN xty = (XTX) – 1
What if XTXX^T XXTX is singular? Inverse X†X daggerX†, wLIN=X†y\mathbf{w}_{\text{LIN}}=X^\dagger \mathbf{y}wLIN=X† Y
In practice, it is recommended to use X†X^\daggerX† directly, on the one hand, to avoid determining whether XTXX^T XXTX is invertible, and on the other hand, it is numerically stable even if it is almost irreversible.
1.2 Generalization of linear regression
Linear regression seems to have no “learning” process, so is it machine learning?
In fact, learning can be said to “happen” as long as it is guaranteed that Eout(wLIN)E_{\text{out}}(\mathbf{w}_\text{LIN})Eout(wLIN) is small enough.
Here, instead of starting from the VC dimension, we explain from another Angle why Eout(wLIN)E_{\text{out}}(\mathbf{w}_\text{LIN})Eout(wLIN) is sufficiently small.
Let’s first look at how big the average Ein _{\text{in}} is:
Among them
H=XX†H=XX^ daggerH=XX† is called hat matrix because it maps y\mathbf{y}y to y^\hat{\mathbf{y}}y^. If y\mathbf{y}y is generated from the ideal F (X)∈ SPANf (X)\in \text{span}f(X)∈ SPAN plus noise noise\mathbf{noise}noise, Then I− hi-hi −H can also map noise\mathbf{noise}noise to y−y^\mathbf{y} -hat {\mathbf{y}}y−y^ :
And trace (I) – H = N – (d + 1) \ text {trace} (I – H) = N – (d + 1) trace (I) – H = N – (d + 1), mark can be understood as “energy”, so there is
If I average EinE_{\text{in}}Ein, Ein=noise level⋅(1−d+1N)\overline{E_{\text{in}}}=\mathbf{noise}\text{level} \cdot (1-\dfrac{d+1}{N})Ein=noise ⋅(1−Nd+1) Eout (1+d+1N)\overline{E_{\text{out}}}=\mathbf{noise}\text{level} \cdot (1+\dfrac{d+1}{N})Eout=noise level⋅(1+Nd+1)
Therefore Ein is ‾\overline{E_{\text{in}}}Ein and Eout \overline{E_{\text{out}} Eout are as follows:
If N→∞N\to\inftyN→∞, then both of them converge to σ2\sigma^2σ2 (noise level\mathbf{noise}\text{level}noise level), The expectation of generalization error is 2(d+1)N\dfrac{2(d+1)}{N}N2(d+1). So, learning happens!
The theory of VC dimension states that Ein _{\text{in}}Ein and Eout _{\text{out}}Eout have an upper limit on the probability that they are far apart, and here it states that their average differences converge. Different angles, but both ways illustrate the power of generalization.
1.3 Dichotomies by linear regression
In linear classification, Y={+1,−1}\mathcal{Y}=\{+1,-1\}Y={+1,−1}, H (x) = sign (wTx) h (\ mathbf {x}) = \ text {signs} ({\ mathbf {w} ^ T \ mathbf {x}}) h (x) = sign (wTx), Err (^ y, y) = 1 / y ^ indicates] y \ text {err} (\ hat {} y, y) = \ mathbf {1} _ {[\ hat {} y \ ne y]} err (^ y, y) = 1 / y ^ = y], to find its optimal solution is a NP hard problem.
And as {+1,−1}⊂R\{+1,-1\}\subset \mathbb{R}{+1,−1}⊂R, that is, the positive and negative categories of the sample can also be expressed as real numbers, and in linear regression Y=R\mathcal{Y}=\mathbb{R}Y=R, then a direct linear regression Get the wLIN \ mathbf {w} _ \ text {LIN} wLIN, And then let the g (x) = sign (wLINTx) g (\ mathbf {x}) = \ text {signs} (\ mathbf {w} _ \ text {LIN} ^ T \ mathbf {x}) g (x) = sign (wLINTx), is that possible?
The linear classification and linear regression error metric respectively for err0/1 = 1 \ [signs (wTx) indicates y] text {err} _ {} 0/1 = \ mathbf {1} _ {[\ text {signs} (\ mathbf {w} ^ T \ mathbf {x}) \ ne [y]} err0/1 = 1 sign (wTx) = y] and errsqr = (wTx – y) 2 \ text {err} _ \ text {SQR} = ({\ mathbf {w} ^ T \ mathbf {x} – y}) ^ 2 errsqr = (wTx – y) 2, their relationship is as follows:
Err0/1 ≤ errSQr \text{err}_{0/1} \le \text{err}_\text{SQR}err0/1≤errsqr. As a result, there are
That is, making the regression EinE_{\text{in}}Ein good enough can also make the classification EoutE_{\text{out}}Eout small enough, but with a loose-bound upper limit. To do so is to trade off bound tightness for efficiency.
General wLIN\mathbf{w}_\text{LIN}wLIN can be used as the initial vector of PLA or Pocket algorithms.
2 logistic regression
2.1 Logistic regression algorithm
In dichotomies, what we’re interested in is
But in a lot of scenarios, what we want to do is soft classification, the probability of getting a certain classification, and what we’re interested in is
The problem is that the data we get is labeled by the category of the sample, not the probability of the sample being assigned to a certain category.
For a sample of all the characteristics of x = (x0, x1, x2,…, xd) \ mathbf {x} = (x_0 x_1, x_2, \ cdots, x_d) x = (x0, x1, x2,…, xd), That s = ∑ I = 0 dwixis = \ sum \ limits_ {I = 0} ^ {d} w_i x_is = I = ∑ 0 dwixi. We can convert this to an estimated probability using the logistic function θ(s)\theta(s)θ(s). In other words, the logistic regression assumption is that H (x)=θ(wTx)h(\mathbf{x})=\theta(\mathbf{w}^T\mathbf{x})h(x)=θ(wTx).
Theta is the most commonly used logic function (s) = es1 + es = 11 + e – s \ theta (s) = \ dfrac {e ^ s} {1} + e ^ s = \ dfrac {1} {1 + e ^ {-s}} theta (s) = 1 + eses = 1 + e – s1
The function image is as follows:
As you can see, it is a smooth, monotonous, s-shaped (sigmoid) function.
Next, the logistic regression Ein(w)E_\text{in}(\mathbf{w})Ein(w) is defined. First the objective function f (x) = P (+ 1) ∣ x f (\ mathbf {x}) = P (+ 1 | \ mathbf {x}) f (x) = P (+ 1) ∣ x represented
Assume the data set at hand is D={(X1,∘),(x2,×),… , xN, x} \ mathcal {D} = \ {(\ mathbf {x} _1, \ circ), (\ mathbf {x} _2, \ times), \ ldots, (\ mathbf {x} _N, \ times) \} D = {(x1, ∘), (x2, x),… , xN, x}
Then, the probability that FFF generates D\mathcal{D}D is
Likelihood of D\mathcal{D}D is generated by our hypothesis HHH
If h≈ FH \approx fh≈ F, then the likelihood of HHH generating D\mathcal{D}D should also be close to the probability of FFF generating D\mathcal{D}D, and the probability of FFF generating D\mathcal{D}D should be large (which happens to be sampled by us). So, machine learning algorithms can take
If h (x) = theta (wTx) h (\ mathbf {x}) = \ theta (\ mathbf {w} ^ T \ mathbf {x}) h (x) = theta (wTx), the nature of the function, 1 – h (x) = h (1 – x) -h (\ mathbf {x}) = h (- \ mathbf {x}) – 1 h (x) = h (-) x, so
And P (x1) P (\ mathbf {x} _1) P (x1), P (x2) P (\ mathbf {x} _2) P (x2),… , P(xN)P(\mathbf{x}_N)P(xN) are independent of HHH, so there are
Now we want to maximize it to find the final HHH. If theta(s) \theta(s)θ(s) \theta(s)θ(s), then the logarithm (the logarithm function is monotonous and does not change the point at which it is maximized), becomes
After taking the negative number (maximization becomes minimization) and dividing by NNN (without changing the minimum point), it can become
Theta (s)\theta(s)θ(s) is expanded to obtain
make Err (w, x, y) = ln (1 + exp (- ywx)) \ text {err} (\ mathbf {w}, \ mathbf {x}, y) = \ ln \ left (1 + \ exp (- y \ mathbf {w} \ mathbf {x}) \ right) err (w, x, y) = ln (1 + exp (- ywx))
This is called a cross-entropy error, But ∑ n = 1 Nerr (w, xn, yn) \ sum \ limits_ {n = 1} ^ n \ text {err} (\ mathbf {w}, \ mathbf {x} _n, y_n) n = 1 ∑ Nerr (w, xn, yn) is Ein (w) E_ \ text {in} (\ mathbf {w}) Ein (w).
2.2 Gradient descent
Next we minimize Ein(w)E_\text{in}(\mathbf{w})Ein(w), which is continuous, differentiable, quadratic differentiable, convex, so we can try to make it gradient 000. Find its gradient
Its gradient can be seen as the weighted average of −ynxn-y_n\mathbf{x}_n−ynxn with θ(⋅)\theta(\cdot)θ(⋅) as its weight. To set it to zero, there are two ways:
- Let all the theta (- ynwTxn) \ theta (- y_n \ mathbf {w} ^ T \ mathbf {x} _n) theta (- ynwTxn) to zero, This means that all samples satisfy YNwNXn by 0y_n\mathbf{W} _N \mathbf{x}_n\gg 0ynwNXn, 0, or D\mathcal{D}D is linearly separable;
- If D\mathcal{D}D is not linearly separable, the weighted sum should be 0. This is a nonlinear equation and there is no closed-form solution.
The iteration can be carried out in a similar way to that in PLA, namely wt+1←wt+ηv\mathbf{w}_{t+1}\leftarrow\mathbf{w}_t+\eta\mathbf{v}wt+1←wt+ηv, where v\mathbf{v}v determines the direction of the update, η\etaη determines the updated step size, as shown below:
How do you iterate? A greedy algorithm can be used to make EinE_\text{in}Ein smaller step by step. Given some η\etaη, to determine the direction of v\mathbf{v}v, the update problem at each step is transformed into
It looks like it’s harder to solve. But if η\etaη is small enough, we can expand it with a locally linear approximation (Taylor expansion, Taylor expansion) :
Ein (wt) of E_ \ text {in} (\ mathbf {w} _t) Ein (wt) and ∇ Ein (wt) \ nabla E_ \ text {in} (\ mathbf {w} _t) ∇ Ein (wt) is known, eta \ eta eta given, Just determine v\mathbf{v}v, notice that the second term of the above equation is essentially the inner product of two vectors, which is minimized when the two vectors are in opposite directions, so minimize the above equation, take
The iterative update of gradient descent becomes: For a given smaller η\etaη, Wt + 1 please wt – eta ∇ Ein (wt) ∥ ∇ Ein (wt) ∥ \ mathbf {w} _ {t + 1} \ leftarrow \ mathbf {w} _t – eta \ \ dfrac {\ nabla E_ \ text {in} (\ mathbf {w} _t)} {\ Vert \ nabla E_ \ text {in} (\ mathbf {w} _t) \ Vert} wt + 1 please wt – eta ∥ ∇ Ein (wt) ∥ ∇ Ein (wt)
Too little of aη \eta can lead to very slow, while too much can lead to instability. It is best to use a variable η\etaη, as shown below:
So, what’s better for η\eta? Can make it with ∥ ∇ Ein (wt) ∥ \ Vert \ nabla E_ \ text {in} (\ mathbf {w} _t) \ Vert ∥ ∇ Ein (wt) ∥ positive correlation, The original fixed eta \ eta eta on ∥ ∇ Ein (wt) ∥ \ Vert \ nabla E_ \ text {in} (\ mathbf {w} _t) \ Vert ∥ ∇ Ein (wt) ∥. Thus, the update rule becomes
The new η\etaη can be called the fixed learning rate.
Linear model of classification
3.1 Comparison of the three algorithms
Remember s=wTxs= mathbf{w}^T\mathbf{x}s=wTx, the following is a summary of three models (linear classification, linear regression, logistic regression) :
The YSYsys here can be referred to as the Classification correctness score, which measures how correct a classification is, the greater the value, the more “correct” the classification is.
If the cross entropy error function errCE(s,y)\text{err}_\text{CE}(s,y)errCE(s,y) is scale (divided by ln2\ln 2ln2), Get errSCE (s, y) = log 2 (1 + exp – (ys)) \ text {err} _ \ text (s, y) = {SCE} \ log_2 (1 + \ exp (ys)) errSCE (s, y) = log2 (1 + exp – (ys))
If you plot their error functions, you get the following:
As can be seen from the figure, There must be a err0/1 errSCE or less (s, y) = 1 ln 2 errce (s, y) \ text {err} _ {} 0/1 \ le \ text {err} _ \ text (s, y) = {SCE} \ dfrac {1} {\ ln 2} \ text {err} _ \ text {CE} (s, y) err0/1 errSCE or less (s, y) = ln21errCE (s, y)
ErrCE \text{err}_\text{CE}errCE can also do the classification task, there are two ideas:
- From a classification point of view, yes
- From the point of view of cross entropy, there are
EinCEE_\text{in}^ text{CE}EinCE is small enough to ensure that eout0/1 (w)E_\text{out}^{0/1}(\mathbf{w}) eout0/1 (w) is small enough to ensure that EinCEE_\text{in}^\text{CE}EinCE is small enough to ensure that eout0/1 (w)E_\text{out}^{0/1}(\mathbf{w}) eout0/1 (w) Linear classification can be done using either logistic regression or linear regression.
PLA, linear regression and logistic regression are used for classification. The advantages and disadvantages of the three methods are as follows:
3.2 Stochastic gradient descent
The time complexity of each iteration of PLA is O(1)O(1)O(1) O(1), but logistic regression (or Pocket algorithm) needs to perform one operation on all samples in D\mathcal{D}D in each iteration, and the time complexity is O(N)O(N)O(N) O(N). Can the time complexity of each iteration also be O(1)O(1)O(1)?
We update wt+1←wt+ηv\mathbf{w}_{t+1}\leftarrow\mathbf{w}_t+\eta\mathbf{v}wt+1←wt+ηv
As you can see, calculating the gradient requires traversing all the samples, which is just too complicated. The 1N∑n=1N\dfrac{1}{n}\sum\limits_{n=1}^{n}N1n=1∑ n can be regarded as the expected E\mathcal{E}E, which is equivalent to the average of the results calculated by taking a random sample continuously. ∇ WERr (W,xn,yn)\nabla_\mathbf{w}\text{err}(\mathbf{w},\mathbf{x} _N,y_n)∇ WERr (W,xn,yn), Then the real gradient can be seen as its expectation:
In this way, iterations can be performed with Stochastic Gradient Descent (SGD). Its advantage is that it is very simple, the cost of calculation is low, very suitable for big data or online learning situations, the disadvantage is not stable enough.
In logistic regression, the step of updating with SGD becomes
This is very similar to the PLA update step, which looks like this:
Thus, logistic regression with SGD can be regarded as a “soft” PLA. Conversely, if η=1\eta=1η=1, PLA can also be regarded as SGD logistic regression when wtTxn\mathbf{w}_t^T \mathbf{x}_nwtTxn is very large.
There are two rules of thumb when using SGD:
- When does it stop? When TTT is large enough (do not judge whether the gradient is really 0, otherwise it will bring the complexity of gradient calculation);
- When x\ eta {x}x is in the general range, take η=0.1\eta=0.1η=0.1.
4. Multiple classification problems
4.1 Use logistic regression to make multiple classifications
Suppose Y = {/ ♢, delta, ⋆} \ mathcal {} Y = \ {\ square, \ diamondsuit \ triangle, star \ \} Y = {/ ♢, delta, ⋆}, data distribution is as follows:
Each category can be classified separately, as shown below:
But when it comes to combining them at the end, there’s a problem. Some regions can’t be determined which category they belong to:
How do you solve it? Logistic regression can be used as a “soft” classifier, again KKK, with data sets for each category
W [k]\mathbf{w}_{[k]}w[k] :
When you’re done, you combine them, desirable G (x) = arg Max k ∈ Y theta (w [k] Tx) g (\ mathbf {x}) = \ arg \ max_ {k \ in \ mathcal {Y}} \ theta (\ mathbf {w} _ {} [k] ^ T \ mathbf {x}) g (x) = argmaxk ∈ Y theta (w/k ]Tx), and then we can get what category a point should belong to:
This is called OVA (one-versus-all) Decomposition, the advantage is efficient, and can be combined with similar logistic regression methods, but the disadvantage is that when KKK is very large, it tends to make D[k]\mathcal{D}_{[k]}D[k] very unbalanced, For example, if there are 100 classes and the distribution is fairly uniform, the number of two types of data in OVA will be very different each time the sample is used for training.
This can be further extended, such as multinomial (‘coupled’) Logistic regression, with restrictions such as “the probability of belonging to different classes should add up to 1”, to make it more suitable for multiple classifications.
4.2 Use dichotomies to make multiple classifications
To overcome the imbalance problem, you can train pairwise classes, using data sets
Perform linear dichotomies:
In the end, Take g (x) = tournament champion {w/k, ℓ Tx} g (\ mathbf {x}) = \ text {tournament Champion} \ {\ mathbf {w} _ {} [k, \ ell] ^ T \ mathbf {x} \} g (x) = tournament champion {w/k, ℓ Tx}
You can:
Such a method is called OVO (one-versus-one) Decomposition, has the advantage of being efficient (because less data is used per training), is stable, and can be combined with any binary method. But the disadvantage is constantly computing w/k, ℓ \ mathbf {w} _ {} [k, \ ell] w/k, ℓ combined operation of complexity is O (K2) O (k ^ 2) O (K2), need more computing space. OVO is often used when KKK is not very large.
5 Nonlinear transformation
For some data sets, Ein _{in}Ein is large, no matter how linear models are used:
5.1 Quadratic hypothesis set
We find that if we use a circle as its classification boundary, it is actually divisible:
So we have to redesign the circular PLA, the circular regression,… ? Of course not. We can map x∈ x \mathbf{x}\in\mathcal{x} x∈ x by the transform φ \Phi φ to z∈ z \mathbf{z}\in\mathcal{z} z∈ z, The circular divisible data in X\mathcal{X}X is linearly divisible in Z\mathcal{Z}Z.
By Φ 2 (x) = (1, x1, x2, x12, x1x2, x22) \ Phi_2 (\ mathbf {x}) = (1, x_1, x_2, x_1 ^ 2, x_1x_2, x ^ 2 _2) Φ 2 (x) = (1, x1, x2, x12, x1x2, x22) mapping Z\mathcal{Z}Z space, can form the general quadratic hypothesis set:
Of course, higher nonlinear transformation can also be used. The flow of nonlinear transformation is shown as follows:
The specific steps are as follows:
- Use first Φ \ Phi Φ will {(xn, yn)} \ {(\ mathbf {x} _n, y_n) \} {} (xn, yn) transformation {(zinc = Φ (xn), yn)} \ {(\ mathbf {} z _n = \ Phi (\ mathbf {x} _n), y_n) \} {(zinc = Φ (xn), yn)};
- Using {(zinc, yn)} \ {(\ mathbf {} z _n, y_n) \} {} (zinc, yn) and A linear classification algorithm \ mathcal {A} A trained model w ~ \ tilde {\ mathbf {w}} w ~;
- Return to g (x) = sign (w ~ T Φ (x)) g (\ mathbf {x}) = \ text {signs} \ left (\ tilde {\ mathbf {w}} ^ T \ Phi (\ mathbf {x}) \ right) g (x) = sign (w ~ T Φ (x)).
5.2 Cost of complexity
Suppose the nonlinear transformation of QQQTH is used:
What is 1+d~1+\tilde d1+d~? If a DDD characteristics, can think after patch 1 above formulas behind each: QQQ times, that is to the d + d + 1 d + is to give each one a number, and the sum of all The Times must be: QQQ. The partition method can be used: Imagine that there are a total of Q+ D +1Q+ D +1Q+ D +1 balls, DDD partitions should be placed in the gap between them, and separated into D + 1D + 1D +1 segments. The number of balls in each segment minus 1 represents the number of items in the corresponding position. As there is required to be at least one ball in each segment, no partitions should be placed at both ends. A total of Q+dQ+dQ+ D positions can be put partition, a total of (Q+dd)\binom{Q+d}{d}(dQ+ D) put method, that is, the number of items on the right of the equal sign above
When QQQ is large, on the one hand, the cost of computation or storage is very high, on the other hand, 1+d~1+ tilde d1+d~ is the upper bound of dVC(H φ Q)d_\text{VC}(\mathcal{H}_{Phi_Q})dVC(H φ Q), If the QQQ is too large, dVCd_\text{VC}dVC is too large, and the model loses generalization ability.
5.3
The choice of
How to choose QQQ? Assuming Φ (x) = 0 (1) \ Phi_0 (\ mathbf {x}) = (1) Φ (x) = 0 (1), 1 (x) = Φ (Φ 0 (x), x1, x2,… , xd) \ Phi_1 (\ mathbf {x}) = \ left (\ Phi_0 (\ mathbf {x}), x_1, x_2, \ ldots, x_d, \ right) Φ 1 (x) = (Φ 0 (x), x1, x2,… , xd),… , Φ Q (x) = (Φ Q – 1 (x), x1Q x1Q – 1 x2,… , xdQ,) \ Phi_Q (\ mathbf {x}) = \ left (\ Phi_} {Q – 1 (\ mathbf {x}), x_1 ^ Q, x_1 ^ 1 {Q} x_2, \ ldots, x_d ^ Q, \ right) Φ Q (x) = (Φ Q – 1 (x), x1Q, x1Q – 1 x2, … H0\mathcal{H}_0H0, H1\mathcal{H}_1H1,…… , HQ\mathcal{H}_QHQ, they have nested relationships
As shown in the figure:
And they satisfy the VC dimension
If gi=argminh∈HiEin(h)g_i=\arg\min_{h\in \mathcal{h}_i} E_\text{in}(h)gi=argminh∈HiEin(h), their EinE_\text{in}Ein satisfy
How to choose QQQ? It’s safe to see if Ein(g1)E_\text{in}(g_1)Ein(g1) is small enough, and if it is small enough, it’s ok, otherwise, use a slightly more complicated model, which is to move to the right in the figure below: