Learning notes about Zhou Zhihua’s book Machine Learning record the learning process this blog is Chapter8

1 Individual and integration

Ensemble learning: learning tasks are completed by constructing multiple learners. Can be divided into homogeneous integration/heterogeneous integration.

Homogeneous: Individual learners are of the same type. The individual learner in this type is called “base learning algorithm”.
Heterogeneous integration: Contains different types of individual learners. The individual learner in this type is called component learner.

Ensemble learning, by combining multiple learners, can obtain significantly better generalization performance than single learners, especially for weak learners (weak learners have slightly better generalization performance than random guesses).

As a rule of thumb, if you mix things that are good and bad together, the result is usually a little less than the best and a little better than the worst. The reason why ensemble learning can achieve better performance than the best single learner is as follows: considering the dichotomous problem, ensemble learning results are produced by voting method, that is, minority rules majority. To achieve good integration, individual learners should be “good but different” : that is, individual learners should be accurate, diverse, and different from one another.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++

For the binary classification problem y∈{−1,+1}y\in \{-1,+1\}y∈{−1,+1 \}y∈{−1,+1 \} and the real function FFF, the error rate of the base classifier is assumed to be ϵ\epsilonϵ, i.e. for each base classifier:

P(h_i(x)\neq f(x))=\epsilon

If simple voting method is used to combine TTT base classifiers, if more than half of the base classifiers are correctly classified, then the ensemble classification is correct:

F(x)=sign(\sum_{i=1}^T h_i(x))

Assuming that the error rates of base classifiers are independent of each other, the Hoeffding inequality can be, and the integration error rate is

P(F(x)\neq f(x)) =\sum_{k=0}^{[T/2]} C_T^k(1-\epsilon)^k\epsilon^{T-k}\\ \le e^ { -\displaystyle\frac{1}{2}T(1-2\epsilon)^2}

The above formula reflects that, with the increase of TTT number of individual classifiers, the integration error rate will decrease exponentially and eventually tend to 0. However, we should note that the errors of base learners are independent of each other in the hypothesis. In real tasks, individual learners are trained to solve the same problem, so they are obviously not independent of each other. So how to produce and combine “good but different” learner is the core of integrated learning research.

According to the generation mode of individual learner, current ensemble learning methods can be roughly divided into two categories:

Boosting, a serialization method with strong dependencies among individual learners that must be generated sequentially
Bagging and “Random Forest” are parallel methods that can be generated simultaneously without strong dependencies between individual learners.

2 Boosting

Boosting algorithm has a similar working mechanism: firstly, a base learner is developed according to the initial training set, and then the distribution of training samples is adjusted according to the performance of the base learner, so that the training samples made wrong by the base learner will receive more attention in the future. Then, the next base learner is trained based on the adjusted sample distribution. This is repeated until the number of base learners reaches TTT first, and finally the TTT learners are weighted together.

Boosting algorithm is AdaBoost algorithm. An additive model, a linear combination of base learners, can be used

H(x)=\sum_{i=1}^T \alpha_ih_t(x)

To minimize the exponential loss function:

\vartheta_{exp}(H|D)=E_{x\sim D}[e^{-f(x)H(x)}]

The loss function means that if f(x)f(x)f(x) and H(x)H(x)H(x) have the same predicted results, then their product is 1 and their negative numbers will have a smaller exponent.

If H(x)H(x)H(x) can minimize the exponential loss function, calculate the partial derivative of H(x)H(x)H(x) :

\frac{\partial \vartheta_{exp}(H|D)}{\partial H(x)}=-e^{-H(x)}P(f(x)=1|x)+e^{H(x)}P(f(x)=-1|x)

Let the partial derivative be 0 to obtain:

H(x)=\frac{1}{2} \ln \frac{P(f(x)=1|x)}{P(f(x)=-1|x)}

Thus there are:

sign(H(x)) = \begin{cases} 1\space \space \space ,P(f(x)=1|x)>P(f(x)=-1|x)\\ -1,P(f(x)=1|x)< P(f(x)=-1|x) \end{cases}\\ \space\space \space \space = \mathop{\arg \max}_{y\in \{1,-1\}}P(f(x)=y|x)

This means that sign(H(x))sign(H(x))sign(H(x)) has reached the Bayesian optimal error rate. That is, the exponential loss function can replace the original 0/1 loss function.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++

In the AdaBoost algorithm, the first base classifier h1H_1H1 is obtained by directly applying the base learning algorithm to the initial data distribution. Hth_tht and αt\alpha_tαt are then iteratively produced. When the base classifier hth_tht is generated based on the distribution DtD_tDt, the weight of the classifier αt\alpha_tαt should minimize the exponential loss function of α THT \alpha_th_tα THT:

\vartheta_{exp}(\alpha_th_t|D_t) =E_{x\sim D_t}[e^{-f(x)\alpha_t h_t(x)}] =E_{x\sim D_t} [e ^ {- \ alpha_t} Ⅱ (f (x) = h_t (x)) + [e ^ {\ alpha_t} Ⅱ (f (x) \ neq h_t (x)] \ \ = e ^ {- \ alpha_t} P_ {x \ sim D_t}(f(x)=h_t(x))+e^{\alpha_t}P_{x\sim D_t}(f(x)\neq h_t(x)) =e^{-\alpha_t}(1-\epsilon_t)+e^{\alpha_t}\epsilon_t

Derivative of the exponential loss function:

\frac{\partial \vartheta_{exp}(\alpha_th_t|D_t) }{\partial \alpha_t}=-e^{-\alpha_t(1-\epsilon_t)+e^{\alpha_t}\epsilon_t}

Let the above formula be 0, and the weight update formula can be obtained:

\alpha_t=\frac{1}{2}\ln (\frac{1-\epsilon_t}{\epsilon_t})

+++

After the AdaBoost algorithm obtains Ht−1H_{T-1}Ht−1, the sample distribution will be adjusted so that the next round of base learner HTH_THT can correct some errors of Ht−1H_{T-1}Ht−1. Approach is to minimize ϑ (Ht – 1 + alpha THT ∣ D) \ vartheta (H_} {t – 1 + \ alpha_th_t | D) ϑ (Ht – 1 + alpha THT ∣ D), can be simplified as:

\vartheta(H_{t-1}+\alpha_th_t|D) =E_{x\sim D}[e^{-f(x)(H_{t-1}(x)+h_t(x))}]\\ =E_{x\sim D}[e^{-f(x)H_{t-1}(x)}e^{-f(x)h_t(x)}]

Notice the f2 (x) = ht2 (x) = 1 f ^ 2 (x) = h_t ^ 2 (x) = 1 f2 (x) = ht2 (x) = 1, with e – f (x) ht (x) e ^ {- f (x) h_t} (x) e – f (x) ht (x) Taylor expansion is:

\vartheta(H_{t-1}+\alpha_th_t|D) =E_{x\sim D}[e^{-f(x)H_{t-1}(x)}(1-f(x)h_t(x)+\frac{f^2(x)h_t^2(x)}{2})]\\ =E_{x\sim D}[e^{-f(x)H_{t-1}(x)}(1-f(x)h_t(x)+\frac{1}{2})]

So the ideal base learner

h_t(x)=\mathop{\arg \max}_{h}\space E_{x\sim D}[\frac{e^{-f(x)H_{t-1}(x)}}{E_{x\sim D}[e^{-f(x)H_{t-1}(x)}]}f(x)h(x))]

Let’s say DtD_tDt represents a distribution

D_t=\frac{D(x)e^{-f(x)H_{t-1}(x)}}{E_{x\sim D}[e^{-f(x)H_{t-1}(x)}]}\\ \\ D_{t+1}(x)=D_t(x) \cdot e^{-f(x)\alpha_th_t(x)}\frac{E_{x\sim D}[e^{-f(x)H_{t-1}(x)}]}{E_{x\sim D}[e^{-f(x)H_t(x)}]}

The ideal HTH_THT will minimize the classification error in the distributed DtD_tDt. So the weak classifier will be trained based on the distribution DtD_tDt. And the classification error of DtD_tDt should be less than 0.5.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++

In summary, we derive the weight update formula, distribution update formula and loss function, and obtain the complete process of AdaBoost:

Boosting requires a base learner to learn specific data distributions, which can be implemented by “re-weighting”, which assigns a new weight to each training sample based on the sample distribution in each round of the training process. For the base learning algorithm that cannot accept the weighted samples, re-sampling method can be adopted, that is, in each round of learning, re-sampling the training set according to the sample distribution, and then training the base learner with the sample set obtained by ressampling.

Generally speaking, there is no significant difference between the two approaches. It should be noted that Boosting algorithm needs to check whether the currently generated base learner meets basic conditions (such as whether the current base classifier is better than random guess) in each round of training. If the conditions are not met, the current base learner will be discarded and the learning process will stop. In this case, the initial set number of learning rounds, T, may be far from being reached, which may result in poor performance with only a few base learners in the final integration. If using “resampling method”, the “restart” chance can be obtained in order to avoid premature stop training process, namely in the abandoned after learning does not meet the conditions of current, according to the current distribution of training samples, sampling, and based on the new sampling results retraining the base learning, so as to make the learning process can continue until the preset T wheel to complete.

Boosting focuses on reducing bias in terms of bias and variance decomposition, so Boosting can build a strong ensemble based on a learner with fairly weak generalization performance.

Bagging and random Forests

To obtain strong generalization performance of the integration, the individual learners in the integration should be independent of each other as much as possible. Different base learners can be trained with overlapping subsets of samples.

3.1 Bagging

Bagging is the most famous representative of parallel integrated learning methods. Bootstrap sampling is directly based on the autonomous sampling method (sampling one sample randomly from the sample set at a time and then putting the sample back into the initial data set). About 63.2% of the samples in the initial training set appeared in the sample set.

According to this method, we can sample TTT sample sets containing MMM training samples, and develop a base learner based on each sample training set. Bagging tends to use simple voting (classification)/simple averaging (regression) to make decisions on the results of multiple base learners. From the perspective of bias variance decomposition, Bagging focuses on reducing variance, so it is more effective on unpruned decision trees, neural networks and other learners susceptible to sample disturbance.

3.2 Random forest

Random Forest is an expanded variant of Bagging. On the basis of constructing Bagging ensemble based on decision tree learner, random attribute selection is further introduced in the training process of decision tree.

Specifically, the traditional decision book selects an optimal attribute from the attribute set of the current node. In the random forest, for each node of the base decision tree, a subset containing KKK attributes is randomly selected from the attribute set of the node, and then an optimal attribute is selected from the subset for partitioning. The parameter KKK here controls how much randomness is introduced. In general, k=log⁡2dk=\log_2dk=log2d is recommended.

Random forest is simple, easy to implement and has low computational cost. Surprisingly, it shows strong performance in many practical tasks and is regarded as “a method representing the level of integrated learning technology”. As can be seen, random forest makes only minor changes to Bagging, but unlike Bagging, where the “diversity” of base learners is derived only from sample perturbations (by sampling the initial training set), the diversity of base learners in random forest comes not only from sample perturbations, but also from attribute perturbations, Therefore, the generalization performance of the final integration can be further improved by increasing the degree of difference between individual learners.

4 Combination Strategy

Suppose the integration contains TTT learners {h1, H2… , hT} \ {h_1, h_2,… , h_T \} {the h1, h2,… ,hT}, where the output of hih_ihi on example XXX is hi(x)h_i(x)hi(x).

4.1 average method

Simple average method:

H(x)=\frac{1}{T}\sum_{i=1}^Th_i(x)

Weighted average method:

H(x)=\sum_{i=1}^T w_ih_i(x)

4.2 voting

Supermajority voting: if a mark wins more than half of the votes, it is predicted to be that mark. Provides rejection prediction and is a good mechanism for learning tasks with high reliability requirements.

Plurality voting: the mark with the most votes

Weighted voting:

H(x)=C_{\mathop{\arg \max}_j\sum_{i=1}^Tw_ih_i^j(x)}

4.3 learning method

When there is a lot of training data, a more powerful combination strategy is to use “learning”, where the combination is done through another learner. Stacking is a typical example of learning. Here we call the individual learner as the primary learner, and the learner used for the combination is called the secondary learner or meta-learner.

Stacking the primary learner from the initial data set, and then “generating” a new data set for training the secondary learner. In this new dataset, the output of the primary learner is treated as sample input characteristics, while the tags of the original sample are still treated as sample tags.

Some studies have shown that it is better to use the probability of the output class of the initial base learner as the input attribute of the Tskii learner and multi-response Linear Regression (MLR) as the secondary learning algorithm.

5 diversity

5.1 Error – divergence decomposition

Suppose we use individual learners H1, H2… , hTh_1, h_2,… , h_Th1, h2,… HT completes the regression learning task by combining the integration generated by the weighted average method f:Rd↦Rf:R^d\mapsto Rf:Rd↦R. For example XXX, the ** “divergence” of the learner hih_iHI is defined as:

A(h_i|x)=(h_i(x)-H(x))^2

Then, the differences of integration are:

\bar A(h|x)=\sum_{i=1}^Tw_iA(h_i|x)\\ =\sum_{i=1}^Tw_i(h_i(x)-H(x))^2

Divergence represents the inconsistency of individual learner in sample XXX, that is, it reflects the diversity of individual learner to a certain extent. The square error of individual learner and integration is:

E(h_i|x)=(f(x)-h_i(x))^2\\ E(H|x)=(f(x)-H(x))^2

Make E ˉ ∣ (h) x = ∑ I = 1 twie ∣ (hi) x \ bar (h | x) = E \ sum_ {I = 1} ^ Tw_iE (h_i | x) E ˉ ∣ (h) x = ∑ I = 1 twie ∣ x (hi) said the individual learning error of the weighted average, there is

\bar A(h|x)=\sum_{i=1}^Tw_iE(h_i|x)-E(H|x)\\ =\bar E(h|x)-E(H|x)

Let p(x)p(x)p(x) be the probability density of the sample, and then p(x) is over the whole sample

\sum_{i=1}^Tw_i\int A(h_i|x)p(x)dx=\sum_{i=1}^Tw_i\int E(h_i|x)p(x)dx- \int E(H|x)p(x)dx

Similarly, the generalization error and divergence term of individual learner on the full sample are:

E_i=\int E(h_i|x)p(x)dx\\ A_i=\int A(h_i|x)p(x)dx

The generalization error of the integration is

E=\int E(H|x)p(x)dx

Make E ˉ twiei = ∑ I = 1, A ˉ = ∑ twiai \ bar E = I = 1 \ sum_ {I = 1} ^ Tw_iE_i,, space, space, bar A = \ sum_ {I = 1} ^ Tw_iA_iE ˉ twiei = ∑ I = 1, Aˉ=∑ I =1TwiAi represents the weighted bifurcation value of the individual learner, where

E=\bar E-\bar A

The expression means: the higher the accuracy of individual learner, the greater the diversity, the better the integration. But it is difficult to optimize the target by “error-divergence decomposition” directly, because Aˉ\bar AAˉ is not A multiplicity measure that can be operated directly, we can only estimate after the integration is constructed. And the above derivation is only applicable to regression, not classification.

5.2 Diversity Measurement

Diversity measure: Estimates the diversity of individual learners. A common practice is to consider pairwise similarity/dissimilarity of individual classifiers.

Given data set D={(x1,y1),(x2,y2)… , (xm, ym)} D = \ {(x_1, y_1), (x_2, y_2),… , (x_m, y_m) \} D = {(x1, y1), (x2, y2),… , (xm, ym)}, the binary classification task, yi ∈ {+ 1, 1} – y_i \ \ {+ 1, 1 \} in yi ∈ {+ 1, 1} -, predictive results of the classifier hih_ihi and hjh_jhj contingency table as follows:

	$h_i=+1$	$h_i=-1$
$h_j$ = + 1	$a$	$c$
$h_j=-1$	$b$	$d$

Where, AAA represents the number of samples whose predicted results of both classifiers are +1, and A + B + C + D = MA +b+ C +d= MA +b+ C + D = M, and the commonly used diversity measure is:

Out of measure:
$dis_{_ij}=\frac{b+c}{m}$
Correlation coefficient:
$\rho_{_ij}=\frac{ad-bc}{\sqrt{(a+b)(a+c)(c+d)(b+d)}}$
Q- Statistics:
$Q_{_ij}=\frac{ad-bc}{ad+bc}$
κ\kappaκ- statistics: p1p_1P1 is the probability that the classifiers reach agreement, p2p_2P2 is the probability that the classifiers reach agreement by chance.
$\kappa=\frac{p_1-p_2}{1-p_2}\\ p_1=\frac{a+d}{m}\\ p_2=\frac{(a+b)(a+c)+(c+d)(b+d)}{m^2}$

5.3 Enhanced Diversity

Data sample disturbance: For example, autonomous sampling is used in Bagging and sequential sampling is used in AdaBoost
Input attribute disturbance: random forest, KKK attributes are randomly selected each time…
Output perturbation: flip method (random marking of some training samples of the hundred years); Output modulation method (converting classified output into regression output) and so on
Algorithm parameter perturbation: number of hidden layer neurons in neural network……

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Machine learning Chapter 8 integrated learning

1 Individual and integration

2 Boosting

Bagging and random Forests

3.1 Bagging

3.2 Random forest

4 Combination Strategy

4.1 average method

4.2 voting

4.3 learning method

5 diversity

5.1 Error – divergence decomposition

5.2 Diversity Measurement

5.3 Enhanced Diversity

Machine learning Chapter 8 integrated learning

1 Individual and integration

2 Boosting

Bagging and random Forests

3.1 Bagging

3.2 Random forest

4 Combination Strategy

4.1 average method

4.2 voting

4.3 learning method

5 diversity

5.1 Error – divergence decomposition

5.2 Diversity Measurement

5.3 Enhanced Diversity

Related Posts

Smart home — traditional fan to voice control

Mask R-CNN Keras/TensorFlow/Pytorch code implementation

NVIDIA HugeCTR GPU version Parameter server — (5) Embedded hash table