Chapter 7 Boosting

The ensemble approach is a common technique in machine learning to combine multiple predictors to create a more accurate predictor. This chapter examines an important family of integration methods called enhancement, specifically the AdaBoost algorithm. This algorithm has been proved to be very effective in some scenarios and is based on rich theoretical analysis. Let’s start by introducing AdaBoost, showing how it can quickly reduce the empirical error of round count enhancement, and pointing out its relationship to some well-known algorithms. Next, we present a VC dimension based on AdaBoost hypothesis set, and carry out theoretical analysis of AdaBoost’s generalization property, and then carry out theoretical analysis of its generalization property based on the concept of margin. The marginal theory developed in this case can be applied to other similar integration algorithms. AdaBoost’s game theory can further help to analyze its properties and reveal the equivalence between weak learning hypothesis and separable condition.

7.1 the Introduction

It is often difficult, for a trivial learning task, to directly design an exact algorithm that meets the strong PAC learning requirements of Chapter 2\ MathCal 22. Still, there is great hope for finding simple predictors that guarantee only slightly better performance than random. A formal definition of this type of weak learner is given below. As in the PAC learning chapter, we let n\mathcal nn be a number indicating that the computation cost of any element x\mathcal xx∈ x\mathcal xx is at most O\mathcal OO(n\mathcal nn), The maximum cost of c\mathcal CC ∈C\in C∈C is expressed by the size (C\ mathcal cc).

Definition 7.1 (Weak learning)

If there exists A conceptual class ∈C\in C which is said to be weakly PAC learnable, an algorithm A\mathcal AA, γ\mathcal γγ γ>0, and A polynomial function poly (·, ·, ·), such that for any δ\mathcal δδ> 0\mathcal 00, For all distributions ∈D\in D∈D on X\mathcal XX and any target concept c\mathcal cc∈ c\mathcal cc,

AdaBoost(
S \in S
= ((
x 1 . y 1 ) . . . . . ( x m . \mathcal x_1, \mathcal y_1), . . . ,(\mathcal x_m,

y m y_m
)

For I ←\leftarrow←to m do

D1 D_1 D1 (I) (I) (I) please \ leftarrow please 1 m \ frac {1} {m} m1

For T ← leftarrow← 1 to t do

Hth_tht ← leftarrow← Basic classification H\mathcal HH small error ϵt\epsilon_tϵt = pip_IPi ~ Dt\mathcal D_tDt[hth_tht(xix_ixi) ≠\neq= yiy_iyi]

Ata_tat ← leftarrow← 12\frac{1}{2}21 log 1−ϵtϵt\frac{1-\epsilon_t}{\epsilon_t}ϵ T1 −ϵt

ZtZ_tZt please \ leftarrow please t \ [ϵ epsilon_t ϵ t (1 – ϵ t \ epsilon_t ϵ t)] 12 ^ {\ frac {1} {2}} 21 ⊳ \ RHD ⊳ normalization standardization factor

For I ← leftarrow← 1\mathcal 11 to m do

Dt\mathcal D_tDt(I II) ←\leftarrow← Dt(I) exp (− ATyiht (xt)) Zt\frac{D_t (I) exp (-a_ty_ih_t (x_t))}{Z_t}ZtDt (I) exp (−atyiht (xt))

F ←\leftarrow← ∑\sum∑ t=1T^ t _{t=1}t=1T ata_tat hth_tht

return f

Figure 7.1 AdaBoost algorithm for the basic classifier set H\mathcal HH⊆{−1, +1} x\mathcal XX. The following algorithm applies to any sample size of M \mathcal mm≥ poly\mathcal polypoly(1\mathcal 11/ δ\mathcal δδ, n\mathcal NN, Size \mathcal sizesize(c\mathcal cc)

Where hSh_ShS is the hypothesis returned by algorithm A\ Mathcal AA when trained on sample S\ Mathcal SS. When such an algorithm A\ Mathcal AA exists, it is called C’s weak learning algorithm or weak learner’s algorithm. The assumptions returned by weak learning algorithms are called base classifiers. The key idea behind the enhancement technique is to use a weak learning algorithm to build a strong learner, that is, an accurate PAC learning algorithm. To do this, promotion techniques take an integrated approach: they combine the different base classifiers returned by weak learners to create a more accurate predictor. But which base classifiers should be used and how should they be combined? The next section addresses these problems by detailing one of the most common and successful enhancement algorithms, AdaBoost.

We use H\mathcal HH to represent the hypothesis set from which the base classifier is chosen, which we sometimes call the base classifier set. Figure 7.1 shows Figure 7.2

An example of AdaBoost with an axially aligned hyperplane as the base classifier. (
a \mathcal a
The top line shows the decision boundaries for each round of advancement. The following line shows how weights are updated for each round, giving false (correct) weight increases (decreases). (
b \mathcal b
) the visualization of the final classifier is constructed as a non-negative linear combination of the base classifier. The base classifier is from
X \mathcal X
Mapped to {
1 \ mathcal – 1
.
+ 1 \mathcal +1
} function when AdaBoost pseudocode, therefore
H \mathcal H
⊆ {
1 \ mathcal – 1
.
+ 1 \mathcal +1
}
X \mathcal X
. The algorithm takes a labeled sample S=(
x 1 x_1
.
y 1 y_1
),… , (
x m x_m
.
y m y_m
) as input, where (
x i x_i
.
y i y_i
)∈X×{−1, +1} is all I ∈{m}, and the index {m}
1 \mathcal 1
,… .
m \mathcal m
} maintains a distribution. The initial (first
1 \mathcal 1
to
2 \mathcal 2
Row), the distribution is uniform (
D 1 \mathcal D1
). In each round of enhancement
h \mathcal h
, that is, the cycle
3 \mathcal 3

8 \mathcal 8
Each iteration of
t \mathcal t
∈ {
i \mathcal i
}, select a new base classifier
t \mathcal t

H \mathcal H
, so that by distribution
D t D_t
Error minimization of weighted training samples:


h t Arg min h H P i …… D i [ h ( x i ) indicates y i ] = Arg min h H i = 1 m D t ( i ) h ( x i ) indicates y i h_{_t}\in\underset{h\in\mathcal H}{\argmin}\underset{i\sim\mathcal D_{_i}}{\mathbb{P}}[h(x_{_i})\neq y_{_i}]=\underset{h\in\mathcal H}{\argmin}\sum^{m}_{i=1}\mathcal D_{_t}(i)_{_{h(x_i)\neq y_i}}

Ztz_tzt is just a normalization factor to ensure that the sum of the weights Dt\mathcal D_tDt(I \mathcal II) is 1\mathcal 11. The exact reason for defining the coefficients αt\mathcal αtαt will become clear later.

So far, it’s been observed, If the error of t of ϵt\epsilon_tϵt base classifier is less than 12\frac{1}{2}21, then 1−ϵtϵt\frac{1-\epsilon_t}{\epsilon_t} t T1 − t >\mathcal>> 1 and at\mathcal A_ \mathcal TAT is positive (ata_TAT >\mathcal >> 0\mathcal 00). As a result, The new distribution Dt+1\mathcal D_{t+1}Dt+1 is from Dt\mathcal D_tDt by greatly increasing its weight I \mathcal II if the point xi\mathcal x_ixi is incorrectly classified (yi\mathcal y_iyi) Ht \mathcal h_tht(xi\mathcal x_ixi)< 0\mathcal 00), conversely, if xi\mathcal x_ixi is correctly classified. This has the effect of focusing more on the misclassified points in the next boost than on those that are correctly classified

After being enhanced by the T\ Mathcal TT round, AdaBoost returns a classifier f\ Mathcal FF based on the symbol of the function, which is a non-negative linear combination of the base classifier HT \ Mathcal H_THT. The weight at\mathcal a_tat is assigned to HT \mathcal h_tht in this sum is a logarithmic function of the ratio of precision 1\mathcal 11- ϵt\epsilon_tϵt and the error ϵt\epsilon_tϵt of\mathcal ofof ht\mathcal h_tht. Therefore, more accurate base classifiers are assigned a greater weight in this sum. Figure 7.2 illustrates the AdaBoost algorithm. The size of these points represents the distribution weight assigned to them at each advance. For any T\mathcal TT ∈ mathcal ∈[t\ Mathcal tt], we will use FT to express the lift of the linear combination t\ Mathcal tt of the base classifier: Ft \mathcal f_tft= ∑\sum∑ s=1t^t _{s=1}s=1t as\mathcal a_sas hs\mathcal h_shs In particular, We have the fT\mathcal f_TfT= F \mathcal FF distribution Dt+1\mathcal D_{t+1}Dt+1 we can use the fT\mathcal f_TfT normalization factor Zs\mathcal Z_sZs, S \mathcal SS ∈[t\mathcal TT], as follows:

∀\forall∀ I \ Mathcal II ∈[M \mathcal MM], Dt + 1 \ mathcal D_ (t + 1} Dt + 1 (I \ mathcal ii) = e – yift (xi) m ∏ s = 1 TZS \ frac {e ^ {- y_if_t (x_i)}} {m \ prod ^ t_ Z_s {s = 1}} m ∏ s = 1 tzse – yift (xi) (7.2)

We will use this identity several times in the proofs in the following sections. It can be displayed directly by repeating the definition of the distribution at the extension point:

Dt + 1 \ mathcal D_ (t + 1} Dt + 1 (I \ mathcal ii) = e – atyift (xi) Zt \ frac {e ^ {- a_ty_if_t (xi)}} {Z_t} Zte – atyift = (xi) Dt−1 (I)e− AT − 1YIHT −1(xi)e− Atyift (xi)Zt−1Zt\frac{\mathcal D_{t-1} (\mathcal I) e ^ {- a_ y_ {t – 1} {I} h_ (xi)} {t – 1} e ^ {- a_ty_if_t (xi)}} {Z_ {t – 1 z_t}} Zt ZTDT – 1-1 (I) e – at – 1 yiht – 1 (xi) e – atyift (xi)

= eyi ∑ s = 1 tashs (xi) m ∏ s = 1 TZS \ frac {e ^ {yi \ sum ^ {t} _ {s = 1} a_sh_s (xi)}} {m \ prod ^ t_ Z_s {s = 1}} m ∏ s = 1 tzseyi ∑ s = 1 tashs (xi)

AdaBoost algorithm can be generalized in the following ways:

Hth_tht may not be a minimum weighted error assumption, but a basic classifier Dt\mathcal D_tDt returned by a trained weak learning algorithm;

The range of the base classifier can be [−1\mathcal −1−1, +1\mathcal +1+1] or, more generally, a bounded subset R\RR

The coefficients αt\mathcal α_tαt can be different and may not even be allowed in closed form. In general, they are chosen to minimize upper bounds on empirical errors, as described in the following section. Of course, in this general case, it is assumed that HT \ Mathcal H_THT is not a binary classifier, but their symbols can define labels, and their sizes can be interpreted as a measure of confidence.

For the rest of this chapter, the range of base classifiers in H\mathcal HH will be assumed to be contained in [−1\mathcal −1−1, +1\mathcal +1+1]. We further analyze the characteristics of AdaBoost and discuss its typical application in practice.

7.2.1 Combined with empirical error

We first show that the empirical error of AdaBoost decreases exponentially as a function of the number of booster rounds.

Theorem 7.2

The empirical error of the classifier returned by AdaBoost verifies that: Rs (f \ mathop {R} _s (\ mathcal fRs (f) \ leq exp or less or less [\ lbrack [∑ – 2 t = 1 t (12 – ϵ t) 2] – 2 \ sum \ limits_ {t = 1} ^ {t} (\ frac {1} {2} – \ epsilon_t) ^ 2 \ rbrack – 1 ∑ 2 T = T (21 – ϵ T) 2] (7.3) in addition, if for all T \ mathcal tt ∈ [T], Gamma \ mathcal gamma \ le gamma or less or less (12 \ frac {1} {2} 21 – ϵ t \ epsilon_t ϵ t), then the Rs \ mathop {R} _sRs (f \ mathcal ff) \ le exp or less or less (- 2 gamma 2 t \ mathcal gamma ^ 2 T gamma 2 T).

Proof: Using the general inequality 1u≤0\mathcal 1_{u\leq0}1u≤0 ≤exp (−u) \le exp (-u) ≤exp (−u) for all u∈R\mathcal u\in \mathbb Ru∈R and identity 7.2, we can write:

Rs (f \ mathop {R} _s (\ mathcal fRs (f) = 1 m \ frac {1} {m} m1 ∑ I = 1 m \ sum \ limits ^ {m} \ limits_ {I = 1}, I = 1 ∑ m