For the time being, however, it is instructive to continue with the current approach and consider how it might be applied in practice to the limited size of the data set that we might wish to use with relatively complex and flexible models. A technique often used to control the overfitting phenomenon in such cases is regularization, which involves adding a penalty term to the error function (1.2) to organize the coefficient to a larger value. The simplest penalty term of this kind takes the form of a square for all coefficients, resulting in a correction function of the form


E ~ ( w ) = 1 2 n = 1 N { y ( x n . w ) t n } 2 + Lambda. 2 w 2 (1.4) \ widetilde (w) = {E} \ frac {1} {2} \ sum_ ^ {n = 1} {n} \ {x_n, w) – y t_n \} ^ 2 + \ frac {\ lambda} {2} | | w | | ^ 2 \ tag} {1.4

The ∣ ∣ w ∣ ∣ 2 ≡ wTw = w02 + w12 +… +wM2||w||^2\equiv w^Tw=w_0^2+w_1^2+… 2 + w_M ^ 2 ∣ ∣ w ∣ ∣ ≡ wTw = w02 + w12 +… +wM2, the coefficient lambda lambda lambda controls the relative importance of regularization as compared to the sum of squares error. Note that the coefficient w0W_0w0 is usually ignored in the regularizer because its inclusion causes the result to depend on the origin selection of the target variable, or it can contain the coefficient w0W_0w0 but contain its own regularization coefficient (we’ll cover this topic in more detail in Section 5.5.1 (p. 541)). Similarly, the error function in (1.4) can be minimized precisely in closed form. Such techniques are known in the statistical literature as shrinkage methods because they reduce the value of the coefficients. The special case of quadratic regularizers is called ridge regression. In the context of neural networks, this approach is called weight attenuation.

Figure 1.7 uses the regularization error function (1.4) to pair with ln⁡λ=−18\ln\lambda=-18lnλ=−18 and ln⁡λ=0\ln\lambda=0lnλ=0. λ=0\lambda=0 in the absence of regularization corresponds to λ=−∞\lambda= -infty λ=−∞, as shown in the lower right corner of Figure 1.4.

Figure 1.7 shows the result of fitting the order M=9M=9M=9 polynomial to the same data set previously, but now using the regularized error function given in (1.4). We see that overfitting is suppressed for ln⁡=−18\ln=-18ln=−18, and we now have a closer representation of the fundamental function sin⁡(2πx)\sin(2 πx) sin(2πx). However, if we use too large a value for λ\lambda lambda, then again we get a poor fit, as shown in Figure 1.7 in ln⁡λ=0\ln\lambda=0lnλ=0. Table 1.2 shows the corresponding coefficients of the fitting polynomial, indicating that regularization has the desired effect of reducing the coefficient size.

Table 1.2 w∗ W ^* W ∗ of polynomials M=9M=9M=9, the regularization parameter λ\lambdaλ has different values. Note that ln⁡λ=−∞\ln\lambda=-\inftylnλ=−∞ corresponds to the model without regularization, which is the graph in the lower right corner of Figure 1.4. We see that the typical size of the coefficients decreases as the value of λ\lambdaλ increases.

The effect of regularization term on generalization error can be seen by plotting the root mean square error (1.3) value of training set and test set against ln⁡λ\ln lambdalnλ, as shown in Figure 1.8. We find that lambda lambda actually now controls the effective complexity of the model and thus determines the degree of overfitting.

The issue of model complexity is an important one and will be discussed in detail in Section 1.3. We need only note here that if we are trying to use this minimization of the error function to solve a real problem, we must find a way to determine an appropriate value for model complexity. The above results suggest a simple way to achieve this by dividing the available data into a training set for determining the coefficient WWW and a separate validation set (also known as a hold set) for optimizing model complexity (MMM or λ\lambdaλ). However, in many cases this will prove too wasteful of valuable training data and we must seek more sophisticated approaches.

So far, our discussion of polynomial curve fitting has relied largely on intuition. We now seek a more principled approach to the problem of pattern recognition by discussing probability theory. In addition to providing the basis for almost all subsequent developments in this book, it will also give us some important insights introduced in the CON text on polynomial curve fitting, and allow us to extend these concepts to more complex cases.

Figure 1.8m =9M=9M=9 Root mean square error (1.3) of polynomial and ln⁡λ\ln lambdalnλ.