3.1. Model selection, underfitting and overfitting

If you’ve ever changed the model structure or hyperparameters in an experiment, you may have noticed that while the model is more accurate on the training data set, it is not necessarily more accurate on the test data set. Why is that?

3.1.1. Training error and generalization error

Before explaining the above phenomena, we need to distinguish between training error and generalization error. Generally speaking, the former refers to the error of the model on the training data set, while the latter refers to the expected error of the model on any test data sample, and is often approximated by the error on the test data set. The training error and generalization error can be calculated using the previously introduced loss function, such as the square loss function used for linear regression and the cross entropy loss function used for Softmax regression.

Let’s take the college entrance examination as an example to directly explain the two concepts of training error and generalization error. The training error can be considered as the error rate of the previous year’s high exam questions (training questions), while the generalization error can be approximated by the error rate of the answer when actually participating in the college entrance examination (test questions). Suppose the training and test questions are randomly sampled from an unknown, large test bank that follows the same syllabus. Given to an elementary student who has not studied high school, the error rates on test and training questions are likely to be similar. However, if a candidate for the third year of high school repeatedly practices the training questions, even if the error rate on the training questions is 0, it does not mean that the real college entrance examination results will be like this.

In machine learning, we usually assume that each sample in a training dataset (training problem) and a test dataset (test problem) is generated independently of each other from the same probability distribution. Given any machine learning model (including parameters), the expected training error and generalization error are the same based on the independent homodistribution assumption. For example, if we set the model parameters to random values (pupils), the training error and generalization error will be very similar. However, we have learned from the previous sections that the parameters of the model are learned by training the model on the training data set, and the parameters are selected according to the minimum training error. So, the expectation of the training error is less than or equal to the generalization error. That is, in general, the model parameters learned from the training data set will make the model perform better on the training data set or equal to the performance on the test data set. Because generalization error cannot be estimated from training error, reducing training error blindly does not mean that generalization error will be reduced.

Machine learning models should focus on reducing generalization errors.

3.1.2. Model selection

In machine learning, it is often necessary to evaluate the performance of several candidate models and select one from them. This process is called model selection. Alternative candidate models can be homogeneous models with different hyperparameters. Taking the multi-layer perceptron as an example, we can choose the number of hidden layers, the number of hidden units and activation functions in each hidden layer. In order to get an effective model, we usually have to work hard on model selection. Next, we describe the validation data sets that are often used in model selection.

3.1.2.1. Validate data sets

Strictly speaking, the test set can only be used once after all the hyperparameters and model parameters have been selected. You cannot use test data to select a model, such as a callback. Since the error cannot be generalized from the training error estimation, the training data selection model should not be relied on only. In view of this, we can reserve some data outside the training data set and test data set for model selection. This piece of data is called a validation set, or validation set for short. For example, we can randomly select a small portion of a given training set as the validation set and the rest as the actual training set.

However, in practical applications, the test data is rarely used once and then discarded because the data is not easy to obtain. As a result, the line between validation and test data sets can be blurred in practice. Strictly speaking, unless explicitly stated, the test sets used in the experiments in this book shall be validation sets, and the test results reported by the experiments shall be validation results (e.g., test accuracy).

3.1.2.2. KK fold cross validation

Because the validation data set does not participate in model training, it is too extravagant to reserve a large amount of validation data when the training data is not enough. An improved approach is KK-fold cross-validation. In KK folding cross-validation, we split the original training data set into KK non-overlapping sub-data sets, and then we do KK times of model training and verification. Each time, we validated the model with one subset and trained the model with other K−1K−1 subsets. During this KK training and validation, each time the subdata set used to validate the model is different. Finally, we average the KK training errors and validation errors respectively.

3.1.3. Underfitting and overfitting

Next, we will explore two typical problems that often occur in model training: one is that the model can’t get low training error, which we call underfitting; the other is that the training error of the model is much less than the error on the test data set, which we call overfitting. In practice, we should deal with both underfitting and overfitting as much as possible. While there are many factors that can contribute to both types of fitting problems, we focus here on two: model complexity and training data set size.

3.1.3.1. Model complexity

To explain model complexity, we take polynomial function fitting as an example. Given a training data set consisting of scalar data feature XX and corresponding scalar label YY, the objective of polynomial function fitting is to find a polynomial function of KK order

Y ^ = b + ∑ k = 1 kxkwky ^ = b + ∑ KXKWK k = 1

To approximate yy. In the above formula, WKWK is the weight parameter of the model and BB is the deviation parameter. Like linear regression, polynomial function fitting uses a square loss function. In particular, the first order polynomial function fitting is also called linear function fitting.

Higher order polynomial functions have higher complexity than lower order polynomial functions because of more model parameters and larger selection space of model functions. Therefore, higher order polynomial functions are more likely to obtain lower training errors on the same training data set than lower order polynomial functions. Given a training data set, the relationship between model complexity and error is usually shown in Figure 3.4. Given the training data set, if the complexity of the model is too low, it is easy to appear underfitting. If the model complexity is too high, it is easy to overfit. One way to deal with under-fitting and over-fitting is to choose a model of appropriate complexity for the dataset.

FIG. 3.4 Influence of model complexity on underfitting and overfitting

3.1.3.2. Training data set size

Another important factor affecting underfitting and overfitting is the size of the training data set. In general, overfitting is more likely to occur if the number of samples in the training data set is too small, especially if it is smaller than the number of model parameters (in terms of elements). In addition, the generalization error does not increase with the increase of the number of samples in the training dataset. Therefore, to the extent that computing resources allow, we usually want the training data set to be large, especially when the model complexity is high, such as the deep learning model with many layers.

3.1.4. Polynomial function fitting experiment

In order to understand the influence of model complexity and training data set size on underfitting and overfitting, we take polynomial function fitting as an example. First import the package or module required for the experiment.

In [1]:
Copy the code
%matplotlib inline
import d2lzh as d2l
from mxnet import autograd, gluon, nd
from mxnet.gluon import data as gdata, loss as gloss, nn
Copy the code

3.1.4.1. Generate data sets

We will generate a manual data set. In the training data set and test data set, given the sample feature xx, we use the following third-order polynomial function to generate the label of the sample:

Y = 1.2 x – 3.4 x2 + 5.6 x3 + 5 + ϵ, y = 1.2 x – 3.4 x2 + 5.6 x3 + 5 + ϵ,

The noise term ϵϵ follows a normal distribution with a mean value of 0 and a standard deviation of 0.1. The sample size of both the training dataset and the test dataset is set to 100.

In [2]:
Copy the code
N_train, n_test, true_w, true_b = 100, 100, [1.2, -3.4, 5.6], 5 features = nd.random. Normal (shape=(n_train + n_test, 1)) poly_features = nd.concat(features, nd.power(features, 2), nd.power(features, 3)) labels = (true_w[0] * poly_features[:, 0] + true_w[1] * poly_features[:, 1] + true_w[2] * poly_features[:, 2] + true_b) labels += nd.random. Normal (Scale =0.1, shape= allelage.shape)Copy the code

Take a look at the first two samples of the generated dataset.

In [3]:
Copy the code
features[:2], poly_features[:2], labels[:2]
Copy the code
Out[3]:
Copy the code
([[2.2122064] [0.7740038]]] <NDArray 2x@CPU (0)>, [[2.2122064 4.893857 10.826221] [0.7740038 0.5990819 0.46369165]] <NDArray 2x3@CPU (0)>, [51.674885 6.3585763] < CPU (0)>Copy the code

3.1.4.2. Define, train, and test models

We first define the plotting function semilogy, in which the yy axis uses logarithmic scale.

In [4]:
Copy the code
Def semilogy(x_VALS, y_vals, x_label, y_label, x2_VALS =None, y2_VALS =None, legend=None, Figsize = (3.5, 2.5)) : d2l.set_figsize(figsize) d2l.plt.xlabel(x_label) d2l.plt.ylabel(y_label) d2l.plt.semilogy(x_vals, y_vals) if x2_vals and y2_vals: d2l.plt.semilogy(x2_vals, y2_vals, linestyle=':') d2l.plt.legend(legend)Copy the code

Like linear regression, polynomial function fitting uses a square loss function. Because we will try to fit the resulting data set with models of different complexity, we put the model definition part in the fit_and_plot function. The training and testing steps for polynomial function fitting are similar to the related steps in Softmax regression described in the section “Implementing Softmax Regression from scratch.”

In [5]:
Copy the code
num_epochs, loss = 100, gloss.L2Loss()

def fit_and_plot(train_features, test_features, train_labels, test_labels):
    net = nn.Sequential()
    net.add(nn.Dense(1))
    net.initialize()
    batch_size = min(10, train_labels.shape[0])
    train_iter = gdata.DataLoader(gdata.ArrayDataset(
        train_features, train_labels), batch_size, shuffle=True)
    trainer = gluon.Trainer(net.collect_params(), 'sgd',
                            {'learning_rate': 0.01})
    train_ls, test_ls = [], []
    for _ in range(num_epochs):
        for X, y in train_iter:
            with autograd.record():
                l = loss(net(X), y)
            l.backward()
            trainer.step(batch_size)
        train_ls.append(loss(net(train_features),
                             train_labels).mean().asscalar())
        test_ls.append(loss(net(test_features),
                            test_labels).mean().asscalar())
    print('final epoch: train loss', train_ls[-1], 'test loss', test_ls[-1])
    semilogy(range(1, num_epochs + 1), train_ls, 'epochs', 'loss',
             range(1, num_epochs + 1), test_ls, ['train', 'test'])
    print('weight:', net[0].weight.data().asnumpy(),
          '\nbias:', net[0].bias.data().asnumpy())
Copy the code

3.1.4.3. Third-order Polynomial Function fitting (normal)

We first use a third-order polynomial function fitting of the same order as the data generating function. Experiments show that the training error of this model and the error in the test data set are low. Training the model parameters are also close to the real value: w1 = 1.2, = – 3.4 w2, w3 = 5.6, b = 5 w1 = 1.2, = – 3.4 w2, w3 = 5.6, b = 5.

In [6]:
Copy the code
fit_and_plot(poly_features[:n_train, :], poly_features[n_train:, :],
             labels[:n_train], labels[n_train:])
Copy the code
Final epoch: Train Loss 0.0067995964 Test Loss 0.010894175 weight: [[1.319636-3.3645654 5.5645313]] BIAS: [4.9527817]Copy the code

3.1.4.4. Linear function fitting (underfitting)

Let’s try linear function fitting again. Obviously, the training error of this model is difficult to continue to decrease after it decreases in the early iteration. The training error remained high after the last iteration. Linear models tend to underfit on data sets generated by nonlinear models such as third-order polynomial functions.

In [7]:
Copy the code
fit_and_plot(features[:n_train, :], features[n_train:, :], labels[:n_train],
             labels[n_train:])
Copy the code
Final epoch: Train Loss 43.99766 Test Loss 160.84781 weight: [[15.548418]] Bias: [2.2836545]Copy the code

3.1.4.5. Insufficient training samples (over-fitting)

In fact, even if the third-order polynomial function model with the same order as the data generation model is used, if the training samples are insufficient, the model is still easy to overfit. Let’s train the model with just two samples. Obviously, there are too few training samples, even fewer than the number of model parameters. This makes the model too complex to be affected by noise in the training data. During the iteration, although the training error is low, the error on the test data set is high. This is a classic over-fitting phenomenon.

In [8]:
Copy the code
fit_and_plot(poly_features[0:2, :], poly_features[n_train:, :], labels[0:2],
             labels[n_train:])
Copy the code
Final epoch: Train Loss 0.4027369 Test Loss 103.314186 weight: [[1.3872364 1.9376589 3.5085924]] BIAS: [1.2312856]Copy the code

We will continue to discuss the problem of fitting and how to deal with overfitting in the next two sections.

3.1.5. Summary

  • Because generalization error cannot be estimated from training error, reducing training error blindly does not mean that generalization error will be reduced. Machine learning models should focus on reducing generalization errors.
  • You can use validation data sets for model selection.
  • Underfitting means that the model cannot obtain a low training error, while overfitting means that the training error of the model is much less than its error in the test data set.
  • Models with appropriate complexity should be selected and too few training samples should be avoided.