Produced by: The Cabin by: Peter Edited by: Peter

Machine learning -6- Suggestions for machine learning

This article records Ng’s suggestions on machine learning, including:

  • Suggestions for applying machine learning
  • Assessment of assumptions
  • Model selection and cross validation
  • Variance and bias diagnosis
  • Regularization and overfitting problems

Suggestions for applying machine learning

When we use the trained model to predict unknown data and find a large error, what can we do next?

  • Get more training samples
  • Try to reduce the number of features
  • Try to get more features
  • Try adding polynomial features
  • Try to reduce the degree of regularization λ\lambdaλ
  • Try to increase the degree of regularization λ\lambdaλ

Evaluating the Hypothesis Evaluating a Hypothesis

When learning the algorithm, we consider how to select parameters to minimize the training error. In the process of model building, it is easy to meet the problem of fitting, so how to evaluate whether the model is over-fitting?

To check whether the algorithm overfits, the data set is divided into training sets and test sets, usually in a 7:3 ratio. The key point is that both the training set and the test set need to contain various types of data, and the data is usually “shuffled” and then divided into the training set and the test set.

After we get our learning model on the training set, we need to use the test set to verify the model. There are two different methods:

  1. Linear regression model: use test data to calculate the cost function JJJ
  2. Logistic regression model:
    • Jtest(θ)J_{test}{(\theta)} Jtest(θ)
    • The ratio of misclassification was calculated for each test set sample, and then the average value was calculated

Model selection and cross validation

Cross validation

What is cross-validation?

The cross-validation set refers to the use of **60% of the data as the * training set *, 20%* of the data as the * cross-validation set, and 20%* of the data as the test set *

Model selection
  • Use the training camp to develop 10 models
  • Calculate the intersection (value of cost function) with 10 models respectively for the cross validation set.
  • Select the model with minimum cost function value
  • Use the model selected in the above steps to calculate the generalization error (the value of the cost function) for the test set.
  • The training error is expressed as:

J_{train}(\theta) = \frac{1}{2m}\sum_\limits{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^2

  • The cross-validation error (obtained from the cross-validation data set) is expressed as:

  • Test error

Diagnostic Variance and Bias Bias vs. Variance

If the results of an algorithm are not very good, there are only two cases: either the deviation is too large, or the variance is too large. In other words, there is either an underfit or an overfit.

The cost function error and polynomial number through the training set and cross validation set are plotted in the same graph:

1. High deviation stage

The cost function errors of the cross-validation set and the training set are large and approximately equal.

2. High variance stage

The error of the cross-validation set is much larger than that of the training set, which is very low

Regularization and bias/variance Regularization and Bias_Variance

Regularization basis

Regularization technique is mainly used to solve the problem of overfitting. Overfitting refers to having good judgment of sample data, but poor prediction of new data.

  • The first model is a linear model that does not fit well enough to fit our training set
  • The third model, a quad-power model, puts too much emphasis on fitting raw data and loses the essence of algorithms: predicting new data
  • The middle model seems to fit best

chestnuts

Suppose we need to fit the polynomial in the figure below, requiring regularization terms

  • A high deviation occurs when λ\lambdaλ is large, assuming that hθ(x)h_\theta(x)hθ(x) is a straight line
  • High variance occurs when λ\lambdaλ is small and approximately zero

If it is polynomial fitting, the higher the degree of x, the better the fitting effect will be, but the corresponding prediction ability may be worse. Treatment of over-fitting:

  1. Discard features that can’t be predicted correctly. You can choose which features to keep manually, or you can use some model selection algorithm, such as PCA
  2. Regularization. Keep all the features, but reduce the magnitude of the parameter.

Add regularization parameters

In the model Theta h (x) = theta. Theta. Theta 0 + 1 x1 + 2 x2 + + theta. Theta 3 x3 4 x4h_ \ theta (x) = \ theta_0 + + \ \ theta_1x_1 theta_2x_2 + + \ \ theta_3x_3 theta_4x_4h theta (x) = theta. Theta 0 + 1 x1 x2 + theta 3 x3 + + theta. 2 In θ4×4, the overfitting problem is mainly caused by higher order terms:

The problem of over-fitting can be prevented by adding Regularization Parameter, where λ\lambdaλ is Regularization Parameter Regularization Parameter:


J ( Theta. ) = 1 2 m i = 1 m ( h Theta. ( x ( i ) ) y ( i ) ) 2 + Lambda. j = 1 n Theta. j 2 J(\theta)=\frac{1}{2m}\sum^m_{i=1}(h_{\theta}(x^{(i)})-y^{(i)})^2+\lambda \sum^n_{j=1}\theta^2_{j}

Attention: generally, don’t punish θ0\theta_0θ0; The addition of the regularization parameter actually penalizes the parameter θ\thetaθ. Comparison between regularized model and original model:

  • If lambda lambda lambda is too large, all parameters are minimized and the model becomes H θ(x)=θ0h_\theta(x)=\theta_0hθ(x)=θ0, resulting in an overfit

parameter
Lambda. \lambda
The choice of

  1. Several regularization models of different degrees were developed by training
  2. The cross validation error is calculated by using multiple models for the cross validation set
  3. Select the model with minimum cross validation error
  4. The model selected in Step 3 is used to calculate the generalization error of the test set

Learning Curves

The learning curve is used to judge whether a certain learning algorithm is in the problem of bias and variance.

The learning curve is a graph drawn by training set error and cross-validation set error as a function of training set sample number MMM


J t r a i n ( Theta. ) = 1 2 m i = 1 m ( h Theta. ( x ( i ) ) y ( i ) ) 2 J_{train}(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^2

J_{cv}{(\theta)} = \frac{1}{2m_{cv}}\sum_\limits{i=1}^{m}(h_{\theta}(x^{(i)}_{cv})-y^{(i)}_{cv})^2

The relationship between training sample M and cost function J

See the result in Figure 1

  • The smaller the sample size is, the error of the training set is very small, and the error of the cross validation set is very large
  • As the sample size increases, the difference decreases

Note: In the case of high deviation and inadequate fitting, increasing the sample number has no effect

In the case of high variance, increasing the number can improve the algorithm effect

conclusion

  1. Get more training samples – solve high variance
  2. Try to reduce the number of features — solve for high variance
  3. Try to get more features – solve high deviations
  4. Try to add polynomial features – solve high bias
  5. Try to reduce the degree of regularization λ – resolve high deviations
  6. Try increasing the degree of regularization λ – to solve the high variance

The variance and deviation of neural networks

Small neural network, less parameters, prone to high deviation and underfitting;

Large neural networks with many parameters are prone to high variance and over-fitting

It is usually better to choose larger neural networks and adopt regularization processing than smaller neural networks

Accuracy and recall

Predictive value
Positive Negtive
The actual value True TP FN
False FP TN

P=TPTP+FPP=\frac{TP}{TP+FP}P=TP+FPTP Recall: R=TPTP+FNR = \ FRAc {TP}{TP+FN}R=TP+FNTP

Recall rate and precision rate are a pair of contradictory quantities. If one is high, the other must be low. The relationship diagram is as follows:

The equilibrium point between recall rate and precision rate is generally expressed by the coefficient F1F_1F1 F1=2PRP+RF_1=\frac{2PR}{P+R}F1=P+R2PR