Machine learning -6- Suggestions for machine learning

Produced by: The Cabin by: Peter Edited by: Peter

Machine learning -6- Suggestions for machine learning

This article records Ng’s suggestions on machine learning, including:

Suggestions for applying machine learning
Assessment of assumptions
Model selection and cross validation
Variance and bias diagnosis
Regularization and overfitting problems

Suggestions for applying machine learning

When we use the trained model to predict unknown data and find a large error, what can we do next?

Get more training samples
Try to reduce the number of features
Try to get more features
Try adding polynomial features
Try to reduce the degree of regularization λ\lambdaλ
Try to increase the degree of regularization λ\lambdaλ

Evaluating the Hypothesis Evaluating a Hypothesis

When learning the algorithm, we consider how to select parameters to minimize the training error. In the process of model building, it is easy to meet the problem of fitting, so how to evaluate whether the model is over-fitting?

To check whether the algorithm overfits, the data set is divided into training sets and test sets, usually in a 7:3 ratio. The key point is that both the training set and the test set need to contain various types of data, and the data is usually “shuffled” and then divided into the training set and the test set.

After we get our learning model on the training set, we need to use the test set to verify the model. There are two different methods:

Linear regression model: use test data to calculate the cost function JJJ
Logistic regression model:
- Jtest(θ)J_{test}{(\theta)} Jtest(θ)
- The ratio of misclassification was calculated for each test set sample, and then the average value was calculated

Model selection and cross validation

Cross validation

What is cross-validation?

The cross-validation set refers to the use of **60% of the data as the * training set *, 20%* of the data as the * cross-validation set, and 20%* of the data as the test set *

Model selection

Use the training camp to develop 10 models
Calculate the intersection (value of cost function) with 10 models respectively for the cross validation set.
Select the model with minimum cost function value
Use the model selected in the above steps to calculate the generalization error (the value of the cost function) for the test set.
The training error is expressed as:

$J_{train}(\theta) = \frac{1}{2m}\sum_\limits{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^2$

The cross-validation error (obtained from the cross-validation data set) is expressed as:

Test error

Diagnostic Variance and Bias Bias vs. Variance

If the results of an algorithm are not very good, there are only two cases: either the deviation is too large, or the variance is too large. In other words, there is either an underfit or an overfit.

The cost function error and polynomial number through the training set and cross validation set are plotted in the same graph:

1. High deviation stage

The cost function errors of the cross-validation set and the training set are large and approximately equal.

2. High variance stage

The error of the cross-validation set is much larger than that of the training set, which is very low

Regularization and bias/variance Regularization and Bias_Variance

Regularization basis

Regularization technique is mainly used to solve the problem of overfitting. Overfitting refers to having good judgment of sample data, but poor prediction of new data.

The first model is a linear model that does not fit well enough to fit our training set
The third model, a quad-power model, puts too much emphasis on fitting raw data and loses the essence of algorithms: predicting new data
The middle model seems to fit best

chestnuts

Suppose we need to fit the polynomial in the figure below, requiring regularization terms

A high deviation occurs when λ\lambdaλ is large, assuming that hθ(x)h_\theta(x)hθ(x) is a straight line
High variance occurs when λ\lambdaλ is small and approximately zero

If it is polynomial fitting, the higher the degree of x, the better the fitting effect will be, but the corresponding prediction ability may be worse. Treatment of over-fitting:

Discard features that can’t be predicted correctly. You can choose which features to keep manually, or you can use some model selection algorithm, such as PCA
Regularization. Keep all the features, but reduce the magnitude of the parameter.

Add regularization parameters

In the model Theta h (x) = theta. Theta. Theta 0 + 1 x1 + 2 x2 + + theta. Theta 3 x3 4 x4h_ \ theta (x) = \ theta_0 + + \ \ theta_1x_1 theta_2x_2 + + \ \ theta_3x_3 theta_4x_4h theta (x) = theta. Theta 0 + 1 x1 x2 + theta 3 x3 + + theta. 2 In θ4×4, the overfitting problem is mainly caused by higher order terms:

The problem of over-fitting can be prevented by adding Regularization Parameter, where λ\lambdaλ is Regularization Parameter Regularization Parameter:

$J(\theta)=\frac{1}{2m}\sum^m_{i=1}(h_{\theta}(x^{(i)})-y^{(i)})^2+\lambda \sum^n_{j=1}\theta^2_{j}$

Attention: generally, don’t punish θ0\theta_0θ0; The addition of the regularization parameter actually penalizes the parameter θ\thetaθ. Comparison between regularized model and original model:

If lambda lambda lambda is too large, all parameters are minimized and the model becomes H θ(x)=θ0h_\theta(x)=\theta_0hθ(x)=θ0, resulting in an overfit

parameter $\lambda$ The choice of

Several regularization models of different degrees were developed by training
The cross validation error is calculated by using multiple models for the cross validation set
Select the model with minimum cross validation error
The model selected in Step 3 is used to calculate the generalization error of the test set

Learning Curves

The learning curve is used to judge whether a certain learning algorithm is in the problem of bias and variance.

The learning curve is a graph drawn by training set error and cross-validation set error as a function of training set sample number MMM

$J_{train}(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^2$

$J_{cv}{(\theta)} = \frac{1}{2m_{cv}}\sum_\limits{i=1}^{m}(h_{\theta}(x^{(i)}_{cv})-y^{(i)}_{cv})^2$

The relationship between training sample M and cost function J

See the result in Figure 1

The smaller the sample size is, the error of the training set is very small, and the error of the cross validation set is very large
As the sample size increases, the difference decreases

Note: In the case of high deviation and inadequate fitting, increasing the sample number has no effect

In the case of high variance, increasing the number can improve the algorithm effect

conclusion

Get more training samples – solve high variance
Try to reduce the number of features — solve for high variance
Try to get more features – solve high deviations
Try to add polynomial features – solve high bias
Try to reduce the degree of regularization λ – resolve high deviations
Try increasing the degree of regularization λ – to solve the high variance

The variance and deviation of neural networks

Small neural network, less parameters, prone to high deviation and underfitting;

Large neural networks with many parameters are prone to high variance and over-fitting

It is usually better to choose larger neural networks and adopt regularization processing than smaller neural networks

Accuracy and recall

	Predictive value
		Positive	Negtive
The actual value	True	TP	FN
	False	FP	TN

P=TPTP+FPP=\frac{TP}{TP+FP}P=TP+FPTP Recall: R=TPTP+FNR = \ FRAc {TP}{TP+FN}R=TP+FNTP

Recall rate and precision rate are a pair of contradictory quantities. If one is high, the other must be low. The relationship diagram is as follows:

The equilibrium point between recall rate and precision rate is generally expressed by the coefficient F1F_1F1 F1=2PRP+RF_1=\frac{2PR}{P+R}F1=P+R2PR

Machine learning -6- Suggestions for machine learning