1. Model error
1.1 Sources of error
- There are two main sources, namely bias and variance.
- Bias reflects the error between the output of the model on the sample and the true value, that is, the accuracy of the model itself
- Variance reflects the error between each output result of the model and the output expectation of the model, that is, the stability of the model
1.2 An understanding of under-fitting and over-fitting
- The simple model (left) is an error caused by a large deviation, which is called underfitting. The model should be redesigned at this point. Since fâ may not have been included in the previous set of functions at all, add more functions (features), or consider more powers and more complex models. If you force more data to train at this time, it will not help, because the designed function set itself is not good, and it will not be better to find more training sets.
- Complex models (right) are errors caused by excessive variance, which is called overfitting. The crude solution: more data
1.3 Approximate error and estimation error
- Approximation Error: it is understood as the training error of the existing training set and pays more attention to “training”.
- Estimation Error: A set of test errors, more concerned with generalization.
- It’s kind of like the concept of bias versus variance.
- In the K-nearest neighbor algorithm, if k is small, the model tends to have low approximation error and high estimation error, that is, tends to overfit
1.4 Model Selection
- There’s a tradeoff between bias and variance. Train different models with a training set, and then compare errors on the test set. The model has the smallest error, not necessarily the best. Because the test set you have is not really the complete test set. For example, on existing test sets, the error is 0.5, but when more test sets are collected, the error is usually greater than 0.5.
- Cross validation. The training set is divided into two parts, one is the training set, the other is the verification set. Train the model with the training set, and then compare on the verification set to determine the best model, then train the model with all the training set, and then test with the public test set. In this case, the errors generally obtained are larger. It is tempting to go back and tweak the model to make it better on the public test set, but this is not recommended.
- N- fold cross validation. The training set is divided into N parts, one of which is used as the verification set at a time.
2. Gradient descent
Tip1: Adjust the learning rate
- Carefully adjust your learning rate
In the figure above, the black curve on the left is the loss function. Assuming that the curve starts from the highest point on the left, if the learning rate adjustment is just right, such as the red line, the lowest point can be found smoothly. If the learning rate adjustment is too small, such as the blue line, it will go too slow. Although this situation can be found with enough time, the actual situation may not wait for the result. If the learning rate is adjusted a little too much, such as the green line, it will oscillate above, can’t go on, never reach the lowest point. Or it could be really big, like the yellow line, and it just flies out, and when you update the parameter you just see that the loss function gets bigger and bigger.
While this visualization is intuitive, it can only be done when the parameters are one-dimensional or two-dimensional; higher dimensions are impossible to visualize.
The solution is the one on the right of the figure, which visualizes the influence of parameter changes on the loss function. For example, the learning rate is too small (blue line), and the loss function declines very slowly; The learning rate is too high (green line), the loss function drops rapidly, but then gets stuck; If the learning rate is very high (yellow line), the loss function flies out; The red one is just about right to get a good result.
- Adaptive learning rate. Here’s a simple idea: As the number of times increases, reduce the learning rate by some factor
- Adagrad algorithm. The learning rate of each parameter is divided by the RMS of the previous differential.
Tip2: Stochastic gradient descent method
- Instead of processing all the data as before, just compute the loss function Ln for one example and update the gradient faster
Tip3: Feature scaling
- If the parameters scaling differ greatly, the effect of using the same set of learning rates is not good