As we mentioned in the last video, a process called regularization helps linear regression and other machine learning algorithms when generalization performance starts to deteriorate.
Now, let’s talk about regularization, another key concept in machine learning, in a little more detail.
If we recall that in machine learning we minimized training errors and MSE train was used for regression, even though our primary interest was in the MSE test, the idea of regularization becomes clearer.
The main idea is to modify the objective function by adding a function with model parameters in the hope that the new function will generate a model with smaller variance.
I’m showing this function that I call J.
It has two terms.
The first is the familiar MSE training loss.
Note that it is a function of the model parameter W and data X.
The second term is given by the regular function omega, which is the weight lambda multiplied by the sum of the inputs.
This parameter is often called a regularization parameter.
Note that the second term is only a function of the model parameter W, not the input X.
Now, the effect of this new term depends on both the form of W and the value of the regularization parameter lambda.
If lambda is close to zero, we go back to the original situation without any regularization.
On the other hand, if the lambda is very large, then the optimizer will ignore the first term and will try to minimize the second term.
Since the second term is completely independent of data, the resulting solution will simply be fixed once and for all by one of W’s regularizers, regardless of the actual data.
For any intermediate value of the lambda, the resulting value of the model parameter will be between the values that would be obtained under these two constraints.
This situation is what we need.
I’ll explain how we choose the best value for lambda in a few minutes, but first let me show you some popular examples of regularizers commonly used in machine learning.
The first common regularizer is the square of the L2 norm of the vector W.
This regularizer regression problem is called ridge regression.
What it does is try to find solutions that don’t have too much weight.
Another popular regularizer is the L1 norm of the vector W.
This is called L1 or LASSO regularization, and it turns out that this regularization enforces the sparsity of the solution.
Another regularization sometimes used when the weights should be non-negative and the sum is 1 is called entropy regularization, which is shown here.
This regularization comes from Bayesian statistics.
We will discuss regularization methods in more depth in the next lesson of this project.
But for now, IF you’ve decided to use the right regularization term, I want to discuss how to choose the best value for lambda.
Regularization parameters will be the first example of what we call hyperparameters.
We’ll see more examples of hyperparameters in this section, but what are those hyperparameters? In general, hyperparameters are any quantitative feature of machine learning algorithms that are optimized directly without minimizing training or in-sample losses such as MSE train.
The main function of hyperparameters is to control model capacity, that is, the model’s ability to progressively fit more complex data.
An example of a regression hyperparameter is the degree of the polynomial used for regression, i.e. it is linear, quadratic, cubic, etc.
Lambda, the regularization parameter we just introduced, is another commonly used hyperparameter.
Other examples of hyperparameters include the number of layers in a decision tree or the number of layers in a neural network or the number of nodes per layer.
Further examples will be provided by parameters that determine how quickly the model fits new data.
These parameters are called learning rates, and proper selection is often very important in practice.
What all of these hyperparameters have in common is that they are usually selected using one of two methods.
The first method is very simple.
We simply split the training set into a smaller training set and a set called the validation set.
For example, you can set aside about 20% of your training data for validation sets.
The idea is to use the new training set to adjust the parameters of the model, and then use the validation set to adjust the hyperparameters.
When tuning parameters and hyperparameters, the final performance of the model is evaluated using a test set, the same way as before.
This approach is simple and theoretically sound, but it may not be ideal when the data set is small.
In this case, it may not be necessary to set aside some data for the validation set, as it may result in poorer sample accuracy.
To deal with this, a method called cross-validation is used.
Unlike the first method, the cross-validation method does not discard any information during training.
That’s how cross validation works.
Let’s say we have N samples that we can use for training but N is small, so it’s problematic to set aside some data.
So, here’s what we did.
First, we define a set of possible values for some hyperparameters.
For example, lambda is the regularization parameter we want to optimize.
Typically, this is defined as a small number of possible values, so we just want to choose the best value from a set of candidate values.
Next, we divide the entire training data set into k blocks of equal size, X1 through Xk.
We then start the following loop, which we repeat for all the values of regularization parameters in our set of candidate values.
First, we take out the first block and train our model on the rest of the training data.
Once done, we use the current and coordinate values of the hyperparameters to evaluate the model error of block X1.
So, so far, it looks exactly the same as the validation set-based approach, but the difference lies in our next step.
After recording the sample, the sample error obtained by the first block, we put it back into the training set and take out our next block, X2.
We then repeat all the steps above, i.e. we train the model with blocks X1, X3, etc., and then run the trained model on blocks X2 and record the estimated out of sample error.
We then continue the loop and calculate the average out-of-sample error obtained when all blocks are removed in sequence.
We perform this procedure separately for all possible values of the hyperparameter we want to try, and the optimal value of the hyperparameter will be the value that provides the minimum mean of the sample error.
What I have described is called k-fold cross validation, where K represents the number of blocks we divide into training data sets.
There are some special cases worth mentioning here.
First of all, if we set K equal to 1, then obviously we won’t have any reserved blocks in this case.
So, this case is not very interesting for our purposes.
Second, if we take K equal to M, we get the restricted case of cross-validation, called leave-one-cross-validation.
In this case, we have only one data point at each step of our loop that is not used for the estimation model and therefore can be used for out-of-sample testing.
This kind of set-aside cross-validation is rarely used in practice because it can become very time consuming once your data set becomes quite large.
A more popular option is to use a 10 – or 5-fold cross-validation, where the number of blocks is 10 or 5, respectively.
Tuning hyperparameters is usually a very important part of building machine learning algorithms, and it can also be a very time consuming part, especially if you have a lot of hyperparameters or you don’t have a good guess at them.
Therefore, it requires a wide range of possible values.
We will conduct this analysis in many parts of the project, including topics such as supervised, unsupervised and reinforcement learning.
To summarize this course, we have covered many important concepts of machine learning, such as overfitting, bias variance decomposition, regularization and hyperparametric tuning problems.
In the next lecture, we’ll begin to see how these concepts can be applied to real-world financial problems that can be solved using supervised learning methods.