The author | Vagif Aliyev compile | source of vitamin k | forward Data Science

Linear regression is probably one of the most common algorithms, and linear regression is something machine learning practitioners must know. This is often a beginner’s first encounter with a machine learning algorithm, and understanding how it operates is crucial to better understand it.

So, in a nutshell, let’s break down the real question: what is linear regression?

Linear regression definition

Linear regression is a supervised learning algorithm that aims to model the relationship between dependent and independent variables using a linear approach. In other words, it aims to fit a linear trend line that best captures data relationships, and, from this line, it can predict what the target value is likely to be.

Great, I know the definition, but how does it work? Good question! To answer this question, let’s step through how linear regression works:

  1. Fitting data (as shown in the figure above).

  2. Calculate the distance between the points (the red dots on the graph are dots and the green lines are distances), square them, and then sum them (the values are squared to ensure that negative values do not produce the wrong value and hinder the calculation). This is the error of the algorithm, or better known as the residual

  3. Store residuals of iterations

  4. Based on an optimization algorithm, the line is slightly “moved” so that the line can better fit the data.

  5. Repeat steps 2-5 until desired results are achieved or the remaining error is reduced to zero.

This method of fitting a line is called the least square method.

The math behind linear regression

Feel free to skip this section if you already understand

The linear regression algorithm is as follows:

It can be simplified as:

The following algorithms will basically complete the following operations:

  1. Accept a Y vector (your data tags, (housing prices, stock prices, etc.)

This is your target vector, which will be used to evaluate your data later (more on that later).

  1. Matrix X (characteristics of data) :

These are the characteristics of the data, i.e. age, sex, sex, height, etc. This is the data that the algorithm will actually use for prediction. Notice how you have a feature 0. This is called the intercept term, and it’s always equal to 1.

  1. Take a weight vector and transpose it:

That’s the magic of algorithms. All the eigenvectors are going to be multiplied by these weights. This is called the dot product. In effect, you will try to find the best combination of these values for a given data set. This is called optimization.

  1. Get the output vector:

This is the prediction vector that comes out of the data. You can then use the cost function to evaluate the performance of the model.

This is basically the whole algorithm in mathematical terms. You should now have a solid understanding of the function of linear regression. But the question is, what is an optimization algorithm? How do we choose the best weight? How do we measure performance?

The cost function

The cost function is essentially a formula that measures the loss or “cost” of a model. If you’ve ever been to any Kaggle competition, you’ve probably come across some. Some common methods include:

  • Mean square error (mse)

  • Root mean square error

  • Mean absolute error

These functions are essential for model training and development because they answer the basic question, “How well does my model predict new instances?” Keep this in mind as it relates to our next topic.

Optimization algorithm

Optimization is often defined as the process of improving something to its full potential. This also applies to machine learning. In the ML world, optimization is essentially trying to find the best combination of parameters for a data set. This is basically the “learning” part of machine learning.

I will discuss two of the most common algorithms: gradient descent and standard equations.

Gradient descent

Gradient descent is an optimization algorithm designed to find the minimum value of a function. It does this by iteratively taking steps up the negative side of the gradient. In our example, gradient descent will constantly update the weights by moving the slope of the tangent line of the function.

A concrete example of gradient descent

To better illustrate gradient descent, let’s look at a simple example. Imagine a person at the top of a mountain and he/she wants to climb to the bottom of the mountain. What they might do is look around to see which direction they should take a step to get down faster. Then, they might take a step in that direction, and now they’re closer. However, they have to be careful when descending as they can get stuck at a certain point, so we have to make sure we choose our step size accordingly.


Again, the goal of gradient descent is to minimize the function. In our case, this is to minimize the cost of our model. It does this by finding the tangent to the function and moving in that direction. The size of the algorithm “step” is defined by the known learning rate. This basically controls how far we move down. Using this parameter, we must be aware of two cases:

  1. If the learning rate is too high, the algorithm may fail to converge (reach a minimum) and bounce around the minimum, but never reach that value

  2. If the learning rate is too small, the algorithm will take too long to reach the minimum value and may “get stuck” on a secondary advantage.

We also have a parameter that controls how many times the algorithm iterates through the data set.

Visually, the algorithm will do the following:

Since this algorithm is very important for machine learning, let’s review what it does:

  1. Randomly initialize weights. This is called random initialization

  2. The model then uses these random weights to make predictions

  3. The prediction of the model is evaluated by a cost function

  4. The model then runs gradient descent, finds the tangent of the function, and takes a step on the slope of the tangent

  5. The process repeats N iterations, or if a condition is met.

Advantages and disadvantages of gradient descent method

Advantages:

  1. It is possible to reduce the cost function to the global minimum (very close to or =0)

  2. One of the most efficient optimization algorithms

Disadvantages:

  1. It can be slow on large data sets because it uses the entire data set to compute the gradient of the tangent line to the function

  2. Prone to fall into sub-advantages (or local minima)

  3. The user must manually select the learning rate and number of iterations, which can be time-consuming

Now that gradient descent has been introduced, let’s introduce the standard equation.

The Normal Equation

If we go back to our example, instead of going step by step down, we will be able to reach the bottom immediately. That’s the standard equation. It uses linear algebra to generate weights and can produce results as good as gradient descent in a very short time.

Advantages and disadvantages of standard equations

advantages

  1. You do not need to select learning rate or iteration number

  2. Very fast

disadvantages

  1. Does not scale well to large data sets

  2. Tends to produce good weights, but not optimal ones

Characteristics of the scale

This is an important pre-processing step for many machine learning algorithms, especially those that use distance measures and calculations such as linear regression and gradient descent. It is essentially scaling our features so that they are in a similar range. Think of it as a house, a scale model of a house. The shape is the same (they’re both houses), but the size is different (5 meters! = 500 m). We did this for the following reasons:

  1. It speeds up the algorithm

  2. Some algorithms are scale sensitive. In other words, if features have different scales, it is possible to assign a higher weight to features of a higher magnitude. This will affect the performance of the machine learning algorithm, and obviously we don’t want our algorithm to be biased towards one feature.

To demonstrate this, suppose we have three features, named A, B, and C:

  • AB distance before scaling =>

  • BC distance before scaling =>

  • AB distance after scaling =>

  • Distance of BC after scaling =>

We can clearly see that these features are more comparable and unbiased than before scaling.

Write linear regression from scratch

Well, now the moment you’ve been waiting for; To achieve!

Note: All code can be downloaded from this Github repo. However, I recommend that you follow the tutorial before doing this, because then you’ll have a better understanding of what code you’re actually writing:

Github.com/Vagif12/ML-…

First, let’s do some basic imports:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
Copy the code

Yes, that’s all you need to import! We used Numpy for the math implementation, Matplotlib for graph-drawing, and ScikitLearn’s Boston dataset.

Load and split data
data = load_boston()
X,y = data['data'],data['target']
Copy the code

Next, let’s create a custom train_test_split function to split our data into a training and test set:

# Split training and test sets
def train_test_divide(X,y,test_size=0.3,random_state=42) :
    np.random.seed(random_state)
    train_size = 1 - test_size
    arr_rand = np.random.rand(X.shape[0])
    split = arr_rand < np.percentile(arr_rand,(100*train_size))
    
    X_train = X[split]
    y_train = y[split]
    X_test =  X[~split]
    y_test = y[~split]
    
    return X_train, X_test, y_train, y_test
X_train,X_test,y_train,y_test = train_test_divide(X,y,test_size=0.3,random_state=42)
Copy the code

Basically, we’re doing it

  1. Get the test set size.

  2. Set a random seed to ensure our results and repeatability.

  3. Get the training set size according to the test set size

  4. A random sample of our characteristics

  5. Split randomly selected instances into training sets and test sets

Our cost function

We will implement MSE or mean square error, a common cost function for regression tasks:

def mse(preds,y) :
        m = len(y)
        return 1/(m) * np.sum(np.square((y - preds)))
Copy the code
  • M refers to the number of training instances

  • Yi refers to an instance of our label vector

  • Preds refers to our predictions

To write clean, repeatable, and efficient code that adheres to software development practices, we will create a linear regression class:

class LinReg:
    def __init__(self,X,y) :
        self.X = X
        self.y = y
        self.m = len(y)
        self.bgd = False
Copy the code
  • BGD is a parameter that defines whether we should use batch gradient descent.

Now we will create a method to add intercept items:

def add_intercept_term(self,X) :
        X = np.insert(X,1,np.ones(X.shape[0:1]),axis=1).copy()
        return X
Copy the code

This is basically inserting a column at the beginning of our feature. It’s just for matrix multiplication.

If we do not add this, then we will force the hyperplane through the origin, causing it to tilt so much that it cannot fit the data properly

Scaling our features:

def feature_scale(self,X) :
        X = (X - X.mean()) / (X.std())
        return X
Copy the code

Next, we will randomly initialize the weights:

def initialise_thetas(self) :
        np.random.seed(42)
        self.thetas = np.random.rand(self.X.shape[1])
Copy the code

Now, we will write the standard equation from scratch using the following formula:

def normal_equation(self) :
        A = np.linalg.inv(np.dot(self.X.T,self.X))
        B = np.dot(self.X.T,self.y)
        thetas = np.dot(A,B)
        return thetas
Copy the code

Basically, we divide the algorithm into three parts:

  1. We get the inverse of X transpose dot X

  2. We get the dot product of the weight and the tag

  3. We get the dot product of the two values

That’s the standard equation! Not bad! Now, we will implement batch gradient descent using the following formula:

def batch_gradient_descent(self,alpha,n_iterations) :
        self.cost_history = [0] * (n_iterations)
        self.n_iterations = n_iterations
        
        for i in range(n_iterations):
            h = np.dot(self.X,self.thetas.T)
            gradient = alpha * (1/self.m) * ((h - self.y)).dot(self.X)
            
            self.thetas = self.thetas - gradient
            self.cost_history[i] = mse(np.dot(self.X,self.thetas.T),self.y)
            
        return self.thetas
Copy the code

Here, we do the following:

  1. We set alpha, or learning rate, and number of iterations

  2. We create a list to store our cost function history for later plotting in line charts

  3. N_iterations,

  4. We get the prediction and calculate the gradient (the slope of the tangent line of the function).

  5. We update the weights to move in the negative direction of the gradient

  6. We use our custom MSE function to record values

  7. Repeat, and when done, return the result

Let’s define a fitting function to fit our data:

def fit(self,bgd=False,alpha=0.158,n_iterations=4000) :
        self.X = self.add_intercept_term(self.X)
        self.X = self.feature_scale(self.X)
        if bgd == False:
            
            self.thetas = self.normal_equation()
        else:
            self.bgd = True
            self.initialise_thetas()
            self.thetas = self.batch_gradient_descent(alpha,n_iterations)
Copy the code

Here, we just need to check if the user wants gradient descent and follow our steps accordingly.

Let’s build a function to plot the cost function:

def plot_cost_function(self) :
        
        if self.bgd == True:
            plt.plot(range((self.n_iterations)),self.cost_history)
            plt.xlabel('No. of iterations')
            plt.ylabel('Cost Function')
            plt.title('Gradient Descent Cost Function Line Plot')
            plt.show()
        else:
            print('Batch Gradient Descent was not used! ')
Copy the code

One final way to predict untagged instances:

def predict(self,X_test) :
        self.X_test = X_test.copy()
        self.X_test = self.add_intercept_term(self.X_test)
        self.X_test = self.feature_scale(self.X_test)
        predictions = np.dot(self.X_test,self.thetas.T)
        return predictions
Copy the code

Now, let’s see which optimizations produce better results. First, let’s try gradient descent:

lin_reg_bgd = LinReg(X_train,y_train)
lin_reg_bgd.fit(bgd=True)

mse(y_test,lin_reg_bgd.predict(X_test))

OUT:
28.824024414708344
Copy the code

Let’s draw our function to see how the cost function is reduced:

So we can see that at about 1000 iterations it starts to converge.

Now the standard equation is:

lin_reg_normal = LinReg(X_train,y_train)
lin_reg_normal.fit()

mse(y_test,lin_reg_normal.predict(X_test))

OUT:
22.151417764247284
Copy the code

So we can see that the standard equation performs slightly better than the gradient descent method. This may be because the data set is small and we did not choose the optimal parameter for the learning rate.

In the future

  1. Greatly improve the learning rate. What happens?

  2. Do not apply feature scaling. Is there a difference?

  3. Try it out and see if you can implement a better optimization algorithm. Evaluate your model in the test set


It was really fun writing this article and although it was a bit long, I hope you learned something today.

The original link: towardsdatascience.com/machine-lea…

Welcome to panchuangai blog: panchuang.net/

Sklearn123.com/

Welcome to docs.panchuang.net/