This article is part of the notes and code repetition (logistic regression) of Ng’s machine learning course [1].

Author: Huang Haiguang [2]

Note: Notes and assignments (including data and original assignment files) and videos can be downloaded on Github [3].

I will publish the course notes and course codes on the official account “Machine Learning Beginners”, please pay attention. This is part two: Logistic Regression, which is the third week of the original tutorial and includes notes and work code (the original work was from OCTAVE, here is the Python code reproduced).

Part ONE: Regression code

The job code [4] for this article can be downloaded in its entirety

Markdown file for notes [5]

PDF file of notes [6]

Notes Section contents

6. Logistic Regression

  • 6.1 Classification Problems
  • 6.2 Representation of hypothesis
  • 6.3 Determining boundaries
  • 6.4 Cost Function
  • 6.5 Simplified cost function and gradient descent
  • 6.6 Advanced Optimization
  • 6.7 Multiple Categories Category: One-to-many

7. Regularization

  • 7.1 Problems of over-fitting
  • 7.2 Cost Function
  • 7.3 Regularized linear regression
  • 7.4 Regularized logistic regression model

6. Logistic Regression

6.1 Classification Problems

Classification 6-1 – Classification (8 min).mKV

In this video and the next few videos, we’ll start talking about categorization.

In classification problems, where the variables you are predicting are discrete values, we will study an algorithm called Logistic Regression, which is one of the most popular and widely used learning algorithms.

In the classification problem, we are trying to predict whether the result belongs to a certain class (for example, true or false). Examples of classification problems are: determining whether an E-mail message is spam; Determining whether a financial transaction is fraudulent; We talked earlier about the problem of tumor classification, the difference between a tumor that is malignant and a tumor that is benign.

Let’s start with the problem of binary classification.

The two classes that the dependent variable may belong to are respectively called negative class and positive class. In this case, 0 represents the negative class and 1 represents the positive class.

If we’re going to use linear regression to solve a classification problem, for classification, the value is 0 or 1, but if you’re using linear regression, then the output of the hypothesis function might be much greater than 1, or much less than 0, even though all the training samples have labels equal to 0 or 1. Although we know that the tag should be 0 or 1, it would be strange if the algorithm got a value much higher than 1 or much less than 0. So the algorithm we are going to study next is called logistic regression algorithm, and the property of this algorithm is that its output value is always between 0 and 1.

By the way, a logistic regression algorithm is a classification algorithm, and we use it as a classification algorithm. Sometimes you might be confused by the word “regression” in the name of the algorithm, but logistic regression is actually a classification algorithm that applies to discrete label values, such as: 1, 0, 0, 1.

In the next video, we’ll start learning the details of the logistic regression algorithm.

6.2 Representation of hypothesis

6-2 – Hypothesis Representation (7 min). MKV

What I want to do in this video is show you the expression for the hypothesis function, which is to say, in the classification problem, what kind of function we’re going to use to represent our hypothesis. We said we wanted the output of our classifier to be between 0 and 1, so we wanted to come up with a hypothetical function that had the property that its predicted value would be between 0 and 1.

Recalling the classification of breast cancer mentioned at the beginning, we can use linear regression to find a straight line suitable for the data:

According to the linear regression model we can only predict continuous values, whereas for classification problems we need to output 0 or 1, we can predict:

When, predict.

When, predict.

For the data shown above, such a linear model seems to do a good job of classifying. If we see another very large malignancy and add it to our training set as an example, that will allow us to get a new line.

At this point, it is not appropriate to use 0.5 as a threshold to predict whether the tumor is benign or malignant. It can be seen that the linear regression model is not suitable for solving such problems because its predicted values can exceed the range of [0,1].

We introduce a new model, logistic regression, in which the output variable range is always between 0 and 1. The assumptions of the logistic regression model are as follows: where: represents the feature vector represents the logical function (Logistic function), which is a commonly used logical function (Sigmoid function), and the formula is:.

Python code implementation:

import numpy as np


def sigmoid(z) :


   return 1 / (1 + np.exp(-z))
Copy the code

The graph of this function is:

Together, we get the assumptions of the logistic regression model:

Understanding of the model:.

For a given input variable, the selected parameters are used to calculate estimated probablity of the output variable =1, i.e. for example, if the given and determined parameters are calculated, it means that there is a 70% probability of positive type, The corresponding probability of being negative is 1-0.7=0.3.

6.3 Determining boundaries

Reference video: 6-3-decision Boundary (15 min).mKV

Now, the concept of decision boundaries. This concept helps us better understand what the hypothesis function of logistic regression is calculating.

In logistic regression, we predict:

When, predict.

When, predict.

According to the s-shaped function plotted above, we know that when

 时 

 时 

 时 

And, that is: time, forecast, forecast

Now suppose we have a model:

And the arguments are vectors [-3, 1, 1]. Then when, i.e., the model will predict. We can draw a straight line, and that line is the boundary of our model, separating regions where the prediction is one from regions where the prediction is zero.

Given this distribution of our data, what model would fit?

Since the region and the region that need to be separated by a curve, we need the quadratic feature: is [-1 0 0 1 1], then the decision boundary we get happens to be a circle with the dot at the origin and radius 1.

We can use very complex models to fit the decision boundaries of very complex shapes.

6.4 Cost Function

Reference video: 6-4-cost Function (11 min).mkv

In this video we are going to show you how to fit the parameters of a logistic regression model. Specifically, I define the optimization objective or cost function used to fit the parameters, which is the fitting problem of logistic regression model in supervised learning problem.

For linear regression models, we define the cost function as the sum of squares of all model errors. Theoretically, we could use the same definition for logistic regression models, but the problem is that when we plug in the cost function defined in this way, the cost function we get will be a non-convexfunction.

This means that our cost function has many local minima, which will affect the gradient descent algorithm to find global minima.

The cost function of linear regression is:. We redefine the cost function of logistic regression as:, where

The relationship between and is shown in the figure below:

The characteristics of the function thus constructed are as follows: when the actual and is also 1, the error is 0; when but is not 1, the error increases as it becomes smaller; When the actual and also 0, the cost is 0, and when but not 0, the error increases as delta increases. The construction is simplified as follows: Enter the cost function to obtain:

Python code implementation:

import numpy as np
def cost(theta, X, y) :
  theta = np.matrix(theta)
  X = np.matrix(X)
  y = np.matrix(y)
  first = np.multiply(-y, np.log(sigmoid(X* theta.T)))
  second = np.multiply((1 - y), np.log(1 - sigmoid(X* theta.T)))
  return np.sum(first - second) / (len(X))
Copy the code

After obtaining such a cost function, we can use the gradient descent algorithm to find the parameters that can minimize the cost function. Algorithm is as follows:

Repeat {   (simultaneously update all ) }

After taking the derivative, we get:

Repeat {  (simultaneously update all ) }

In this video, we define the cost function for a single training sample. Convexity analysis is beyond the scope of this course, but we can prove that the substitution function we choose gives us a convex optimization problem. The cost function will be convex and have no local optimal value.

Derivation process:

Consider:

So:

Note: Although the gradient descent algorithm obtained on the surface looks the same as the gradient descent algorithm of linear regression, the gradient descent algorithm here is different from that of linear regression, so it is actually different. In addition, it is still necessary to perform feature scaling before running the gradient descent algorithm.

Some alternatives to GRADIENT descent algorithms: In addition to gradient descent algorithms, there are some algorithms that are often used to minimize the cost function. These algorithms are more complex and superior, and usually do not require manual selection of the learning rate, and are usually faster than gradient descent algorithms. These algorithms include Conjugate gradients (Conjugate gradients), local optimization methods (Broyden Fletcher Goldfarb Shann,BFGS) and finite memory local optimization methods (LBFGS).

6.5 Simplified cost function and gradient descent

6-5-SIMPLIFIED Cost Function and Gradient Descent (10 min). MKV

In this video, we’re going to figure out a slightly simpler way to write cost functions, instead of the way we’re doing it now. At the same time, we need to figure out how to use the gradient descent method to fit the parameters of logistic regression. Therefore, after listening to this lecture, you should know how to implement a complete logistic regression algorithm.

This is the cost function of logistic regression:

This formula can be combined as:

 

That is, the cost function of logistic regression:

Given this cost function, what do I do in order to fit the parameters? We’re going to try to find parameters that are as minimal as possible. So we want to minimize this term, which is going to give us some parameter. If we give a new sample, say some feature, we can use the parameters that fit the training sample to output the prediction of the hypothesis. Were, in fact, in addition, we assume that the output of the probability value, and is about to for the parameter, the probability of, you can think our hypothesis is the estimated probability, so, the next step is to figure out how to maximize the minimum cost function, as of a function, so that we can for the training set fitting parameters.

The method to minimize the cost function is gradient descent. Here is our cost function:

 

If we want to minimize this function of theta, this is the template for gradient descent that we usually use.

We’re going to update each parameter over and over again, and we’re going to update each parameter with this expression, which is by itself minus the learning rate times the derivative term. After taking the derivative, we get:

And if you calculate it, you get this equation: let me write it down here, and the sum of this back here, over here, is essentially the prediction error times, so you put the partial derivative back here, and we can write the gradient descent algorithm as follows:

So, if you have a trait, which is:, the parameter vector includes all the way to, then you need to use this expression:

To update all values at the same time.

Now, if you compare this update rule to what we used for linear regression, you’ll be surprised to see that this is exactly what we used for linear regression gradient descent.

So, are linear regression and logistic regression the same algorithm? To answer this question, we need to look at logistic regression and see what has changed. In fact, the definition of the hypothesis has changed.

For the assumed function of linear regression:

Now the logical function assumes that the function:

So even though the rules for updating parameters look basically the same, the gradient descent of a logical function and the gradient descent of a linear regression are actually two very different things because of the change in the definition of the assumptions.

In the previous video, when we talked about gradient descent for linear regression, we talked about how to monitor gradient descent to make sure it converges, and I usually do the same thing for logistic regression, to monitor gradient descent to make sure it converges properly.

When we do logistic regression using gradient descent, we have these different parameters, all the way to, and we need to update these parameters with this expression. We can also update these values using a for loop, with either for I =1 to n, or for I =1 to n+1. Of course, it is possible to do without the for loop, and ideally we would prefer a vectorized implementation that updates all of these parameters at the same time.

And finally, we talked about feature scaling earlier when we talked about linear regression, and we saw how feature scaling improves the convergence rate of gradient descent, and this feature scaling method also applies to logistic regression. If you have a wide range of features, then feature scaling can also make gradient descent converge faster in logistic regression.

That’s it. Now you know how to implement logistic regression, which is a very powerful and probably the most widely used classification algorithm in the world.

6.6 Advanced Optimization

6-6-Advanced Optimization (14 min).mkV

In the last video, we talked about minimizing the cost function in logistic regression using gradient descent. In this video, I can teach you some advanced optimization algorithms and the optimization of some advanced concepts, the use of these methods, we can make by gradient descent, logic regression speed greatly improved, which will make the algorithm more suitable for solving large-scale machine learning problems, for example, we have a large number of characteristics. Now let’s look at gradient descent in a different way, we have a cost function, and we want to minimize it, so what we need to do is write code that, when you put in parameters, they compute two things: and partial derivatives equal to zero, one all the way to zero.

Assuming we have code that does both of these things, all GRADIENT descent does is perform these updates over and over again. Another way to think about gradient descent is: we need to write code to compute the partial derivatives of the sum, and then plug these into the gradient descent, and then it will minimize this function for us. For gradient descent, I don’t think technically you actually need to write code to compute the cost function. All you have to do is write code to compute the derivatives, but if you want your code to monitor the convergence of these things, then you have to write your own code to compute the cost functions and the partial derivatives. So, after writing code that calculates both, we can use gradient descent. However, gradient descent is not the only algorithm we can use, there are other algorithms, more advanced, more complex. If we can use these methods to calculate the cost function and partial derivative of two items, so the algorithm is optimized for us cost function of different method, conjugate gradient method BFGS variable metric method and L – BFGS variable metric method (limit) is some of the more advanced optimization algorithm, they need a way to calculate, And you need a way to compute the derivative term, and then use a more sophisticated algorithm than gradient descent to minimize the cost function. The details of these three algorithms are beyond the scope of this course. In fact, you usually end up spending days or weeks working on these algorithms, and you can take a course to improve your numerical skills, but let me tell you about some of their features:

These three algorithms have many advantages:

One is to use any of these algorithms, you usually don’t have to choose the learning rate manually, so one way to think about these algorithms, given the way to compute the derivative terms and the cost functions, is to say that algorithms have an intelligent inner loop, and, in fact, they do have an intelligent inner loop, Called the Line search algorithm, it can automatically try different learning rates and automatically choose a good learning rate, so it can even choose a different learning rate for each iteration, so you don’t have to choose yourself. These algorithms are actually doing more complicated things than just choosing a good learning rate, so they tend to end up converging much faster than gradient descent, but a detailed discussion of what they do is beyond the scope of this course.

This part is slightly

I hope you learn from this slide is the main content: write a function that it can return cost function value and gradient value, so want to apply this to logistic regression, or even in the linear regression, you can also put these optimization algorithm for linear regression, you need to do is to enter the appropriate code to calculate these things here.

Now that you know how to use these advanced optimization algorithms, you can use a sophisticated optimization library that makes the algorithm a little more ambiguous to use. So maybe it’s a little bit harder to debug, but because these algorithms typically run much faster than gradient descent.

So when I have a big machine learning problem, I choose these advanced algorithms over gradient descent. With these concepts, you should be able to apply logistic regression and linear regression to larger problems, which is the concept of advanced optimization.

In the next video, I want to show you how to modify a logistic regression algorithm that you already know, and make it work for multicategory classification problems.

6.7 Multiple Categories Category: One-to-many

6-7-Multiclass Classification_ one-Vs-all (6 min).mkv

In this video, we will talk about how to use Logistic regression to solve multi-category classification problems. Specifically, I want to use a classification algorithm called one-vs-all.

Let’s look at some examples.

First example: If say you now need a learning algorithm can automatically classify mail to a different folder, or tag can be automatically, so you may need some different folders, or different labels to do it, by separate email, mail from a friend from work, from family mail or email about interests and hobbies, so, We have such a classification problem: there are four categories, respectively represented by,,, and.

The second example is about medical diagnosis, if a patient comes into your clinic with a stuffy nose, they may not be sick, and that’s represented by this category; I’ve got the flu. Or the flu.

Third example: If you are doing about the weather of machine learning classification problem, so you might want to distinguish between day is sunny, cloudy, rainy or snowy, for all of the above example, you can select a very small number, a relatively “caution” values, such as 1 to 3, 1 to 4 or other values, the above said is more class classification problem, And by the way, it doesn’t matter if the subscript is 0, 1, 2, 3, or 1, 2, 3, 4, I prefer to start the classification at 1 rather than 0, because it doesn’t really matter what the subscript is.

For the previous binary classification problem, however, our data might look something like this:

For a multi-class classification problem, our data set might look like this:

I use 3 different symbols to represent 3 categories, and the question is given 3 types of data sets, how do we get a learning algorithm to classify them?

We now know how to do binary categorization, you can use logistic regression, and maybe you know for straight lines, you can split a data set into positive and negative categories. Using the idea of one-to-many classification, we can apply it to multi-class classification problems.

Here’s how to do one-to-many sorting, sometimes referred to as “one-to-more” sorting.

Now we have a training set, as shown in the figure above, with three categories, which we represent by triangles, boxes, and crosses. What we’re going to do is take a training set and divide it into three binary classification problems.

So let’s start with class 1, which is represented by triangles, and we can actually create a new pseudo training set, with type 2 and type 3 as negative and type 1 as positive, and we create a new training set, as shown in the figure below, and we’re going to fit out a suitable classifier.

The triangle here is the positive sample, and the circle is the negative sample. Think of it this way: set the triangle value to 1 and the circle value to 0, and let’s train a standard logistic regression classifier so that we get a positive boundary.

To achieve this transformation, we mark one of the multiple classes as a positive class (), and then all the other classes as negative classes. The model is denoted as. Then, similarly, we choose another class to label as positive (), and label all the other classes as negative, and write the model as, and so on. In the end, we get a series of models, which can be summarized as:

Finally, when we need to make predictions, we run all the classifiers, and for each input variable, we choose the most likely output variable.

So anyway, we’ve done what we need to do, and now all we need to do is train this logistic regression classifier: for every possible, and finally, to make a prediction, we give the input a new value and use that to make a prediction. So what we’re going to do is we’re going to input in our three classifiers, and we’re going to pick the one that’s the largest, which is.

You now know the basic way to pick a classifier, pick which classifier is the most reliable and the most effective, and then you can say that you get the right classification, whatever the value is, we have the highest probability value, and that’s what we predict. So this is the multi-category classification problem, and the one-to-many approach, and with this little approach, you can now also use logistic regression classifiers for multi-category classification problems.

7. Regularization

7.1 Problems of over-fitting

7-1 – The Problem of Overfitting (10 min). MKV

So far, we have studied several different learning algorithms, including linear regression and logistic regression, which are effective in solving many problems, but when they are applied to certain machine learning applications, there are problems with over-fitting, which can cause them to perform poorly.

In this video I’ll explain what an overfitting problem is, and in the next few videos we’ll talk about a technique called regularization that can improve or reduce overfitting.

If we have a very large number of features, the hypotheses we learn may fit the training set very well (the cost function may be almost zero), but may not generalize to the new data.

Here is an example of a regression problem:

The first model is a linear model that does not fit well and does not fit well into our training set. The third model, a quad-power model, puts too much emphasis on fitting raw data and loses the essence of algorithms: predicting new data. We can see that, if a new value is given to make prediction, it will perform badly, which is overfitting. Although it can adapt to our training set very well, it may not perform well when the new input variable is used for prediction. And the model in the middle seems to be the best fit.

There are also problems with categorization:

In terms of polynomial understanding, the higher the number of, the better the fitting, but the corresponding prediction ability may be worse.

The question is, if we find an overfitting problem, what do we do about it?

  1. Discard some features that don’t help us predict correctly. You can manually select which features to keep, or use some model selection algorithm to help (such as PCA)
  2. Regularization. Keep all the features, but reduce the magnitude of the parameter.

7.2 Cost Function

Reference video: 7-2-cost Function (10 min).mkv

In the regression problem above, if our model is: we can see from the previous examples that it is those higher-order terms that cause the over-fitting, so if we can make the coefficients of these higher-order terms close to zero, we can fit well. So what we’re going to do is we’re going to reduce the value of these parameters to some extent, and that’s the basic method of regularization. We decide to reduce the size of the sum, and what we need to do is modify the cost function to put a little penalty in the sum. In doing so, we also need to take this penalty into account when trying to minimize the cost, and ultimately choose the smaller sum. The modified cost function is as follows:

The sum selected by such a cost function has much less influence on the predicted results than before. If we have a very large number of features, and we don’t know which of them we’re going to punish, we’re going to punish all of them, and we’re going to let the software that optimizes the cost function choose the degree of those penalties. As a result, a relatively simple assumption is obtained to prevent over-fitting problems:

It is also called Regularization Parameter (**Regularization Parameter**). Note: as a rule, we do not punish. The possible pairs between the regularized model and the original model are shown in the following figure:

If the regularization parameters selected are too large, all parameters are minimized and the model becomes, as shown in the red line above, underfitting. So why does an additional term decrease the value of theta? Because if we make the value very large, in order to make the Cost Function as small as possible, all the values (excluding) will be reduced to a certain extent. But if alpha is too large, then all we have is a line that’s parallel to the axis. Therefore, for regularization, we should take a reasonable value, so as to better use regularization. To review cost functions, in order to use regularization, let’s apply these concepts to linear regression and logistic regression, so that we can avoid overfitting them.

7.3 Regularized linear regression

7-3-Regularized Linear Regression (11 min).mkv

For the solution of linear regression, we previously derived two learning algorithms: one based on gradient descent and one based on normal equations.

The cost function of regularized linear regression is:

If we want to minimize the cost function by using the gradient descent law, since we have not regularized the gradient descent algorithm will be divided into two cases:

    {

 

 

   

}

Adjust the update formula of when in the above algorithm to obtain:

It can be seen that the change of the regularized linear regression gradient descent algorithm is that each time the value is reduced by an additional value based on the updating rules of the original algorithm.

We can also use the normal equation to solve the regularized linear regression model, as shown below:

The size of the matrix in the figure is.

7.4 Regularized logistic regression model

Reference video: 7-4-Regularized Logistic Regression (9 min). MKV

For logistic regression problems, we’ve looked at two optimization algorithms in previous lectures: we first looked at using gradient descent to optimize the cost function, and then we looked at more advanced optimization algorithms that require you to design your own cost function.

Similarly, for logistic regression, we also add a regularized expression to the cost function to obtain the cost function:

Python code:

import numpy as np


def costReg(theta, X, y, learningRate) :
    theta = np.matrix(theta)
    X = np.matrix(X)
    y = np.matrix(y)
    first = np.multiply(-y, np.log(sigmoid(X*theta.T)))
    second = np.multiply((1 - y), np.log(1 - sigmoid(X*theta.T)))
    reg = (learningRate / (2 * len(X))* np.sum(np.power(theta[:,1:theta.shape[1]],2))
    return np.sum(first - second) / (len(X)) + reg
Copy the code

To minimize the cost function, the gradient descent algorithm is obtained by taking the derivative:

    {

 

 

   

}

Note: Looks the same as linear regression, but knows, so different from linear regression. In Octave, we can still use the fminuc function to minimize the parameters of the cost function, but it is worth noting that the parameters update rules are different than in other cases. Note:

  1. Although the gradient descent in regularized logistic regression and the expression in regularized linear regression look the same, they are quite different because of their differences.
  2. Do not participate in any of these regularizations.

Right now, you may only have a rudimentary understanding of machine learning algorithms, but once you’ve mastered linear regression, advanced optimization algorithms, and regularization techniques, frankly, you probably already understand machine learning better than many engineers. Now, you have a lot of machine learning knowledge, more than those silicon Valley engineers, or building products with machine learning algorithms.

In the course that follows, we’re going to learn a very powerful nonlinear classifier that can be constructed to solve both linear regression problems and logistic regression problems. You will gradually discover that there are more powerful nonlinear classifiers that can be used to solve polynomial regression problems. We’re going to learn a learning algorithm that’s N times more powerful than what we do today.


Code section

Machine learning exercise 2 – Logistic regression

This note contains the second programming exercise for machine learning on Coursera in Python. Please refer to the assignment file [1] for detailed descriptions and equations. In this exercise, we will implement logistic regression and apply it to a classification task. We will also improve the robustness of the algorithm by adding regularization to the training algorithm and test it in more complex cases.

Code modification and annotation: Huang Hai Guang, [email protected]

Logistic regression

In the initial phase of the training, we will build a logistic regression model to predict whether a student will be admitted to college or not. Imagine that you are the administrator of the relevant section of the university and want to determine whether students will be accepted or not by their scores on two tests. You now have a set of training samples from previous applicants that you can use to train logistic regression. For each training sample, you have their two test scores and the final admission results. To accomplish this prediction task, we are going to build a classification model that can evaluate the likelihood of admission based on two test scores.

Let’s start by examining the data.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Copy the code
path = 'ex2data1.txt'
data = pd.read_csv(path, header=None, names=['Exam 1'.'Exam 2'.'Admitted'])
data.head()
Copy the code
Exam 1 Exam 2 Admitted
0 34.623660 78.024693 0
1 30.286711 43.894998 0
2 35.847409 72.902198 0
3 60.182599 86.308552 1
4 79.032736 75.344376 1

Let’s create a scatter plot of two scores and use color coding to visualize if the sample is positive (accepted) or negative (not accepted).

positive = data[data['Admitted'].isin([1])]
negative = data[data['Admitted'].isin([0])]


fig, ax = plt.subplots(figsize=(12.8))
ax.scatter(positive['Exam 1'], positive['Exam 2'], s=50, c='b', marker='o', label='Admitted')
ax.scatter(negative['Exam 1'], negative['Exam 2'], s=50, c='r', marker='x', label='Not Admitted')
ax.legend()
ax.set_xlabel('Exam 1 Score')
ax.set_ylabel('Exam 2 Score')
plt.show()
Copy the code

There seems to be a clear decision boundary between the two categories. Now we need to implement logistic regression so that we can train a model to predict the outcome. The equation is implemented in the following code example in “ex2.pdf” in the “Exercises” folder.

The sigmoid function

Together, we get the hypothesis function of the logistic regression model:

def sigmoid(z) :
    return 1 / (1 + np.exp(-z))
Copy the code

Let’s do a quick check to make sure it works.

nums = np.arange(-10.10, step=1)


fig, ax = plt.subplots(figsize=(12.8))
ax.plot(nums, sigmoid(nums), 'r')
plt.show()
Copy the code

Great! Now we need to write the cost function to evaluate the result. Cost function:

def cost(theta, X, y) :
    theta = np.matrix(theta)
    X = np.matrix(X)
    y = np.matrix(y)
    first = np.multiply(-y, np.log(sigmoid(X * theta.T)))
    second = np.multiply((1 - y), np.log(1 - sigmoid(X * theta.T)))
    return np.sum(first - second) / (len(X))
Copy the code

Now, we’re going to do some setup, very similar to what we did in exercise 1 in linear regression.

# add a ones column - this makes the matrix multiplication work out easier
data.insert(0.'Ones'.1)


# set X (training data) and y (target variable)
cols = data.shape[1]
X = data.iloc[:,0:cols-1]
y = data.iloc[:,cols-1:cols]


# convert to numpy arrays and initalize the parameter array theta
X = np.array(X.values)
y = np.array(y.values)
theta = np.zeros(3)
Copy the code

Let’s check the dimensions of the matrix to make sure everything is okay.

theta
Copy the code
array([ 0..0..0.])
Copy the code
X.shape, theta.shape, y.shape
Copy the code
((100.3), (3,), (100.1))
Copy the code

Let’s calculate the cost function of the initialization argument (theta is 0).

cost(theta, X, y)
Copy the code
0.69314718055994529
Copy the code

That looks good. Next, we need a function to calculate the gradient of our training data, labels, and some parameters thata.

“Gradient descent”

  • This is batch gradient descent.
  • To vectorization:
def gradient(theta, X, y) :
    theta = np.matrix(theta)
    X = np.matrix(X)
    y = np.matrix(y)


    parameters = int(theta.ravel().shape[1])
    grad = np.zeros(parameters)


    error = sigmoid(X * theta.T) - y


    for i in range(parameters):
        term = np.multiply(error, X[:,i])
        grad[i] = np.sum(term) / len(X)


    return grad
Copy the code

Note that we are not actually performing gradient descent in this function, we are just computing a gradient step. In practice, an Octave function called “fminunc” is used to optimize the function to compute cost and gradient parameters. Since we use Python, we can do the same thing with SciPy’s “optimize” namespace.

Let’s look at the results of gradient descent with our data and zero initial parameters.

gradient(theta, X, y)
Copy the code
array([ 0.1       , 12.00921659.11.26284221])
Copy the code

SciPy’s TRUNCated Newton (TNC) implementation can now be used to find the optimal parameter.

import scipy.optimize as opt
result = opt.fmin_tnc(func=cost, x0=theta, fprime=gradient, args=(X, y))
result
Copy the code
(array([25.1613186 ,   0.20623159.0.20147149]), 36.0)
Copy the code

Let’s see what the cost function looks like at this conclusion

cost(result[0], X, y)
Copy the code
0.20349770158947464
Copy the code

Next, we need to write a function that prints predictions for dataset X with the argument theta we learned. We can then use this function to score the training accuracy of our classifier. Hypothesis function of logistic regression model:

When greater than or equal to 0.5, y=1 is predicted

When less than 0.5, y=0 is predicted.

def predict(theta, X) :
    probability = sigmoid(X * theta.T)
    return [1 if x >= 0.5 else 0 for x in probability]
Copy the code
theta_min = np.matrix(result[0])
predictions = predict(theta_min, X)
correct = [1 if ((a == 1 and b == 1) or (a == 0 and b == 0)) else 0 for (a, b) in zip(predictions, y)]
accuracy = (sum(map(int, correct)) % len(correct))
print ('accuracy = {0}%'.format(accuracy))
Copy the code
accuracy = 89%
Copy the code

Our logistic regression classifier predicted correctly, with 89 percent accuracy, if a student was admitted or not. Not bad! Remember, this is the accuracy of the training set. We did not hold true approximations from Settings or using cross validation, so this number could be higher than its true value (this topic will be explained later).

Regularized logistic regression

In the second part of the training, we will improve the logistic regression algorithm by adding regular terms. If regularization is a bit new to you, or if you like the equation background in this section, refer to “ex2.pdf” in the “Exercises” folder. In short, regularization is a term in cost function that makes the algorithm more likely to favor a “simpler” model (in this case, the model will have smaller coefficients). This theory helps to reduce overfitting and improve the generalization ability of the model. So, let’s get started.

Imagine you are a production supervisor at a factory and you have some chip test results in two tests. For both tests, you want to decide whether the chip is to be accepted or discarded. To help you make tough decisions, you have a test data set of past chips from which you can build a logistic regression model.

Much like part 1, let’s start with data visualization!

path =  'ex2data2.txt'
data2 = pd.read_csv(path, header=None, names=['Test 1'.'Test 2'.'Accepted'])
data2.head()
Copy the code
Test 1 Test 2 Accepted
0 0.051267 0.69956 1
1 0.092742 0.68494 1
2 0.213710 0.69225 1
3 0.375000 0.50219 1
4 0.513250 0.46564 1
positive = data2[data2['Accepted'].isin([1])]
negative = data2[data2['Accepted'].isin([0])]


fig, ax = plt.subplots(figsize=(12.8))
ax.scatter(positive['Test 1'], positive['Test 2'], s=50, c='b', marker='o', label='Accepted')
ax.scatter(negative['Test 1'], negative['Test 2'], s=50, c='r', marker='x', label='Rejected')
ax.legend()
ax.set_xlabel('Test 1 Score')
ax.set_ylabel('Test 2 Score')
plt.show()
Copy the code

Wow, this data looks a lot more complicated than the previous one. In particular, you’ll notice that there are no linear decision bounds to properly separate the two types of data. One approach is to use linear techniques such as logistic regression to construct features derived from polynomials of original features. Let’s start by creating a set of polynomial features.

degree = 5
x1 = data2['Test 1']
x2 = data2['Test 2']


data2.insert(3.'Ones'.1)


for i in range(1, degree):
    for j in range(0, i):
        data2['F' + str(i) + str(j)] = np.power(x1, i-j) * np.power(x2, j)


data2.drop('Test 1', axis=1, inplace=True)
data2.drop('Test 2', axis=1, inplace=True)


data2.head()
Copy the code
Accepted Ones F10 F20 F21 F30 F31 F32 F40 F41 F42 F43
0 1 1 0.051267 0.002628 0.035864 0.000135 0.001839 0.025089 0.000007 0.000094 0.001286 0.017551
1 1 1 0.092742 0.008601 0.063523 0.000798 0.005891 0.043509 0.000074 0.000546 0.004035 0.029801
2 1 1 0.213710 0.045672 0.147941 0.009761 0.031616 0.102412 0.002086 0.006757 0.021886 0.070895
3 1 1 0.375000 0.140625 0.188321 0.052734 0.070620 0.094573 0.019775 0.026483 0.035465 0.047494
4 1 1 0.513250 0.263426 0.238990 0.135203 0.122661 0.111283 0.069393 0.062956 0.057116 0.051818

Now we need to modify the cost and gradient functions of Part 1 to include regularization terms. The first is the cost function:

Regularized cost function

def cost(theta, X, y, learningRate) :
    theta = np.matrix(theta)
    X = np.matrix(X)
    y = np.matrix(y)
    first = np.multiply(-y, np.log(sigmoid(X * theta.T)))
    second = np.multiply((1 - y), np.log(1 - sigmoid(X * theta.T)))
    reg = (learningRate / (2 * len(X))) * np.sum(np.power(theta[:,1:theta.shape[1]], 2))
    return np.sum(first - second) / len(X) + reg
Copy the code

Notice the “reg” term in this equation. Note another “learning rate” parameter. This is a hyperparameter that controls the regularization term. Now we need to add the regularized gradient function:

If we want to minimize the cost function by using the gradient descent law, since we have not regularized the gradient descent algorithm will be divided into two cases:

For the above algorithm j=1,2… , the updated formula at n can be adjusted to obtain:

def gradientReg(theta, X, y, learningRate) :
    theta = np.matrix(theta)
    X = np.matrix(X)
    y = np.matrix(y)


    parameters = int(theta.ravel().shape[1])
    grad = np.zeros(parameters)


    error = sigmoid(X * theta.T) - y


    for i in range(parameters):
        term = np.multiply(error, X[:,i])


        if (i == 0):
            grad[i] = np.sum(term) / len(X)
        else:
            grad[i] = (np.sum(term) / len(X)) + ((learningRate / len(X)) * theta[:,i])


    return grad
Copy the code

Initialize variables as we did in Part 1.

# set X and y (remember from above that we moved the label to column 0)
cols = data2.shape[1]
X2 = data2.iloc[:,1:cols]
y2 = data2.iloc[:,0:1]


# convert to numpy arrays and initalize the parameter array theta
X2 = np.array(X2.values)
y2 = np.array(y2.values)
theta2 = np.zeros(11)
Copy the code

Let’s get the initial learning rate to a reasonable value. We can do this later if necessary (i.e. if the punishment is too strong or not strong enough).

learningRate = 1
Copy the code

Now, let’s try calling the new Theta regularization function, which defaults to 0, to make sure the calculation works.

costReg(theta2, X2, y2, learningRate)
Copy the code
0.6931471805599454
Copy the code
gradientReg(theta2, X2, y2, learningRate)
Copy the code
array([ 0.00847458.0.01878809.0.05034464.0.01150133.0.01835599.0.00732393.0.00819244.0.03934862.0.00223924.0.01286005.0.00309594])
Copy the code

Now we can use the same optimization function as in the first part to calculate the optimized result.

result2 = opt.fmin_tnc(func=costReg, x0=theta2, fprime=gradientReg, args=(X2, y2, learningRate))
result2
Copy the code
(array([  1.22702519 e-04.7.19894617 e-05.3.74156201 e-04.1.44256427 e-04.2.93165088 e-05.5.64160786 e-05.1.02826485 e-04.2.83150432 e-04.6.47297947 e-07.1.99697568 e-04.1.68479583 e-05]), 96.1)
Copy the code

Finally, we can use the predictive function from Part 1 to see how accurate our scheme is on the training data.

theta_min = np.matrix(result2[0])
predictions = predict(theta_min, X2)
correct = [1 if ((a == 1 and b == 1) or (a == 0 and b == 0)) else 0 for (a, b) in zip(predictions, y2)]
accuracy = (sum(map(int, correct)) % len(correct))
print ('accuracy = {0}%'.format(accuracy))
Copy the code
accuracy = 77%
Copy the code

While we have implemented these algorithms, it is worth noting that we can also use advanced Python libraries like SciKit-learn to solve this problem.

from sklearn import linear_modelCall the linear regression package of SkLearn
model = linear_model.LogisticRegression(penalty='l2', C=1.0)
model.fit(X2, y2.ravel())
Copy the code
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Copy the code
model.score(X2, y2)
Copy the code
0.66101694915254239
Copy the code

This is a lot less accurate than what we just implemented, but remember that this result can be calculated using the default parameters. We may need to make some parameter adjustments to get the same precision as our previous results.

Reference material \

[1] machine learning courses: www.coursera.org/course/ml [2] Huang Haiguang: https://github.com/fengdu78 [3] making: github.com/fengdu78/Co… [4] Job code: https://github.com/fengdu78/Coursera-ML-AndrewNg-Notes/blob/master/code/ex2-logistic%20regression/ML-Exercise2.ipynb [5] Markdown file: github.com/fengdu78/Co… [6] PDF file: https://github.com/fengdu78/Coursera-ML-AndrewNg-Notes/blob/master/ machine learning personal notes full version v5.4 – A4 print edition. PDF

On this site

The “Beginner machine Learning” public account was founded by Dr. Huang Haiguang. Huang Bo has more than 23,000 followers on Zhihu and ranks among the top 110 in github (32,000). This public number is committed to the direction of artificial intelligence science articles, for beginners to provide learning routes and basic information. Original works include: Personal Notes on Machine learning, notes on deep learning, etc.

Highlights from the past

  • All those years of academic philanthropy. – You’re not alone

  • Suitable for beginners to enter the artificial intelligence route and information download

  • Ng machine learning course notes and resources (Github star 12000+, provide Baidu cloud image) \

  • Ng deep learning notes, videos and other resources (Github standard star 8500+, providing Baidu cloud image)

  • Python code implementation of Statistical Learning Methods (Github 7200+)

Note: If you join our wechat group or QQ group, please reply”Add group

To join Knowledge Planet (4300+ user ID: 92416895), please reply”Knowledge of the planet