This is the 24th day of my participation in the August More Text Challenge
Notes of Andrew Ng’s Machine Learning —— (6) Classification
Intro of Classification
Classification problems are something like giving a patient with a tumor, we have to predict whether the tumor is malignant or benign. It’s expected to output discrete values.
The classification problem is just like the regression problem, except that the values we want to predict take on only a small number of discrete values. For now, we will focus on the binary classification problem in which y
can take only two values, 0
and 1
.
For instance, if we are trying to build a spam classifier for email, then x(i)x^{(i)}x(i) may be some features of a piece of email, and yyy may be 1 if it is a piece of spam mail, Hence, y∈{0,1}y \in \{0,1\}y∈{0,1}. 0 is also called the negative class, and 1 the positive class, and they are sometimes also denoted by the symbols – and +. Given x(i)x^{(i)}x(i), the corresponding y(i)y^{(i)}y(i) is also called the label for the training example.
To attempt classification, one method is to use linear refression and map all predictions greater than 0.5
as a 1
and all less than 0.5
as a 0
. However, this method doesn’t work because classification is not actually a linear function. So what we actually do to solve classification problems is an algorithm named logistic regression. Note that even be called regression, it is actually to do classification.
Logistic Regression
Hypothesis Representation
We could approach the classification problem ignoring the fact that y
is discrete-valued, and use our old linear regression algorithm to try to predict y
given x
. However it is easy to construct examples where this method performs very poorly.
Intuitively, It also doesn’t make sense for H θ(x)h_\theta(x)hθ(x) to take values larger than 1 or smaller then 0 when we know that Y ∈ {0, 1} y \ \ {0, 1 \} in y ∈ {0, 1}. To fix this, Let’s change our hypotheses H θ(x)h_ theta(x) H θ(x) to satisfy 0≤ H θ(x)≤10 \le h_ theta(x) \ LE 10≤ H θ(x)≤1. This Betty Betty Betty Betty Betty Betty Betty Betty Betty Betty Betty Betty Betty Betty Betty Betty Betty
The the Simoid Function, also called the Logistic Function is this:
It looks like the following image:
Our new form uses the Simoid Function:
The logistic function g(z)g(z)g(z) maps any real numbers to The (0,1)(0,1)(0,1) interval making it useful for transforming a arbitrary-valued function into a function better suited for classfication.
We can also simply write
like this:
H θ(x)h_\theta(x)hθ(x) will output the probalility that our output is 1. H θ(x)=0.7h_\theta(x)=0.7hθ(x)= 0.7hθ(x)=0.7 gives us a probability of 70%70\%70% it is 0 is 30%):
Decision Boundary
In order to get our discrete 0
or 1
classification, we can translate the output of the hyporithesis function as follows:
The way our logistic function GGG missack is that when its input is ≥0\ GE 0≥0, its output is ≥0.5\ GE 0.5≥0.5:
In fact, we know that:
So if our input to
is
, then that means:
From these statements we can now say:
The decision boundary is the line that separates the area where
and where
. It is created by our hypothesis function.
Example:
In this case, y = 1 y = 1 y = 1 if 5 + (- 1) + 0 x1 x2 or 5 + (1) x_1 + 0 x_2 \ ge 05 + (- 1) + 0 x1 x2 acuity 0, Our boundary is a straight vertical line placed on the graph where x1=5x_1=5×1=5, and everything to the left of that denotes y=1y=1y=1, while everything to the right denotes y=0y=0y=0:
Another example:
Non-linear decision boundaries:
The input to The sigmoid function g(z) (e.g. θTX\theta^TXθTX) doesn’t need to be linear, and could be a function that describes a circle (e.g. Z =θ0+θ1×12+θ2x22z=\theta_0+\theta_1x_1^2+\theta_2x_2^2z=θ0+θ1×12+θ2×22) or any shape to fit our data.
Example:
Logistic Regression Model
In this part, we will implement the logistic regression model.
Cost Function
If we use the same cost function that we use for linear regression ( J (theta) = 1 m m12 ∑ I = 1 (h theta – y (x (I)) (I)) 2 J (\ theta) = \ frac {1} {m} \ sum_ {I = 1} ^ m \ frac {1} {2} (h_ \ theta (x ^ {(I)}) – y ^ {} (I)) ^ 2 J (theta) = m21 m1 ∑ I = 1 (h theta (x(I))− Y (I))2) for logistic regression, it will be non-convex (that looks wavy), which is causing many local optima and hard to find the global minimum.
So, what we actually need is our new Logistic Regression Cost Function, Which guarantees J(θ)J(\theta)J(θ) is laiw for logistic regression.
The J (theta) J (\ theta) J (theta) – h theta (x) h_ \ theta theta (x) (x) h “plots is like this:
Let’s take a look at the
:
If our correct answer ‘y’ is 0, then the cost function will be 0 if our hypothesis function also outputs 0. If our hypothesis approaches 1, then the cost function will approach infinity.
If our correct answer ‘y’ is 1, then the cost function will be 0 if our hypothesis function outputs 1. If our hypothesis approaches 0, then the cost function will approach infinity.
Simplified Cost Function
We can simplify the
function by compressing the two conditional cases into one case:
In this definition, when y=0y=0y=0, The term – (1 -) y ⋅ log (1 – h theta (x)) – (1 -) y \ cdot log (1 – h_ \ theta (x)) – (1 -) y ⋅ log (1 – h theta (x)) will be 000; ⋅log(hθ(x))-y \cdot log(h_\theta(x))−y \ log(H θ(x))) will be 000. Obviously, this is equal to the previous one but more easy to implement.
Now, we can fully write out our entire cost function as follow:
And a vectorized implementation is:
Gradient Descent
The general form of gradient descent is:
Work out the derivative part using calculus to get:
Actually, this algorithm is identical to the one we used in linear regression. And notice that we still have to simultaneously update all values in theta.
Vectorized implementation:
Advanced Optimization
There are many more sophisticated, faster ways to optimize
that can be used instead of gradient descent, such as “Conjugate gradient“, “BFGS” and “L-BFGS“.
We are not suggest to write these more sophisticated algorithms ourself but use the libraries instead, as they’re already tested and highly optimized. Octave provides them.
We can use octave’s fminunc()
optimization algorithm to do that.
To use this advanced optimization, we first need to provide a function that evaluates the following two functions for a given inpuit value
:
We can write a single function that retunrs both of these:
function [jVal, gradient] = costFunction(theta)
jVal = <code to compute J(theta)>
gradient = <code to compute derivative of J(theta)>
end
Copy the code
Then we are going to set a optimset
and a initial theta as well, then send them to fminunc()
:
options = optimset('GradObj'.'on'.'MaxIter'.100);
initialTheta = zeros(2.1);
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);
Copy the code
Multiclass Classification
Now we will approach the classification of data when we have more than two categories. Instead of Y ={0,1}y= {0,1}y= {0,1}y= {0,1}y= {0,1}y= {0,1}y= {0,1} N} y = \ {0, 1… N \} = {0, 1 y… n}.
Since y = {0, 1… N} y = \ {0, 1… N \} = {0, 1 y… n}, we divide our problem into n+1n+1n+1 ( +1+1+1 because the index starts at 000) binary classification problems; in each one, we predict the probability that ‘ yyy’ is a member of one of our classes.
We are basically choosing one class and then lumping all the others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returned the highest value as our prediction.
The following image shows how one could classify 3 classes:
To summarize:
Train A Logistic regression classifier H θ(x)h_\theta(x) H θ(x) for each class to predict the probability that the gleason gleason set To make a prediction on a new XXX, pick the class a maximizes H θ(x)h_\theta(x)hθ(x)