Wechat official account: Youerxiaohu author: Peter Editor: Peter

Machine learning -1- Supervised and unsupervised learning

Topics covered in the first week include:

  • Supervised and unsupervised learning
  • Univariate linear regression problem
  • Cost function
  • Gradient descent algorithm

Supervised Learning

Predicting Housing prices in Boston using Supervised Learning (Regression problem)

  • In most cases, you might fit a line
  • Sometimes it might be better to use a conic curve to fit

What is the regression problem?

In supervised learning, we give a learning algorithm a data set, say a series of houses, we give the correct price for each sample in the data set, what they actually sell for and then we use the learning algorithm to get more answers, and we need to estimate the result of a continuous value, which is a regression problem

Using supervised learning to predict benign breast cancer (classification problem)

  • The horizontal axis shows the size of the tumor
  • On the vertical axis, 1 is malignant, 0 is benign

What is a classification problem?

The problem with machine learning is that estimating the probability of a tumor being malignant or benign is a classification problem.

The classification problem is that we try to deduce discrete output values: 0 or 1 benign or malignant, when in fact there may be more than two outputs in the classification problem.

Let’s say you have three types of breast cancer, so you want to predict discrete outputs 0, 1, 2, 3. 0 is benign, 1 is type 1 breast cancer, 2 is type 2 cancer, and 3 is type 3 cancer, which is also a classification problem.

application

  • Spam problem
  • Classification of diseases

Unsupervised Learning

  • In supervised learning, data is labeled
  • In unsupervised learning, data is not labeled and clustering algorithm is mainly mentioned

Main applications of unsupervised learning:

  • Understanding and application of genetics
  • Social Network Analysis
  • Organize large computer clusters
  • Market segment
  • Classification of news events

Linear Regression with One Variable

Housing problem

The horizontal axis is the different size of the house, and the vertical axis is the sale price of the house.

Supervised learning: Gives the correct answer for each statistic. In supervised learning, we have a given piece of data called a training set

Regression problem: Predict an accurate output value based on previous data.

Classification problem: Predicting discrete output values, such as looking for a cancer tumor and wanting to determine whether the tumor is benign or malignant, is a 0/1 discrete output problem

Supervised learning working patterns

Learning process explanation:

  • Housing prices in the training set are fed to the learning algorithm
  • Learn how the algorithm works, output a function, denoted by h
  • hsaidhypothesis, represents the solution or function of the learning algorithm.
  • hAccording to the inputxIs worth toyValue, sohisxTo theyA function mapping of
  • The possible expressions are: H θ(x)=θ0+θ1xh_{\theta}(x)=\theta_0+\theta_1xhθ(x)=θ0+θ1x with only one characteristic or input variable, which is called a univariate linear regression problem

Cost function, cost function

The cost function is also called the squared error function, the squared error cost function.

In linear regression we have a training set like this, MMM represents the number of training samples, for example m=47m =47m =47. Our hypothetical function, the one used to make the prediction, is of the linear form: hθ(x)=θ0+θ1xh_\theta \left(x \right)=\theta_{0}+\theta_{1}xhθ(x)=θ0+θ1x.

Function to explain

  • M: Number of training samples
  • Theta h (x) = theta. Theta 0 + 1 xh_ {\ theta} (x) = \ \ theta_1xh theta theta_0 + (x) = theta. Theta 0 + 1 x: assuming that function
  • θ0\ theTA_0 θ0 and θ1\ theTA_1 θ1: represents two model parameters, namely the slope of the line and the intercept on the Y-axis

The modeling error

The main objectives of modeling are:

  1. The red dots in the graph represent the real value yiy_iyi, the real data set
  2. H (x)h(x)h(x) is the predicted value from the model
  3. Objective: Select model parameters that minimize the sum of squares of modeling errors


J ( Theta. 0 . Theta. 1 ) = 1 2 m i = 1 m ( h Theta. ( x ( i ) ) y ( i ) ) 2 J \left( \theta_0, \theta_1 \right) = \frac{1}{2m}\sum\limits_{i=1}^m \left( h_{\theta}(x^{(i)})-y^{(i)} \right)^{2}

Cost function intuitive explanation 1

In this case, we do this by assuming that θ0=0\theta_0=0θ0=0, assuming that h(x)h(x)h(x) is a function of x, and the cost function J(θ0,θ1)J(\theta_0,\theta_1)J(θ0,θ1) is a function of θ\thetaθ, Minimize the cost function

Cost function intuitive explanation 2

It is explained by contour diagram. It can be seen from the contour diagram that there must be some point that minimizes the cost function, that is, it can be seen that there exists a point in three-dimensional space that minimizes J(θ0,θ1)J(\theta_{0}, \theta_{1})J(θ0,θ1).

“Gradient Descent

thought

Gradient descent is an algorithm used to minimize a function.

  • The idea behind: Start by randomly picking a combination of parameters (θ0,θ1… , theta n) (\ theta_0 \ theta_1,… , \ theta_n) (theta 0, 1, theta… ,θn) computes the cost function, and then we look for the next combination of parameters that reduces the value of the cost function the most.

  • Keep doing this until you reach a local minimum, because not all parameter combinations have been tried and you cannot be sure that the resulting local minimum is the global minimum.

Batch gradient descentbatch gradient descent

The algorithm formula is

Feature: Two parameters need to be updated synchronously

Intuitive explanation of gradient descent:

First look at the specific algorithm formula: theta. J: = theta j – alpha partial j (theta) partial theta j \ theta_j: = \ theta_j – \ alpha \ frac {\ partial j (\ theta)} {\ partial \ theta_j} theta. J: = theta – alpha partial theta partial j j j (theta)

Specific description: assign θ\thetaθ so that J(θ)J(\theta)J(θ) proceeds in the direction of the fastest gradient descent, iterating continuously until the local minimum value is finally obtained.

Learning rate: alpha \alpha alpha is the learning rate that determines how far down we go in the direction that reduces the cost function the most.

  • Too small learning rate: slow convergence takes a long time to reach the global lowest point
  • Learning rate is too large: it may cross the lowest point or even fail to converge

LinearRegression of GradientDescent-For-LinearRegression

Gradient descent is a very common algorithm, it is not only used in linear regression and linear regression model, square error cost function. Combine gradient descent with the cost function.

The gradient descent method is applied to the previous linear regression problem. The key is to calculate the derivative of the cost function, namely:


partial partial Theta. j J ( Theta. 0 . Theta. 1 ) = partial partial Theta. j 1 2 m i = 1 m ( h Theta. ( x ( i ) ) y ( i ) ) 2 \frac{\partial }{\partial {{\theta }_{j}}}J({{\theta }_{0}},{{\theta }_{1}})=\frac{\partial }{\partial {{\theta }_{j}}}\frac{1}{2m}{{\sum\limits_{i=1}^{m}{\left( {{h}_{\theta }}({{x}^{(i)}})-{{y}^{(i)}} \right)}}^{2}}

J = 0 j = 0 j = 0: Partial partial theta 0 J (theta 0, theta 1) = 1 m ∑ ∗ I = 1 m (h theta – y (x (I)) (I)) \ frac {\ partial} {\ partial {{\ theta} _ {0}}} J ({{\ theta} _ {0}}, {{\ theta }_{1}})=\frac{1}{m}{{\sum\limits*{i=1}^{m}{\left( {{h}_{\theta }}({{x}^{(i)}})-{{y}^{(i)}} }}} \ right) partial theta 0 partial J (theta 0, theta 1) = ∑ m1 ∗ I = 1 m (h theta – y (x (I)) (I))

J = 1 = 1 j j = 1: Partial partial theta 1 J (theta 0, theta 1) = 1 m ∑ ∗ I = 1 m (theta (h – y (x (I)) (I)) ⋅ x (I)) \ frac {\ partial} {\ partial {{\ theta} _ {1}}} J ({{\ theta} _ {0}}, {{\ theta }_{1}})=\frac{1}{m}\sum\limits*{i=1}^{m}{\left( \left( {{h}_{\theta }}({{x}^{(i)}})-{{y}^{(i)}} \right)\cdot {{x}^{(i)}} 1 partial J \} right) partial theta (theta 0, theta 1) = ∑ m1 ∗ I = 1 m (theta (h – y (x (I)) (I)) ⋅ x (I))

Then the algorithm can be rewritten as:

Repeat {


Theta. 0 : = Theta. 0 a 1 m i = 1 m ( h Theta. ( x ( i ) ) y ( i ) ) {\theta_{0}}:={\theta_{0}}-a\frac{1}{m}\sum\limits_{i=1}^{m}{ \left({{h}_{\theta }}({{x}^{(i)}})-{{y}^{(i)}} \right)}


Theta. 1 : = Theta. 1 a 1 m i = 1 m ( ( h Theta. ( x ( i ) ) y ( i ) ) x ( i ) ) {\theta_{1}}:={\theta_{1}}-a\frac{1}{m}\sum\limits_{i=1}^{m}{\left( \left({{h}_{\theta }}({{x}^{(i)}})-{{y}^{(i)}} \right)\cdot {{x}^{(i)}} \right)}

}

This gradient descent algorithm is called batch gradient descent algorithm. Its main features are as follows:

  • At every step of gradient descent, we used all the training samples
  • In gradient descent, when we compute the derivative of the derivative, we need to sum, we need to sum all m training samples