This is the 21st day of my participation in the August More Text Challenge

Notes of Andrew Ng’s Machine Learning —— (3) Multivariate Linear Regression

Multiple Features

Linear regression with multiple variables is alse known as “multivariate linear regression“.

Notations

We now introduce notation for equations where we can have any number of input variables (Multiple Features, i.e. Multivariate):

$m$ : the number of training examples.
$n$ : the number of features.
$x^{(i)}$ : the input (features) of the $i^{th}$ training example, a n-dimensional vector.
$x^{(i)}_j$ : the value of feature $j$ in the $i^{th}$ training example.

hypothesis

The multivariable form of the hypothesis function accommodating these multiple features is as follows:

h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3+ … + \theta_nx_n

In order to develop intuition about this function, we can think about θ0\theta_0θ0 as the basic price of a house, \theta_1θ1 as the price per square meter, \theta_2θ2 as the price per floor, etc. x1x_1x1 will be the number of square meters in the house, x2x_2x2 the number of floors, etc.

Using the definition of matrix muyltiplication, our multivariable hypothesis function can be concisely represented as:

h_\theta(x)= \left[\begin{array}{c}\theta_0 & \theta_1 & \ldots & \theta_n\end{array}\right] \left[\begin{array}{c}x_0 \\ x_1 \\ \vdots \\ x_n \end{array}\right] = \theta^Tx

This is a vectorization of our hypothesis function for one training example.

Note: For convenience, we assume X0 (I)=1for(I ∈1,… ,m)x^{(i)}_0=1 \quad \textrm{for(}i \in 1, … , m \ textrm} {) x0 (I) = 1 for (I ∈ 1,… This allows us to do matrix operations with θ\thetaθ and XXX. Hence making two vector θ\thetaθ and x(I)x^{(I)}x(I). match each other element-wise (that is, have the same number of elements: n+1).

Gradient Descent For Multiple Variables

Let’s say the condition about the multiple variables:

Content: h theta (x) = ∑ I = 0 m theta ixih_ \ theta (x) = \ sum_ {I = 0} ^ m \ theta_ix_ih theta (x) = ∑ I = 0 m theta ixi.

The Parameters: theta 0, 1, theta… , theta n \ theta_0 \ theta_1,… , \ theta_n theta 0, 1, theta… , theta n.

The Cost Function: J (theta 0, 1, theta… , theta n) = 12 m ∑ I = 1 m (h theta – y (x (I)) (I)) 2 j (, \ \ theta_0 theta_1,… , \ theta_n) = \ frac {1} {2} m. \ sum_ {I = 1} ^ m (h_ \ theta (x ^ {(I)}) – y ^ {} (I)) ^ 2 j (theta 0, 1, theta… , theta n) = 2 m1 ∑ I = 1 m (h theta – y (x (I)) (I)) 2.

Or vectorizedly:

Content: h theta (x) = theta Txh_ \ theta (x) = \ theta ^ Txh theta (x) = theta Tx.

The Parameters: theta \ theta theta.

Cost Function: J (theta) = 12 m ∑ I = 1 m (theta Tx (I) – y (I)) 2 J (\ theta) = \ frac {1} {2} m. \ sum_ {I = 1} ^ m (\ theta ^ Tx ^ {(I)} – y ^ {} (I)) ^ 2 J (theta) = 2 m1 ∑ I = 1 m (theta Tx (I) – y (I)) 2.

The Gradient Descent will be like this:

repeat until convergence {

$\qquad \theta_j := \theta_j – \alpha \frac{1}{m} \sum^m_{i=1}[h_\theta(x^{(i)})-y^{(i)}] \cdot x_j^{(i)}\qquad \textrm{for }j:=0, … , n$

}

The following image compares gradient descent with one variable to gradient descent with multiple variables:

Feature Scaling

We can speed up gradient descent by having each of our input values in roughly the same range. This is because $\theta$ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.

The way to prevent this is to modify the ranges of our input variables so that they are all roughly the same. Ideally:

\ begin \ {array} {c} – 1 le x_ {I} \ \ \ \ le 1 textrm {or} \ \ 0.5 \ le x_ {I} {array} 0.5 \ \ le end

These aren’t exact requirements; we are only trying to speed things up. The goal is to get all input variables into roughly one of these ranges, give or take a few.

In practice, we offen think it’s OK for variables In range [−3,−13)∪(+13,+3][-3,-\frac{1}{3}) \cup (+\frac{1}{3}, + 3] [- 3-31) ∪ (+ 31, + 3).

Two techniques to help with this are feature scaling and mean normalization.

Feature scaling

Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of input variable, resulting in a new range of just 1.

\begin{array}{rl} \textrm{Range:} & s_i = max(x_i)-min(x_i)\\ \textrm{Scaling:} & x_i:=\frac{x_i}{s_i} \end{array}

Mean normalization

Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero.

\begin{array}{rl} \textrm{Average:} & \mu_i = \frac{sum(x_i)}{m}\\ \textrm{Mean normalizing:} & x_i:=x_i-\mu_i \end{array}

Implement

We always implement both of these techniques via adjusting our input values as shown in this formula:

x_i:=\frac{x_i-\mu_i}{s_i}

μ I \ mu_I μ I is the average of all the values for feature (I)
$s_i$ is the range of values (max - min), or $s_i$ could also be the standard deviation.

For example, if xix_ixi represents housing prices with a range of 100 to 2000 and a mean value of 1000, then, Xi: = price – 10001900 x_i: = \ dfrac {price – 1000} {1900} xi: = 1900 price – 1000.

In octave

In octave, the function mean can offer us the avaerage of values for feature (i), when the function std gives us the standard deviation of values for feature (i). So we can write the program like this:

function [X_norm, mu, sigma] = featureNormalize(X)
%FEATURENORMALIZE Normalizes the features in X 

X_norm = X;
mu = zeros(1.size(X, 2));
sigma = zeros(1.size(X, 2));

mu = mean(X);
sigma = std(X);
X_norm = (X - mu) ./ sigma;

end
Copy the code

FEATURENORMALIZE(X) returns a normalized version of X where the mean value of each feature is 0 and the standard deviation is 1. This is often a good preprocessing step to do when working with learning algorithms:

First, for each feature dimension, compute the mean of the feature and subtract it from the dataset, storing the mean value in mu. Next, compute the standard deviation of each feature and divide each feature by it’s standard deviation, storing the standard deviation in sigma.

Note that X is a matrix where each column is a feature and each row is an example. You need to perform the normalization separately for each feature.

Learning Rate

Debugging gradient descent

Make a plot with number of iterations on the X-axis. Now plot the cost functiong J(θ)J(\theta)J(θ) over the number of iterations Iterations of gradient descent. If J(θ)J(\theta)J(θ) ever increases, Then you probably need to decrease the learning rate α\alphaα.

Automatic convergence test

Declare convergence if $J(\theta)$ decreases by less than $\epsilon$ , where $\epsilon$ is some small value such as $10^{-3}$ . However in practice it’s difficult to choose this threshold value.

Making sure gradient descent is working correctly

It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration.

To summarize:

If alpha \alpha is too small: slow convergence.

If α\alphaα is too large: receivemay not decrease on every iteration and thus may not converge.

Implement

We should try different $\alpha$ to find a fit one by drawing #iterations-J(θ) plots.

E.g. To choose alpha \alpha, try:

. , 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1...Copy the code

Features and Polynomial Regression

Combine Features

We can improve our features and the form of our hypothesis function in a couple different ways.

For example, we can combine multiple features into one, Such as combining x1x_1x1 and x2x_2x2 into a new feature x3x_3x3 by taking X1 ⋅x2x_1 \cdot x_2x1⋅x2.

Polynomial Regression

To fit the data well, our hypothesis function may need to be non-linear. So we can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).

For example, If our hypothesis function is hθ(x)=θ0+θ1x1h_ theta(x)=\theta_0+\theta_1x_1hθ(x)=θ0+θ1×1 then we can create additional value features based on x1x_1x1, Quadratic function H θ(x)=θ0+θ1×1+θ2x12h_\theta(x)=\theta_0+\theta_1x_1+\ Theta_2x_1 ^2hθ(x)=θ0+θ1×1+θ2×12 or the cubic function Theta h (x) = theta. Theta. Theta 0 + 1 x1 + 2 x12 + theta 3 x13h_ \ theta (x) = \ theta_0 + + \ \ theta_1x_1 theta_2x_1 ^ 2 + \ theta_3x_1 theta ^ 3 h (x) = theta. Theta. Theta 0 + 1 x1 + 2 x12 + 3 x13 theta, to make it a square root function, we could do: Theta h (x) = theta. Theta. Theta 0 + 1 x1 + 2 x1h_ \ theta (x) = + \ \ theta_0 theta_1 x_1 + \ theta_2 \ SQRT {x_1} h theta (x) = theta. Theta 0 + 1 x1 + 2 x1 theta.

In the cubic version, we can create new features $x_2$ , $x_3$ where $x_2=x_1^2$ and $x_3=x_1^3$ , then we can get a set of thetas via gradient descent for multiple variables.

⚠️Note. If you choose your features this way then feature scaling becomes very important:

e.g. if x has range 1 ~ 1000:

Then range of $x^2$ becomes 1 ~ 1000000;

And range of $x^3$ becomes 1 ~ 1000000000;

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Machine learning (3) Multivariable linear regression

Notes of Andrew Ng’s Machine Learning —— (3) Multivariate Linear Regression

Multiple Features

Notations

hypothesis

Gradient Descent For Multiple Variables

Feature Scaling

Feature scaling

Mean normalization

Implement

In octave

Learning Rate

Debugging gradient descent

Automatic convergence test

Making sure gradient descent is working correctly

Implement

Features and Polynomial Regression

Combine Features

Polynomial Regression

Machine learning (3) Multivariable linear regression

Notes of Andrew Ng’s Machine Learning —— (3) Multivariate Linear Regression

Multiple Features

Notations

hypothesis

Gradient Descent For Multiple Variables

Feature Scaling

Feature scaling

Mean normalization

Implement

In octave

Learning Rate

Debugging gradient descent

Automatic convergence test

Making sure gradient descent is working correctly

Implement

Features and Polynomial Regression

Combine Features

Polynomial Regression

Related Posts

Windows compile OpencV, support CUDA acceleration

These are the three most important points in the data governance process

Summary of SOTA methods on Cityscapes for four semantic segmentation datasets