Linear regression
Regression is a class of methods that can model the relationship between one or more independent and dependent variables. In the natural and social sciences, regression is often used to represent the relationship between inputs and outputs.
Most tasks in the field of machine learning are usually related to predictions. When we want to predict a number, regression comes into play. Common examples include predicting prices (houses, stocks, etc.), predicting length of stay (for inpatients, etc.), forecasting demand (retail sales, etc.). But not all predictions are regression problems. In later chapters, we will cover the classification problem. The goal of a classification problem is to predict which of a set of categories the data belongs to.
Basic elements of linear regression
Linear regression, which dates back to the early 19th century, is the simplest and most popular of the various standard tools for regression. The linear regression is based on several simple assumptions: First, the relationship between the independent variable x\mathbf{x}x and the dependent variable YYy is assumed to be linear, that is, YYy can be expressed as the weighted sum of the elements in x\mathbf{x}x, where some noise containing observations is usually allowed; Second, we assume that any noise is relatively normal, such as noise following a normal distribution.
To explain the linear regression, let’s take a practical example: we want to estimate the price of a house (in dollars) based on the size of the house (square feet) and the age of the house (years). To develop a model that predicts house prices, we need to collect a real data set. This data set includes the sale price, square footage and age of the house. In machine learning terminology, this data set is called a training data set or training set. Each row of data (such as the data corresponding to a housing transaction) is called a sample, which can also be called a data point or a data instance. We call the target we are trying to predict, such as the price of a house, a label or target. The independent variables (area and age) on which the prediction is based are called features or covariate.
In general, we use NNN to represent the number of samples in the dataset. Samples of index for iii, its input is expressed as x (I) = [x1 (I), x2 (I)] ⊤ \ mathbf {x} ^ = {(I)} [x_1 ^ {(I)}, x_2 ^ {(I)}] ^ \ topx (I) = [x1 (I), x2 (I)] ⊤, The corresponding label is y(I)y^{(I)}y(I).
1. Linear model
The linear assumption is that the target (house price) can be expressed as a weighted sum of features (area and age) as follows:
Merc =warea +wage age+b.\mathrm{price} = w_{\mathrm{area}} \cdot \ Mathrm {area} + w_{\mathrm{age}} \cdot \mathrm{age} + p. rice = warea ⋅ area + wage ⋅ age + b. : eqlabel: eq_price – area
:eqref: Wareaw_ {\ Mathrm {area} warea and Wagew_ {\ Mathrm {age}}wage in eq_price-area are called weights, which determine the effect of each feature on our predicted value. A BBB is called a bias, offset, or intercept. The bias is what the predicted value should be when all features are set to 0. Even though there won’t be any real houses that are 0 square meters or exactly 0 years old, we still need a bias term. Without a bias term, the expressiveness of our model would be limited. Strictly speaking, :eqref:eq_price-area is an affine transformation of the input feature. Affine transformation is characterized by linear transformation through weighting and features, and translation through bias terms.
Given a data set, our goal is to find the model weights w\mathbf{w}w and bias BBB so that the predictions based on the model roughly match the real prices in the data. The predicted value of the output is determined by the input features through an affine transformation of the linear model determined by the selected weights and biases.
In the field of machine learning, we usually use high-dimensional data sets, and it is more convenient to use linear algebra representation in modeling. When our input contains DDD features, we express the predicted result y^\hat{y}y^ (usually using the “sharp corner” symbol to indicate the estimate of YYy) as:
By putting all the features into the vector x∈ rd-mathbf {x} \in \mathbb{R}^dx∈Rd and replacing the ownership into the vector W ∈ Rd-mathbf {w} \in \mathbb{R}^dw∈Rd, we can express the model concisitely in dot product form:
^ = w ⊤ x + y b. \ hat {} y = \ mathbf {w} ^ \ top \ mathbf {x} + b.y ⊤ ^ = w x + b. : eqlabel: eq_linreg – y
In :eqref:eq_linreg-y, the vector x\mathbf{x}x corresponds to the features of a single data sample. The symbolic matrix X∈Rn×d\mathbf{X} \in \mathbb{R}^{n \times d}X∈Rn×d can conveniently refer to NNN samples of our entire data set. Where each row of X\mathbf{X}X is a sample and each column is a feature.
For the characteristic set X\mathbf{X}X, the predicted value y^∈Rn\hat{\mathbf{y}} \in \mathbb{R}^ny^∈Rn can be expressed by matrix-vector multiplication as:
The summation in this process will use the broadcast mechanism (the broadcast mechanism is described in detail in :numref:subsec_broadcasting). Given the training data feature X\mathbf{X}X and the corresponding known label y\mathbf{y}y, the goal of linear regression is to find a set of weight vectors W \mathbf{w}w and bias BBB: When given new sample features sampled from the same distribution of X\mathbf{X}X, this set of weight vectors and bias makes the error of the new sample prediction label as small as possible.
Although we believe that the best model for predicting YYy given x\mathbf{x}x will be linear, it is difficult to find a real data set with NNN samples, where for all 1≤ I ≤ N1 \leq I \leq n1≤ I ≤n, Y (I) y ^ {(I)} y (I) completely ⊤ x (I) is equal to w + b \ mathbf {w} ^ \ top \ mathbf {x} ^ {(I)} + bw ⊤ x (I) a + b. No matter what means we use to observe feature X\mathbf{X}X and label y\mathbf{y}y, small observational errors are likely to occur. Therefore, even if we are sure that the underlying relationship between features and labels is linear, we will add a noise term to account for the effect of observation errors.
Before we can start looking for the best model parameters w\mathbf{w} W and BBB, we need two more things: (1) a way to measure model quality; (2) a method that can update the model to improve the quality of model prediction.
2. Loss function
Before we can start thinking about how to fit data with a model (FIT), we need to determine a measure of fit. Loss function can quantify the difference between the actual value of the target and the predicted value. Usually, we choose non-negative numbers as the loss, and the smaller the value is, the smaller the loss is. In perfect prediction, the loss is 0. The most commonly used loss function in regression problems is the squared error function. When the predicted value of sample III is y^(I)\hat{y}^{(I)}y^(I) and its corresponding true label is Y (I)y^{(I)}y(I), the squared error can be defined by the following formula:
The constant 12\frac{1}{2}21 does not make an essential difference, but it is slightly simpler in form (because the constant coefficient is 1 when we take the derivative of the loss function). Since the training data set is not under our control, the empirical error is only a function of the model parameters. To further illustrate, consider the following example. We graphed the regression problem for the one-dimensional case, as shown by numref: fig_FIT_linreg.
Due to the quadratic term in the squared error function, a larger difference between the estimated y^(I)\hat{y}^{(I)}y^(I) and the observed y(I)y^{(I)}y(I) will result in a larger loss. In order to measure the quality of the model over the entire data set, we need to calculate the loss mean (also equivalent to summation) over the training set of NNN samples.
When training the model, we want to find a set of parameters (W ∗, B ∗\mathbf{w}^*, B ^* W ∗, B ∗) that can minimize the total loss across all training samples. The following type:
3. The analytical solution
Linear regression happens to be a very simple optimization problem. Unlike most of the other models we will cover in this book, the solutions to linear regression can be simply expressed in a formula called analytical solutions. First, we merge the bias BBB into the parameter W \mathbf{w}w by attaching a column to the matrix containing all the parameters. Our prediction problem is to minimize ∥ y – Xw ∥ 2 \ | \ mathbf {} y – \ mathbf {X} \ mathbf {w} \ | ^ 2 ∥ y – Xw ∥ 2. There is only one critical point in the loss plane, and this critical point corresponds to the loss minimum of the whole region. Set the derivative of the loss with respect to W \mathbf{w}w to 0, and obtain the analytical solution:
Simple problems like linear regression have analytical solutions, but not all problems have analytical solutions. Analytical solution can be well analyzed mathematically, but it cannot be widely used in deep learning due to its strict restrictions on problems.
4. Stochastic gradient descent
Even when we can’t get an analytical solution, we can still train the model effectively. For many tasks, models that are hard to optimize work better. So it’s important to figure out how to train these hard-to-optimize models.
In this book, we use a method called Gradient Descent, which optimizes almost all deep learning models. It reduces errors by constantly updating parameters in the direction of decreasing loss function.
The simplest use of gradient descent is to calculate the derivative of the loss function (the mean loss of all samples in the data set) with respect to the model parameters (also known here as the gradient). In practice, however, execution can be very slow: we have to traverse the entire data set before each parameter update. As a result, we typically randomly select a small batch of samples each time we need a computational update, a variant called minibatch stochastic gradient Descent.
In each iteration, we first randomly sample a small batch of B\mathcal{B}B, which consists of a fixed number of training samples. We then calculate the derivatives (also known as gradients) of the average losses of small batches with respect to model parameters. Finally, we multiply the gradient by a predetermined positive number η\etaη and subtract it from the value of the current parameter.
We use the following mathematical formula to represent this update process (∂\partial∂ stands for partial derivative) :
To sum up, the steps of the algorithm are as follows: (1) Initialize the values of model parameters, such as random initialization; (2) Small batch samples are randomly selected from the data set and parameters are updated in the direction of negative gradient, and this step is continuously iterated. For square losses and affine transformations, we can explicitly write them as follows:
Formula :eqref: w\mathbf{w}w and x\mathbf{x}x in eq_linreg_batch_update are vectors. Here, the more elegant vector notation is better than the coefficient notation (e.g. W1,w2… , wdw_1 w_2 \ ldots, w_dw1, w2,… ,wd) is more readable. ∣ B ∣ | \ mathcal {B} | ∣ B ∣ said each sample in small batches, it is also known as the batch size (batch size). η\etaη stands for the learning rate. The values for batch size and learning rate are usually manually specified in advance rather than obtained through model training. These parameters, which can be adjusted but are not updated during training, are called hyperparameters. Hyperparameter tuning is the process of selecting hyperparameters. Hyperparameters are usually adjusted based on the results of training iterations, which are evaluated on a separate validation dataset.
After training a predetermined number of iterations (or until some other stop condition is met), we record estimates of the model parameters as w^,b^\hat{\mathbf{w}}, \hat{b} W ^,b^. However, even if our function is indeed linear and noiseless, these estimates do not really minimize the loss function. This is because the algorithm makes the loss converge slowly to the minimum, but it cannot achieve the minimum precisely in a finite number of steps.
Linear regression happens to be a learning problem with only one minimum in the whole domain. But for a model as complex as deep neural network, the loss plane usually contains multiple minima. Deep learning practitioners rarely make much effort to find such a set of parameters to minimize losses on the training set. In fact, what’s harder to do is to find a set of parameters that can achieve a low loss on data we’ve never seen before, a challenge called generalization.
5. Use models to predict
The linear regression model of a given “learning” w ^ ⊤ x + b ^ \ hat {\ mathbf {w}} ^ \ top \ mathbf {x} + \ hat {b} w ^ ⊤ x + b ^. Now we can estimate the price of a new house (not included in the training data) by measuring the size of the house x1x_1x1 and the age of the house x2x_2x2. The process of estimating objects with given features is called inference or inference.
This book will try to stick with the word forecast. While inference has become a standard term for deep learning, it’s actually a misnomer. In statistics, inference is more about estimating parameters based on a data set. When deep learning practitioners talk to statisticians, misuses of terminology often lead to misunderstandings.
Vectorization acceleration
When training our models, we often want to be able to process entire small batches of samples simultaneously. To do this, we need (instead of writing an expensive for loop in Python) to vectorize the computation to make use of the linear algebra library.
%matplotlib inline
import math
import time
import numpy as np
import paddle
Copy the code
To illustrate the importance of what a vector turns into, let’s consider (two ways of adding vectors). We instantiate two 10,000 dimensional vectors that are all 1’s. In one way, we’ll use Python’s for loop to iterate over vectors; In the other approach, we will rely on a call to +.
n = 10000
a = paddle.ones([n])
b = paddle.ones([n])
Copy the code
Since we will be benchmarking run-time frequently throughout this book, [we define a timer] :
class Timer: #@save
""" Record multiple run times." ""
def __init__(self) :
self.times = []
self.start()
def start(self) :
""" Start the timer. ""
self.tik = time.time()
def stop(self) :
""" Stop the timer and record the time in the list." ""
self.times.append(time.time() - self.tik)
return self.times[-1]
def avg(self) :
""" Return average time. ""
return sum(self.times) / len(self.times)
def sum(self) :
""" Returns the sum of times. ""
return sum(self.times)
def cumsum(self) :
""" Return total time." ""
return np.array(self.times).cumsum().tolist()
Copy the code
Now we can benchmark the workload.
First, [we use the for loop to add one bit at a time].
c = paddle.zeros([n])
timer = Timer()
for i in range(n):
c[i] = a[i] + b[i]
f'{timer.stop():. 5f} sec'
Copy the code
'0.85271 sec'
Copy the code
(Or, we use the overloaded + operator to calculate the sum by elements).
timer.start()
d = a + b
f'{timer.stop():. 5f} sec'
Copy the code
'0.01487 sec'
Copy the code
It turns out that the second method is much faster than the first. Vectorized code usually results in an order of magnitude acceleration. In addition, we put more math into the library instead of having to write so many calculations ourselves, which reduces the possibility of errors.
Normal distribution and square loss
Next, we interpret the square loss objective function through the assumption of noise distribution.
The relationship between normal distribution and linear regression is very close. Normal distribution, also known as Gaussian distribution, was first applied to astronomical research by the German mathematician Gauss. In simple terms, if the random variable XXX has the mean μ\muμ and variance σ2\sigma^2σ2 (standard deviation σ\sigmaσ), its normal distribution probability density function is as follows:
Next [we define a Python function to calculate the normal distribution].
def normal(x, mu, sigma) :
p = 1 / math.sqrt(2 * math.pi * sigma**2)
return p * np.exp(-0.5 / sigma**2 * (x - mu)**2)
Copy the code
from matplotlib import pyplot as plt
from IPython import display
def use_svg_display() :
"""Use the svg format to display a plot in Jupyter. Defined in :numref:`sec_calculus`"""
display.set_matplotlib_formats('svg')
def set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend) :
"""Set the axes for matplotlib. Defined in :numref:`sec_calculus`"""
axes.set_xlabel(xlabel)
axes.set_ylabel(ylabel)
axes.set_xscale(xscale)
axes.set_yscale(yscale)
axes.set_xlim(xlim)
axes.set_ylim(ylim)
if legend:
axes.legend(legend)
axes.grid()
def set_figsize(figsize=(3.5.2.5)) :
"""Set the figure size for matplotlib. Defined in :numref:`sec_calculus`"""
use_svg_display()
plt.rcParams['figure.figsize'] = figsize
Copy the code
def plot(X, Y=None, xlabel=None, ylabel=None, legend=None, xlim=None,
ylim=None, xscale='linear', yscale='linear',
fmts=(The '-'.'m--'.'g-.'.'r:'), figsize=(3.5.2.5), axes=None) :
"""Plot data points. Defined in :numref:`sec_calculus`"""
if legend is None:
legend = []
set_figsize(figsize)
axes = axes if axes else plt.gca()
# Return True if `X` (tensor or list) has 1 axis
def has_one_axis(X) :
return (hasattr(X, "ndim") and X.ndim == 1 or isinstance(X, list)
and not hasattr(X[0]."__len__"))
if has_one_axis(X):
X = [X]
if Y is None:
X, Y = [[]] * len(X), X
elif has_one_axis(Y):
Y = [Y]
if len(X) ! =len(Y):
X = X * len(Y)
axes.cla()
for x, y, fmt in zip(X, Y, fmts):
if len(x):
axes.plot(x, y, fmt)
else:
axes.plot(y, fmt)
set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
Copy the code
We now (visualize the normal distribution).
# Use Numpy again for visualization
x = np.arange(-7.7.0.01)
# Mean and standard deviation pairs
params = [(0.1), (0.2), (3.1)]
plot(x, [normal(x, mu, sigma) for mu, sigma in params], xlabel='x',
ylabel='p(x)', figsize=(4.5.2.5),
legend=[f'mean {mu}, std {sigma}' for mu, sigma in params])
Copy the code
/ opt/conda envs/python35 - paddle120 - env/lib/python3.7 / site - packages/ipykernel_launcher. Py: 7: DeprecationWarning: 'set_matplotlib_formats' is deprecated since IPython 7.23, directly use `matplotlib_inline.backend_inline.set_matplotlib_formats()` import sysCopy the code
As we can see, changing the mean produces a shift along the XXX axis, and increasing the variance will scatter the distribution and reduce its peak.
One reason the mean square error loss function (mSE) can be used for linear regression is that we assume that the observation contains noise, which follows a normal distribution. The normal distribution of noise is as follows:
Among them, ϵ ~ N (0, sigma 2) \ epsilon \ sim \ mathcal {N} (0, \ sigma ^ 2) ϵ ~ N (0, sigma 2).
Therefore, we can now write the likelihood of observing a particular YYY with a given x\mathbf{x}x:
Now, according to the maximum likelihood estimation method, the optimal values of the parameters W \mathbf{w}w and BBB are the values that maximize the likelihood of the entire data set:
The estimator selected according to the maximum likelihood estimation method is called the maximum likelihood estimator. Although it may seem difficult to maximize the product of many exponential functions, we can simplify this by maximizing the likelihood logarithm without changing the goal. For historical reasons, optimization is often said to minimize rather than maximize. We can minimize negative log likelihood −logP(y∣X)-\log P(\mathbf y \mid \mathbf X)−logP(y∣X). The resulting mathematical formula is:
Now we can just assume sigma sigma is some fixed constant and ignore the first term, because the first term does not depend on w\mathbf{w}w and BBB. Now the second term is the same as the mean square error described earlier, except for the constant 1σ2\frac{1}{\sigma^2}σ21. Fortunately, the solution to the above formula does not depend on sigma \sigma. Therefore, under the assumption of Gaussian noise, the minimum mean square error is equivalent to the maximum likelihood estimate for the linear model.
4. From linear regression to deep network
So far, we’ve only talked about linear models. Although neural networks cover more and richer models, we can still describe linear models in the same way that we describe neural networks, thus treating linear models as neural networks. First, we rewrite the model with a “layer” notation.
1. Neural network diagram
Practitioners of deep learning like to draw diagrams to visualize what is happening in their models. In numREF: FIG_single_NEURON, we describe the linear regression model as a neural network. Note that the diagram shows only the connection mode, that is, only how each input is connected to the output, obscuring the values of weights and biases.
In the neural network shown by: NUMref: FIG_single_neuron, the input is X1,… , xdx_1 \ ldots, x_dx1,… , xD, so the input number (or feature dimension) in the input layer is DDD. The output of the network is o1O_1o1, so the number of outputs in the output layer is 1. Note that the input values are already given, and there is only one computational neuron. Because the model focuses on where the computation takes place, we usually do not consider the input layer when calculating the number of layers. In other words :numref: FIG_single_neuron the number of layers of neural network is 1. We can regard the linear regression model as a neural network consisting of only a single artificial neuron, or a single layer neural network.
For linear regression, each input is linked to each output (in this case, only one output), We call this transformation (the output layer in fig_single_NEURON) the fully connected layer or dense layer. The network of these layers is discussed in detail in the next chapter.
2. The biology
The invention of linear regression (1795) predates computational neuroscience, so it seems inappropriate to describe linear regression as a neural network. When control scientists and neurobiologists Warren McCulloch and Walter Pitts began developing models of artificial neurons, why did they use linear models as a starting point? Numref :fig_Neuron: This is an image of a biological neuron consisting of dendrites (input terminals) and nucleu (CPU). Axon (output line) and axon terminal (output terminal) connect with other neurons through synapse (Synapse).
Dendrites receive information from other neurons (or environmental sensors like the retina). This information is weighted by the synaptic weight WIW_iWI to determine the impact of the input (that is, activated or suppressed by multiplying xiwix_i w_ixiwi). Weighted inputs from multiple sources are aggregated in the nucleus in the form of a weighted sum y=∑ixiwi+by = \sum_i X_i W_I + BY =∑ Ixiwi + B, and this information is then sent to axon YYY for further processing, Some nonlinear processing is usually done by sigma(y) \sigma(y) sigma(y). After that, it either reaches its destination (a muscle, for example) or travels through a dendrite to another neuron.
Of course, the idea that many of these units can be pieced together with the right connections and the right learning algorithms to produce behaviors that are more interesting and complex than those produced by a single neuron is thanks to our work on real biological nervous systems.
Most deep learning research today draws little direct inspiration from neuroscience. We quoted Stuart Russell and Peter novy who, in their classic textbook Artificial Intelligence Artificial Intelligence: A Modern Approach: cite: Russell. Norvig. Said in 2016: While airplanes may have been inspired by birds, ornithology hasn’t been the main driver of aviation innovation for centuries. Similarly, inspiration in deep learning today comes as much or more from math, statistics, and computer science.
Five, the summary
- The key elements in machine learning models are training data, loss functions, optimization algorithms, and the model itself.
- Vectorization makes the math simpler and runs faster.
- Minimizing the objective function is equivalent to performing maximum likelihood estimation.
- Linear regression model is also a simple neural network.