Writing in the front

The data is stored in a multidimensional Numpy array, also known as a tensor. In general, all current machine learning systems use tensors as basic data structures. Tensors are so important to this field that Google’s TensorFlow is named after them. So what is a tensor? At the heart of the concept of a tensor is that it is a data container. The data it contains is almost always numeric data, so it is a container for numbers. You’re probably familiar with matrices, which are two-dimensional tensors. A tensor is a generalization of a matrix to any dimension (in particular, dimensions of a tensor are usually called axes).

tensor

Scalar (0D tensor)

A tensor containing only one number is called a scalar (or scalar, zero-dimensional, or 0D). In Numpy, a number in float32 or float64 is a scalar tensor (or array of scalars). You can use the nDIM property to view the number of axes in a Numpy tensor. Scalar tensors have zero axes (nDIM == 0). The number of tensor axes is also called ranks. Here is a Numpy scalar.

import numpy as np 
p = np.array(12)
print(p.ndim)
#output:0
Copy the code

Vector (1D tensor)

Arrays of numbers are called vectors or one-dimensional tensors. A one-dimensional tensor has only one axis. Here is a Numpy vector.

import numpy as np
x = np.array([1.2.3.4.5.6])
print(x.ndim)
# output: 1
Copy the code

This vector has five elements, so it’s called the 5D vector. Don’t confuse the 5D vector with the 5D tensor! A 5D vector has only one axis and has 5 dimensions along the axis, whereas a 5D tensor has 5 axes (there may be any dimensions along each axis). Dimensionality can represent the number of elements along an axis (such as a 5D vector) or the number of axes in a tensor (such as a 5D tensor), which can sometimes be confusing. In the latter case, it is technically more accurate to say a tensor of order 5 (the order of a tensor is the number of axes), but the vague notation for a 5D tensor is more common.

Matrix (2D tensor)

Arrays of vectors are called matrices or two-dimensional tensors. A matrix has two axes (usually called rows and columns). You can think of a matrix intuitively as a rectangular grid of numbers. The following is a Numpy matrix.

import numpy as np
x = np.array([[5.78.2.34.0],
				[6.79.3.35.1],
				[7.80.4.36.2]]) 
print(x.ndim)
# output: 2
Copy the code

3D and higher dimensional tensors

Combining multiple matrices into a new array yields a 3D tensor, which you can intuitively think of as a cube of numbers. Here is a 3D tensor of Numpy.

x = np.array([[[5.78.2.34.0],
				[6.79.3.35.1],
				[7.80.4.36.2]],
				[[5.78.2.34.0],
				[6.79.3.35.1],
				[7.80.4.36.2]],
				[[5.78.2.34.0],
				[6.79.3.35.1],
				[7.80.4.36.2]]])
print(x.ndim)
# output: 3
Copy the code

Combining multiple 3D tensors into an array creates a 4D tensor, and so on. Deep learning typically deals with tensors ranging from 0D to 4D, but may encounter 5D tensors when dealing with video data.

Key attributes

Tensors are defined by three key properties.

  • Number of axes (order). For example, a 3D tensor has 3 axes and a matrix has 2 axes. This is also called ndim for tensors in Python libraries such as Numpy.
  • Shape. This is a tuple of integers representing the dimension size (number of elements) of the tensor along each axis. For example, the shape of the previous matrix example is (3, 5), and the shape of the 3D tensor example is (3, 3, 5). A vector shape contains only one element, such as (5,), whereas a scalar shape is empty, that is, ().
  • Data type (commonly called dtype in Python libraries). This is the type of data contained in tensors. For example, tensors can be of type float32, Uint8, float64, etc. In rare cases, you might encounter a char tensor. Note that string tensors do not exist in Numpy (and most other libraries) because tensors are stored in preallocated contiguous memory segments, and string lengths are variable and cannot be stored this way.

Relevant concepts

  • The specific elements of choosing tensors are called tensor slicing.

  • In general, the first axis of all data tensors in deep learning (axis 0, because the index starts at 0) is the sample axis (sometimes called the sample dimension).

  • Deep learning models do not process the entire data set at once, but break the data into small batches. For such batch tensors, the first axis (axis 0) is called the batch axis or batch dimension. You’ll come across this term a lot when working with Keras and other deep learning libraries.

  • Real world data tensors

    • Vector data: 2D tensor, shape of (samples, features).
    • Time series data or sequence data: 3D tensors with shapes (samples, timesteps, features).
    • Image: 4D tensor, shape of (samples, height, width, channels) or (samples, channels, height, width).
    • Video: 5D tensor, shape of (samples, frames, height, width, channels) or (samples, frames, channels, height, width).
  • Vector data: For this data set, each data point is encoded as a vector, so a batch of data is encoded as a 2D tensor (that is, an array of vectors), where the first axis is the sample axis and the second axis is the feature axis.

Tensor operation

Element by element operation

Relu operations and addition are element-wise operations, that is, the operations are applied independently to each element in a tensor, that is, these operations are well suited for large-scale parallel implementation. If you want to write a simple Python implementation of element-by-element operations, you can use a for loop. The following code is a simple implementation of element-by-element RELu.

def naive_relu(x) :
     assert len(x.shape) == 2    
     x = x.copy()          
     for i in range(x.shape[0) :for j in range(x.shape[1]):             
     		x[i, j] = max(x[i, j], 0)    
     return x
Copy the code

The same implementation is used for addition.

def naive_add(x, y) :
     assert len(x.shape) == 2
     assert x.shape == y.shape 
 
    x = x.copy()         
    for i in range(x.shape[0) :for j in range(x.shape[1]):             
    		x[i, j] += y[i, j]     
    return x
Copy the code

radio

We add a 2D tensor to a vector, and if we add two tensors of different shapes, what happens? If there is no ambiguity, smaller tensors are broadcast to match the shape of larger tensors. Broadcasting involves the following two steps.

  • Add an axis to the smaller tensor (called the broadcast axis) so that its NDIM is the same as the larger tensor.
  • Repeat the smaller tensor along the new axis so that it has the same shape as the larger tensor.

Let’s look at a specific example. Let’s say X has the shape of 32, 10, and y has the shape of 10. First, we add an empty first axis to y so that the shape of y becomes (1, 10). Then, we repeat y 32 times along the new axis, so that the resulting tensor y has the shape of (32, 10), and y [I, :] == y for I in range(0, 32). Now, we can add X and Y, because they have the same shape.

No new 2D tensors are created in the actual implementation, because that would be very inefficient. Here is a simple implementation.

def naive_add_matrix_and_vector(x, y) :
     assert len(x.shape) == 2        
     assert len(y.shape) == 1        
     assert x.shape[1] == y.shape[0] 
 
    x = x.copy()         
    for i in range(x.shape[0) :for j in range(x.shape[1]):             
    		x[i, j] += y[j]     
    return x
Copy the code

Tensor of the dot product

The dot product, also called tensor product (not to be confused with the element-by-element product), is the most common and useful tensor operation. Unlike element-by-element operations, it merges the elements of the input tensor together. In Numpy, Keras, Theano, and TensorFlow, the element-by-element product is implemented with *. The dot product in TensorFlow uses a different syntax, but in both Numpy and Keras, the dot product is implemented using the standard dot operator.

import numpy as np
z = np.dot(x, y)
Copy the code

The point in a mathematical symbol (.) That’s the dot product.

z=x.y
Copy the code

What does the dot product do from a mathematical standpoint? Let’s first look at the dot product of two vectors x and y. The calculation process is as follows.

import numpy as np

def naive_vector_dot(x, y)
	assert len(x.shape) = = 1assert len(y.shape) = = 1assert y.shape[0] = =y.shape[0]

	z = 0.
	for i in range(x.shape[0]) :
		z += x[i] * y[i]
	return z
Copy the code

Notice that the dot product between two vectors is a scalar, and you can only do the dot product between vectors that have the same number of elements. You can also take the dot product of a matrix x with a vector y, and return a vector where each entry is the dot product between each row of y and x. The implementation process is as follows.

import numpy as np
def naive_matrix_vector_dot(x, y) : 
	assert len(x.shape) == 2
	assert len(y.shape) == 1
	assert x.shape[1] == y.shape[0]

	z = np.zeros(x.shape[0])
	for i in range(x.shape[0) :for j in range(y.shape[0]):
			z[i] = x[i, j] * y[j]
	return z
Copy the code

And you can reuse the code that you wrote earlier, and see the relationship between matrix and vector dot products and vector dot products

def naive_matrix_vector_dot(x, y) :     
	z = np.zeros(x.shape[0])     
	for i in range(x.shape[0]):         
		z[i] = naive_vector_dot(x[i, :], y)     
	return z 
Copy the code

Note that if either of the two tensors has nDIM greater than 1, then the dot operation is no longer symmetric, that is, dot(x, y) is not equal to dot(y, x).

Note that if either of the two tensors has nDIM greater than 1, then the dot operation is no longer symmetric, that is, dot(x, y) is not equal to dot(y, x).



More generally, you can take the dot product of higher dimensional tensors as long as the shape matches the same principles as the previous 2D tensors:

Deformation tensor

Tensor deformation is changing the rows and columns of a tensor to get the desired shape. The deformed tensor has the same total number of elements as the original tensor. Simple examples can help us understand tensor deformation.

x = np.array([[0..1.],
				[2..3.],
				[4..5.]]) 

print(x.shape) 
#output:(3, 2) 

x = x.reshape((6.1)) 
print(x)
#output:
#array([[ 0.],
# [1].
# [2].
# [3].
# [4].
# [5]])
Copy the code

A special kind of tensor deformation frequently encountered is transposition. To transpose a matrix is to swap rows and columns so that x[I, :] becomes x[:, I].

 x = np.zeros((300.20))
 x = np.transpose(x) 
 print(x.shape) 
 #output:(20, 300)
Copy the code

Gradient-based optimization

In the derivative, slope A is called the derivative of F at point P. If a is negative, it means that the slight change of X near point P will cause f(x) to decrease (as shown in Figure 2-10). If a is positive, then a small change in x will cause f(x) to increase. In addition, the absolute value of a (the magnitude of its derivative) indicates how fast it is increasing or decreasing.



For every differentiable function f(x) (differentiable means it can be differentiated. For example, smooth continuous functions can be differentiated), there exists a derivative function f ‘(x) that maps the value of x to the slope of f’s local linear approximation at that point. For example, the derivative of cosine of x is -sin of x, the derivative of f of x = a times x is f ‘(x) = a, and so on.

The gradient

Gradient is the derivative of tensor operation. It is the extension of the concept of derivative to the derivative of functions of many variables. Functions of several variables are functions with tensors as inputs. Suppose you have an input vector X, a matrix W, a target Y, and a loss function, Loss. You can use W to calculate the predicted value y_pred, and then calculate the loss, or the distance between the predicted value y_pred and the target y.

y_pred = dot(W, x) 
loss_value = loss(y_pred, y)
Copy the code

If the input data x and y remain constant, then this can be thought of as a function of mapping W to the loss value.

loss_value = f(W)
Copy the code

Let’s say the current value of W is W0. The derivative of F at W0 point is a tensor gradient(f)(W0), which has the same shape as W. Each coefficient gradient(f)(W0)[I, j] represents the direction and magnitude of loss_value change when W0[I, j] is changed. The tensor gradient(f)(W0) is the derivative of function F (W) = loss_value at W0.

We saw earlier that the derivative of f(x) can be viewed as the slope of the curve of f. Similarly, gradient(f) (W0) can also be seen as the tensor representing f(W) curvature (CURVATURE) near W0.

For a function f(x), you can decrease the value of f(x) by moving x one small step in the opposite direction of the derivative. Similarly, for the function f(W) of the tensor, you can decrease f(W) by moving W in the opposite direction of the gradient, such as W1 = W0 – step * gradient(f)(W0), where step is a small scaling factor. In other words, if you move in the opposite direction of curvature, you’re intuitively going to be lower on the curve. Note that the scale factor step is required because gradient(f)(W0) is just an approximation of the curvature around W0 and cannot be too far away from W0.

Stochastic gradient descent

Given a differentiable function, it is theoretically possible to find its minimum analytically: the minimum of the function is the point at which the derivative is 0, so you simply find all the points at which the derivative is 0 and calculate which of those points the function has a minimum. Applying this method to neural network, the ownership weight corresponding to the minimum loss function is obtained by analytic method. This method can be realized by solving W of equation gradient(f)(W) = 0. This is a polynomial equation with N variables, where N is the number of coefficients in the network. Such equations can be solved when N=2 or N=3, but they cannot be solved for actual neural networks, because the number of parameters is not less than thousands, and often tens of millions.

The parameters are adjusted bit by bit based on the current loss in the random data batch. Since you’re dealing with a differentiable function, you can calculate its gradient and effectively implement it. By updating the weights in the opposite direction of the gradient, the loss will get smaller each time.

  • The data batch composed of training sample X and corresponding target Y is extracted.
  • Run the network on X and get the predicted value y_pred.
  • Calculate the network loss on this batch of data and measure the distance between y_pred and Y.
  • Calculate the gradient of loss with respect to network parameters [a backward pass].
  • Move the parameter a bit in the opposite direction of the gradient, such as W -= Step * gradient, to reduce the loss on the batch of data a bit.

It’s easy! The method I just described is called mini-batch stochastic gradient Descent (SGD). The term stochastic refers to the random selection of each batch of data.

In addition, there are many variants of SGD, the difference is that the next weight update is calculated with the last weight update in mind, instead of just considering the current gradient value, such as SGD, Adagrad, RMSProp, etc. These variants are called optimization methods or optimizers. Of particular interest is the concept of momentum, which is used in many variants. Momentum solves two problems of SGD: convergence rate and local minimum. The figure below shows the curve of loss as a function of network parameters.



As you can see, near a parameter value, there is a local minimum: near this point, moving to the left and moving to the right causes the loss value to increase. If SGD with small learning rate is used for optimization, the optimization process may fall into the local minimum, resulting in the failure to find the global minimum. Such problems can be avoided using the momentum method, which is inspired by physics. A useful mental image is to think of optimization as a ball rolling down a loss function curve. If the ball has enough momentum, it won’t get stuck in the canyon and will eventually reach the global minimum. The momentum method is implemented by moving the ball at each step, taking into account not only the previous slope value (the current acceleration), but also the current velocity (from the previous acceleration). What this means in practice is that updating parameter W takes into account not only the current gradient value, but also the last parameter update, as shown in the simple implementation below.

past_velocity = 0. 
momentum = 0.1    
while loss > 0.01:        
	w, loss, gradient = get_current_parameters()     
	velocity = past_velocity * momentum - learning_rate * gradient     
	w = w + momentum * velocity - learning_rate * gradient     
	past_velocity = velocity     
	update_parameter(w)
Copy the code

Back propagation algorithm

In the previous algorithm, we assumed that the function was differentiable, so that its derivative could be calculated explicitly. In practice, neural network functions consist of many connected tensor operations, each of which has a simple, known derivative. For example, the following network F contains three tensor operations A, B, and C, and three weight matrices W1, W2, and W3.

f(W1, W2, W3) = a(W1, b(W2, c(W3))) 
Copy the code

From the knowledge of calculus, such a chain of functions can be derived using the following identity, known as the chain rule: (f(g(x)))’ = F ‘(g(x)) * g'(x). The chain rule is applied to calculate the gradient value of neural network, and the algorithm obtained is called backpropagation (sometimes also called reverse mode differentiation). Back propagation starts from the final loss value and works backwards from the top layer to the bottom layer. The chain rule is used to calculate the contribution of each parameter to the loss value. Now and for years to come, neural networks will be implemented using modern frameworks that allow symbolic differentiation, such as TensorFlow. That is, given a chain of operations and given the derivatives of each operation, these frameworks can use the chain rule to compute the gradient function of the chain of operations, mapping network parameter values to gradient values. For such a function, the back propagation is reduced to calling the gradient function. Because of symbolic differentiation, you don’t have to manually implement the backward propagation algorithm. Therefore, we will not waste your time and energy in this section deriving specific formulas for backpropagation. You just need to fully understand how gradient-based optimization works.