This article is one of a series of notes I wrote while studying Deep Learning with Python (2nd edition, by Francois Chollet). The article covers the notebooks of Jupyter to Markdown, and I will release all of the Jupyter notebooks on GitHub once all of the articles have been completed.
You can be in this website online reading the original text of the book (English) : livebook.manning.com/book/deep-l…
The author of this book also gives a set of Jupyter notebooks: github.com/fchollet/de…
Chapter 2. Before we begin: The Mathematical Building blocks of neural Networks
A first look at neural networks
Programming languages start with “Hello World” and Deep learning starts with MINST.
MNIST is used to train handwritten number recognition, which consists of 28×28 grayscale handwritten images, with labels (values 0 to 9) corresponding to each image.
Import the MNIST data set
# Loading the MNIST dataset in Keras
from tensorflow.keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
Copy the code
Take a look at the training set:
print(train_images.shape)
print(train_labels.shape)
train_labels
Copy the code
Output:
(60000, 28, 28)
(60000,)
array([5, 0, 4, ..., 5, 6, 8], dtype=uint8)
Copy the code
Here is the test set:
print(test_images.shape)
print(test_labels.shape)
test_labels
Copy the code
Output:
(10000, 28, 28)
(10000,)
array([7, 2, 1, ..., 4, 5, 6], dtype=uint8)
Copy the code
Network building
Let’s construct a neural network for learning MNIST sets:
from tensorflow.keras import models
from tensorflow.keras import layers
network = models.Sequential()
network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28, )))
network.add(layers.Dense(10, activation='softmax'))
Copy the code
Neural networks are made up of layers. One layer is like a distillation filter, which “filters” the incoming data and “refines” the information it needs to pass to the next layer.
Such a series of “layers” are combined to process the data like an assembly line. One layer at a time makes the data being processed, or the “representation of the data,” more and more “useful” to the results we ultimately want.
The network we’ve just built consists of two “Dense layers”, so called because they are Dense connected, or fully connected.
The data goes to the last layer (layer 2), which is a 10-way SoftMax layer. This layer outputs an array of 10 probability values that add up to 1. This output “represents” information that is useful for predicting the number of images. In fact, each probability value in the output represents the probability that the input image belongs to one of 10 numbers (0-9)!
compile
Next, we need to compile the network. This step needs to be given three parameters:
- Loss function: a function that evaluates how well your network performs
- Optimizer: How to update (optimize) your network
- Indicators that need to be monitored during training and testing, such as in this example, we only care about one indicator – the accuracy of the prediction
network.compile(loss="categorical_crossentropy",
optimizer='rmsprop',
metrics=['accuracy'])
Copy the code
pretreatment
Graphics processing
We also need to process the graphic data and turn it into something that we know online.
The images in the MNIST dataset are 28×28 and each value is a uint8 of [0, 255]. And our neural network wants 28×28 float32 in [0, 1].
train_images = train_images.reshape((60000.28 * 28))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000.28 * 28))
test_images = test_images.astype('float32') / 255
Copy the code
Label processing
Tags also need to be handled.
from tensorflow.keras.utils import to_categorical
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)
Copy the code
Training network
network.fit(train_images, train_labels, epochs=5, batch_size=128)
Copy the code
Output:
Train on 60000 samples Epoch 1/5 60000/60000 [==============================] - 3s 49us/sample - loss: 0.2549 accuracy: 0.9254 Epoch 2/5 60000/60000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 2 s 38 us/sample - loss: 0.1025 accuracy: 0.9693 Epoch 3/5 60000/60000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 2 s 35 us/sample - loss: 0.0676 accuracy: 0.9800 Epoch 4/5 60000/60000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 37 us - 2 s/sample - loss: 0.0491 accuracy: 0.9848 Epoch 5/5 60000/60000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 42 us - 2 s/sample - loss: 0.0369 accuracy: 0.9888 < tensorflow. Python. Keras. Callbacks. History at 0 x13a7892d0 >Copy the code
As you can see, the training is fast, and after a while you have 98%+ accuracy on the training set.
Try it again with the test set:
test_loss, test_acc = network.evaluate(test_images, test_labels, verbose=2) # verbose=2 to avoid a looooong progress bar that fills the screen with '='. https://github.com/tensorflow/tensorflow/issues/32286
print('test_acc:', test_acc)
Copy the code
Output:
10000/1-0s-loss: 0.0362-accuracy: 0.9789 test_acc: 0.9789Copy the code
Our trained network did not perform as well in the test set as it did in the previous training set, which is the pot of “overfitting”.
Data representation of neural networks
Tensor, an array of arbitrary dimensions (I mean, a programming array). A matrix is a tensor in two dimensions.
We often refer to the dimension of a tensor as the axis.
Know the tensor
Scalar (0 d Tensors)
Scalars, a Scalars is a tensor with zero dimensions (zero axes) that contains a number.
A scalar in NUMPY can be represented by either float32 or float64.
import numpy as np
x = np.array(12)
x
Copy the code
Output:
array(12)
Copy the code
x.ndim # Axis number (dimension)
Copy the code
Output:
1
Copy the code
Tensors vector (1 d)
Vectors, Vectors are 1-dimensional tensors (with 1 axis) containing a list of scalars.
x = np.array([1.2.3.4.5])
x
Copy the code
Output:
array([1, 2, 3, 4, 5])
Copy the code
x.ndim
Copy the code
Output:
1
Copy the code
We call a vector that has five elements like this a five-dimensional vector. But notice that the 5D vector is not a 5D tensor!
- 5D vector: has only one axis, and has five dimensions along this axis.
- The 5D tensor: has 5 axes, and can have any dimension along each axis.
This is a puzzle, because sometimes the dimension is the number of axes, sometimes it’s the number of elements on the axis.
So, we’d better put it another way, in terms of order, and say tensor of order 5.
Tensors matrix (2 d)
Matrices, Matrices are tensors of order 2 (2 axes, that’s what we call rows and columns), containing a column of vectors.
x = np.array([[5.78.2.34.0],
[6.79.3.35.1],
[7.80.4.36.2]])
x
Copy the code
Output:
array([[ 5, 78, 2, 34, 0],
[ 6, 79, 3, 35, 1],
[ 7, 80, 4, 36, 2]])
Copy the code
x.ndim
Copy the code
Output:
2
Copy the code
Higher-order tensor
You take an array of matrices and you get a tensor of order 3.
And then you have array of order 3 tensors and you have tensor of order 4, and so on, and you have higher order tensors.
x = np.array([[[5.78.2.34.0],
[6.79.3.35.1],
[7.80.4.36.2]],
[[5.78.2.34.0],
[6.79.3.35.1],
[7.80.4.36.2]],
[[5.78.2.34.0],
[6.79.3.35.1],
[7.80.4.36.2]]])
x.ndim
Copy the code
Output:
3
Copy the code
In deep learning, we usually use tensors of order 0 to 4.
The three elements of a tensor
- Order (number of axes) : 3,5,…
- Shape (dimensions of each axis) :(2, 1, 3), (6, 5, 5, 3, 6)…
- Data types: Float32, Uint8,…
Let’s look at the tensor data in MNIST:
from tensorflow.keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
print(train_images.ndim)
print(train_images.shape)
print(train_images.dtype)
Copy the code
Output:
3
(60000, 28, 28)
uint8
Copy the code
So train_images is a 3rd order tensor of an 8-bit unsigned integer.
Print out the picture inside to see:
digit = train_images[0]
import matplotlib.pyplot as plt
print("image:")
plt.imshow(digit, cmap=plt.cm.binary)
plt.show()
print("label: ", train_labels[0])
Copy the code
Output:
label: 5
Copy the code
The Numpy tensor operation
Tensor slice:
my_slice = train_images[10:100]
print(my_slice.shape)
Copy the code
Output:
(90, 28, 28)
Copy the code
Is equivalent to:
my_slice = train_images[10:100,,,,]print(my_slice.shape)
Copy the code
Output:
(90, 28, 28)
Copy the code
Is equivalent to the
my_slice = train_images[10:100.0:28.0:28]
print(my_slice.shape)
Copy the code
Output:
(90, 28, 28)
Copy the code
Select 14×14 in the lower right corner:
my_slice = train_images[:, 14:, 14:]
plt.imshow(my_slice[0], cmap=plt.cm.binary)
plt.show()
Copy the code
Output:
Select 14×14 at the center:
my_slice = train_images[:, 7: -7.7: -7]
plt.imshow(my_slice[0], cmap=plt.cm.binary)
plt.show()
Copy the code
Output:
Data volume
In deep learning data, the first axis (index=0) is usually called the “sample axis” (or “sample dimension”).
In deep learning, we typically don’t process the whole data set at once, we process it batch by batch.
In MNIST, one of our batches is 128 data:
The first batch of #
batch = train_images[:128]
The second batch of #
batch = train_images[128:256]
# n
n = 12
batch = train_images[128 * n : 128 * (n+1)]
Copy the code
Therefore, when using batch, we also call the first axis “batch axis”.
Common data tensor representation
data | Tensor dimensionality | The shape of |
---|---|---|
Vector data | 2D | (samples,features) |
The time series | 3D | (samples, timesteps, features) |
image | 4D | (samples, height, width, channels) or (samples, channels, height, width) |
video | 5D | (samples, frames, height, width, channels) or (samples, frames, channels, height, width) |
“Gears” of neural networks: Tensor operations
In our first neural network example (MNIST), each of our layers actually does something like this on the input data:
output = relu(dot(W, input) + b)
Copy the code
Input is input, W and B are attributes of the layer, and output is output.
There’s relu, dot, add between these things, and we’ll explain that next.
Element-wise operations
And the element-wise operation is that you act on each Element of the tensor individually. For example, let’s implement a simple relu(relu(x) = Max (x, 0)) :
def naive_relu(x) :
assert len(x.shape) == 2 # x is a 2D Numpy tensor.
x = x.copy() # Avoid overwriting the input tensor.
for i in range(x.shape[0) :for j in range(x.shape[1]):
x[i, j] = max(x[i, j], 0)
return x
Copy the code
Addition is also a per-element operation:
def naive_add(x, y) :
# assert x and y are 2D Numpy tensors and have the same shape.
assert len(x.shape) == 2
assert x.shape == y.shape
x = x.copy() # Avoid overwriting the input tensor.
for i in range(x.shape[0) :for j in range(x.shape[1]):
x[i, j] += y[i, j]
return x
Copy the code
In Numpy, it’s all written down. The specific operations are given to BLAS written in C or Fortran, and the speed is high.
You can check to see if BLAS are installed:
import numpy as np
np.show_config()
Copy the code
Output:
blas_mkl_info:
NOT AVAILABLE
blis_info:
NOT AVAILABLE
openblas_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
NOT AVAILABLE
openblas_lapack_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
Copy the code
Here’s how to use numpy’s element-by-element relu and add:
a = np.array([[1.2.3], [...1.2, -3],
[3, -1.4]])
b = np.array([[6.7.8], [...2, -3.1],
[1.0.4]])
c = a + b # Element-wise addition
d = np.maximum(c, 0) # Element-wise relu
print(c)
print(d)
Copy the code
Output:
[[7, 9, 11] [- 3-1-2] [8] 4-1], [[7, 9, 11] [0 0 0] [4 0 8]]Copy the code
Radio (-)
When performing a per-element operation, if the two tensors have different shapes, the smaller one will “broadcast” into the same shape as the larger one, as far as practicable.
Specifically, it is possible to broadcast a pair of pairs with shapes such as (A, B… , n, n+1, … M) and (n, n+1… , m) and the two tensors of m).
Such as:
x = np.random.random((64.3.32.10)) # x is a random tensor with shape (64, 3, 32, 10).
y = np.random.random((32.10)) # y is a random tensor with shape (32, 10).
z = np.maximum(x, y) # The output z has shape (64, 3, 32, 10) like x.
Copy the code
The operation of broadcast is as follows:
- The small tensor increases the axis (broadcast axis) to the same as the large one (ndim)
- The elements of the small tensor are repeated on the new axis, adding to the same shape as the large one.
E.g.
x: (32, 10), y: (10,)
Step 1: add an empty first axis to y: Y -> (1, 10)
Step 2: repeat y 32 times alongside this new axis: Y -> (32, 10)
Copy the code
Y[I, :] == Y for I in range(0, 32)
Of course, in the actual implementation, we don’t copy like this, which is a waste of space, we do it directly in the algorithm. For example, let’s implement a simple vector and matrix addition:
def naive_add_matrix_and_vector(m, v) :
assert len(m.shape) == 2 # m is a 2D Numpy tensor.
assert len(v.shape) == 1 # v is a Numpy vector.
assert m.shape[1] == v.shape[0]
m = m.copy()
for i in range(m.shape[0) :for j in range(m.shape[1]):
m[i, j] += v[j]
return m
naive_add_matrix_and_vector(np.array([[1 ,2.3], [4.5.6], [7.8.9]]),
np.array([1, -1.100]))
Copy the code
Output:
array([[ 2, 1, 103],
[ 5, 4, 106],
[ 8, 7, 109]])
Copy the code
Dot product of tensors
The tensor dot product, or product of tensors, is done with dot(x, y) in NUMpy.
The operation of the dot product can be seen in the following simple program:
# Dot product
def naive_vector_dot(x, y) :
assert len(x.shape) == 1
assert len(y.shape) == 1
assert x.shape[0] == y.shape[0]
z = 0.
for i in range(x.shape[0]):
z += x[i] * y[i]
return z
The dot product of matrix and vector
def naive_matrix_vector_dot(x, y) :
z = np.zeros(x.shape[0])
for i in range(x.shape[0]):
z[i] = naive_vector_dot(x[i, :], y)
return z
# Matrix dot product
def naive_matrix_dot(x, y) :
assert len(x.shape) == 2
assert len(y.shape) == 2
assert x.shape[1] == y.shape[0]
z = np.zeros((x.shape[0], y.shape[1]))
for i in range(x.shape[0) :for j in range(y.shape[1]):
row_x = x[i, :]
column_y = y[:, j]
z[i, j] = naive_vector_dot(row_x, column_y)
return z
Copy the code
a = np.array([[1.2.3], [...1.2, -3],
[3, -1.4]])
b = np.array([[6.7.8], [...2, -3.1],
[1.0.4]])
naive_matrix_dot(a, b)
Copy the code
Output:
array([[ 5., 1., 22.],
[-13., -13., -18.],
[ 24., 24., 39.]])
Copy the code
The same thing is true for tensor dot products with higher dimensions. For example, (this is shape) :
(a, b, c, d) . (d,) -> (a, b, c)
(a, b, c, d) . (d, e) -> (a, b, c, e)
Copy the code
Any reshaping of tensor
The operation, in short, is that the elements are the same, but the arrangement is different.
x = np.array([[0..1.],
[2..3.],
[4..5.]])
print(x.shape)
Copy the code
Output:
(3, 2)
Copy the code
x.reshape((6.1))
Copy the code
Output:
array([[0.],
[1.],
[2.],
[3.],
[4.],
[5.]])
Copy the code
x.reshape((2.3))
Copy the code
Output:
array([[0., 1., 2.],
[3., 4., 5.]])
Copy the code
Transposition is a special kind of matrix deformation, transposition is the exchange of columns and columns.
The original x[I, :] transpose becomes x[:, I].
x = np.zeros((300.20))
y = np.transpose(x)
print(y.shape)
Copy the code
Output:
(20, 300)
Copy the code
The “engine” of neural networks: Gradient based optimization
In our first neural network example (MNIST), each layer performs operations on the input data:
output = relu(dot(W, input) + b)
Copy the code
In this formula, W and B are attributes of the layer (weights, or trainable parameters). To be specific,
W
Is the kernel attribute;b
Is bias.
These “weights” are what the neural network learns from the data.
At first, these weights are randomly initialized to smaller values. Then from this random output, feedback adjustment, gradually improve.
This process of improvement is done in a “training cycle” that can go on forever if necessary:
- Extract a batch of training data X and the corresponding y
- Propagating forward, the prediction y_pred of X calculated through the network is obtained
- Through Y_pred and Y, the loss is calculated
- Adjust the parameters in some way to reduce the loss
The first three steps are relatively simple, and the fourth step is more complex to update parameters. A more effective and feasible way is to use differentiability and move the parameters in the opposite direction of the gradient by calculating the gradient.
Derivative (derivative)
This section explains the definition of a derivative.
(Go straight to the book.)
So if you know the derivative, and you have to update x to minimize a function f of x, you just have to move x in the opposite direction of the derivative.
Gradient (gradient)
The gradient is the derivative of a tensor operation. Or gradient is a generalization of derivative as a function of several variables. The gradient at a point represents the curvature at that point.
Consideration:
y_pred = dot(W, x)
loss_value = loss(y_pred, y)
Copy the code
If x and y are fixed, then loss_value will be a function of W:
loss_value = f(W)
Copy the code
Let the current point be W0, then the derivative (gradient) of F at W0 is denoted by gradient(f)(W0), and the gradient value is of the same type as W. In which, each element gradient(f) (W0)[I, j] represents the direction and magnitude of f change when W0[I, j] is changed.
So, to change the value of W to achieve min f, you can move in the opposite direction of the gradient (i.e. the direction of gradient descent) :
W1 = W0 - step * gradient(f)(W0)
Copy the code
Stochastic Gradient Descent
In theory, given a differentiable function, its minimum value must be taken at the point where the derivative is zero. So all we have to do is take all the points where the derivative is zero, compare the values of the function, and we get the minimum.
When this method is put into the neural network, an equation gradient(f)(W) = 0 about W needs to be solved, which is an n-element equation (N= the number of parameters in the neural network). In fact, N is generally not less than 1k, which makes it almost impossible to solve this equation.
Therefore, in the face of this problem, we use the above 4-step method. In the fourth step, we use gradient descent to update parameters in the opposite direction of the gradient step by step and move forward in the direction of loss reduction step by step:
- Extract a batch of training data X and the corresponding y
- Propagating forward, the prediction y_pred of X calculated through the network is obtained
- Through Y_pred and Y, the loss is calculated
- Adjust the parameters in some way to reduce the loss
- Propagating backwards, the gradient of the loss function with respect to the network parameters is calculated
- Move the parameter slightly in the opposite direction of the gradient to reduce the loss (W -= step * gradient)
This method is called mini-batch Stochastic Gradient Descent (Mini-batch SGD). The word random means that the data we extracted in step 1 was randomly extracted.
Some variants of SGD update values not only by looking at the current gradient, but also by looking at the last weight update. These variations are called “optimization methods” or “optimizers.” In many of these variants, a concept called momentum is used.
Momentum mainly deals with two problems in SGD: convergence rate and local minimum point. Momentum can be used to avoid convergence to the local optimal solution when learning rate comparison is small, instead of continuing to advance to the global optimal solution.
The momentum here is the momentum concept that comes from physics. We can imagine a small ball rolling down the loss surface (the direction of gradient descent), and if there is enough momentum, it can “dash” past the local minimum and not get trapped there. In this example, the ball’s motion is determined not only by the slope of its current position (the current acceleration), but also by its current velocity (which depends on the previous acceleration).
This idea is put into the neural network, that is, a weight update, not only looks at the current gradient, but also looks at the last weight update:
# naive implementation of Optimization with momentum
past_velocity = 0.
momentum = 0.1 # Constant momentum factor
while loss > 0.01: # Optimization loop
w, loss, gradient = get_current_parameters()
velocity = past_velocity * momentum + learning_rate * gradient
w = w + momentum * velocity - learning_rate * gradient
past_velocity = velocity
update_parameter(w)
Copy the code
Back propagation algorithm: chain derivative
A neural network is a chain of tensor operations, such as:
F (W1, W2, W3) = a(W1, b(W2, c(W3))) # where W1, W2, W3 are weightsCopy the code
There’s a chain rule in calculus that says f(g(x)) = f'(g(x)) * g'(x).
By applying the chain rule to the neural network, an algorithm called “Backpropagation”, which is also called “reverse mode differentiation”, is produced.
Back propagation starts from the final calculated loss, and works backwards from the top layer of the neural network to the bottom layer. The chain rule is used to calculate the contribution of each parameter in each layer to the loss.
Frameworks like TensorFlow today have a capability called “symbolic differentiation.” This allows these frameworks to automatically compute the gradient function for a given neural network operation, and then instead of manually implementing back propagation (which is interesting, but really annoying to write), we can just take the value from the gradient function.