原文 : Deep Learning From Scratch IV: Gradient Descent and Backpropagation
Translation: Sun Yimeng
- Chapter 1: Calculation diagram
- Chapter two: Perceptrons
- Chapter three: Training standards
- Chapter 4: Gradient descent and back propagation
- Chapter 5: Multilayer perceptron
- Chapter 6: TensorFlow
Gradient descent
In general, to find the minimum value of the function, we can set its derivative to be 0, and then solve for the parameters. But actually you have to find 和 A closed form solution of phi is impossible, so we have to passGradient descentIteratively find the minimum.
Imagine you are standing on a high mountain, and if you want to get down to the ground in the shortest time, choose the steepest part of the mountain.
Gradient descent is the same, it takes a random number in the range of parameters, and then iteratively searches for the lossThe smallest case. At every step of the iteration, you find oneDrop the value in the steepest direction, and then continue iterating one step in that direction.
The one-dimensional case of this process is shown below:
The direction in which the steepest decline occurs at a particular point is determined by the gradient of the function at that point: the negative of the gradient (because it is falling) is the direction in which the loss falls fastest.
In that case, how do you minimize itHere are some ideas:
- First, 和 A random value
- To calculateabout 和 The gradient of
- Take the negative of the gradient as the direction and iterate one step further
- Go back to step 2
Let’s write an operation that minimizes a node by gradient descent. The learning_rate parameter allows you to set the step size for each step.
import numpy as np
class Operation:
"""Represents a graph node that performs a computation. An `Operation` is a node in a `Graph` that takes zero or more objects as input, and produces zero or more objects as output. """
def __init__(self, input_nodes=[]):
"""Construct Operation """
self.input_nodes = input_nodes
# Initialize list of consumers (i.e. nodes that receive this operation's output as input)
self.consumers = []
# Append this operation to the list of consumers of all input nodes
for input_node in input_nodes:
input_node.consumers.append(self)
# Append this operation to the list of operations in the currently active default graph
_default_graph.operations.append(self)
def compute(self):
"""Computes the output of this operation. "" Must be implemented by the particular operation. """
pass
class Graph:
"""Represents a computational graph """
def __init__(self):
"""Construct Graph"""
self.operations = []
self.placeholders = []
self.variables = []
def as_default(self):
global _default_graph
_default_graph = self
class placeholder:
"""Represents a placeholder node that has to be provided with a value when computing the output of a computational graph """
def __init__(self):
"""Construct placeholder """
self.consumers = []
# Append this placeholder to the list of placeholders in the currently active default graph
_default_graph.placeholders.append(self)
class Variable:
"""Represents a variable (i.e. an intrinsic, changeable parameter of a computational graph). """
def __init__(self, initial_value=None):
"""Construct Variable Args: initial_value: The initial value of this variable """
self.value = initial_value
self.consumers = []
# Append this variable to the list of variables in the currently active default graph
_default_graph.variables.append(self)
class add(Operation):
"""Returns x + y element-wise. """
def __init__(self, x, y):
"""Construct add Args: x: First summand node y: Second summand node """
super().__init__([x, y])
def compute(self, x_value, y_value):
"""Compute the output of the add operation Args: x_value: First summand value y_value: Second summand value """
return x_value + y_value
class matmul(Operation):
"""Multiplies matrix a by matrix b, producing a * b. """
def __init__(self, a, b):
"""Construct matmul Args: a: First matrix b: Second matrix """
super().__init__([a, b])
def compute(self, a_value, b_value):
"""Compute the output of the matmul operation Args: a_value: First matrix value b_value: Second matrix value """
return a_value.dot(b_value)
class Session:
"""Represents a particular execution of a computational graph. """
def run(self, operation, feed_dict={}):
"""Computes the output of an operation Args: operation: The operation whose output we'd like to compute. feed_dict: A dictionary that maps placeholders to values for this session """
# Perform a post-order traversal of the graph to bring the nodes into the right order
nodes_postorder = traverse_postorder(operation)
# Iterate all nodes to determine their value
for node in nodes_postorder:
if type(node) == placeholder:
# Set the node value to the placeholder value from feed_dict
node.output = feed_dict[node]
elif type(node) == Variable:
# Set the node value to the variable's value attribute
node.output = node.value
else: # Operation
# Get the input values for this operation from node_values
node.inputs = [input_node.output for input_node in node.input_nodes]
# Compute the output of this operation
node.output = node.compute(*node.inputs)
# Convert lists to numpy arrays
if type(node.output) == list:
node.output = np.array(node.output)
# Return the requested node value
return operation.output
def traverse_postorder(operation):
"""Performs a post-order traversal, returning a list of nodes in the order in which they have to be computed Args: operation: The operation to start traversal at """
nodes_postorder = []
def recurse(node):
if isinstance(node, Operation):
for input_node in node.input_nodes:
recurse(input_node)
nodes_postorder.append(node)
recurse(operation)
return nodes_postorder
red_points = np.random.randn(50, 2) - 2*np.ones((50, 2))
blue_points = np.random.randn(50, 2) + 2*np.ones((50, 2))
class negative(Operation):
"""Evaluate negative numbers element by element."""
def __init__(self, x):
"""Construct negative operation parameter list: x: input node"""
super().__init__([x])
def compute(self, x_value):
"""Calculate the negative operation output parameter list: x_value: input value"""
return -x_value
class log(Operation):
"""Take the logarithm of each element."""
def __init__(self, x):
"""Construct log argument list: x: input node"""
super().__init__([x])
def compute(self, x_value):
"""Calculate log operation output parameter list: x_value: input value"""
return np.log(x_value)
class multiply(Operation):
"""For each element, return the value of x times y."""
def __init__(self, x, y):
"""Construct the multiplication argument list: x: input node for the first multiplier y: input node for the second multiplier"""
super().__init__([x, y])
def compute(self, x_value, y_value):
"""Compute the output of the multiplication operation Args: x_value: the value of the first multiplier y_value: the value of the second multiplier"""
return x_value * y_value
class negative(Operation):
"""Evaluate negative numbers element by element."""
def __init__(self, x):
"""Construct negative operation parameter list: x: input node"""
super().__init__([x])
def compute(self, x_value):
"""Calculate the negative operation output parameter list: x_value: input value"""
return -x_value
class reduce_sum(Operation):
"""Calculate the sum of the elements of a tensor in one or more dimensions."""
def __init__(self, A, axis=None):
"""Construct reduce_sum parameter list: A: tensor to perform reduce operation Axis: dimensions that require reduce, if 'None' (the default), extend reduce for all dimensions."""
super().__init__([A])
self.axis = axis
def compute(self, A_value):
"""Calculate the parameter list of the output values of the reduce_sum operation: A_value: tensor value of the input"""
return np.sum(A_value, self.axis)
class softmax(Operation):
""Return the result of a's Softmax function.""
def __init__(self, a):
"""Construct softmax parameter list: A: input node"""
super().__init__([a])
def compute(self, a_value):
"""Calculate the output values of the Softmax Operation parameter list: a_value: input value"""
return np.exp(a_value) / np.sum(np.exp(a_value), axis=1)[:, None]
from queue import Queue
class GradientDescentOptimizer:
def __init__(self, learning_rate):
self.learning_rate = learning_rate
def minimize(self, loss):
learning_rate = self.learning_rate
class MinimizationOperation(Operation):
def compute(self):
# Calculate gradient
grad_table = compute_gradients(loss)
Pass through all nodes
for node in grad_table:
if type(node) == Variable:
Find the gradient corresponding to the node
grad = grad_table[node]
# Do one iteration in the direction of the negative gradient
node.value -= learning_rate * grad
return MinimizationOperation()
Copy the code
The figure below depicts an iterative process of gradient descent. Let’s start with a random line (label 1) and move forward to find a slightly better line (label 2). Go one step at a time until you find a good dividing line.
In the above realization of gradient descent, we use the compute_gradient(Loss) function to calculate the gradient of the output of loss operation on other node N in the calculation graph (that is, the direction in which the loss increases the fastest, and the negative of the gradient is the direction in which the loss decreases the fastest).
Consider the following calculation:
According to the chain rule of derivatives, we can get:
So if we want to figure outaboutThe gradient of theta from thetaHere we go, all the way backFor each node along the way, compute the gradient of its output with respect to its input, one by one, all the way to. And then you multiply all the gradients.
Consider this scenario:
In this case, from 到 There are two paths: 和 . As a result,about 的Total differentialThe solution is as follows:
From this we can derive a general algorithm to calculate the gradient of the loss with respect to a node: start at the node representing the loss and do the breadth-first search backwards. Each time a node N is visited, the gradient of the output of n is calculated. Here, the gradient is calculated by performing the following operations on each consumption node C of node N (consumption node is node C that takes the output of node N as input) :
- Gain loss about
c
Gradient of outputG
- will
G
onc
About the output ofn
The gradient of the output of
And then we add up these gradients.
As a prerequisite for backpropagation, we need to specify a function for each operation that calculates the gradient of each input to the operation from the gradient of the output to the operation.
To do this we define a decorator @registergradient (Operation_name) :
# Operation mapping to the corresponding gradient calculation function
_gradient_registry = {}
class RegisterGradient:
""Decorator for registering gradient calculation function for op type""
def __init__(self, op_type):
"""Create a decorator instance with op_type as the Operation type parameter list: op_type: the name of Operation"""
self._op_type = eval(op_type)
def __call__(self, f):
"""Register function 'f' as a gradient calculation function of 'op_type'"""
_gradient_registry[self._op_type] = f
return f
Copy the code
Now assume that the _gradient_registry dictionary is complete, so let’s do the back-propagation:
from queue import Queue
def compute_gradients(loss):
Grad_table [node] can be used to retrieve the gradient of the node output
grad_table = {}
The gradient with respect to the loss is 1
grad_table[loss] = 1
# Reverse breadth-first search, starting at the loss node
visited = set()
queue = Queue()
visited.add(loss)
queue.put(loss)
while not queue.empty():
node = queue.get()
# If the node is not a lost node
ifnode ! = loss:#
Calculate the loss gradient with respect to the node output
#
grad_table[node] = 0
Pass through all consumption nodes
for consumer in node.consumers:
Extract the gradient of loss with respect to the output of the consumption node
lossgrad_wrt_consumer_output = grad_table[consumer]
Compute the function of the gradient of the input of the consumer node based on the gradient of the output of the consumer node
consumer_op_type = consumer.__class__
bprop = _gradient_registry[consumer_op_type]
Get the gradient of loss with respect to all inputs of the consumption node
lossgrads_wrt_consumer_inputs = bprop(consumer, lossgrad_wrt_consumer_output)
if len(consumer.input_nodes) == 1:
Lossgrads_wrt_consumer_inputs is a scalar if the consumption node has only one input node
grad_table[node] += lossgrads_wrt_consumer_inputs
else:
Lossgrads_wrt_consumer_inputs is an array of gradients for each input node
Get the sequence of this node in the input node of the consumption node
node_index_in_consumer_inputs = consumer.input_nodes.index(node)
Get the gradient of the loss with respect to the node
lossgrad_wrt_node = lossgrads_wrt_consumer_inputs[node_index_in_consumer_inputs]
Theta to the total gradient with respect to this node
grad_table[node] += lossgrad_wrt_node
#
Queue each input node
#
if hasattr(node, "input_nodes") :for input_node in node.input_nodes:
if not input_node in visited:
visited.add(input_node)
queue.put(input_node)
# return all gradients about visited points
return grad_table
Copy the code
For each operation, we define a function that translates the gradient of losses with respect to the output of the operation into the gradient of losses with respect to each input node of the operation, storing them in a list. Calculating the gradient of a matrix is boring, so let’s ignore that, and I’ll just give you the result. Skipping this section will not affect your understanding of the whole.
If you want to know how to get this result, it looks like this:
- Find the partial derivatives of each output value with respect to each input value (possibly a tensor of order greater than 2, i.e., neither scalar, vector, nor matrix, involving a lot of summation)
- Using the chain rule, calculate the gradient of the input with respect to the node by the gradient of the output with respect to the node. The result is going to be a tensor of the same form as the input tensor, which means that if the input is a matrix, the result is going to be a matrix.
- Write the result as a series of matrices operation to improve the operation efficiency. This is a tricky step.
Negative operation corresponds to gradient calculation
Known aboutThe gradient of, so aboutThe gradient of phi is.
@RegisterGradient("negative")
def _negative_gradient(op, grad):
"""List of gradient parameters for calculating 'negative' : op: we need to process 'negative' Operation grad: gradient return value for this' negative 'Operation: Gradient of input with respect to 'negative'""
return -grad
Copy the code
The gradient of the logarithm
Known aboutThe gradient ofAbout,The gradient of phi is.
@RegisterGradient("log")
def _log_gradient(op, grad):
""Compute the gradient of log. Parameter list: op: to process 'log' Operation grad: Gradient of the output of 'log' Operationa Return value: gradient of the input of 'log'""
x = op.inputs[0]
return grad/x
Copy the code
The gradient corresponding to multiplication is computed
Known aboutThe gradient ofAbout,The gradient ofAbout theThe gradient is.
@RegisterGradient("multiply")
def _multiply_gradient(op, grad):
"""Calculate the gradient of 'multiply'. Parameter list: op: The multiply 'Operation grad: About the gradient of the output of the multiply' operationa Return value: About the gradient of the input of the multiply '""
A = op.inputs[0]
B = op.inputs[1]
return [grad * B, grad * A]
Copy the code
Matmul corresponding gradient operation
Known aboutThe gradient ofAbout,The gradient ofAbout theThe gradient is.
@RegisterGradient("matmul")
def _matmul_gradient(op, grad):
"""Calculate the gradient of 'matmul'. Parameter list: op: to process 'matmul' Operation grad: Gradient of the output of 'matmul' Operationa Return value: gradient of the input of 'matmul'"""
A = op.inputs[0]
B = op.inputs[1]
return [grad.dot(B.T), A.T.dot(grad)]
Copy the code
The gradient operation for add
Known aboutThe gradient isAbout,The gradient ofAbout theThe gradient of phi is also. so 和 The shape is the same.
if 和 Different shapes, like matricesThe number of rows is 100, vector 1, 2The number of rows is 1, soIs toAdded to theOn each line of. In this case, the gradient calculation is a little more complicated, so we won’t go into the details here.
@RegisterGradient("add")
def _add_gradient(op, grad):
"""Calculate the gradient parameter list corresponding to 'add' : op: to process 'add' Operation grad: Gradient on the output of 'add' Operationa Return value: gradient on the input of 'add'"""
a = op.inputs[0]
b = op.inputs[1]
grad_wrt_a = grad
while np.ndim(grad_wrt_a) > len(a.shape):
grad_wrt_a = np.sum(grad_wrt_a, axis=0)
for axis, size in enumerate(a.shape):
if size == 1:
grad_wrt_a = np.sum(grad_wrt_a, axis=axis, keepdims=True)
grad_wrt_b = grad
while np.ndim(grad_wrt_b) > len(b.shape):
grad_wrt_b = np.sum(grad_wrt_b, axis=0)
for axis, size in enumerate(b.shape):
if size == 1:
grad_wrt_b = np.sum(grad_wrt_b, axis=axis, keepdims=True)
return [grad_wrt_a, grad_wrt_b]
Copy the code
Reduce_sum corresponding gradient calculation
Known aboutThe gradient of the output of, is about inputThe gradient of is repeated along the specified axis.
@RegisterGradient("reduce_sum")
def _reduce_sum_gradient(op, grad):
"""Calculate the gradient parameter list corresponding to 'reduce_sum' : op: to process the 'reduce_sum' Operation grad: The gradient return value of the output of 'reduce_sum' operationa: On the input gradient of 'reduce_sum'"""
A = op.inputs[0]
output_shape = np.array(A.shape)
output_shape[op.axis] = 1
tile_scaling = A.shape // output_shape
grad = np.reshape(grad, output_shape)
return np.tile(grad, tile_scaling)
Copy the code
Gradient calculation corresponding to Softmax
@RegisterGradient("softmax")
def _softmax_gradient(op, grad):
"""Calculate the gradient parameter list corresponding to 'softmax' : op: to process 'softmax' Operation grad: Gradient of the output of 'softmax' Operationa Return value: gradient of the input of 'softmax'"""
softmax = op.output
return (grad - np.reshape(
np.sum(grad * softmax, 1),
[-1, 1]
)) * softmax
Copy the code
Now let’s try out the code above and see what the best weight we can get for the perceptron is.
Create a new Graph
Graph().as_default()
X = placeholder()
c = placeholder()
# Randomly initialize weights
W = Variable(np.random.randn(2, 2))
b = Variable(np.random.randn(2))
# Build a perceptron
p = softmax(add(matmul(X, W), b))
# Build cross entropy loss
J = negative(reduce_sum(reduce_sum(multiply(c, log(p)), axis=1)))
Build minimize operationMinimization_op = GradientDescentOptimizer (learning_rate = 0.01). Minimize (J)# build placeholder input
feed_dict = {
X: np.concatenate((blue_points, red_points)),
c:
[[1, 0]] * len(blue_points)
+ [[0, 1]] * len(red_points)
}
Create a session
session = Session()
Perform 100 gradient descent iterations
for step in range(100):
J_value = session.run(J, feed_dict)
if step % 10 == 0:
print("Step:", step, " Loss:", J_value)
session.run(minimization_op, feed_dict)
Print the final result
W_value = session.run(W)
print("Weight matrix:\n", W_value)
b_value = session.run(b)
print("Bias:\n", b_value)
Copy the code
Notice, now we start with a pretty high loss, and then we let it go down pretty quickly.
Finally, let’s draw the final dividing line:
Import matplotlib.pyplot as PLT [/amalthea_pre_exercise_code] [amalthea_sample_code] W_value = Np.array ([[1.27496197 [1.11820232-2.01586474]]) b_value = Np.array ([-0.45274057-0.39071841])# Draw the line y = -x
x_axis = np.linspace(-4, 4, 100)
y_axis = -W_value[0][0]/W_value[1][0] * x_axis - b_value[0]/W_value[1][0]
plt.plot(x_axis, y_axis)
# Draw the red/blue dots
plt.scatter(red_points[:,0], red_points[:,1], color='red')
plt.scatter(blue_points[:,0], blue_points[:,1], color='blue')
plt.show()
Copy the code
If you have any questions, please ask them in the comments.
If you want to see if you can make a move into artificial intelligence, please click right: Intelligent AI class