原文 : Deep Learning From Scratch IV: Gradient Descent and Backpropagation

Translation: Sun Yimeng


  • Chapter 1: Calculation diagram
  • Chapter two: Perceptrons
  • Chapter three: Training standards
  • Chapter 4: Gradient descent and back propagation
  • Chapter 5: Multilayer perceptron
  • Chapter 6: TensorFlow

Gradient descent

In general, to find the minimum value of the function, we can set its derivative to be 0, and then solve for the parameters. But actually you have to findA closed form solution of phi is impossible, so we have to passGradient descentIteratively find the minimum.

Imagine you are standing on a high mountain, and if you want to get down to the ground in the shortest time, choose the steepest part of the mountain.

Gradient descent is the same, it takes a random number in the range of parameters, and then iteratively searches for the lossThe smallest case. At every step of the iteration, you find oneDrop the value in the steepest direction, and then continue iterating one step in that direction.

The one-dimensional case of this process is shown below:

The direction in which the steepest decline occurs at a particular point is determined by the gradient of the function at that point: the negative of the gradient (because it is falling) is the direction in which the loss falls fastest.

In that case, how do you minimize itHere are some ideas:

  1. First,A random value
  2. To calculateaboutThe gradient of
  3. Take the negative of the gradient as the direction and iterate one step further
  4. Go back to step 2

Let’s write an operation that minimizes a node by gradient descent. The learning_rate parameter allows you to set the step size for each step.

import numpy as np

class Operation:
    """Represents a graph node that performs a computation. An `Operation` is a node in a `Graph` that takes zero or more objects as input, and produces zero or more objects as output. """

    def __init__(self, input_nodes=[]):
        """Construct Operation """
        self.input_nodes = input_nodes

        # Initialize list of consumers (i.e. nodes that receive this operation's output as input)
        self.consumers = []

        # Append this operation to the list of consumers of all input nodes
        for input_node in input_nodes:
            input_node.consumers.append(self)

        # Append this operation to the list of operations in the currently active default graph
        _default_graph.operations.append(self)

    def compute(self):
        """Computes the output of this operation. "" Must be implemented by the particular operation. """
        pass

class Graph:
    """Represents a computational graph """

    def __init__(self):
        """Construct Graph"""
        self.operations = []
        self.placeholders = []
        self.variables = []

    def as_default(self):
        global _default_graph
        _default_graph = self

class placeholder:
    """Represents a placeholder node that has to be provided with a value when computing the output of a computational graph """

    def __init__(self):
        """Construct placeholder """
        self.consumers = []

        # Append this placeholder to the list of placeholders in the currently active default graph
        _default_graph.placeholders.append(self)

class Variable:
    """Represents a variable (i.e. an intrinsic, changeable parameter of a computational graph). """

    def __init__(self, initial_value=None):
        """Construct Variable Args: initial_value: The initial value of this variable """
        self.value = initial_value
        self.consumers = []

        # Append this variable to the list of variables in the currently active default graph
        _default_graph.variables.append(self)

class add(Operation):
    """Returns x + y element-wise. """

    def __init__(self, x, y):
        """Construct add Args: x: First summand node y: Second summand node """
        super().__init__([x, y])

    def compute(self, x_value, y_value):
        """Compute the output of the add operation Args: x_value: First summand value y_value: Second summand value """
        return x_value + y_value

class matmul(Operation):
    """Multiplies matrix a by matrix b, producing a * b. """

    def __init__(self, a, b):
        """Construct matmul Args: a: First matrix b: Second matrix """
        super().__init__([a, b])

    def compute(self, a_value, b_value):
        """Compute the output of the matmul operation Args: a_value: First matrix value b_value: Second matrix value """
        return a_value.dot(b_value)
class Session:
    """Represents a particular execution of a computational graph. """

    def run(self, operation, feed_dict={}):
        """Computes the output of an operation Args: operation: The operation whose output we'd like to compute. feed_dict: A dictionary that maps placeholders to values for this session """

        # Perform a post-order traversal of the graph to bring the nodes into the right order
        nodes_postorder = traverse_postorder(operation)

        # Iterate all nodes to determine their value
        for node in nodes_postorder:

            if type(node) == placeholder:
                # Set the node value to the placeholder value from feed_dict
                node.output = feed_dict[node]
            elif type(node) == Variable:
                # Set the node value to the variable's value attribute
                node.output = node.value
            else:  # Operation
                # Get the input values for this operation from node_values
                node.inputs = [input_node.output for input_node in node.input_nodes]

                # Compute the output of this operation
                node.output = node.compute(*node.inputs)

            # Convert lists to numpy arrays
            if type(node.output) == list:
                node.output = np.array(node.output)

        # Return the requested node value
        return operation.output


def traverse_postorder(operation):
    """Performs a post-order traversal, returning a list of nodes in the order in which they have to be computed Args: operation: The operation to start traversal at """

    nodes_postorder = []

    def recurse(node):
        if isinstance(node, Operation):
            for input_node in node.input_nodes:
                recurse(input_node)
        nodes_postorder.append(node)

    recurse(operation)
    return nodes_postorder


red_points = np.random.randn(50, 2) - 2*np.ones((50, 2))
blue_points = np.random.randn(50, 2) + 2*np.ones((50, 2))

class negative(Operation):
    """Evaluate negative numbers element by element."""

    def __init__(self, x):
        """Construct negative operation parameter list: x: input node"""
        super().__init__([x])

    def compute(self, x_value):
        """Calculate the negative operation output parameter list: x_value: input value"""
        return -x_value

class log(Operation):
    """Take the logarithm of each element."""

    def __init__(self, x):
        """Construct log argument list: x: input node"""
        super().__init__([x])

    def compute(self, x_value):
        """Calculate log operation output parameter list: x_value: input value"""
        return np.log(x_value)

class multiply(Operation):
    """For each element, return the value of x times y."""

    def __init__(self, x, y):
        """Construct the multiplication argument list: x: input node for the first multiplier y: input node for the second multiplier"""
        super().__init__([x, y])

    def compute(self, x_value, y_value):
        """Compute the output of the multiplication operation Args: x_value: the value of the first multiplier y_value: the value of the second multiplier"""
        return x_value * y_value

class negative(Operation):
    """Evaluate negative numbers element by element."""

    def __init__(self, x):
        """Construct negative operation parameter list: x: input node"""
        super().__init__([x])

    def compute(self, x_value):
        """Calculate the negative operation output parameter list: x_value: input value"""
        return -x_value

class reduce_sum(Operation):
    """Calculate the sum of the elements of a tensor in one or more dimensions."""

    def __init__(self, A, axis=None):
        """Construct reduce_sum parameter list: A: tensor to perform reduce operation Axis: dimensions that require reduce, if 'None' (the default), extend reduce for all dimensions."""
        super().__init__([A])
        self.axis = axis

    def compute(self, A_value):
        """Calculate the parameter list of the output values of the reduce_sum operation: A_value: tensor value of the input"""
        return np.sum(A_value, self.axis)

class softmax(Operation):
    ""Return the result of a's Softmax function.""

    def __init__(self, a):
        """Construct softmax parameter list: A: input node"""
        super().__init__([a])

    def compute(self, a_value):
        """Calculate the output values of the Softmax Operation parameter list: a_value: input value"""
        return np.exp(a_value) / np.sum(np.exp(a_value), axis=1)[:, None]

from queue import Queue

class GradientDescentOptimizer:
    def __init__(self, learning_rate):
        self.learning_rate = learning_rate

    def minimize(self, loss):
        learning_rate = self.learning_rate

        class MinimizationOperation(Operation):
            def compute(self):
                # Calculate gradient
                grad_table = compute_gradients(loss)

                Pass through all nodes
                for node in grad_table:
                    if type(node) == Variable:
                        Find the gradient corresponding to the node
                        grad = grad_table[node]

                        # Do one iteration in the direction of the negative gradient
                        node.value -= learning_rate * grad

        return MinimizationOperation()
Copy the code

The figure below depicts an iterative process of gradient descent. Let’s start with a random line (label 1) and move forward to find a slightly better line (label 2). Go one step at a time until you find a good dividing line.


In the above realization of gradient descent, we use the compute_gradient(Loss) function to calculate the gradient of the output of loss operation on other node N in the calculation graph (that is, the direction in which the loss increases the fastest, and the negative of the gradient is the direction in which the loss decreases the fastest).

Consider the following calculation:

According to the chain rule of derivatives, we can get:


So if we want to figure outaboutThe gradient of theta from thetaHere we go, all the way backFor each node along the way, compute the gradient of its output with respect to its input, one by one, all the way to. And then you multiply all the gradients.

Consider this scenario:

In this case, fromThere are two paths:. As a result,aboutTotal differentialThe solution is as follows:


From this we can derive a general algorithm to calculate the gradient of the loss with respect to a node: start at the node representing the loss and do the breadth-first search backwards. Each time a node N is visited, the gradient of the output of n is calculated. Here, the gradient is calculated by performing the following operations on each consumption node C of node N (consumption node is node C that takes the output of node N as input) :

  • Gain loss aboutcGradient of outputG
  • willGoncAbout the output ofnThe gradient of the output of

And then we add up these gradients.

As a prerequisite for backpropagation, we need to specify a function for each operation that calculates the gradient of each input to the operation from the gradient of the output to the operation.

To do this we define a decorator @registergradient (Operation_name) :

# Operation mapping to the corresponding gradient calculation function
_gradient_registry = {}


class RegisterGradient:
    ""Decorator for registering gradient calculation function for op type""

    def __init__(self, op_type):
        """Create a decorator instance with op_type as the Operation type parameter list: op_type: the name of Operation"""
        self._op_type = eval(op_type)

    def __call__(self, f):
        """Register function 'f' as a gradient calculation function of 'op_type'"""
        _gradient_registry[self._op_type] = f
        return f
Copy the code

Now assume that the _gradient_registry dictionary is complete, so let’s do the back-propagation:

from queue import Queue


def compute_gradients(loss):
    Grad_table [node] can be used to retrieve the gradient of the node output
    grad_table = {}

    The gradient with respect to the loss is 1
    grad_table[loss] = 1

    # Reverse breadth-first search, starting at the loss node
    visited = set()
    queue = Queue()
    visited.add(loss)
    queue.put(loss)

    while not queue.empty():
        node = queue.get()

        # If the node is not a lost node
        ifnode ! = loss:#
            Calculate the loss gradient with respect to the node output
            #
            grad_table[node] = 0

            Pass through all consumption nodes
            for consumer in node.consumers:

                Extract the gradient of loss with respect to the output of the consumption node
                lossgrad_wrt_consumer_output = grad_table[consumer]

                Compute the function of the gradient of the input of the consumer node based on the gradient of the output of the consumer node
                consumer_op_type = consumer.__class__
                bprop = _gradient_registry[consumer_op_type]

                Get the gradient of loss with respect to all inputs of the consumption node
                lossgrads_wrt_consumer_inputs = bprop(consumer, lossgrad_wrt_consumer_output)

                if len(consumer.input_nodes) == 1:
                    Lossgrads_wrt_consumer_inputs is a scalar if the consumption node has only one input node
                    grad_table[node] += lossgrads_wrt_consumer_inputs

                else:
                    Lossgrads_wrt_consumer_inputs is an array of gradients for each input node

                    Get the sequence of this node in the input node of the consumption node
                    node_index_in_consumer_inputs = consumer.input_nodes.index(node)

                    Get the gradient of the loss with respect to the node
                    lossgrad_wrt_node = lossgrads_wrt_consumer_inputs[node_index_in_consumer_inputs]

                    Theta to the total gradient with respect to this node
                    grad_table[node] += lossgrad_wrt_node

        #
        Queue each input node
        #
        if hasattr(node, "input_nodes") :for input_node in node.input_nodes:
                if not input_node in visited:
                    visited.add(input_node)
                    queue.put(input_node)

    # return all gradients about visited points
    return grad_table
Copy the code


For each operation, we define a function that translates the gradient of losses with respect to the output of the operation into the gradient of losses with respect to each input node of the operation, storing them in a list. Calculating the gradient of a matrix is boring, so let’s ignore that, and I’ll just give you the result. Skipping this section will not affect your understanding of the whole.

If you want to know how to get this result, it looks like this:

  • Find the partial derivatives of each output value with respect to each input value (possibly a tensor of order greater than 2, i.e., neither scalar, vector, nor matrix, involving a lot of summation)
  • Using the chain rule, calculate the gradient of the input with respect to the node by the gradient of the output with respect to the node. The result is going to be a tensor of the same form as the input tensor, which means that if the input is a matrix, the result is going to be a matrix.
  • Write the result as a series of matrices operation to improve the operation efficiency. This is a tricky step.

Negative operation corresponds to gradient calculation

Known aboutThe gradient of, so aboutThe gradient of phi is.

@RegisterGradient("negative")
def _negative_gradient(op, grad):
    """List of gradient parameters for calculating 'negative' : op: we need to process 'negative' Operation grad: gradient return value for this' negative 'Operation: Gradient of input with respect to 'negative'""
    return -grad
Copy the code

The gradient of the logarithm

Known aboutThe gradient ofAbout,The gradient of phi is.

@RegisterGradient("log")
def _log_gradient(op, grad):
    ""Compute the gradient of log. Parameter list: op: to process 'log' Operation grad: Gradient of the output of 'log' Operationa Return value: gradient of the input of 'log'""
    x = op.inputs[0]
    return grad/x
Copy the code

The gradient corresponding to multiplication is computed

Known aboutThe gradient ofAbout,The gradient ofAbout theThe gradient is.

@RegisterGradient("multiply")
def _multiply_gradient(op, grad):
    """Calculate the gradient of 'multiply'. Parameter list: op: The multiply 'Operation grad: About the gradient of the output of the multiply' operationa Return value: About the gradient of the input of the multiply '""

    A = op.inputs[0]
    B = op.inputs[1]

    return [grad * B, grad * A]
Copy the code

Matmul corresponding gradient operation

Known aboutThe gradient ofAbout,The gradient ofAbout theThe gradient is.

@RegisterGradient("matmul")
def _matmul_gradient(op, grad):
    """Calculate the gradient of 'matmul'. Parameter list: op: to process 'matmul' Operation grad: Gradient of the output of 'matmul' Operationa Return value: gradient of the input of 'matmul'"""

    A = op.inputs[0]
    B = op.inputs[1]

    return [grad.dot(B.T), A.T.dot(grad)]
Copy the code

The gradient operation for add

Known aboutThe gradient isAbout,The gradient ofAbout theThe gradient of phi is also. soThe shape is the same.

ifDifferent shapes, like matricesThe number of rows is 100, vector 1, 2The number of rows is 1, soIs toAdded to theOn each line of. In this case, the gradient calculation is a little more complicated, so we won’t go into the details here.

@RegisterGradient("add")
def _add_gradient(op, grad):
    """Calculate the gradient parameter list corresponding to 'add' : op: to process 'add' Operation grad: Gradient on the output of 'add' Operationa Return value: gradient on the input of 'add'"""
    a = op.inputs[0]
    b = op.inputs[1]

    grad_wrt_a = grad
    while np.ndim(grad_wrt_a) > len(a.shape):
        grad_wrt_a = np.sum(grad_wrt_a, axis=0)
    for axis, size in enumerate(a.shape):
        if size == 1:
            grad_wrt_a = np.sum(grad_wrt_a, axis=axis, keepdims=True)

    grad_wrt_b = grad
    while np.ndim(grad_wrt_b) > len(b.shape):
        grad_wrt_b = np.sum(grad_wrt_b, axis=0)
    for axis, size in enumerate(b.shape):
        if size == 1:
            grad_wrt_b = np.sum(grad_wrt_b, axis=axis, keepdims=True)

    return [grad_wrt_a, grad_wrt_b]
Copy the code

Reduce_sum corresponding gradient calculation

Known aboutThe gradient of the output of, is about inputThe gradient of is repeated along the specified axis.

@RegisterGradient("reduce_sum")
def _reduce_sum_gradient(op, grad):
    """Calculate the gradient parameter list corresponding to 'reduce_sum' : op: to process the 'reduce_sum' Operation grad: The gradient return value of the output of 'reduce_sum' operationa: On the input gradient of 'reduce_sum'"""
    A = op.inputs[0]

    output_shape = np.array(A.shape)
    output_shape[op.axis] = 1
    tile_scaling = A.shape // output_shape
    grad = np.reshape(grad, output_shape)
    return np.tile(grad, tile_scaling)
Copy the code

Gradient calculation corresponding to Softmax

@RegisterGradient("softmax")
def _softmax_gradient(op, grad):
    """Calculate the gradient parameter list corresponding to 'softmax' : op: to process 'softmax' Operation grad: Gradient of the output of 'softmax' Operationa Return value: gradient of the input of 'softmax'"""

    softmax = op.output
    return (grad - np.reshape(
        np.sum(grad * softmax, 1),
        [-1, 1]
    )) * softmax
Copy the code


Now let’s try out the code above and see what the best weight we can get for the perceptron is.

Create a new Graph
Graph().as_default()

X = placeholder()
c = placeholder()

# Randomly initialize weights
W = Variable(np.random.randn(2, 2))
b = Variable(np.random.randn(2))

# Build a perceptron
p = softmax(add(matmul(X, W), b))

# Build cross entropy loss
J = negative(reduce_sum(reduce_sum(multiply(c, log(p)), axis=1)))

Build minimize operationMinimization_op = GradientDescentOptimizer (learning_rate = 0.01). Minimize (J)# build placeholder input
feed_dict = {
    X: np.concatenate((blue_points, red_points)),
    c:
        [[1, 0]] * len(blue_points)
        + [[0, 1]] * len(red_points)

}

Create a session
session = Session()

Perform 100 gradient descent iterations
for step in range(100):
    J_value = session.run(J, feed_dict)
    if step % 10 == 0:
        print("Step:", step, " Loss:", J_value)
    session.run(minimization_op, feed_dict)

Print the final result
W_value = session.run(W)
print("Weight matrix:\n", W_value)
b_value = session.run(b)
print("Bias:\n", b_value)
Copy the code

Notice, now we start with a pretty high loss, and then we let it go down pretty quickly.

Finally, let’s draw the final dividing line:

Import matplotlib.pyplot as PLT [/amalthea_pre_exercise_code] [amalthea_sample_code] W_value = Np.array ([[1.27496197 [1.11820232-2.01586474]]) b_value = Np.array ([-0.45274057-0.39071841])# Draw the line y = -x
x_axis = np.linspace(-4, 4, 100)
y_axis = -W_value[0][0]/W_value[1][0] * x_axis - b_value[0]/W_value[1][0]
plt.plot(x_axis, y_axis)

# Draw the red/blue dots
plt.scatter(red_points[:,0], red_points[:,1], color='red')
plt.scatter(blue_points[:,0], blue_points[:,1], color='blue')
plt.show()
Copy the code

If you have any questions, please ask them in the comments.


If you want to see if you can make a move into artificial intelligence, please click right: Intelligent AI class