Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.


Learning PyTorch with Examples — PyTorch Tutorials 1.9.1+ Cu102 Documentation

We will use a problem of fitting y=sin(x) with a third order polynomial as our running example. The network will have four parameters, and will be trained with gradient descent to fit random data by minimizing the Euclidean distance between the network output and the true output.

It’s fitting a sin(x) image with a third-order polynomial, requiring four input parameters, and then using gradient descent to find the optimal parameters.

NumPy

Although pyTorch is getting started, the code is implemented using Numpy to help you understand:

# Set random x and y
x = np.linspace(-math.pi, math.pi, 2000)
y = np.sin(x)

Initialize the four weights
a = np.random.randn()
b = np.random.randn()
c = np.random.randn()
d = np.random.randn()

# Set your learning rate
learning_rate = 1e-6


for t in range(2000) :Y = a + b x + c x^2 + d x^3
    y_pred = a + b * x + c * x ** 2 + d * x ** 3

    Sum (y_predict - y)^2) = 2m
    loss = np.square(y_pred - y).sum(a)# output loss, here is to select a few random output, do not tangle why is the output
    if t % 100= =99:
        print(t, loss)

    # Back propagation calculates the gradients of ABCD, taking partial derivatives of each parameter, so the following conclusions are obtained. If you don't understand, you can see the derivation below. y
    grad_y_pred = 2.0 * (y_pred - y)
    grad_a = grad_y_pred.sum()
    grad_b = (grad_y_pred * x).sum()
    grad_c = (grad_y_pred * x ** 2).sum()
    grad_d = (grad_y_pred * x ** 3).sum(a)# Update weight
    a -= learning_rate * grad_a
    b -= learning_rate * grad_b
    c -= learning_rate * grad_c
    d -= learning_rate * grad_d

print(f'Result: y = {a} + {b} x + {c} x^2 + {d} x^3')
Copy the code

For the Gardients section:


l o s s = i = 1 2000 ( y p r e d y ) 2 = i = 1 2000 ( a + b x + c x 2 + d x 3 y ) 2 loss = \sum_{i=1}^{2000}(y_{pred}-y)^2=\sum_{i=1}^{2000}(a+bx+cx^2+dx^3-y)^2

Take the partial derivative of the top expression:


partial l o s s partial a = partial i = 1 2000 ( a + b x + c x 2 + d x 3 y ) 2 partial a = partial i = 1 2000 ( y p r e d y ) 2 partial y p r e d y partial a + b x + c x 2 + d x 3 y partial a = i = 1 2000 2 ( y p r e d y ) \frac{\partial loss}{\partial a} = \frac{\partial \sum_{i=1}^{2000}(a+bx+cx^2+dx^3-y)^2}{\partial a} = \frac{\partial \ sum_ {I = 1} ^ {2000} (y_ {Mr Pred} – y) ^ 2} {\ partial y_ {Mr Pred} – y}, \ frac {\ partial cx ^ 2 + a + bx + dx ^ 3 – y} {\ partial a} = \sum_{i=1}^{2000}2(y_{pred}-y)


partial l o s s partial b = partial i = 1 2000 ( a + b x + c x 2 + d x 3 y ) 2 partial b = partial i = 1 2000 ( y p r e d y ) 2 partial y p r e d y partial a + b x + c x 2 + d x 3 y partial b = i = 1 2000 2 x ( y p r e d y ) \frac{\partial loss}{\partial b} = \frac{\partial \sum_{i=1}^{2000}(a+bx+cx^2+dx^3-y)^2}{\partial b} = \frac{\partial \ sum_ {I = 1} ^ {2000} (y_ {Mr Pred} – y) ^ 2} {\ partial y_ {Mr Pred} – y}, \ frac {\ partial cx ^ 2 + a + bx + dx ^ 3 – y}} {\ partial b = \sum_{i=1}^{2000}2x(y_{pred}-y)

In the same way:


partial l o s s partial c = i = 1 2000 2 x 2 ( y p r e d y ) \frac{\partial loss}{\partial c} = \sum_{i=1}^{2000}2x^2(y_{pred}-y)


partial l o s s partial d = i = 1 2000 2 x 3 ( y p r e d y ) \frac{\partial loss}{\partial d} = \sum_{i=1}^{2000}2x^3(y_{pred}-y)

Pytorch

So why use PyTorch if numpy can be written?

Numpy provides an n-dimensional array object, and many functions for manipulating these arrays. Numpy is a generic framework for scientific computing; it does not know anything about computation graphs, or deep learning, or gradients.

Numpy is a great framework, but it cannot utilize GPUs to accelerate its numerical computations. For modern deep neural networks, GPUs often provide speedups of 50x or greater, so unfortunately numpy won’t be enough for modern deep learning.

Numpy provides objects that can easily compute N-dimensional arrays, and is a scientific computing framework. However, Numpy cannot compute graphs, deep learning, gradients, and GPU acceleration, so numpy is not suitable for current deep learning. Hence the PyTorch.

import torch
import math


dtype = torch.float
device = torch.device("cpu")
# Uncomment the line below to run on the GPU
# device = torch.device("cuda:0") 

Set input randomly
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)

# Random initialization parameters
a = torch.randn((), device=device, dtype=dtype)
b = torch.randn((), device=device, dtype=dtype)
c = torch.randn((), device=device, dtype=dtype)
d = torch.randn((), device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(2000) :# Prediction function
    y_pred = a + b * x + c * x ** 2 + d * x ** 3

    # Calculate and output loss
    loss = (y_pred - y).pow(2).sum().item()
    if t % 100= =99:
        print(t, loss)

    # Backpropagation
    grad_y_pred = 2.0 * (y_pred - y)
    grad_a = grad_y_pred.sum()
    grad_b = (grad_y_pred * x).sum()
    grad_c = (grad_y_pred * x ** 2).sum()
    grad_d = (grad_y_pred * x ** 3).sum(a)# Update weight
    a -= learning_rate * grad_a
    b -= learning_rate * grad_b
    c -= learning_rate * grad_c
    d -= learning_rate * grad_d


print(f'Result: y = {a.item()} + {b.item()} x + {c.item()} x^2 + {d.item()} x^3')
Copy the code

autograd

PyTorch: Tensors and autograd(auto-gradient)

In the above examples, we had to manually implement both the forward and backward passes of our neural network. Manually implementing the backward pass is not a big deal for a small two-layer network, but can quickly get very hairy for large complex networks.

Thankfully, we can use automatic differentiation to automate the computation of backward passes in neural networks. The autograd package in PyTorch provides exactly this functionality. When using autograd, the forward pass of your network will define a computational graph; nodes in the graph will be Tensors, and edges will be functions that produce output Tensors from input Tensors. Backpropagating through this graph then allows you to easily compute gradients.

This sounds complicated, It’s pretty simple to use in practice. Each Tensor represents a node in a computational graph. If X is a Tensor that has x.requires_grad=True then x.grad is another Tensor holding the gradient of x with respect to some scalar value.

In addition to the advantages of Numpy, we do not need to write the back-propagation process, because In PyTorch we can use Autograd to automatically calculate the neural network back-propagation process. When we use Autograd, forward propagation defines a graph in which the nodes are tensors and the edges of the graph are functions that generate the output tensors from the input tensors. Gradient can be easily obtained through the back propagation of this graph.

Although it sounds complicated, it is simple to use. Each tensor represents a node in a computational graph. If x is a tensor and x.equires_grad =True is set to it, then x.grad stores the tensor of x relative to some scalar gradient.

import torch
import math

dtype = torch.float
device = torch.device("cpu")
# Uncomment the line below to run on the GPU
# device = torch.device("cuda:0")  

# Create Tensors to hold input and outputs.
Requires_grad =False by default, indicating that we do not need to calculate the gradient of this tensor in the backpropagation process.
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)

Requires_grad =True indicates that we want to preserve the gradient in the backpropagation process
a = torch.randn((), device=device, dtype=dtype, requires_grad=True)
b = torch.randn((), device=device, dtype=dtype, requires_grad=True)
c = torch.randn((), device=device, dtype=dtype, requires_grad=True)
d = torch.randn((), device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(2000):
    y_pred = a + b * x + c * x ** 2 + d * x ** 3

    # Now loss is a Tensor of shape (1,)
    # loss.item() Retrieves scalar values in Loss
    loss = (y_pred - y).pow(2).sum(a)if t % 100= =99:
        print(t, loss.item())

    This call calculates the gradients of all the tensors for requires_grad=True.
    # Their values are then stored separately in the corresponding tensors
    loss.backward()

    # Manually update weights
    # Wrap in torch. No_grad () because we set requires_grad=True, but we don't want to record the gradient of the A - operation in autograd
    with torch.no_grad():
        a -= learning_rate * a.grad
        b -= learning_rate * b.grad
        c -= learning_rate * c.grad
        d -= learning_rate * d.grad

        Manually clear the tensor storing gradient after updating the weights
        # When calculating BACKWARD, the gradient at the previous moment needs to be zero. otherwise, the gradient value will accumulate all the time
        a.grad = None
        b.grad = None
        c.grad = None
        d.grad = None

print(f'Result: y = {a.item()} + {b.item()} x + {c.item()} x^2 + {d.item()} x^3')
Copy the code

Define a new autograd function

Under the hood, each primitive autograd operator is really two functions that operate on Tensors. The forward function computes output Tensors from input Tensors. The backward function receives the gradient of the output Tensors with respect to some scalar value, and computes the gradient of the input Tensors with respect to that same scalar value.

In PyTorch we can easily define our own autograd operator by defining a subclass of torch.autograd.Function and implementing the forward and backward functions. We can then use our new autograd operator by constructing an instance and calling it like a function, passing Tensors containing input data.

The original autograd operator actually provides two tensor operations:

  • Forward propagation: Computes the output tensor from the input tensor
  • Back propagation: Receives the gradient of the output tensor with respect to a scalar and computes the gradient of the input tensor with respect to the same scalar

In PyTorch we can implement our own forward and backward propagation using torch.Autograd. Function, and then we can construct an instance and call it as if it were a Function, using the new Autograd operator.

Our previous prediction model was Y =a+ Bx + CX2 + DX3y =a+ Bx + Cx ^2+dx^3y=a+ Bx +dx y=a+bP3(c+dx)y=a+bP_3(c+dx)y=a+bP3(c+dx). P3 (x) = 12 x3 (5-3 x) P_3 (x) = \ frac {1} {2} \ left (5 x ^ 3-3 x \ right) P3 (x) = 21 (5 x3-3 x) is three times of Legendre polynomial. Now let’s implement our custom Autograd function to implement our new model:

# -*- coding: utf-8 -*-
import torch
import math


class LegendrePolynomial3(torch.autograd.Function) :
    @staticmethod
    def forward(ctx, input) :
        In a forward pass, we receive an input tensor and return an output tensor CTX is a pseudo-backward-propagating context object hiding information You can use CTX to cache arbitrary objects in order to use the save_for_BACKWARD method in a backward pass. "" "
        ctx.save_for_backward(input)
        return 0.5 * (5 * input ** 3 - 3 * input)

    @staticmethod
    def backward(ctx, grad_output) :
        In backward propagation we receive a tensor containing the gradient of Loss with respect to the output and we need to calculate the gradient of Loss with respect to the input.
        input, = ctx.saved_tensors
        return grad_output * 1.5 * (5 * input ** 2 - 1)


dtype = torch.float
device = torch.device("cpu")

# declare tensors that hold inputs and outputs
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)


# Randomly initialize weights
# y = a + b * P3(c + d * x), we need four weights abcd
These numbers should be initialized close to the correct answer to ensure convergence (? I have a question, how do you know the right answer)
a = torch.full((), 0.0, device=device, dtype=dtype, requires_grad=True)
b = torch.full((), -1.0, device=device, dtype=dtype, requires_grad=True)
c = torch.full((), 0.0, device=device, dtype=dtype, requires_grad=True)
d = torch.full((), 0.3, device=device, dtype=dtype, requires_grad=True)

learning_rate = 5e-6
for t in range(2000) :Call our own function P3
    P3 = LegendrePolynomial3.apply

    y_pred = a + b * P3(c + d * x)

    # Calculate and output loss
    loss = (y_pred - y).pow(2).sum(a)if t % 500= =0:
        print(t, loss.item())

    # Backpropagation
    loss.backward()

    # Update weight
    with torch.no_grad():
        a -= learning_rate * a.grad
        b -= learning_rate * b.grad
        c -= learning_rate * c.grad
        d -= learning_rate * d.grad

        Every time when calculating backward, it is necessary to zero the gradient at the previous moment, otherwise the gradient value will accumulate all the time
        a.grad = None
        b.grad = None
        c.grad = None
        d.grad = None

print(f'Result: y = {a.item()} + {b.item()} * P3({c.item()} + {d.item()} x)')
Copy the code

Neural network module