0 x00 the
This article explains two examples, starting with two official PyTorch documents. This article will not be translated sentence by sentence, but selected key points and try to add their own understanding.
We learned the basic concepts of automatic differentiation in the previous two articles, and from here we continue to analyze how PyTorch implements automatic differentiation. Because the content involved is too much and too complicated, we plan to use 2 ~ 3 articles to introduce how to realize forward propagation and 3 ~ 4 articles to introduce how to realize backward propagation.
The first two articles in the series are linked as follows:
Automatic Differentiation of Deep Learning Tools (1)
Automatic Differentiation of Deep Learning Tools (2)
0 x01 overview
When training neural networks, the most commonly used algorithm is back propagation. In this algorithm, parameters (model weights) are adjusted according to the gradient of the loss function relative to a given parameter. To calculate these gradients, PyTorch implements a built-in reverse automatic differential engine called Torch. Autograd. It supports automatic gradient calculation of any computed graph.
1.1 Coding History
Conceptually, Autograd records a computational graph. When creating a tensor, if REQUIres_grad is set to true, Pytorch knows that it needs to automatically differentiate the tensor. PyTorch then records the history of each operation on the tensor to produce a conceptual directed acyclic graph whose leaf nodes are the input tensor of the model and whose roots are the output tensor of the model. The user does not need to encode all the execution paths of the graph, because the user runs what the user later wants to differentiate. By tracing the graph from root to leaf, the user can automatically calculate the gradient using the chain derivative rule.
Internally, Autograd represents this graph as a graph of “Function” or “Node” objects (real expressions) that can be evaluated using the Apply method.
1.2 Application
When calculating forward propagation, Autograd does the following:
- Run the requested action to compute the result tensor.
- Build a DAG diagram that computes gradients and maintain a record of all performed operations (including the gradient function of the operation and the resulting new tensor) in the DAG diagram. The exact calculation method for each tensor gradient is stored in the grad_fn property of the tensor node.
When forward propagation is complete, we do backward propagation by calling.backward() on the DAG root, and autograd does the following:
- using
.grad_fn
The gradient of each tensor is calculated, and a back-propagation computation graph containing gradient calculation method is constructed. - The gradients accumulate in their respective tensors
.grad
Property, and using the chain rule, all the way down to the leaf tensor. - The computed graph is recreated with each iteration, which allows us to use Python code to change the shape and size of the computed graph with each iteration.
Note that the DAG in PyTorch is dynamic, and after each.Backward () call, Autograd starts filling in a new computed graph, which is recreated from scratch. This allows us to change the shape and size of the computed graph with each iteration using Python code.
0 x02 sample
Let’s take a look at two examples, both of which are from the official PyTorch documentation.
2.2 Example Interpretation 1
We start with pytorch.org/tutorials/b… To demonstrate and interpret.
2.2.1 code
Example code is as follows:
import torch
a = torch.tensor(2., requires_grad=True)
b = torch.tensor(6., requires_grad=True)
O = 3*a**3
P = b**2
Q = O - P
external_grad = torch.tensor(1.)
Q.backward(gradient=external_grad)
print(a.grad)
print(b.grad)
print("=========== grad")
a = torch.tensor(2., requires_grad=True)
b = torch.tensor(6., requires_grad=True)
Q = 3*a**3 - b**2
grads = torch.autograd.grad(Q, [a, b])
print(grads[0])
print(grads[1])
print(Q.grad_fn.next_functions)
print(O.grad_fn.next_functions)
print(P.grad_fn.next_functions)
print(a.grad_fn)
print(b.grad_fn)
Copy the code
The output is:
tensor(36.)
tensor(-12.)
=========== grad
tensor(36.)
tensor(-12.)
((<MulBackward0 object at 0x000001374DE6C308>, 0), (<PowBackward0 object at 0x000001374DE6C288>, 0))
((<PowBackward0 object at 0x000001374DE6C288>, 0), (None, 0))
((<AccumulateGrad object at 0x000001374DE6C6C8>, 0),)
None
None
Copy the code
Q is calculated as follows:
2.2.2 analysis
The dynamic map is built during forward propagation. In the forward propagation, Q is the final output, but in the back propagation, Q is the initial input of calculation, which is the Root of the back propagation graph.
In the example, the corresponding tensor is:
- A is 2, B is 6, and Q is
tensor(-12., grad_fn=<SubBackward0>)
.
The corresponding integral is:
2.3 Example Interpretation 2
This time to pytorch.org/tutorials/b… For example.
2.3.1 Sample code
Consider the simplest layer of neural network with input X, parameters W and B, and some loss functions. It can be defined in PyTorch in the following way:
import torch
x = torch.ones(5) # input tensor
y = torch.zeros(3) # expected output
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
z = torch.matmul(x, w)+b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)
Copy the code
2.3.2 Tensors, functions, and computational graphs
The code above defines the following calculation diagram:
Credit: pytorch.org/tutorials/_…
In this network, w and B are the parameters we need to optimize. Therefore, we need to calculate the gradient of the loss function with respect to these variables. To do this, we set the requires_grad attribute for these tensors.
Note that you can set the value of requires_grad when you create the tensor, or you can set it later using the x.equires_grad_ (True) method.
The Function we apply to tensors to build computations is actually an object of the Function class. This object knows how to evaluate a function forward and how to evaluate its derivative in the backpropagation step. References to the backpropagation function are stored in properties of the grad_fn tensor.
print('Gradient function for z =', z.grad_fn)
print('Gradient function for loss =', loss.grad_fn)
Copy the code
The output is as follows:
Gradient function for z = <AddBackward0 object at 0x7f4dbd4d3080>
Gradient function for loss = <BinaryCrossEntropyWithLogitsBackward object at 0x7f4dbd4d3080>
Copy the code
2.3.3 Calculate the gradient
loss.backward()
print(w.grad)
print(b.grad)
Copy the code
It is concluded that:
Tensor ([[0.1881, 0.1876, 0.0229], [0.1881, 0.1876, 0.0229], [0.1881, 0.1876, 0.0229], [0.1881, 0.1876, 0.0229]. [0.1881, 0.1876, 0.0229]]) tensor ([0.1881, 0.1876, 0.0229])Copy the code
- We can only get the values at the leaf nodes of the computational graph
requires_grad
Property set toTrue
To get the nodegrad
Properties. We can’t get gradients for all the other nodes in our graph. - For performance reasons, we can only use it on a given computational graph
backward
Perform a gradient calculation. If we need to call it multiple times on the same graphbackward
, need to be inbackward
Call-time settingretain_graph=True
.
Pay attention to
2.3.4 Disable gradient tracking
By default, all tensors with requiRES_grad =True track their calculation history and support gradient calculation. However, there are cases where we do not need to do this. For example, if we have trained the model and only want to apply it to some input data, i.e. we only want to do forward calculations over the network, we can stop tracking calculations by enclosing our calculation code with a torch.no_grad() block:
z = torch.matmul(x, w)+b
print(z.requires_grad)
with torch.no_grad():
z = torch.matmul(x, w)+b
print(z.requires_grad)
Copy the code
Output:
True
False
Copy the code
Another way to achieve the same result is to use the detach() method on tensors:
z = torch.matmul(x, w)+b
z_det = z.detach()
print(z_det.requires_grad)
Copy the code
The output
False
Copy the code
There are reasons why you might want to disable gradient tracking:
- Some parameters in the neural network are marked as freezing parameters. This is a very common scenario for fine-tuning the pre-training network.
- Speed up the calculation when only forward propagation is performed, because it is more efficient to calculate on tensors that do not track gradients.
0x03 Logical relationship
If you look at the process of forward calculation from the perspective of computation diagram, you are constructing diagram and executing diagram. The construction diagram “describes the relationships between node operations.” The “execution graph” is the execution of the operation relation in the session, which is the process of the tensor propagating forward in the calculation graph.
Forward computation relies on some base classes, and before looking at forward propagation in detail, we need to look at the logical relationships between these base classes. Analyzing the PyTorch system from a DAG perspective, the logic is as follows.
-
Figure represents computing tasks. PyTorch treats computation as a kind of directed acyclic graph, or computational graph, but this is a virtual graph with no real data structure in the code.
-
The graph consists of nodes and edges.
-
Nodes represent operational operations.
- A node takes zero or more Tensor through the edge, and then the node performs the calculation and produces zero or more Tensor.
- The node’s member variable, next_functions, is a tuple list that represents which other functions the node is exporting to. The number of lists is the number of edges of this grad_fn. Each tuple in the list corresponds to an Edge message, which contains (edge. function, edge.input_nr).
-
Side (Edge)This is the flow relationship between the operations.
- Edge.function: indicates which other function this Edge needs to output to.
- Edge.input_nr: Specifies the Edge input for Function.
-
useTensorPresentation data, that is, data flowing between nodes, without which the graph is meaningless.
See the following figure for details:
\
+---------------------+ +----------------------+
| SubBackward0 | | PowBackward0 |
| | Edge | | Edge
| next_functions +-----+--------> | next_functions +----------> ...
| | | | |
+---------------------+ | +----------------------+
|
|
| +----------------------+
| Edge | MulBackward0 |
+--------> | | Edge
| next_functions +----------> ...
| |
+----------------------+
Copy the code
With that out of the way, the next article will cover some of the basic classes related to the PyTorch differential engine.
0xEE Personal information
★★★★ Thoughts on life and technology ★★★★★
Wechat official account: Rosie’s Thoughts
0 XFF reference
Github.com/KeithYin/re…
Pytorch Learning Note (13) : Low-level implementation parsing of backward processes
Initialization of PyTorch
Pytorch’s automatic derivative mechanism – the establishment of computational graph
How autograd encodes the history
Pytorch.org/tutorials/b…
Pytorch Note (Calculation diagram + Autograd)-Node(1)
Explain the network construction in Pytorch
PyTorch’s optimizer
Distribution of PyTorch
PyTorch’s Tensor
PyTorch’s Tensor
PyTorch’s Tensor
PyTorch dynamic diagram (part 2)
PyTorch dynamic diagram (part 1)
Calculation diagram — Explain teacher Li Hongyi’s PPT with Pytorch
How to use PyTorch to find gradients automatically
PyTorch Automatic Derivative (Autograd) principle analysis
Pytorch Automatic Derivation of Autograd
PyTorch’s core developers take the inner workings of the game personally
PyTorch Automatic differential fundamentals
Towardsdatascience.com/pytorch-aut…