In early 2019, ApacheCN organized volunteers to translate PyTorch1.0 Chinese documents (Making the address), as well as being officially licensed by PyTorch, I’m sure there are already a lot of people out thereChinese Official websiteYes. But so farproofreadingWe are short of staff, so I hope everyone will participate. We’ve been in an email exchange with PyTorch’s Bruce Lin for a while. At an appropriate time, we will organize volunteers to work on other PyTorch projects. Please join us and follow us. More hope our series of work can be helpful to you.

Translator: bat67

Proofread by FontTian

In PyTorch, the core of all neural networks is the Autograd package. Let’s take a quick look at this package, and then train our first neural network.

The Autograd package provides an automatic derivative mechanism for all operations on tensors. It is a define-by-run framework, which means that backpropagation is determined by how the code runs and can be different from iteration to iteration.

Let’s look at some simple examples.

tensor

Torch.Tensor is the core class of this package. If its attribute.requires_grad is set to True, it will track all operations on the tensor. When the calculation is complete, all gradients can be calculated automatically by calling.backward(). All gradients of this tensor will automatically add up to the.grad property.

To prevent a tensor from being traced to history, call the.detach() method to detach it from the calculation history and prevent its future calculation records from being traced.

To prevent tracking history (and memory usage), the code block can be wrapped in with torch.no_grad():. This is especially useful when evaluating models, which may have trainable parameters of REQUIres_grad = True, but we don’t need to perform gradient calculations on them along the way.

There is another class that is very important for the implementation of Autograd: Function.

The Tensor and Function connect to each other to create an Acyclic graph, which encodes the entire history of computation. Every Tensor has a.grad_fn property, which references the Function that creates the Tensor itself (unless the Tensor is manually created by the user, that is, grad_fn of the Tensor is None).

If you need to compute derivatives, you can call.backward() on the Tensor. If the Tensor is a scalar (that is, it contains data for one element) then you don’t need to specify any parameters for backward(), but if it has more elements then you need to specify a gradient parameter, which is the Tensor of shape matching.

import torch
Copy the code

Create a tensor and set requiRES_grad =True to track its calculation history

x = torch.ones(2.2, requires_grad=True)
print(x)
Copy the code

Output:

tensor([[1..1.],
        [1..1.]], requires_grad=True)
Copy the code

Let’s do an operation on this tensor:

y = x + 2
print(y)
Copy the code

Output:

tensor([[3..3.],
        [3..3.]], grad_fn=<AddBackward0>)
Copy the code

Y is the result of the calculation, so it has the grad_fn property.

print(y.grad_fn)
Copy the code

Output:

<AddBackward0 object at 0x7f1b248453c8>
Copy the code

More operations on y

z = y * y * 3
out = z.mean()

print(z, out)
Copy the code

Output:

tensor([[27..27.],
        [27..27.]], grad_fn=<MulBackward0>) tensor(27., grad_fn=<MeanBackward0>)
Copy the code

.requires_grad_(…) Changed the existing tensor’s requires_grad flag in place. If not specified, this flag is entered as False by default.

a = torch.randn(2.2)
a = ((a * 3) / (a - 1))
print(a.requires_grad)
a.requires_grad_(True)
print(a.requires_grad)
b = (a * a).sum(a)print(b.grad_fn)
Copy the code

Output:

False
True
<SumBackward0 object at 0x7f1b24845f98>
Copy the code

The gradient

Now you propagate back, because out is a scalar, so out.backward() is equivalent to out.backward(torch.tensor(1.)).

out.backward()
Copy the code

Output the derivative d(out)/dx

print(x.grad)
Copy the code

Output:

tensor([[4.5000.4.5000],
        [4.5000.4.5000]])
Copy the code

What we have is a matrix with all values of 4.5.

Let’s call the out tensor “o”, “O”, “O”.

O = 1 4 ∑ I z I o= \frac{1}{4}\sum_i z_i o=41i∑zi Z I =3(xi+2) 2z_i =3(x_i+2)^2 zi=3(xi+2)2 and Z I | x I =1=27 z_i\bigr\rvert_{x_i=1} =27 zi= given xi=1=27 Partial o partial I = 3 x 2 (x + 2) I \ frac {\ partial o} {\ partial x_i} = \ frac {3} {2} (x_i + 2) partial xi partial o = 23 (xi + 2), So partial O partial x I ∣ x I =1 = 9 2 = 4.5 \frac{\partial o}{\partial x_i}\bigr\rvert_{x_i=1} = \frac{9}{2} = 4.5 Partial xi partial o ∣ ∣ xi = 1 = 29 = 4.5.

Mathematically, if the vector-valued function y ⃗ =f(x ⃗) \vec{y}=f(\vec{x}) y =f(\vec{x}) y =f(x), then the gradient of y ⃗ \vec{y} y with respect to x ⃗ \vec{x} x is a Jacobian:

Partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial partial y_{1}}{\partial x_{1}} &amp; \cdots &amp; \frac{\partial y_{m}}{\partial x_{1}}\\ \vdots &amp; \ddots &amp; \vdots\\ \frac{\partial y_{1}}{\partial x_{n}} &amp; \cdots &amp; \ frac {\ partial y_ {m}} {\ partial x_ {n}} {array} \ \ end right) J = ⎝ ⎜ ⎛ partial x1 partial y1 ⋮ partial xn partial y1.. ⋱. Partial x1 partial ym ⋮ partial xn partial ym ⎠ ⎟ ⎞

In general, Torch. Autograd is an “engine” for calculating the Jacobian cross product. In other words, given any vector v= (v 1 v 2… v m) T v=\left(\begin{array}{CCCC} v_{1} & v_{2} & \cdots & ⋅J v^{T}\ end{array}\right)^{T} v=(v2… vm)T If v v v happens to be a scalar function l=g(y ⃗) l=g\left(\vec{y}\right) l= the derivative of g(y), \left(\begin{array}{CCC}\frac{\partial L}{\partial y_{1}} & \cdots & \frac{\partial l}{\partial y_{m}}\end{array}\right)^{T} v=(partial y1∂ L ∂ YM ∂ L)T \ \end{array}\right)^{T} v=(partial y1∂ L… partial YM ∂ L)T \

Carrier V = (∂ Y 1 ∂ x 1… ∂ Y m ∂ x 1 ⋮ ⋱ ⋮ ∂ y 1 ∂ x N… ∂ y m ∂ x N) (∂ L ∂ y 1 ⋮ ∂ L ∂ y m) = (∂ L ∂ X 1 ∂ l ∂ Y M) x n ) J^{T}\cdot v=\left(\begin{array}{ccc} \frac{\partial y_{1}}{\partial x_{1}} &amp; \cdots &amp; \frac{\partial y_{m}}{\partial x_{1}}\\ \vdots &amp; \ddots &amp; \vdots\\ \frac{\partial y_{1}}{\partial x_{n}} &amp; \cdots &amp; \frac{\partial y_{m}}{\partial x_{n}} \end{array}\right)\left(\begin{array}{c} \frac{\partial l}{\partial y_{1}}\\ \vdots\\ \frac{\partial l}{\partial y_{m}} \end{array}\right)=\left(\begin{array}{c} \frac{\partial l}{\partial x_{1}}\\ \vdots\\ \frac{\partial l}{\partial x_{n}} \end{array}\right) JT ⋅ v = ⎝ ⎜ ⎛ partial x1 partial y1 ⋮ partial xn partial y1.. ⋱. Partial x1 partial ym ⋮ partial xn partial ym ⎠ ⎟ ⎞ ⎝ ⎜ ⎛ partial y1 partial l ⋮ partial ym partial l ⎠ ⎟ ⎞ = ⎝ ⎜ ⎛ partial x1 partial l ⋮ partial xn partial l ⎠ ⎟ ⎞

⋅J v^{T}\cdot J vT⋅J can also be interpreted as a carrier of column vectors

This property of the Jacobian cross product makes it very convenient to input external gradients into models with non-scalar outputs.

Now let’s look at an example of a Jacobian cross product:

x = torch.randn(3, requires_grad=True)

y = x * 2
while y.data.norm() < 1000:
    y = y * 2

print(y)
Copy the code

Output:

tensor([-278.6740.935.4016.439.6572], grad_fn=<MulBackward0>)
Copy the code

In this case, y is no longer a scalar. Torch. Autograd cannot compute the full Jacobian directly, but if we only want the Jacobian cross product, we simply pass this vector as a parameter to the BACKWARD:

v = torch.tensor([0.1.1.0.0.0001], dtype=torch.float)
y.backward(v)

print(x.grad)
Copy the code

Output:

tensor([4.0960 e+02.4.0960 e+03.4.0960 e-01])
Copy the code

You can also prevent autograd from tracking the history of a tensor with.requires_grad=True by wrapping the code block in with torch.no_grad():.

print(x.requires_grad)
print((x ** 2).requires_grad)

with torch.no_grad():
    print((x ** 2).requires_grad)
Copy the code

Output:

True
True
False
Copy the code

Follow-up reading:

Autograd and Function are documented at pytorch.org/docs/autogr…