0 x00 the

This series will take a look at how PyTorch’s auto-differentiation capabilities are implemented in about ten articles. This article, the second forward propagation, introduces automatic differentiation (gradient calculation) as part of the PyTorch base class involved. Because the number of words is too much (12,000 words), so split into two parts.

The previous chapters in the series are linked as follows:

Automatic Differentiation of Deep Learning Tools (1)

Automatic Differentiation of Deep Learning Tools (2)

Automatic differentiation of Deep Learning Tools (3) — Interpretation of examples

PyTorch implements forward propagation (1) — Base class (1)

0x01 Previous review

We’ve talked about some of the basic classes like Variable, Function, Tensor, and then we’ll go on to the other basic classes. SubBackward0, PowBackward0, and PowBackward0 are all derived classes of Node. In this article we will refine the diagram.

+---------------------+              +----------------------+
| SubBackward0        |              | PowBackward0         |
|                     |      Edge    |                      |  Edge
|   next_functions  +-----+--------> |     next_functions +----------> ...
|                     |   |          |                      |
+---------------------+   |          +----------------------+
                          |
                          |
                          |          +----------------------+
                          |  Edge    | MulBackward0         |
                          +--------> |                      |  Edge
                                     |     next_functions +----------> ...
                                     |                      |
                                     +----------------------+
Copy the code

0x02 TensorImpl

2.1 on

PyTorch makes a lot of use of the Bridge design pattern, and at::Tensor uses the Bridge pattern to hand over the implementation to TensorImpl.

class TORCH_API Tensor { private: struct unsafe_borrow_t { explicit unsafe_borrow_t() = default; }; explicit Tensor(unsafe_borrow_t, const Tensor& rhs) : impl_(c10::intrusive_ptr<at::TensorImpl, UndefinedTensorImpl>::reclaim(rhs.impl_.get())) {} friend MaybeOwnedTraits<Tensor>; protected: friend class ::caffe2::Tensor; void enforce_invariants(); c10::intrusive_ptr<TensorImpl, UndefinedTensorImpl> impl_; };Copy the code

Details are as follows:

+------------------------------------------------+ +---------------------------+ |Tensor | |TensorImpl | | | | | | | bridge | | | <TensorImpl, UndefinedTensorImpl> impl_+-----------> | autograd_meta_ | | | | | | | | named_tensor_meta_ | +------------------------------------------------+ | | | pyobj_ | | | | sizes_and_strides_ | | | | storage_offset_ | | |  | data_type_ | | | | device_opt_ | | | | | +---------------------------+Copy the code

2.2 define

TensorImpl is defined as follows. Since this article is about automatic differentiation and forward propagation, we focus on the related variable of this part of the function, which is autograd_meta_. In addition to autograd_meta_, it’s metadata describing the Strides for the Tensor’s size, the types of elements it contains, the devices it depends on, the Strides, and so on.

struct C10_API TensorImpl : public c10::intrusive_ptr_target { Storage storage_; private: // This pointer points to an AutogradMeta struct that stores autograd-specific // fields (such as grad_ / grad_fn_ / grad_accumulator_). This pointer always // has unique ownership (meaning only one TensorImpl can own it at a time). // // autograd_meta_ can be nullptr, as an optimization. When this occurs, it is // equivalent to having an autograd_meta_ pointing to a default constructed // AutogradMeta; intuitively, tensors which don't require grad will have this // field set to null. // // This means accessors on autograd_meta_ have to be careful to test if they // got a nullptr, and handle default behavior appropriately in that case. // // Note that we don't enforce the invariant that if the AutogradMeta is // default constructed, it is nullptr (to do this, we'd have to continuously // check if an AutogradMeta became, by mutation, equal to the default // constructed form. (This might be useful, but it seems rare enough that // a requires_grad=True variable will turn back into the requires_grad=False // version.) So there are three representable states: // // 1. autograd_meta_ == nullptr // 2. autograd_meta_ is default constructed (semantically, same as (1)) // 3. autograd_meta_ has nontrivial information content // std::unique_ptr<c10::AutogradMetaInterface> autograd_meta_ = nullptr; / / focus on protected here: STD: : unique_ptr < c10: : NamedTensorMetaInterface > named_tensor_meta_ = nullptr; c10::VariableVersion version_counter_; PyObject* pyobj_ = nullptr; c10::impl::SizesAndStrides sizes_and_strides_; int64_t storage_offset_ = 0; int64_t numel_ = 1; caffe2::TypeMeta data_type_; c10::optional<c10::Device> device_opt_; bool is_contiguous_ : 1; /* HasContiguityPolicy */ uint8_t has_contiguity_ : 2; bool storage_access_should_throw_ = false; bool is_channels_last_ : 1; bool is_channels_last_contiguous_ : 1; bool is_channels_last_3d_ : 1; bool is_channels_last_3d_contiguous_ : 1; bool is_non_overlapping_and_dense_ : 1; bool is_wrapped_number_ : 1; bool allow_tensor_metadata_change_ : 1; bool reserved_ : 1; DispatchKeySet key_set_; };Copy the code

For automatic differential, STD: : unique_ptr < c10: : AutogradMetaInterface > autograd_meta_ = nullptr; Is the key.

This member variable is used to store special variables with automatic differential correlation, such as grad_ / grad_fn_ / grad_accumulator_. Each TensorImpl has only one AutogradMeta at the same time.

Autograd_meta_ is the unique identifier that distinguishes a Variable from a normal tensor or a functional tensor with autograd:

  • For tensors that do not require a gradient, the autograd_meta_ variable is null.
  • However, for optimization purposes, autograd_meta_ can be null even if a gradient is required, which is equivalent to being assigned a default AutogradMeta. Therefore, you need to check whether the value is NULL.
  • In the case of gradients, generally speaking, autograd_meta_ will be initialized to AutogradMeta or DifferentiableViewMeta.

AutogradMetaInterface is defined as follows, which is an abstract interface that requires derived classes to implement concrete functions.

struct C10_API AutogradMetaInterface {
  virtual void set_requires_grad(
      bool requires_grad,
      at::TensorImpl* self_impl) = 0;
  virtual bool requires_grad() const = 0;
  virtual at::Tensor& mutable_grad() = 0;
  virtual const at::Tensor& grad() const = 0;
  virtual const at::Tensor& fw_grad(uint64_t level, const at::Tensor& self)
      const = 0;
  virtual void set_fw_grad(
      const at::Tensor& new_grad,
      const at::Tensor& self,
      uint64_t level,
      bool is_inplace_op) = 0;
  virtual ~AutogradMetaInterface();
};
Copy the code

0x03 Automatic derivative related class

The following classes are related to automatic derivatives.

3.1 AutogradMeta

AutogradMeta inherits AutogradMetaInterface and stores auto-differential-dependent things, such as gradient values of nodes and gradient calculation functions, which are defined as follows:

//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ // AutogradMeta //~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /// Each `Variable` has one unique `AutogradMeta` struct, which stores autograd /// metadata fields that are necessary for tracking the Variable's autograd history. /// As an optimization, a Variable may store a nullptr, in lieu of a default /// constructed AutogradMeta. /// 1. A `grad_fn`, if the variable is in the interior of the graph. This is the /// gradient of the function that produced the variable. /// 2. A `grad_accumulator`, if the variable is a leaf, which accumulates a /// scalar gradient value into its `grad` variable. struct TORCH_API AutogradMeta : public c10::AutogradMetaInterface { std::string name_; Variable grad_; // Save the current gradient of Variable STD ::shared_ptr<Node> grad_fn_; // The middle node is responsible for the gradient calculation. Pytorch determines whether a Variable is a leaf node by checking whether grad_fn_ is empty, which can be accessed through the grad_fn() method. std::weak_ptr<Node> grad_accumulator_; Grad_accumulator_ is the gradient accumulation function. This field is used to store all the forward AD gradients // Associated with This AutogradMeta (and) the Tensor it corresponds to) // There is a semantic 1:1 correspondence between AutogradMeta and // ForwardGrad but: // - This field is lazily populated. // - This field is a shared_ptr but it must never be // shared by multiple Tensors.  See Note [ Using ForwardGrad ] // Any transition from not_initialized to initialized // must be protected by mutex_ std::shared_ptr<ForwardGrad> fw_grad_; // forward AD gradients std::vector<std::shared_ptr<FunctionPreHook>> hooks_; std::shared_ptr<hooks_list> cpp_hooks_list_; // Only meaningful on leaf variables (must be false otherwise) bool requires_grad_; // Only meaningful on non-leaf variables (must be false otherwise) bool retains_grad_; // Only non-leaf nodes are meaningful, whether to keep the figure bool is_view_; // The "output number" of this Variable; // The "output number" of this Variable; e.g., if this variable // was the second output of a function, then output_nr == 1. // We use this to make sure we can setup the backwards trace // correctly when this variable is passed to another function. uint32_t output_nr_; // Variable is the output of a function, output_nr_ is the output of the function, such as = 0, // Mutex to ensure that concurrent read operations that modify internal // state are still thread-safe. Used by grad_fn(), grad_accumulator(), // fw_grad() and set_fw_grad() // This is mutable because we need to be able to acquire this from const // version of this class for the functions above mutable std::mutex mutex_; };Copy the code

The main member variables of AutogradMeta are as follows:

  • Grad_ : Stores the gradient of the current instance of Variable, which is itself a Variable.

  • Grad_fn: is a Node instance, not a leaf Node. PyTorch uses grad_fn() to determine whether a Variable is a leaf Variable.

  • Grad_accumulator_ : is also an instance of Node and is available only for leaf nodes.

    • This is accessed via grad_accumulator() of Variable.
    • Grad_accumulator_ is the gradient accumulation function.
    • The corresponding gradient is stored in the grad_ variable.
  • Requires_grad_ : Specifies whether this Variable instance requires grad.

  • Retains_grad_ : Only non-leaf nodes are meaningful, meaning whether the graph needs to be kept.

  • Is_view_ : is a flag indicating whether this Variable instance is a View (no actual storage, base Variable).

  • Version_counter_ : Version number.

  • Output_nr_ : is a number. Output_nr_ indicates the first output of Node, for example, 0 indicates the first output of Node.

  • Base_ : is the base variable of the view.

Grad_fn is configured as SubBackward0 as an example:

+----------------------------------------------+          +-------------------------+
|Tensor                                        |          |TensorImpl               |
|                                              |          |                         |
|                                              |  bridge  |                         |
|   <TensorImpl, UndefinedTensorImpl> impl_ +-----------> |    autograd_meta_ +---------+
|                                              |          |                         |   |
|                                              |          |    named_tensor_meta_   |   |
+----------------------------------------------+          |                         |   |
                                                          |    pyobj_               |   |
                                                          |                         |   |
                                                          |    sizes_and_strides_   |   |
                                                          |                         |   |
                                                          |    storage_offset_      |   |
                                                          |                         |   |
                                                          |    data_type_           |   |
                                                          |                         |   |
                                                          |    device_opt_          |   |
                                                          |                         |   |
                                                          |                         |   |
                                                          +-------------------------+   |
                                                                                        |
                   +-------------------------+                                          |
                   | AutogradMeta            |                                          |
                   |                         +<-----------------------------------------+
                   |                         |
                   |      grad_accumulator_  |
                   |                         |            +-------------------------+
                   |      grad_fn_ +--------------------> | SubBackward0            |
                   |                         |            |                         |
                   |      hooks_             |            |                         |
                   |                         |            |                         |
                   |      retains_grad_      |            |           next_edges_   |
                   |                         |            |                         |
                   |      output_nr_         |            |                         |
                   |                         |            |                         |
                   |      fw_grad_           |            |                         |
                   |                         |            |                         |
                   +-------------------------+            +-------------------------+
​
Copy the code

3.2 DifferentiableViewMeta

For input variables, many operations return new variables that share storage with the input variable, which is called the view variable on top of the base variable. In PyTorch, we have two types of views: differentiable and non-differentiable. To support proper versioning, base and view variables must share the same versioning counter (version_counter), regardless of type.

The DifferentiableViewMeta is used to handle differentiable views.

//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ // DifferentiableViewMeta //~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /// DifferentiableViewMeta is created to support gradient tracking of /// such **in-place** operations. In particular, /// + if an in-place op is done on base, the grad_fn field of the view may /// become stale. So accesses should always go through grad_fn(), which /// reconstructs an updated grad_fn if the version_counter has incremented. /// All other fields are always valid.  /// + if an in-place op is done on view, in rebase_history() of view, which is /// called after every in-place op in VariableType.cpp, the grad_fn of base /// is updated. /// + if a single autograd Node returns multiple differentiable views, if any /// output is modified by an inplace operation, the autograd engine will make /// an equivalent graph (corresponding to the view operations) without using /// equivalent graph, where each output is treated as if it were produced by a /// distinct view operation. This discards the original (e.g., user provided) /// grad_fn. If the provided grad_fn does more than the backward of the view, /// then the DifferentiableViewMeta must be created with creation_meta= /// CreationMeta::MULTI_OUTPUT_NODE to prevent the engine from ignoring the /// provided grad_fn. enum class CreationMeta: uint8_t { DEFAULT, IN_CUSTOM_FUNCTION, MULTI_OUTPUT_NODE, NO_GRAD_MODE, MULTI_OUTPUT_SAFE, INFERENCE_MODE}; struct TORCH_API DifferentiableViewMeta : public AutogradMeta { private: /// Informations about the views c10::optional<ViewInfo> backward_info_; c10::optional<ViewInfo> forward_info_; // Optimization to reduce the number of ViewInfo we create. // In the (very common) case where backward_info_ == forward_info_, we only // populate backward_info_ (that should be used as both the forward and backward // view information) and set shared_view_info_ = true. // Invariants: // - If shared_view_info_ is false, there is no special constraints on // backward_info_ and forward_info_ // - If shared_view_info_ is true, we must have: // - backward_info_.has_value() == true // - forward_info_.has_value() == false bool shared_view_info_; /// The two following fields are extra information that we track to ensure that /// any operation on this backward view is valid. /// The value of the version_counter at the time grad_fn was created. The /// grad_fn field is stale if attr_version_ ! = version_counter.current_version(). uint32_t attr_version_; CreationMeta creation_meta_; };Copy the code

3.3 AutogradContext

The AutogradContext is the context in which the Autograd is manipulated to store information generated during the forward process so that it can be accessed during the backward propagation.

/// Context to save information during `forward` that can be accessed in `backward` /// in custom autograd operations (see `torch::autograd::Function` for details). struct TORCH_API AutogradContext { // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init) AutogradContext() : materialize_grads_(true) {} AutogradContext(const AutogradContext &other) = delete; AutogradContext& operator=(const AutogradContext& other) = delete; /// Can be used to save non-variable data for `backward`. // NOLINTNEXTLINE(cppcoreguidelines-non-private-member-variables-in-classes) ska::flat_hash_map<std::string, at::IValue> saved_data; /// Saves the list of variables for a future call to `backward`. This /// should be called at most once from inside of `forward`. void save_for_backward(variable_list to_save); /// Marks variables in the list as modified in an in-place operation. This /// should be called at most once from inside  of `forward` and all arguments /// should be inputs. void mark_dirty(const variable_list &inputs); /// Marks outputs in the list as not requiring gradients. This should be called /// at most once from inside of `forward` and all arguments should be outputs. void mark_non_differentiable(const variable_list &outputs); // Sets whether undefined output grad tensors should be expanded to tensors // full of zeros before calling backward function. Default value is true. void set_materialize_grads(bool value); /// Get the list of variables that were saved in `forward` using /// `save_for_backward()`. Before returning them to the  user, a check is made to /// ensure that they were not modified by any in-place operations. variable_list get_saved_variables() const; const std::unordered_set<at::TensorImpl*>& get_and_bump_dirty() const; const std::unordered_set<at::TensorImpl*>& get_non_differentiable() const; private: std::unordered_set<at::TensorImpl*> non_differentiable_; std::unordered_set<at::TensorImpl*> dirty_inputs_; std::vector<torch::autograd::SavedVariable> saved_variables_; variable_list to_save_; bool materialize_grads_; // The CppNode in the autograd graph that owns this AutogradContext. We need a // weak_ptr to avoid a refcycle. Since grad_fn_ owns this AutogradContext, it // will always be alive when we want to use it. std::weak_ptr<Node> grad_fn_; bool has_freed_buffers_; void save_variables(); template <class T> friend struct CppNode; };Copy the code

For the user, AutogradContext is primarily about customizing Auto functions. The following is an example from a comment.

/// ``` /// class MyFunction : public Function<MyFunction> { /// public: /// static variable_list forward(AutogradContext *ctx, int n, Variable var) { /// // Save data for backward in context /// ctx->saved_data["n"] = n; /// var.mul_(2); /// // Mark var as modified by inplace operation /// ctx->mark_dirty({var}); /// return {var}; /// } /// /// static variable_list backward(AutogradContext *ctx, variable_list /// grad_output) { /// // Use data saved in forward /// auto n = ctx->saved_data["n"].toInt(); /// return {grad_output[0]*n}; / / / / / /}}; /// ``` /// /// To use `MyFunction`: /// ``` /// Variable x; /// auto y = MyFunction::apply(6, x); /// // Example backward call /// y[0].sum().backward();Copy the code

We enter Auto Function from this.

3.4 Auto Function

Autograd uses Function to compute the results and gradients, and codes the history of the operation. Every operation you do on the Tensor creates a new Function object that performs the calculations and records what happens. The operation history is preserved in the form of the function DAG, with edges representing data dependencies (input < -output).

Typically, the only way for users to interact with Function is to create subclasses and define new operations (extending new functionality), which is the recommended way to extend Torch. Autograd. For more details on how to use this class, see the notes on extending the Autograd engine: pytorch.org/docs/stable…

If you want to use custom Autograd operations, implement a Function subclass using static forward and reverse functions.

  • Forward can take any number of arguments and should return a list of variables or variables.

    • Any use of Variable arguments will be registered in the computation diagram, but vectors/sets or other data structures will not traverse the registration.
    • You can use c10:: Optional as one of the parameters, and if the parameter has a value, it will be registered as a variable in the graph.
    • Should forward will point to “the torch: : autograd: : AutogradContext” pointer as the first parameter. Variables can be used as “CTX -> save_for_BACKWARD” and saved in the “CTX ->saved_data” map. Other data will be stored as<std::string, at::IValue>The pair is saved in the CTX ->saved_data map.
  • Backward to the torch should be used: : autograd: : AutogradContext pointer and a list of variables as parameters.

    • The number of variables that this variable list contains andforwardThe output is the same number of variables.
    • Backward should return as many variables as the input, each of which contains a gradient corresponding to the input.
    • Variables saved in Forward can be accessed by CTX ->get_saved_Variables, and other saved data can be accessed by CTX ->saved_data.
    • When backward is called, we can process the computed graph in topological order by calling the methods of each Function object and passing the returned gradient to the next Function.

An example of a derived subclass of Function is as follows:

class Exp(Function):
​
     @staticmethod
     def forward(ctx, i):
         result = i.exp()
         ctx.save_for_backward(result)
         return result
​
     @staticmethod
     def backward(ctx, grad_output):
         result, = ctx.saved_tensors
         return grad_output * result
​
#Use it by calling the apply method:
output = Exp.apply(input)
Copy the code

As shown earlier, Function has been replaced by Node, so here we are again.

0x04 Node

In earlier versions, Node was called Function, but later changed to Node to better match the concept of nodes.

Node is an abstract class that represents operations with zero or more variables as inputs and zero or more variables as outputs. The input Node of the Node in the forward diagram is the output Node of the Node in the backward propagation diagram. In PyTorch’s Autograd mechanism, all functions derive from this class and override its “apply” method. Instances of subclasses can then be called by the call operator.

When the autograd system is considered as a graph, Node is a vertex or Node connected to each other by (directed) Edge, which itself is represented by (Node, input_NR) pairs. Variable is the input and output of Node and moves between these edges during graph execution. When two or more “edges” (from different sources) point to the same input of a “node”, the values generated along all of these edges are implicitly summed up before being forwarded to the target “node”.

Subclasses are usually used to represent differentiable functions and their gradient operators. Note, however, that because the definition of “node” is very general, “node” takes zero or more inputs and produces zero or more outputs. The use of nodes is very flexible and goes beyond pure mathematical operations. For example, the AccumageGrad function is a sink that accepts an input but does not produce an output, instead accumulating the input as a side effect. On the other end, the “GraphRoot” function does not take input from other functions, but produces multiple outputs. Specific can see the torch/CSRC autograd/function. H.

4.1 define

Let’s look at the definition of the Node class. For better illustration, we keep only the member variables and remove the member functions.

using edge_list = std::vector<Edge>; struct TORCH_API Node : std::enable_shared_from_this<Node> { protected: /// Performs the `Node`'s actual operation. virtual variable_list apply(variable_list&& inputs) = 0; /// Calls `apply()`, but instruments it with tracing machinery. variable_list traced_apply(variable_list inputs); /// NOTE [ Sequence Number] /// /// The sequence_nr has two main usages in autograd: /// /// 1) Helps determine the node's execution priority in the engine. /// All else being equal, nodes with higher priority numbers are executed first. /// Thus, nodes corresponding to ops executed later are the first to be executed in /// the backward pass. One caveat is that we prioritize AccumulateGrad nodes by /// explicitly setting its sequence_nr to be UINT64_MAX. /// 2) The sequence number of this `Node` is paired with with thread_id it was created in /// as a unique identifier by the profiler to annotate recorded events. /// The purpose of this is to help users (and possibly programs) interpreting the profiler's /// output  to correlate backward nodes with its forward ops. /// We need both sequence_nr and thread_id to identify a node because  sequence_nr is /// thread_local, i.e., starts counting up from zero in a new thread // Sequence number used to correlate backward nodes with forward ops in the  // profiler and provide determinisim in the engine. const uint64_t sequence_nr_; // NOTE [ Topological Number ] // // topological_nr is used to prune branches in the DAG during autograd discovery as //  maintaining topological_nr helps us check in O(1) if there does NOT exist // a directed path between two nodes. // // The topological order number of this `Node` representing the length of the // longest possible path from this Node to any leaf node. If you are leaf node, // aka AccumulateGrad, this will be zero. This value has the property that // For every pair of nodes X, Y in G, existence of a directed path from X to Y // implies topo_nr(X) > topo_nr(Y). The converse is not true, however, so we // cannot prove existence of a path from X to Y, only non-existence. // // One assumption we make when using topo_nr is that once a node // has been used, i.e., has a parent node, its own topo_nr does not change // we have added some checks with the `has_parent_` field to enforce this. // // What NOT to do: // // 1) 2 -> 1 -> 0 In this diagram we label nodes with their topo_nr. // 2 -> 1 -> 0 We have two simple graphs that can each arise from // `t.exp().exp()`, for example. // 2) 2 -> 1 -> 0 // / // 2 -> 1 -> 0 We add 2 as a next edge to 1 even though 1 already // has a parent. // 3) 2 -> 1 -> 0 // / // 2 -> 3 -> 0 2 < 3, yet there exists a path from 2 to 3! // uint64_t topological_nr_ = 0; // Tracks whether this node has been added as the next_edge of another node // via set_next_edge(s), which always calls topological_nr() of all its children // See NOTE [ Topological Number ] for why we need this. mutable  bool has_parent_ = false; // Id of the thread that created the instance uint64_t thread_id_ = 0; std::mutex mutex_; Edge_list next_edges_; edge_list next_edges_; PyObject* pyobj_ = nullptr; // weak reference std::unique_ptr<AnomalyMetadata> anomaly_metadata_ = nullptr; std::vector<std::unique_ptr<FunctionPreHook>> pre_hooks_; std::vector<std::unique_ptr<FunctionPostHook>> post_hooks_; at::SmallVector<InputMetadata, 2> input_metadata_; // Override operator () Apply () variable_list operator()(variable_list&& inputs) {// In the first iteration of named tensors, autograd ignores names and // operates on unnamed tensors. In the long term, autograd should // probably operate with names. at::NoNamesGuard no_names_guard; bool pre_sampled = false; if (at::shouldRunRecordFunction(&pre_sampled)) { // Using RecordFunction to trigger observers in the backward pass at::RecordFunction guard(at::RecordScope::BACKWARD_FUNCTION, pre_sampled); if (guard.isActive()) { // Using sequence number and thread id to correlate with // the forward pass function guard.setForwardThreadId(thread_id_); if (guard.needsInputs()) { guard.before( name(), std::vector<c10::IValue>(inputs.begin(), inputs.end()), sequence_nr());  } else { guard.before(name(), sequence_nr()); } } // keeping stack guard object alive during the call return apply(std::move(inputs)); } else { return apply(std::move(inputs)); }}};Copy the code

Its constructor is:

  explicit Node(
      uint64_t sequence_nr,
      edge_list&& next_edges = edge_list())
      : sequence_nr_(sequence_nr),
      next_edges_(std::move(next_edges)) {
​
    for (const Edge& edge: next_edges_) {
      update_topological_nr(edge);
    }
​
    if (AnomalyMode::is_enabled()) {
      metadata()->store_stack();
​
      // If anomaly mode is enabled and graph is constructed, then assign the
      // currently evaluating node as the parent of this node.
      // A parent is a Node where this Node is created.
      // We are tracking the parents to track multiple backward operations.
      assign_parent();
    }
​
    // Store the thread_id of the forward operator.
    // See NOTE [ Sequence Numbers ]
    thread_id_ = at::RecordFunction::currentThreadId();
  }
Copy the code

4.2 Important member variables

Let’s explain some important member variables in detail.

2 input_metadata_

Input_metadata_ represents the meta information for input data and defines the input parameters of a Function.

4.2.2 next_edges_

This is the edge associated with the operator in the forward process.

We think of PyTorch’s Autograd system as a graph, where each Node instance is a graph Node and each Node instance is connected via Edge. Edge is a structure that represents edges in the graph by pairing (Function, input_nr). A Node member, next_edges_, is a set of Edge instances that represent the (other) nodes to which the return value of this Node instance is to be output, i.e., next_edges_ is the link between nodes.

Node’s inputs and outputs are Variable instances, so when a graph is executed, the Variable instance flows between these edges. When two or more edges point to the same Node (the Node has an entry degree greater than 1), the output of these edges is implicitly added up and sent to the pointed target Node.

The user can add an edge to Node using add_next_edge(), obtain the corresponding edge using next_edge(index), and obtain an iterator to iterate over the edge using the next_edge() method.

Holdings sequence_nr_

This variable is used to associate backward nodes with forward operations in the network and to provide deterministic information in the engine. Sequence_nr_ grows monotonously as Function instances are built, which serves two purposes:

  • Helps determine the priority of node execution in the engine. All other things being equal, the higher-priority node executes first. Therefore, the action performed later in forward propagation is the action performed first in backward propagation. One thing to note is that for the AccumulateGrad node, we explicitly set sequence_NR to UINT64_MAX. In PyTorch’s inverse graph calculation,AccumulateGradThe type represents the leaf node type, which is the graph termination node.AccumulateGradThere is one in the class.variableProperty points to a leaf node.
  • The sequence_NR_ of this “node” is paired with thread_ID as a unique identifier for a node that logs events in the profiler. The goal is to help the user (and possibly the program) interpret the profiler’s output so that the backward node can be associated with the forward operation. Because sequence_NR is a thread_local variable, it is counted from zero in the new thread.

4.2.4 topological_nr_

This variable is the topological sequence number of the node and represents the length of the longest possible path from that node to any leaf node. If there is a leaf node, AccumulateGrad, topological_nr_ will be zero.

Topological_nr_ is used to trim branches in the DAG during autograd discovery, and maintaining topology topological_nr_ helps us to complete the check at O(1) time when there is no directed path between two nodes.

Topological_nr_ has the following properties:

  • For each pair of nodes X, Y in G, if there is a directed path from X to Y, it means topo_nr(X) > topo_NR (Y). However, this is not the case, so we cannot prove the existence of a path from X to Y, only that it does not exist.
  • One of the assumptions we make when using topological_nr_ is that once a node is used, that is, it has a parent node, its own Topological_nr_ will not change. We have added some checks to the “has_parent_” field to enforce this.

4.2.5 operator ()

Variable_list operator()(variable_list&& inputs) is the primary method for Node. This method takes multiple instances of Variable wrapped in a vector, prints multiple instances of Variable wrapped in a vector, and then calls the apply concrete business function. This method relies on C++ polymorphism to convert a call to operator into a call to the apply method on itself (a subclass).

All functions in PyTorch for propagating calculations back inherit from the Function class and override the Apply pure virtual Function in the Function class.

0x05 Edge

From its name, Edge is the Edge of a computed graph. The main variables are:

  • STD ::shared_ptr function: the Node to which this edge points.
  • Uint32_t input_nr: specifies which Edge is the input of function.
using tensor_list = std::vector<at::Tensor>; using variable_list = std::vector<Variable>; using edge_list = std::vector<Edge>; using saved_variable_list = std::vector<SavedVariable>; using IndexRange = std::pair<size_t, size_t>; /// Represents a particular input of a function. struct Edge { Edge() noexcept : function(nullptr), input_nr(0) {} Edge(std::shared_ptr<Node> function_, uint32_t input_nr_) noexcept : function(std::move(function_)), input_nr(input_nr_) {} /// Convenience method to test if an edge is valid. bool is_valid() const noexcept { return function ! = nullptr; } // Required for use in associative containers. bool operator==(const Edge& other) const noexcept { return this->function == other.function && this->input_nr == other.input_nr; } bool operator! =(const Edge& other) const noexcept { return ! (*this == other); } /// The function this `Edge` points to. std::shared_ptr<Node> function; // The identifier of a particular input to The function. Uint32_t input_nr; // specify the Edge input of function}; }} // namespace torch::autogradCopy the code

0 x06 logic diagram

We have refined the logic diagram as follows, with the top half of the Python world and the bottom half of the C++ world:

+--------------------------------------------+ +------------------------------+ | SubBackward0 | | PowBackward0 | | | | | Edge | | | next_functions +----------> ... | next_functions[0] = (PowBackward0, 0) +----------> | | | | +------------------------------+ | | | | +-------------------------------+ | next_functions[1] =  (MulBackward0, 0) +----------> | MulBackward0 | | | | | Edge | | | next_functions +----------> ... +--------------------------------------------+ | | +-------------------------------+ ^ | | | Python +--------------------------------------------------------------------------------------------------------+ | C++ | v +---------------------------------------------+ +----------------------+ +------------------+ | SubBackward0 | | Edge 1 | | PowBackward0 | | +-------------------------> | | | | | | | | function +----------> | | | + | | | | | | next_edges_ =  [Edge 1, Edge 2] | | input_nr = 0 | | | | + | +----------------------+ +------------------+ | | | | | | +---------------------------------------------+ +----------------------+ +------------------+ | | Edge 2 | | MulBackward0 | | | | | | +----------------> | function +----------> | | | | | | | input_nr = 0 | | | | | | | +----------------------+ +------------------+Copy the code

The mobile phone is as follows:

Now that the basic classes in propagation have been analyzed, the next article will show you how to use these classes to accomplish forward propagation.

0xEE Personal information

★★★★ Thoughts on life and technology ★★★★★

Wechat official account: Rosie’s Thoughts

0 XFF reference

Github.com/KeithYin/re…

Pytorch Learning Note (13) : Low-level implementation parsing of backward processes

Initialization of PyTorch

Pytorch’s automatic derivative mechanism – the establishment of computational graph

How autograd encodes the history

Pytorch.org/tutorials/b…

Pytorch Note (Calculation diagram + Autograd)-Node(1)

Explain the network construction in Pytorch

PyTorch’s optimizer

Distribution of PyTorch

PyTorch’s Tensor

PyTorch’s Tensor

PyTorch’s Tensor

PyTorch dynamic diagram (part 2)

PyTorch dynamic diagram (part 1)

Calculation diagram — Explain teacher Li Hongyi’s PPT with Pytorch

How to use PyTorch to find gradients automatically

PyTorch Automatic Derivative (Autograd) principle analysis

Pytorch Automatic Derivation of Autograd

PyTorch’s core developers take the inner workings of the game personally

PyTorch Automatic differential fundamentals

Towardsdatascience.com/pytorch-aut…