0 x00 the
In PyTorch DataParallel training, it replicates models across multiple Gpus before training. In the course of analysis, THE author found that it was difficult to understand the process of replicating DataParallel without sorting out some basic knowledge about GPU.
Other articles in this series are as follows:
Automatic Differentiation of Deep Learning Tools (1)
Automatic Differentiation of Deep Learning Tools (2)
Automatic differentiation of Deep Learning Tools (3) — Interpretation of examples
PyTorch implements forward propagation (1) — Base class (1)
PyTorch implements forward propagation (2) — Base class (2)
PyTorch how to implement forward propagation (3) – implementation
How to implement back propagation (1)—- call engine
Pytorch how to implement backward propagation (2)—- engine static structure
Pytorch how to implement backward propagation (3)—- engine dynamic logic
PyTorch how to implement backward propagation (4)—- specific algorithm
PyTorch distributed (1)—— History and Overview
0 x01 problem
Before DataParallel forward propagation, data needs to be distributed on the GPU to replicate the model, as shown in the following figure.
From this we have a few questions:
- What exactly is behind moving the model to the GPU?
- How do I call GPU operations on top of the CPU?
- How to seamlessly switch between CPU and GPU operations?
- Do I need to move the loss function onto the GPU?
We’ll look at them all.
Note that CUDA and Dispatcher are only outlined so that the reader can walk through the process, and those interested can delve into it for themselves.
0x02 Move model to GPU
2.1 cuda operation
CUDA is a GPU programming model developed by NVIDIA, which provides A GPU programming interface. Users can build applications based on GPU computing by CUDA programming.
Cuda is used to set CUDA and run CUDA operations. It keeps track of the currently selected GPU on which all CUDA tensors assigned by the user will be created by default. Users can use torch.cuda.device to modify the selected device. Once a tensor is allocated, you can perform operations on it regardless of the device selected, and PyTorch puts the result of the run on the same device as the original tensor.
By default, with the exception of ~torch.tensor. Copy_ and other methods that have similar copying capabilities (such as ~torch.tensor. To and ~torch.tensor. Cuda), cross-GPU operations are not allowed unless peer-to-peer memory access is enabled.
Here’s a concrete example from the source code, as you can see, tensors can be created and manipulated on devices.
cuda = torch.device('cuda') # Default CUDA device
cuda0 = torch.device('cuda:0')
cuda2 = torch.device('cuda:2') # GPU 2 (these are 0-indexed)
x = torch.tensor([1., 2.], device=cuda0)
# x.device is device(type='cuda', index=0)
y = torch.tensor([1., 2.]).cuda()
# y.device is device(type='cuda', index=0)
with torch.cuda.device(1):
# allocates a tensor on GPU 1
a = torch.tensor([1., 2.], device=cuda)
# transfers a tensor from CPU to GPU 1
b = torch.tensor([1., 2.]).cuda()
# a.device and b.device are device(type='cuda', index=1)
# You can also use ``Tensor.to`` to transfer a tensor:
b2 = torch.tensor([1., 2.]).to(device=cuda)
# b.device and b2.device are device(type='cuda', index=1)
c = a + b
# c.device is device(type='cuda', index=1)
z = x + y
# z.device is device(type='cuda', index=0)
# even within a context, you can specify the device
# (or give a GPU index to the .cuda call)
d = torch.randn(2, device=cuda2)
e = torch.randn(2).to(cuda2)
f = torch.randn(2).cuda(cuda2)
# d.device, e.device, and f.device are all device(type='cuda', index=2)
Copy the code
2.2 the Module
The model of deep learning can be regarded as a container of parameters, and the running model is actually doing some basic matrix operations on the input parameters. In general, user-defined models are derived from the nn.modules. Module class. Distributed training involves synchronous updating of parameters and copying models to multiple workers, so we need to see the status of Module first. As you can see from the definition, Module member variables are divided into status parameters and hooks functions.
class Module: dump_patches: bool = False _version: int = 1 training: bool _is_full_backward_hook: Optional[bool] def __init__(self): """ Initializes internal Module state, shared by both nn.Module and ScriptModule. """ torch._C._log_api_usage_once("python.nn_module") self.training = True Self._parameters = OrderedDict() # Self._parameters = OrderedDict() # self._buffers = OrderedDict() # self._non_persistent_buffers_set = set() self._backward_hooks = OrderedDict() self._is_full_backward_hook = None self._forward_hooks = OrderedDict() self._forward_pre_hooks = OrderedDict() self._state_dict_hooks = OrderedDict() self._load_state_dict_pre_hooks = OrderedDict() self._modules = OrderedDict()Copy the code
We will focus on state parameters. There are four main state parameters:
-
self.training
- Whether this network is training.
-
self._modules
- Is a sub-module of the network, defined in an iterative manner.
-
self._parameters
- Network parameters. Is the parameter that will be updated with BP in the training process, which is the object of gradient update.
-
self._buffers
- In the training process, the parameters will not be updated with BP, but need to be preserved, such as moving mean and variance in BatchNorm, whose optimization is not backpropagation through gradient but through other ways.
In essence, the combination of self._parameters and self._buffers is a concrete state of a model after the network structure of a model has been defined. If you need to copy a model:
self._modules
It is a part of the network structure. When we copy the model to other workers, we will copy it together.- while
self._parameters
和self._buffers
Both need to be explicitly copied to other workers in order to maintain the same state in different Python processes.
So, does that mean we just need to copy self._modules, self._parameters and self._buffers? Let’s move on.
2.3 mobile
2.3.1 sample
Having seen how to manipulate tensors on the GPU, let’s look at how to place the model on the GPU.
First we define a model.
class ToyModel(nn.Module):
def __init__(self):
super(ToyModel, self).__init__()
self.net1 = nn.Linear(10, 10)
self.relu = nn.ReLU()
self.net2 = nn.Linear(10, 5)
def forward(self, x):
return self.net2(self.relu(self.net1(x)))
Copy the code
Then use the model as follows.
Model = ToyModel().cuda(device_ids[0]) Optimizer = optim.sgd (ddp_model.parameters(), Lr =0.001) optimizer.zero_grad() outputs = ddp_model(torch. Randn (20, 10)) labels = Torch. Randn (20, 10) 5).to(device_ids[0]) loss_fn(outputs, labels).backward() optimizer.step()Copy the code
2.3.2 operation
In the example, CUDA method is used to copy the model to the GPU. The comments indicate that the parameters and buffers of the model are moved to the GPU. This code actually calls CUDA (Device) using self._apply.
def cuda(self: T, device: Optional[Union[int, device]] = None) -> T:
r"""Moves all model parameters and buffers to the GPU.
This also makes associated parameters and buffers different objects. So
it should be called before constructing optimizer if the module will
live on GPU while being optimized.
.. note::
This method modifies the module in-place.
Args:
device (int, optional): if specified, all parameters will be
copied to that device
Returns:
Module: self
"""
return self._apply(lambda t: t.cuda(device))
Copy the code
Let’s look at some other functions that you’re familiar with.
First, the to method is essentially calling to(device) with self._apply, and we’ve omitted some checking code.
def to(self, *args, **kwargs):
r"""Moves and/or casts the parameters and buffers.
This can be called as
.. function:: to(device=None, dtype=None, non_blocking=False)
.. function:: to(dtype, non_blocking=False)
.. function:: to(tensor, non_blocking=False)
.. function:: to(memory_format=torch.channels_last)
Its signature is similar to :meth:`torch.Tensor.to`, but only accepts
floating point or complex :attr:`dtype`s. In addition, this method will
only cast the floating point or complex parameters and buffers to :attr:`dtype`
(if given). The integral parameters and buffers will be moved
:attr:`device`, if that is given, but with dtypes unchanged. When
:attr:`non_blocking` is set, it tries to convert/move asynchronously
with respect to the host if possible, e.g., moving CPU Tensors with
pinned memory to CUDA devices.
See below for examples.
.. note::
This method modifies the module in-place.
Args:
device (:class:`torch.device`): the desired device of the parameters
and buffers in this module
dtype (:class:`torch.dtype`): the desired floating point or complex dtype of
the parameters and buffers in this module
tensor (torch.Tensor): Tensor whose dtype and device are the desired
dtype and device for all parameters and buffers in this module
memory_format (:class:`torch.memory_format`): the desired memory
format for 4D parameters and buffers in this module (keyword
only argument)
Returns:
Module: self
"""
device, dtype, non_blocking, convert_to_format = torch._C._nn._parse_to(*args, **kwargs)
def convert(t):
if convert_to_format is not None and t.dim() in (4, 5):
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
non_blocking, memory_format=convert_to_format)
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
return self._apply(convert)
Copy the code
Second, the CPU method also calls the CPU (device) using self._apply.
def cpu(self: T) -> T:
r"""Moves all model parameters and buffers to the CPU.
.. note::
This method modifies the module in-place.
Returns:
Module: self
"""
return self._apply(lambda t: t.cpu())
Copy the code
Therefore, we need to examine the _apply method.
2.3.3 _apply method
We can see that the main logic is:
-
Traverse _parameters:
-
The parameter is processed by calling fn to get param_applied.
- Run the param_applied command to reset the parameters.
-
If the parameter has a gradient, then:
- Grad_applied is processed by calling fn for grad.
- Reset the gradient of a parameter with grad_applied.
-
-
Traverse _buffers:
- The buF is processed by calling fn.
def _apply(self, fn): for module in self.children(): module._apply(fn) def compute_should_use_set_data(tensor, tensor_applied): if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): # If the new tensor has compatible tensor type as the existing tensor, # the current behavior is to change the tensor in-place using `.data =`, # and the future behavior is to overwrite the existing tensor. However, # changing the current behavior is a BC-breaking change, and we want it # to happen in future releases. So for now we introduce the # `torch.__future__.get_overwrite_module_params_on_conversion()` # global flag to let the user control whether they want the future # behavior of overwriting the existing tensor or not. return not torch.__future__.get_overwrite_module_params_on_conversion() else: _parameters for key, param in self._parameters.items(): if param is not None: # Tensors stored in modules are graph leaves, and we don't want to # track autograd history of `param_applied`, so we have to use # `with torch.no_grad():` with torch.no_grad(): Param_applied = fn(param) # Param_applied should_use_set_data = compute_should_use_set_data(param, param_applied) Param. data = param_applied # set else with param_applied: assert isinstance(param, Parameter) assert param.is_leaf # # Reset self._parameters[key] = Parameter(param_applied, Requires_grad) if param.grad is not None: Should_use_set_data = compute_should_use_set_data(param.grad, Grad_applied) if should_use_set_data: param.grad.data = grad_applied Is_leaf self._parameters[key]. Grad = grad_applied. Requires_grad_ (param.grad.requires_grad) # Grad_buffers for key, buf in self._buffers. Items (): if buf is not None: Self._buffers[key] = fn(buf) # return selfCopy the code
Therefore, we can see that moving the model to the GPU is actually moving self._parameters and self._buffers of the model to the GPU, but not self._modules. We do.cuda() processing for the model by putting the parameters of the model in video memory (and actually using them to perform calculations).
+ | +---------------------------------+ | +----------------------------------+ | CPU | | | CPU | | +--------------+ | | | +--------------------+ | | |Module | | | | | Module | | | | | | | | | | | | | _parameters+----> Parameters | | | | _parameters ------+ | | | | | | | | | | | | | _buffers +------> Buffers | | | +-----+ _buffers | | | | | | | | | | | | | | | | _modules | | | | | | _modules | | | | | | | | | | | | | | | +--------------+ | | | | +--------------------+ | | | | | | | | | +---------------------------------+ | +----------------------------------+ | | | + | | +-------------------------------> Module.cuda() +---------------------------------> Time + | | | | | +---------------------------------+ | +----------------------------------+ | GPU | | | GPU | | | | | | | | | | | | | | | Parameters <-----+ | | | | | | | | | | | | | | | | | +----> Buffers | | | | | | | | | | | +---------------------------------+ | +----------------------------------+ | +Copy the code
Why isn’t self._modules moved? This is not necessary because _modules can be thought of as a list, which serves primarily as a bridge over which recursive traversal can be used to get all the parameters on the network. This feature is not required for subsequent operations.
2.4 summary
Now we can answer the first question: what exactly is going on behind moving the model to the GPU?
Call CUDA or to to move the model to the GPU, which actually moves self._parameters and self._buffers of the model to the GPU, not self._modules. This movement is called recursively, moving every leaf of the model onto the GPU.
0x03 Calling a function on GPU
3.1 FOUNDATION of CUDA programming model
Let’s start with the basics of the CUDA programming model.
3.1.1 Heterogeneous model
CUDA programming model is a heterogeneous model. The program runs on a heterogeneous system, which consists of a CPU and a GPU, separated by a bus. When the program runs, the CPU and GPU work together.
In CUDA, there are two important concepts: host and Device.
- Host: indicates the CPU and its memory.
- Device: GPU and its memory.
Therefore, a CUDA-based program is divided into two parts: Host code and Device code, which run on CPU and GPU respectively. The host and device can communicate with each other to copy data.
- Host Code: The part executed on the CPU that is compiled using Linux (GNU GCC) and Windows (Microsoft Visual C) compilers. We can think of C language working objects as CPU and memory.
- Device Code: Part of the execution on the GPU, compiled using the NVIDIA NVCC compiler. Roughly speaking, CUDA C works on the GPU and memory on the GPU (also known as device memory).
+-------------------+ +--------------------+ | | | | | +----------+ | | +----------+ | | | | | | | | | | | RAM | | | | RAM | | | | | | | | | | | +----+-----+ | | +----+-----+ | | | +--------+ | | | | | | | | | +----+-----+ | | +----+-----+ | | | | | | | | | | | CPU | | | | GPU | | | | | | | | | | | +----------+ | | +----------+ | | | | | +-------------------+ +--------------------+ Host DeviceCopy the code
3.1.2 Parallel thought
The idea of CUDA programming is parallel, which is roughly as follows:
- Divide a large execution task into several simple repeatable operations, and then use several threads to perform these operations separately to achieve the purpose of parallelism.
- The data processed by the task should also be divided into several small data blocks. For example, a big data is divided into several GPU groups, and each GPU group needs to be divided into multiple thread groups again. Tensors in thread groups may need to be subdivided into groups that can be processed by tensor processors.
Therefore, a typical CUDA program consists of serial and parallel code.
- Serial code is standard C code, executed by host.
- The parallel code is CUDA C code, executed in the device.
The CUDA main program is started by the CPU, that is, the program is started by the host executing serial code. When the part that requires parallel processing of data is encountered, the device executes parallel code to complement it. The device can perform most operations independently of the host. When a device code starts, control is immediately returned to the CPU to perform other tasks, so this is an asynchronous process.
Figure from docs.nvidia.com/cuda/cuda-c… .
3.1.3 Processing process
A typical CUDA program execution flow is as follows:
- Allocate host memory space and initialize data.
- Allocate device video memory space.
- The data to be computed is copied from Host memory to device video memory.
- Call CUDA kernel function to complete the user specified operation on the device.
- Copy the calculated result from GPU memory to Host memory.
- Frees memory allocated on device and host.
See the following figure for details.
3.2 the function
3.2.1 kernel function
Kernel functions are functions that are executed in parallel in the device thread. In CUDA programs, the main program needs to execute configuration on the GPU kernel before calling it to determine the number of thread blocks, the number of threads in each thread block and the size of shared memory. For example, << parameter 1, parameter 2>> is used to specify the number of threads required by the kernel function and how the threads are organized during invocation. In this way, several threads will be started in GPU to execute the kernel function in parallel, and each thread is assigned a unique thread number.
qualifiers | perform | call | note |
---|---|---|---|
__global__ |
Device side execution | It can be called from a host or from some specific device | Asynchronous operation: After host sends parallel computing tasks to GPU’s task call list, it does not wait for kernel to finish executing the next step |
__device__ |
Device side execution | Device side call | Can not and__global__ At the same time with |
__host__ |
Host execution | The host call | Can be omitted, cannot be combined__global__ At the same time__device__ Used at the same time, when the function is compiled on both device and host. |
Details are as follows:
+------------------------+ +------------------------+ | | | | | | | | | __host__ __global__ | | __device__ | | + + | | | | | | | | + | | | | | | | | | | v---------------> | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | +<--------------v | | | | | | | | | | | | | | | | | | | | | | | | v v | | v | | | | | +------------------------+ +------------------------+ Host DeviceCopy the code
These three qualifiers are actually three common runtime scenarios in CUDA. The device function and global function cannot call some common C/C++ functions (because they have no corresponding GPU implementation) because they need to run on GPU.
The following code is an NVIDIA example that uses the built-in threadIdx variable to add two tensors A and B to get C. Therefore, VecAdd() is executed by each of the N threads.
// Kernel definition __global__ void VecAdd(float* A, float* B, float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; } int main() { ... // Kernel invocation with N threads VecAdd<<<1, N>>>(A, B, C); . }Copy the code
3.2.2 PyTorch sample
We from third_party/cub/cub/device/dispatch/dispatch_reduce cuh find a kernel function examples to look at.
/**
* Reduce region kernel entry point (multi-block). Computes privatized reductions, one per thread block.
*/
template <
typename ChainedPolicyT, ///< Chained tuning policy
typename InputIteratorT, ///< Random-access input iterator type for reading input items \iterator
typename OutputIteratorT, ///< Output iterator type for recording the reduced aggregate \iterator
typename OffsetT, ///< Signed integer type for global offsets
typename ReductionOpT> ///< Binary reduction functor type having member <tt>T operator()(const T &a, const T &b)</tt>
__launch_bounds__ (int(ChainedPolicyT::ActivePolicy::ReducePolicy::BLOCK_THREADS))
__global__ void DeviceReduceKernel(
InputIteratorT d_in, ///< [in] Pointer to the input sequence of data items
OutputIteratorT d_out, ///< [out] Pointer to the output aggregate
OffsetT num_items, ///< [in] Total number of input data items
GridEvenShare<OffsetT> even_share, ///< [in] Even-share descriptor for mapping an equal number of tiles onto each thread block
ReductionOpT reduction_op) ///< [in] Binary reduction functor
{
// The output value type
typedef typename If<(Equals<typename std::iterator_traits<OutputIteratorT>::value_type, void>::VALUE), // OutputT = (if output iterator's value type is void) ?
typename std::iterator_traits<InputIteratorT>::value_type, // ... then the input iterator's value type,
typename std::iterator_traits<OutputIteratorT>::value_type>::Type OutputT; // ... else the output iterator's value type
// Thread block type for reducing input tiles
typedef AgentReduce<
typename ChainedPolicyT::ActivePolicy::ReducePolicy,
InputIteratorT,
OutputIteratorT,
OffsetT,
ReductionOpT>
AgentReduceT;
// Shared memory storage
__shared__ typename AgentReduceT::TempStorage temp_storage;
// Consume input tiles
OutputT block_aggregate = AgentReduceT(temp_storage, d_in, reduction_op).ConsumeTiles(even_share);
// Output result
if (threadIdx.x == 0)
d_out[blockIdx.x] = block_aggregate;
}
Copy the code
3.3 summary
Now we know that PyTorch can actually perform parallel operations on the GPU by calling the __Global__ method. This answers our second question: how do you call GPU operations on top of the CPU?
0x04 Switches between GPU and CPU
Let’s look at how to switch between GPU/CPU.
As you can see from the sample code, as long as cudA functions are called to move the model onto the GPU, we can use cudA Global kernel to perform parallel computing on the GPU.
Model = ToyModel().cuda(device_ids[0]) device_ids) loss_fn = nn.MSELoss() optimizer = optim.SGD(ddp_model.parameters(), Lr =0.001) Optimizer.zero_grad () outputs = ddp_model(torch. Randn (20, 10))Copy the code
One question we missed was how PyTorch knew to call the GPU’s global kernel at this point. Why doesn’t PyTorch call CPU functions or other device functions? That’s what we need to analyze next.
4.1 the Dispatcher mechanism
Here we mainly use blog.ezyang.com/2020/09/let… .
4.4.1 problem
In PyTorch, the expected behavior of the operator is caused by a combination of mechanisms, such as:
- The kernel that does the actual work.
- Whether reverse automatic differentiation is supported, e.g., marker bits that make Loss.Backward () work properly.
- Whether torch.jit. Trace is enabled.
- If you are in a Vmap call, the operator running will exhibit different batch behavior.
As a result, we know that there are too many different ways to interpret the PyTorch Operator, and if we tried to handle all the behavior in a single function called Add, our implementation code would quickly evolve into an unmaintainable mess.
So we needed a mechanism to solve this problem that was not just an if statement, but a very important abstraction within PyTorch, and it had to do this with as little degradation of PyTorch performance as possible. That mechanism is the Dispatcher.
4.1.2 What is Dispatcher
What is a dispatcher? The Dispatcher maintains a table of Pointers to functions for each operator that provide an implementation for each Dispatch key, a mechanism that roughly corresponds to a crosscutting concern in PyTorch. In the figure above, you can see that in this table there are Dispatch entries for different backends (CPU, CUDA, XLA) and more advanced concepts (such as Autograd and trace). The dispatcher’s job is to calculate a Dispatch key based on the tensor input and a few other things, and then jump to the function that the table of function Pointers points to.
Those familiar with C++ may notice that this function pointer table is very similar to the virtual table in C++. In C++, virtual functions of objects are implemented by associating each object with a pointer to a virtual table that contains the implementation of each virtual function on the object concerned. In PyTorch, we basically re-implemented virtual tables, but with some differences.
- Dispatch tables are allocated by operator, while virtual tables are allocated by class. This means that we can extend the supported set of operators by assigning a new Dispatch table. In contrast, for a C++ object, you can extend the type by subclassing it, but you can’t easily add virtual functions. Unlike normal object-oriented systems, most of PyTorch’s extensibility lies in defining new operators (rather than new subclasses), so this trade-off is reasonable. In addition, the types of Dispatch keys are not publicly extensible, and we expect users who want to add new Dispatch keys to add their Dispatch keys by submitting a patch to the PyTorch core team.
- Our dispatch key calculation takes into account all operator parameters (multiple Dispatch) and thread-local state (TLS). This is unlike a virtual table, where only the first object (the this pointer) is important.
- Finally, the Dispatcher supports boxing and unboxing as part of the op’s calling convention. More on this at the end of the article.
Interesting historical note: We used virtual functions to implement dynamic dispatch, and when we realized we needed more power than virtual tables, we re-implemented dynamic Dispatch.
4.1.3 How do I Calculate keys
So how exactly do we calculate the Dispatch key? We do this based on the Dispatch Key set, which is a basic abstraction that is a bitset of the Dispatch key. Basically, we combine dispatch key sets from different sources (and in some cases mask some keys) to get a final Dispatch key set. Then we pick the highest-priority key in the set (the Dispatch keys are implicitly sorted by some priority), which is what we should call this time. So, what is the source of these Dispatch key sets?
- Each tensor input has a Dispatch key set made up of all the Dispatch keys on that tensor (intuitively, the values of those Dispatch keys would be something like a “CPU” string, which tells us that the tensor is a CPU tensor, So it should be handled by the CPU handler in the DISPATCH table).
- We also have a local include set for “modal” functions, such as tracing, which is not associated with any tensors but is the local mode of a thread that the user can turn on or off in certain ranges.
- Finally, we have a global set that contains the Dispatch key that should always be considered. (Since this slide was written, Autograd has moved from global sets to tensors. The high-level structure of the system, however, has not changed).
In addition to these, there is a local exclude set that excludes certain dispatch keys from dispatches. A common scenario is for a handler to process a key and then exclude itself with a local exclude set so that we don’t try to reprocess the key later.
4.1.4 registered
Let’s look at how to register this Dispatch key into the Dispatch table. This is done through the Operator Registration API. Operators register apis in three main ways:
- Define the schema for operator.
- The implementation is then registered on the corresponding key.
- Finally, there is a fallback method that the user can use for a keyallThe operator defines the same handler.
To visualize operator registration at work, let’s imagine that all the OP’s dispatch tables work together to form a two-dimensional grid like this:
- On the vertical axis is each OP supported in PyTorch.
- On the horizontal axis is each dispatch key supported by the system.
The operator registration action is to fill in the corresponding implementation in the cells defined on these two axes.
When registering the kernel function for an operator on a particular Dispatch key, we fill in the contents of a cell (blue below).
4.2 the Dispatcher code
Let’s take a look at the source code.
4.2.1 Virtual function table
4.2.1.1 example
We can from aten/SRC/aten/native/native_functions yaml find some examples of virtual functions.
Tensor zero_(A!) self) -> Tensor(a!) device_check: NoCheck # TensorIterator variants: method, function dispatch: CPU, CUDA: zero_ Meta: Zero_meta_ SparseCPU, SparseCUDA: zero_sparse_ MkldnnCPU: mKLDNN_zero_ # sub.out The corresponding virtual function table - func: sub.out(Tensor self, Tensor other, *, Scalar alpha=1, Tensor(a!) out) -> Tensor(a!) device_check: NoCheck # TensorIterator structured: True structured_inherits: TensorIteratorBase dispatch: CPU, CUDA: Sub_out SparseCPU, SparseCUDA: sub_out_SPARSE # sub.Tensor sub.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor device_check: NoCheck # TensorIterator variants: function, method structured_delegate: sub.out dispatch: SparseCPU, SparseCUDA: sub_sparseCopy the code
4.2.1.2 Implementation of Operator
We can look at two implementations of Zero. Here is the MkldnnCPU implementation.
Tensor& mkldnn_zero_(Tensor& self) { using Vec = vec::Vectorized<float>; ideep::tensor& x = itensor_from_mkldnn(self); auto n = x.get_nelems(); auto* x_ = static_cast<float*>(x.get_data_handle()); Parallel_for (0, n, 2048, [x_](int64_t begin, int64_t end) {vec::map([](vec /* unused */) {return 0.0; }, x_ + begin, x_ + begin, end - begin); }); return self; }Copy the code
For example, SparseCPU, SparseCUDA implementation:
// --------------------------------------------------------------------
// zero_(SparseTensor)
// --------------------------------------------------------------------
// hummu hummu
SparseTensor& zero_sparse_(SparseTensor& self) {
AT_ASSERT(self.is_sparse());
at::zeros_out(self, get_sparse_impl(self)->sizes());
return self._coalesced_(true);
}
Copy the code
4.2.2 the Dispatcher to define
Let’s look at the definition of Dispatcher, which gives only partial member variables.
class TORCH_API Dispatcher final { private: // For direct access to backend fallback information friend class impl::OperatorEntry; struct OperatorDef final { explicit OperatorDef(OperatorName&& op_name) : op(std::move(op_name)) {} impl::OperatorEntry op; size_t def_count = 0; size_t def_and_impl_count = 0; }; friend class OperatorHandle; template<class> friend class TypedOperatorHandle; public: static Dispatcher& realSingleton(); // Stores all operators, and stores different versions of each operator in its member variables, such as CPU, CUDa, autograd.... std::list<OperatorDef> operators_; <ska::flat_hash_map<OperatorName, OperatorHandle>> operatorLookupTable_; // Map from namespace to debug string (saying, e.g., where the library was defined) ska::flat_hash_map<std::string, std::string> libraries_; std::array<impl::AnnotatedKernel, static_cast<uint8_t>(DispatchKey::NumDispatchKeys)> backendFallbackKernels_; std::unique_ptr<detail::RegistrationListenerList> listeners_; std::mutex mutex_; };Copy the code
Holdings to register
We next show how to register a virtual function table.
RegistrationHandleRAII Dispatcher::registerImpl( OperatorName op_name, c10::optional<DispatchKey> dispatch_key, KernelFunction kernel, c10::optional<impl::CppSignature> cpp_signature, std::unique_ptr<FunctionSchema> inferred_function_schema, std::string debug ) { std::lock_guard<std::mutex> lock(mutex_); auto op = findOrRegisterName_(op_name); Auto handle = op.operatorDef_->op.registerKernel(this, dispatch_key, STD ::move(kernel), std::move(cpp_signature), std::move(inferred_function_schema), std::move(debug) ); ++op.operatorDef_->def_and_impl_count; return RegistrationHandleRAII([this, op, op_name, dispatch_key, handle] { deregisterImpl_(op, op_name, dispatch_key, handle); }); }Copy the code
4.2.3.1 registry
OperatorEntry represents an operator and its Dispatch table, where only member variables are given.
Class TORCH_API OperatorEntry final {// Represents an operator and its Dispatch table public: OperatorName name_; c10::optional<AnnotatedSchema> schema_; // Store operator implementations for different keys, such as CPU, CUDA, autograd, etc. All operator version in this table STD: : array < KernelFunction, static_cast < uint8_t > (DispatchKey: : NumDispatchKeys) > dispatchTable_; DispatchKeyExtractor dispatchKeyExtractor_; Ska ::flat_hash_map<DispatchKey, STD ::list<AnnotatedKernel>> Kernels_; };Copy the code
4.2.3.2 Registration Behavior
The final registration action is set in dispatchTable_.
void OperatorEntry::updateDispatchTableEntry_(const c10::Dispatcher& dispatcher, DispatchKey dispatch_key) {
auto dispatch_ix = static_cast<uint8_t>(dispatch_key);
dispatchTable_[dispatch_ix] = computeDispatchTableEntry(dispatcher, dispatch_key);
dispatchKeyExtractor_.setOperatorHasFallthroughForKey(dispatch_key, dispatchTable_[dispatch_ix].isFallthrough());
}
Copy the code
4.2.4 how to dispatch
4.2.4.1 Scheduling Basis
PyTorch schedules different Operators based on dType, Device, and layout.
- Most types (such as INT32) can be mapped directly using templates, but some operators do not support templates and require a dynamic scheduler such as the Dispatcher.
- PyTorch’s Tensor runs not only on cpus, but also on gpus, MKLDNN and XLA devices, which also requires dynamic scheduling.
- Layout means the arrangement of elements in a tensor, which is the difference between strided layout and sparse layout, so dynamic scheduling is also required.
4.2.4.2 Scheduling code
This is part of the code we present here, for those who are interested, keep digging.
template<class Return, class... Args> C10_DISPATCHER_INLINE_UNLESS_MOBILE Return Dispatcher::call(const TypedOperatorHandle<Return(Args...) >& op, Args... args) const { detail::unused_arg_(args...) ; Op.operatordef_ ->op.dispatchKeyExtractor(). Template getDispatchKeySetUnboxed<Args... >(args...) ; TORCH_INTERNAL_ASSERT_DEBUG_ONLY(! c10::isAliasDispatchKey(dispatchKeySet.highestPriorityTypeId())); / / get operator const KernelFunction & kernel = make operatorDef_ - > op. The lookup (dispatchKeySet. HighestPriorityTypeId ()); # ifNdef PYTORCH_DISABLE_PER_OP_PROFILING bool pre_factor = false; if (C10_UNLIKELY(at::shouldRunRecordFunction(&pre_sampled))) { return callWithDispatchKeySlowPath<Return, Args... >(op, pre_sampled, dispatchKeySet, kernel, std::forward<Args>(args)...) ; } #endif // PYTORCH_DISABLE_PER_OP_PROFILING return kernel.template call<Return, Args... >(op, dispatchKeySet, std::forward<Args>(args)...) ; }Copy the code
4.2.4.3 key
Let’s look at the definition of key, because there are too many, so we only give partial values.
enum class DispatchKey : uint8_t { CPU, // registered at build/aten/src/ATen/RegisterCPU.cpp CUDA, // registered at build/aten/src/ATen/RegisterCUDA.cpp HIP, // NB: I think this is not actually used, due to Note [Masquerading as // CUDA] FPGA, // Xilinx support lives out of tree at // https://gitlab.com/pytorch-complex/vitis_kernels MSNPU, // unused externally, but tested at // test/cpp_extensions/msnpu_extension.cpp XLA, // lives out of tree at https://github.com/pytorch/xla MLC, // lives out of tree at https://github.com/pytorch/MLCompute Vulkan, Metal, XPU, // For out of tree Intel's heterogeneous computing plug-in HPU, // For out of tree & closed source integration of HPU / Habana VE, // For out of tree & closed source integration of SX-Aurora / NEC Lazy, // For lazy tensor backends // A meta tensor is a tensor without any data associated with it. (They // have also colloquially been referred to as tensors on the "null" device). // A meta tensor can be used to dry run operators without actually doing any // computation, e.g., add on two meta tensors would give you another meta // tensor with the output shape and dtype, but wouldn't actually add anything. Meta, // Here are backends which specify more specialized operators // based on the dtype of the tensor. QuantizedCPU, // registered at build/aten/src/ATen/RegisterQuantizedCPU.cpp QuantizedCUDA, // registered at build/aten/src/ATen/RegisterQuantizedCUDA.cpp QuantizedXPU, // For out of tree Intel's heterogeneous computing plug-in // This backend is to support custom RNGs; it lets you go // to a different kernel if you pass in a generator that is not a // traditional CPUGeneratorImpl/CUDAGeneratorImpl. To make use of this // key: // 1) set it as a second parameter of at::Generator constructor call in // the user-defined PRNG class. // 2) use it as a dispatch key while registering custom kernels // (templatized kernels specialized for user-defined PRNG class) // intended for out of tree use; tested by aten/src/ATen/test/rng_test.cpp CustomRNGKeyId, // Here are backends which specify more specialized operators // based on the layout of the tensor. Note that the sparse backends // are one case where ordering matters: sparse multi-dispatches with // the corresponding dense tensors, and must be handled before them. MkldnnCPU, // registered at build/aten/src/ATen/RegisterMkldnnCPU.cpp // NB: not to be confused with MKLDNN, which is Caffe2 only SparseCPU, // registered at build/aten/src/ATen/RegisterSparseCPU.cpp SparseCUDA, // registered at build/aten/src/ATen/RegisterSparseCUDA.cpp SparseHIP, // TODO: I think this is not actually used, due to Note // [Masquerading as CUDA] SparseXPU, // For out of tree Intel's heterogeneous computing plug-in SparseVE, // For out of tree & closed source integration of SX-Aurora / NEC SparseCsrCPU, SparseCsrCUDA, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, AutogradHPU, ...... };Copy the code
4.2.4.4 Key Usage
Due to space constraints, we can’t do in-depth analysis. Here we only present scenarios based on DeviceType. We can see from the following function how to map from DeviceType to DispatchKey type.
template <typename Func>
inline CppFunction dispatch(c10::DeviceType type, Func&& raw_f) {
auto deviceTypeToDispatchKey = [](c10::DeviceType t){
switch (t) {
// This list is synchronized with the k-constants in c10/core/DeviceType.h
case c10::DeviceType::CPU:
return c10::DispatchKey::CPU;
case c10::DeviceType::CUDA:
return c10::DispatchKey::CUDA;
case c10::DeviceType::XLA:
return c10::DispatchKey::XLA;
case c10::DeviceType::Lazy:
return c10::DispatchKey::Lazy;
case c10::DeviceType::MLC:
return c10::DispatchKey::MLC;
case c10::DeviceType::Meta:
return c10::DispatchKey::Meta;
case c10::DeviceType::HIP:
return c10::DispatchKey::HIP;
case c10::DeviceType::MSNPU:
return c10::DispatchKey::MSNPU;
case c10::DeviceType::HPU:
return c10::DispatchKey::HPU;
default:
TORCH_CHECK(false,
"Device type ", t, " cannot be overloaded at dispatch time, "
"please file a bug report explaining what you were trying to do.");
}
};
return dispatch(deviceTypeToDispatchKey(type), std::forward<Func>(raw_f));
}
Copy the code
4.3 summary
So far we know that PyTorch can dispatch different operators based on dType, Device, and layout through the Dispatcher mechanism. This answers our third question: how to seamlessly switch between CPU and GPU operations?
On the fourth question: do I need to move the loss function onto the GPU? And we have the answer:
The parameters of the loss function are forward-propagated outputs and labels, the outputs are already on the GPU (because the training data is already on the GPU), and the labels are manually set to the GPU by the user. Therefore, the parameters of the loss function are already above the GPU. In this way, Dispather will call the corresponding operator of the GPU based on the device, so there is no need to move the loss function to the GPU.
Let’s sort out an overall logic as follows, and the sequence is:
- Move training inputs to GPU.
- Assume that there is only one operator, op1, and go to the Dispatcher with the dispatch key device=’GPU’.
- Find the op1-GPU operator, calculate, and get the outputs.
- Outputs automatically reside on the GPU.
- Put Labels on the GPU as well.
- Assume that there is only one operator, op2, and the parameters of the loss function are above GPU. Therefore, use the dispatch key device= ‘GPU’ to search for the loss function in the Dispatcher.
- The operator op2-GPU is found, and loss is obtained after calculation.
+--------------------+ +-----------+ | Forward | +------------+ +------------------+ | GPU | | | | GPU | | Loss Function | | +---> | op1 op1-gpu() +----> | +---> | | +--------+ | Inputs | 1 | | 4 | Outputs | | | | GPU | | | | + ^ | | | | | | | +-----------+ | | | | +------------+ | op2 op2-gpu() +-->+ loss | | | | | | | | | +--------------------+ +------------+ | + ^ | | | | | | GPU | 5 | | | | +--------+ | | | +---> | | 6 | 7 | 2 | | 3 | Labels | | | | | | | | | | | | | | | +------------+ +------------------+ +----------------------------+ +--------------------------------+ | | | | | | +-----------------------------------------------------------------------------+ | | | | | | | +-------------------------------------------------------+ | | | | | Dispather | | | | | | + + + + | | | | | | | XLA | CPU | Metal | GPU | | | | | | +---------------------------------------------------+ | | | | | | | | | | | | | | +--------> | OP1 | op1-xla | op1-cpu | op1-metal | op1-gpu +---+ | | 'device=GPU' | | | | | +------+ | | | | +---------------------------------------------------+ | | | | | | | | | | +------------> | OP2 | op2-xla | op2-cpu | op2-metal | op2-gpu +---------------+ 'device=GPU' | | | | | +------+ | | +---------------------------------------------------+ | | | | | | | | OP3 | op3-xla | op3-cpu | op3-metal | op3-gpu | | | | | | | | +---------------------------------------------------+ | +-------------------------------------------------------+Copy the code
The mobile phone is as follows:
In the next article, we will analyze DataParallel, etc.
0xEE Personal information
★★★★ Thoughts on life and technology ★★★★★
Wechat official account: Rosie’s Thoughts
0 XFF reference
Blog.ezyang.com/2020/09/let…
Pytorch.org/tutorials/a…
Summary of GPU Multi-card Parallel Training (Taking PyTorch as an example)
Parallel Training Methods for Contemporary Postgraduates (Single machine and Multi-card)
Distributed training from starter to quit
PyTorch Initialization
The dispatcher pytorch
Talk about the Pytorch Dispatcher
Pytorch: Using CUDA to implement an operator
Dynamic generation of the PyTorch ATen code
Blog.csdn.net/qq_23858785…
CUDA function prefix
Introduction to CUDA C programming
CPU – GPU parallel processing – CUDA programming from want to start to give up
Blog.csdn.net/weixin_4223…
Blog.csdn.net/crazy_sunsh…
CPU, GPU, CUDA, CuDNN introduction
CUDA Programming (3): GPU architecture learn about!
CUDA Programming beginner’s Minimalist tutorial
What’s so hard about WRITING CUDA?
PyTorch in A Nutshell
Fully connected layer
Pytorch extension (2) : Pytorch combines C++ and Cuda extension
Pytorch is a combination of C and Cuda
PyTorch source code interpretation cpp_extension: an inside look at the C++/CUDA operator implementation and call flow
The dispatcher pytorch