Please go here to see the graphic tutorial:Studyai.com/pytorch-1.4…
Named tensors are designed to make tensors easier to use by allowing users to associate explicit names with tensor dimensions. In most cases, operations that take dimension parameters will accept dimension names, eliminating the need to track dimensions by position. In addition, naming tensors use names to automatically check that the API is being used correctly at run time, providing additional security. Names can also be used to rearrange dimensions, for example, supporting “broadcast by name” instead of “broadcast by location.”
This tutorial is intended as a guide to the features included in the 1.3 release. Finally, you will be able to:
Create tensors with named dimensions and remove or rename those dimensions. For the basics of how operations propagate dimension names, see Named Dimensions for cleaner code in two key areas: A Flattening and unflattening dimensions along the coastCopy the code
Finally, we will learn how to practice named tensors by constructing multi-head attention modules using named Tensors.
The naming tensor in PyTorch was inspired by and in collaboration with Sasha Rush. Sasha presented the initial idea and proof of concept on his blog, The January 2019 blog post. Basic: Named dimensions
PyTorch now allows tensors to have named dimensions; Factory functions take a new “names” parameter that associates each dimension with a name. This method can be used in many factory functions:
tensor
empty
ones
zeros
randn
randCopy the code
Now we construct a tensor with a name:
import torch imgs = torch.randn(1, 2, 2, 3, names=(‘N’, ‘C’, ‘H’, ‘W’)) print(imgs.names)
Tensor. Names [I] is the name of the I th dimension of tensor.
There are two ways to rename the dimensions of a Tensor:
Method #1: Set the.names attribute (this method can change the name of a given dimension in situ)
imgs.names = [‘batch’, ‘channel’, ‘width’, ‘height’] print(imgs.names)
Method #2: Specify new names (this changes names out-of-place)
imgs = imgs.rename(channel=’C’, width=’W’, height=’H’) print(imgs.names)
The preferred way to delete a name is to call tensor.rename(None) :
imgs = imgs.rename(None) print(imgs.names)
The unnamed tensors (sors with no named dimensions) still work, and have no name in their “REPR”
unnamed = torch.randn(2, 1, 3) print(unnamed) print(unnamed.names)
Naming tensors do not require all dimensions to be named.
imgs = torch.randn(3, 1, 1, 2, names=(‘N’, None, None, None)) print(imgs.names)
Because named tensors can coexist with unnamed tensors, we need a nice way to write named tensor-aware code that senses both named and unnamed tensors. Use tensor. Refine_names (*names) to improve dimension naming, and lift unnamed dimensions dims to named dims. Refining a dimension refers to a rename operation with the following constraints:
A None dim can be refined to have any name. A named dim can only be refined to have the same Name.(The name of the dimension that already has a name remains unchanged)Copy the code
imgs = torch.randn(3, 1, 1, 2) named_imgs = imgs.refine_names(‘N’, ‘C’, ‘H’, ‘W’) print(named_imgs.names)
Refine the last two dims to ‘H’ and ‘W’. In Python 2, use the string ‘… ‘
instead of …
named_imgs = imgs.refine_names(… , ‘H’, ‘W’) print(named_imgs.names)
def catch_error(fn): try: fn() assert False except RuntimeError as err: err = str(err) if len(err) > 180: err = err[:180] + “…” print(err)
named_imgs = imgs.refine_names(‘N’, ‘C’, ‘H’, ‘W’)
Try to change the name of a dimension that already has a name to something else
catch_error(lambda: named_imgs.refine_names(‘N’, ‘C’, ‘H’, ‘width’))
Most simple operations propagate names. The ultimate goal of naming tensors is that all operations propagate names in a reasonable and intuitive way. In version 1.3, support was added for many common operations, such as.abs() :
print(named_imgs.abs().names)
Access and Compression (Accessors and Reduction)
You can refer to dimensions using dimension names instead of locations. These operations also support name propagation. Indexing is not currently implemented but is planned for implementation. Using the Named_IMGS tensor, we can do the following:
Output = named_imgs. Sum (‘C’)
Img0 = named_imgs.select(‘N’, 0)
Name inference
The process by which names spread between two steps is called: Name inference:
Check names: The operator can perform automatic checks at run time to see if certain dimension names must match. Propagation name: Name inference propagates the output name to the output tensor.Copy the code
Let’s start with two small examples: Adding 2 One-Dim tensors with no broadcasting.
x = torch.randn(3, names=(‘X’,)) y = torch.randn(3) z = torch.randn(3, names=(‘Z’,))
Checking the names: First, we will check whether the names of the two tensors match, as long as the strings corresponding to the names are equal, or at least one of them is “None” (” None “can be read as wildcard names). According to this rule, only one of the above three quantities will fail to add to each other, namely x + z:
catch_error(lambda: x + z)
Spread names: Unify two names by returning the most refined of the two. In x + y, x is more refined than None.
print((x + y).names)
Most of the rules for name inference are simple, but some of them may have unexpected semantics. Let’s look at some of the scenarios you might encounter: Broadcasting and matrix multiplication. Broadcasting
The naming tensor does not change the broadcast behavior itself: it still broadcasts by position. However, when checking to see if two dimensions can be broadcast, PyTorch also checks to see if the dimensions’ names match.
This causes the naming tensor to prevent accidental alignment during broadcast operations. In the following example, we apply per_batch_scale to imGS.
imgs = torch.randn(2, 2, 2, 2, names=(‘N’, ‘C’, ‘H’, ‘W’)) per_batch_scale = torch.rand(2, names=(‘N’,)) catch_error(lambda: imgs * per_batch_scale)
Without names, the tensor per_batch_scale will be aligned to the last dimension of the IMGS, which is not what we want. What we actually want to do is align the per_batch_scale with the batch dimension of the IMGS. See the explicit Broadcast by Name feature for manipulating dimensions by name by aligning tensors. Matrix multiplication
Torch. Mm (A, B) performs the product operation on the second dimension of A and the first dimension of B, returning that the first dimension of the tensor is the same as the first dimension of A, and its second dimension is the same as the second dimension of B. Other matrix multiplication operations, such as torch. Matmul, torch. Mv, and torch. Dot, behave similarly.
markov_states = torch.randn(128, 5, names=(‘batch’, ‘D’)) transition_matrix = torch.randn(5, 5, names=(‘in’, ‘out’))
Perform a state transition process
new_state = markov_states @ transition_matrix print(new_state.names)
As you can see, matrix multiplication does not check if contracted dimensions have the same name.
Next, we introduce two new behaviors given by named tensors: Explicit broadcasting by name and flattening/shrinking by name new behavior 1: Explicit broadcasting by Name
One of the main complaints about using dimensions is the need to “unsqueze” the “dummy” dimensions so that some operations can be performed successfully. For example, in our per-Batch-scale case above, where tensors are not named, we need to do this:
imgs = torch.randn(2, 2, 2, 2) # N, C, H, W per_batch_scale = torch.rand(2) # N
correct_result = imgs * per_batch_scale.view(2, 1, 1, 1) # N, C, H, W incorrect_result = imgs * per_batch_scale.expand_as(imgs) assert not torch.allclose(correct_result, incorrect_result)
By using named tensors, we can make these operations safer (and easier to perform when the number of dimensions is uncertain). We provide a new tensor.align_AS (other) operation,
This changes the order of tensors to match a particular order in other.names, adding one-sized dimensions where appropriate (tensor.align_to(names) works as well):
imgs = imgs.refine_names(‘N’, ‘C’, ‘H’, ‘W’) per_batch_scale = per_batch_scale.refine_names(‘N’)
named_result = imgs * per_batch_scale.align_as(imgs)
Note: Named Tensors do not yet work with Allclose
assert torch.allclose(named_result.rename(None), correct_result)
New behavior two: Flatten/shrink dimensions by name
A common operation is flattening/flattening dimensions: flattening and unflattening dimensions. Currently, the user performing this process is 0 0 using View, 0, or Flatten; Common uses include flattening batch dimensions to send tensors to operators that must accept input with a particular number of dimensions (i.e., conv2D accepts 4D input).
0 To make these operations more semantic than View, 0 We introduce a new method tensor.unflatten(dim, namedShape) method and update flatten to work in named tensors: 0 tensor.flatten(dims, new_dim).
flatten can only flatten adjacent dimensions but also works on non-contiguous dims. One must pass into unflatten a named shape, which is a list of (dim, size) tuples, to specify how to unflatten the dim. It is possible to save the sizes during a flatten for unflatten but we do not yet do that.
imgs = imgs.flatten([‘C’, ‘H’, ‘W’], ‘features’) print(imgs.names)
imgs = imgs.unflatten(‘features’, ((‘C’, 2), (‘H’, 2), (‘W’, 2))) print(imgs.names)
Automatic differential support
Autograd currently ignores all names on tensors and evaluates them as regular tensors. The gradient calculation is still correct but it loses the security of tensor naming. Support for automatic differentiation is also in the development roadmap.
x = torch.randn(3, names=(‘D’,)) weight = torch.randn(3, names=(‘D’,), requires_grad=True) loss = (x – weight).abs() grad_loss = torch.randn(3) loss.backward(grad_loss)
Clone () print(correct_grad) # Future versions will do this
weight.grad.zero_() grad_loss = grad_loss.refine_names(‘C’) loss = (x – weight).abs()
Ideally, we would check that the names of Loss and grad_Loss match, but we haven’t implemented this yet
loss.backward(grad_loss)
Print (weight.grad) # still unnamed assert torch. Allclose (weight.grad, correct_grad)
Some other supported and unsupported features
For a detailed breakdown of what version 1.3 supports, see here.
In particular, we would like to point out three important features that are not currently supported:
Save and load tensors via torch. Save or torch. Load Multithreading via ' 'torch. For example, the following code will failCopy the code
imgs_named = torch.randn(1, 2, 2, 3, names=(‘N’, ‘C’, ‘H’, ‘W’))
@torch.jit.script def fn(x): return x
catch_error(lambda: fn(imgs_named))
As a stopgap, remove the name by tensor=tensor.rename (None) before using any name that doesn’t already support naming tensors. A long case: multi-head attention
Now, we’ll implement a PyTorch’s “nn.Module” with a complete example: Multi-head attention. We assume the reader is already familiar with: multi-head attention; If you are new to this, see this explanation or this explanation.
We used an implementation from ParlAI: Multi-head attention; In particular, the code here. Read the code in this example; Then, in comparison with the code below, notice that there are four comments (I), (II), (III), and (IV) in the code, where named tensors are used to make the code more readable; We’ll dive into each one after the code blocks.
import torch.nn as nn import torch.nn.functional as F import math
class MultiHeadAttention(nn.Module): def init(self, n_heads, dim, dropout=0): super(MultiHeadAttention, self).init() self.n_heads = n_heads self.dim = dim
self.attn_dropout = nn.Dropout(p=dropout) self.q_lin = nn.Linear(dim, dim) self.k_lin = nn.Linear(dim, dim) self.v_lin = nn.Linear(dim, dim) nn.init.xavier_normal_(self.q_lin.weight) nn.init.xavier_normal_(self.k_lin.weight) nn.init.xavier_normal_(self.v_lin.weight) self.out_lin = nn.Linear(dim, dim) nn.init.xavier_normal_(self.out_lin.weight) def forward(self, query, key=None, value=None, mask=None): # (I) query = query.refine_names(... , 'T', 'D') self_attn = key is None and value is None if self_attn: mask = mask.refine_names(... , 'T') else: mask = mask.refine_names(... , 'T', 'T_key') # enc attn dim = query.size('D') assert dim == self.dim, \ f'Dimensions do not match: {dim} query vs {self.dim} configured' assert mask is not None, 'Mask is None, please specify a mask' n_heads = self.n_heads dim_per_head = dim // n_heads scale = math.sqrt(dim_per_head) # (II) def prepare_head(tensor): tensor = tensor.refine_names(... , 'T', 'D') return (tensor.unflatten('D', [('H', n_heads), ('D_head', dim_per_head)]) .align_to(... , 'H', 'T', 'D_head')) assert value is None if self_attn: key = value = query elif value is None: # key and value are the same, but query differs key = key.refine_names(... , 'T', 'D') value = key dim = key.size('D') # Distinguish between query_len (T) and key_len (T_key) dims. k = prepare_head(self.k_lin(key)).rename(T='T_key') v = prepare_head(self.v_lin(value)).rename(T='T_key') q = prepare_head(self.q_lin(query)) dot_prod = q.div_(scale).matmul(k.align_to(... , 'D_head', 'T_key')) dot_prod.refine_names(... , 'H', 'T', 'T_key') # just a check # (III) attn_mask = (mask == 0).align_as(dot_prod) dot_prod.masked_fill_(attn_mask, -float(1e20)) attn_weights = self.attn_dropout(F.softmax(dot_prod / scale, dim='T_key')) # (IV) attentioned = ( attn_weights.matmul(v).refine_names(... , 'H', 'T', 'D_head') .align_to(... , 'T', 'H', 'D_head') .flatten(['H', 'D_head'], 'D') ) return self.out_lin(attentioned).refine_names(... , 'T', 'D')Copy the code
(I) Refine the dimensions of the input tensors
def forward(self, query, key=None, value=None, mask=None): # (I) query = query.refine_names(… , ‘T’, ‘D’)
Query = query. Refine_names (… , ‘T’, ‘D’) serves as enforcable documentation, and promotes the input unnamed dimensions to named dimensions. It checks whether the last two dimensions can be refined to [‘ T ‘, ‘D’] to prevent future unprompted or confounding size mismatches.
*(II) manipulates the dimensions of tensors in prepare_head *
(II)
def prepare_head(tensor): tensor = tensor.refine_names(… , ‘T’, ‘D’) return (tensor.unflatten(‘D’, [(‘H’, n_heads), (‘D_head’, dim_per_head)]) .align_to(… , ‘H’, ‘T’, ‘D_head’))
The first thing to notice is how the code clearly states the input and output dimensions: input tensors must end in dimensions T and D, and output tensors must end in dimensions H, T, and D_head.
The second thing to notice is how the code clearly describes what is happening. Prepare_head retrieves the key, query, and value, splits the embedded dim into multiple heads, and finally rearranges the dim order as […, ‘H’, ‘T’, ‘D_head’]. The prepare_head of the ParlAI implementation is as follows, using view and transpose operations:
def prepare_head(tensor): # input is [batch_size, seq_len, n_heads * dim_per_head] # output is [batch_size * n_heads, seq_len, dim_per_head] batch_size, seq_len, _ = tensor.size() tensor = tensor.view(batch_size, tensor.size(1), n_heads, dim_per_head) tensor = ( tensor.transpose(1, 2) .contiguous() .view(batch_size * n_heads, seq_len, dim_per_head) ) return tensor
Our variant of the prepare_head function implemented by named tensors uses more detailed operations but has more semantic meaning than the prepare_head version implemented by View and transpose, And includes executable documentation in the form of names.
(III) Explicit broadcast by name
def ignore(): # (III) attn_mask = (mask == 0).align_as(dot_prod) dot_prod.masked_fill_(attn_mask, -float(1e20))
Mask usually has dimensions [N, T] (in Self attention) or [N, T, T_key] (in Encoder attention) while dot_prod has dimensions [N, H, T, T_key]. To make mask broadcast correctly with dot_prod, we would usually unsqueeze dims 1 and -1 in the case of self attention or unsqueeze dim 1 in the case of encoder attention. Using named tensors, we simply align attn_mask to dot_prod using align_as and stop worrying about where to unsqueeze dims.
*(IV) More dimensional manipulation using ALIGN_to and flatten *
def ignore(): # (IV) attentioned = ( attn_weights.matmul(v).refine_names(… , ‘H’, ‘T’, ‘D_head’) .align_to(… , ‘T’, ‘H’, ‘D_head’) .flatten([‘H’, ‘D_head’], ‘D’) )
Here, as in (II), align_to and flatten have a stronger (albeit more verbose) semantic meaning than view and transpose. Run the case
n, t, d, h = 7, 5, 2 * 3, 3 query = torch.randn(n, t, d, names=(‘N’, ‘T’, ‘D’)) mask = torch.ones(n, t, names=(‘N’, ‘T’)) attn = MultiHeadAttention(h, d) output = attn(query, mask=mask)
works as expected!
print(output.names)
The above work went as expected. Also, notice that in the code we don’t mention the batch Dimension name at all. In fact, our MultiHeadAttention module is unaware of the existence of batch dimensions.
query = torch.randn(t, d, names=(‘T’, ‘D’)) mask = torch.ones(t, names=(‘T’,)) output = attn(query, mask=mask) print(output.names)