Today marks 529 days since OneFlow opened source, OneFlow V0.6.0 has been officially released. The update includes framework, model and Oneflow-onNX, including:
- Performance improvement, including static graph, dynamic graph, operator performance, video memory occupation and so on
- Add a large number of common operators
- Perfect static diagrams and ConsistentTensor
- Support for OneFlow as the back-end of Nvidia Triton to provide Serving
- Achieve rich visual pre-training model, aligned with TorchVision and TIMM
- To achieve a more complete oneflow-onNx conversion function
The update details are as follows.
Framework to optimize
1. Deeply optimize the performance of Nn. Graph
- Compared with V0.5.0, nn.Graph of V0.6.0 improved the training speed by 10% on ResNet AMP and WDL models
-
- There is room for optimization in the performance of the new version of dynamic/static conversion. Recently, we focused on optimizing the performance of Nn. Graph in high frequency iterative training scenarios
- Graph scheduling instruction of nn.Graph is redesigned. The interaction logic between Actor Graph and Eager VM is reconstructed. The runtime of Graph and Python input/output Tensor should be as asynchronous and parallel as possible
2. Deep optimization of Eager performance
- Compared with V0.5.0, v0.6.0 OneFlow Eager significantly improves the training speed in small batch scenarios
-
- Optimized vm scheduling logic in depth
- Optimize the get/set items
- Optimization of tensor. Numel ()
- Optimize oneflow. The Size ()
3. Deep optimization operator performance
- Some operators affecting the performance bottleneck of the new model are optimized, and the training speed of relevant models is improved obviously
-
- New fused Dropout series operator
- Added CPU edition Group deconv and improved performance
- Add inplace implementations for the following operators: mul, HARD_sigmoid, sin
- Linalg.vector_norm has been optimized for ord=2.0, which is 4 times better than before
- Deeply optimized LayerNorm OP with performance significantly ahead of PyTorch and Apex implementations
- Realize op data type automatic promotion
4. Deep optimization of Eager memory occupancy
- Some operators are optimized to occupy video memory in network training so that the same computing device can run larger models or data
-
- Optimize the backward video memory usage of broadcast Binary operators
- Optimize the Slice operator’s backward video memory occupancy
- Optimize LayerNorm’s video memory footprint
5. Add many utility functions to static Graph nn.Graph
- Nn. Graph adds many new features for static Graph efficiency, debugging, completeness, and ease of use in more scenarios:
-
- To assist static graph debugging, we added:
-
- Debug mode supports graph.debug(1) to print more composition information
- Provide the environment variable ONEFLOW_DEBUG_PASS to display the changes to the computed graph before and after the compile-time graph optimization
- Added user-readable thread naming information to the Nsight Profile to facilitate locating and retrieving target critical thread locations
- Enriched test cases for lots of static graphs: added automatic nn.graph tests with Eager tests
-
- To support deployment of models using Nn.graph (Serving), graph.save() and load() interfaces are provided
- In order to accelerate AMP on the GPU using TensorCore, an environment variable is provided: ONEFLOW_ENABLE_NHWC is used to represent the CNN-related operator to calculate Channels last
- Make nn.Graph support more usage scenarios:
-
- Supports sparse update Optimizer, which is used for sparse update of parameters in WDL scenarios
- Support using Sequential, ModuleList, ModuleDict, ParameterList, and ParameterDict containers under nn.Graph
- Support for creating Optimizer in the init function of nn.Graph
- Nn. Graph supports multiple parameters sharing the same Tensor
- The actual number of processes is greater than the number of GPU devices
- For Consistent SBP reasoning under nn.Graph, Inplace is considered and more Inplaces are supported
6. A large number of operators are added
- New operators: Cumsum, MeshGrid, Linspace, Diagonal, MoveDim, Roialign, NMS, Arccos, roll
- New operators: masked_fill, floordiv, glu, pool1d, pool2d, pool3D
- Add Unfold and Fold OP
- Realize op data type automatic promotion
- Implement expand and repeat op
- The current torchVision library models can be switched with one click by importing oneflow as Torch
7. User-defined autograd.function is supported
- Users can customize autograd.function like Torch
8. Provide basic Serving functions
- Supports OneFlow as the backend of Triton to provide the Serving function of the model
Add a little bit of a Tensor
- Support Tensor uses 2-D SBP to represent arbitrary mixed parallelism (like a Linear operation that runs data parallel in the row direction of the device matrix, models parallel in the column direction)
- Supports Tensor from arbitrary 1-D SBP to 2-D SBP. The network consists of a mixture of 1-D parallel and 2-D parallel.
- Support for constructing ConsistentTensor from numpy
- New oneflow. From_numpy ()
- New oneflow. Numel ()
- New tensor. Expand_as () # # #
Model implementation
Release flowvison 0.0.54
(link:Github.com/Oneflow-Inc…
1. Achieved a rich visual pre-training model
Image classification
- CNN Series: ResNet, DenseNet, VGG, ResNext, EfficientNet and more
- Vision Transformer series: ViT, PVT, SWin-Transformer, etc
- Vision MLP series: MLP-Mixer, RES-MLP, G-MLP, etc
Target detection
- SSD, SSDLite
- Faster R-CNN
- RetinaNet
Image segmentation
- FCN
- DeepLabV3
Style migration
- StyleNet: Support styles sketch, Candy, Mosaic, Rain_princess, Undie
Undie2. Implements data enhancement operations aligned with TorchVision
Data enhancement operations aligned with TorchVision, such as CenterCrop and ColorJitter, can be directly replaced by importing FlowVision as TorchVision in most scenarios
3. Align advanced data enhancement implementations in TIMM
Advanced data enhancement operations implemented in FlowVision.data
- Mixup
- CutMix
- Random-Erasing
- AutoAugment
- RandAugment
- AugMix
4. Separate the Layers module to provide plug-and-play blocks for model building
Flowvision. The layers. Attention module
- Non-local, SELayer, CBAM, BAM, ECA, and other plug-and-play attention modules are implemented
Flowvision. The layers. Blocks module
- Provides modules such as PatchEmb, Pooler, ConvBnAct, etc. that may be used to build models
Flowvision. The layers. Regularization module
- In addition, activation, weight_init and other independent files are provided to provide activation functions, initialization methods and other components
OneFlow – ONNX transformation
Update OneFlow to ONNX model format toolkit
- Support CPU and GPU mode OneFlow model to onNX model
- New operator and model test examples to align all classification models in OneFlowVision library
- Fix onnX-Runtime bug during PReLU conversion
- Compatible with onNX-Runtime library 1.9.0 or later
- PIP install oneflow-onnx oneflow-onnx oneflow-onnx
OneFlow’s new generation of open source deep learning framework: github.com/Oneflow-Inc…