Abstract:In this article, I want to share with you some practical applications of Thor. Parts of the Thor algorithm are currently open source in Mindspore
This article is shared from the Huawei cloud community “MindSpore self-developed high-order optimizer source code analysis and practical application”, the original author: HWCloudai.
In this article, I want to share with you some practical applications of Thor. Part of the Thor algorithm is currently open source in Mindspore.
https://gitee.com/mindspore/m…
Using the Thor training network is very easy in MindSpore, so let’s first show you how to use it in four lines of code.
From mindspore.nn.optim import THOR # create net = net () # opt = THOR(net, lr, Tensor(damping), config.momentum, config.weight_decay, config.loss_scale, config.batch_size, Split_indices =split_indices) loss_fn=loss, optimizer=opt, loss_scale_manager=loss_scale, metrics={'acc'}, amp_level="O2", keep_batchnorm_fp32=False, Train (config.epoch_size, dataset, callbacks=cb, frequency=config.frequency) sink_size=dataset.get_dataset_size(), dataset_sink_mode=True)
- Import the packages required by the second-order optimizer THOR
- The first line of code typically creates the network
- The second line defines the optimizer THOR that we use
- The third line of code is to increase the calculation graph to achieve better THOR performance
- The fourth line of code trains the network
Let’s talk about it a little bit more. First import the package of the second-order optimizer required by MindSpore, located at MindSpore.nn. optim
Then create the network you need; Then define the THOR optimizer, pass in network information and THOR superparameter information (such as learning rate, regularization term coefficient, etc.);
Then call convert_to_thor_model function, which makes THOR achieve better performance by adding calculation graph. What does it mean? When the network is running, it is a calculation graph, and THOR will use outdated second-order information. The two graphs respectively perform the operation of updating the second-order matrix and not updating the second-order matrix to achieve better performance (P.S. MindSpore supports dynamic and static graphs, and static graphs are used here for better performance. Students who are interested in this content can click on this link: https://mindspore-website.obs,…). ;
Finally, you can start training by calling model.train. A brief introduction of how to use, then let’s look at its source code.
Source code analysis
Thor supports GPU and Ascend(Optimizer). Thor supports GPU and Ascend(Optimizer). Thor supports GPU and Ascend(Optimizer). The main difference between these two classes is the operator difference. Take the Class Thor_Ascend (Optimizer) as an example.
class THOR_Ascend(Optimizer): Def __init__(self, net, learning_rate, damping, momentum, weight_decay=0.0, loss_scale=1.0, batch_size=32, decay_filter=lambda x: x.name not in [], split_indices=None): params = filter(lambda x: x.requires_grad, net.get_parameters()) super(THOR_Ascend, self).__init__(learning_rate, params, weight_decay, Loss_scale) if isinstance(momentum, float) and momentum < 0.0: Raise valueError (" Momentum should be at least 0.0, but got momentum {}".format(momentum)) self.momentum = Parameter(Tensor(momentum, mstype.float32), name="momentum") self.params = self.parameters self.moments = self.params.clone(prefix="moments", init='zeros') self.hyper_map = C.HyperMap() self.opt = P.ApplyMomentum() self.net = net self.matrix_A_cov = ParameterTuple(filter(lambda x: 'matrix_A' in x.name, net.get_parameters())) self.matrix_G_cov = ParameterTuple(filter(lambda x: 'matrix_G' in x.name, net.get_parameters())) ...
All optimizers in MindSpore inherit from the Class Optimizer, a base class that defines some basic functions (such as obtaining learning rate, gradient scaling, etc.). Thor initializes the passed superarguments by defining them as class properties for easy invocation, and defines the operators that will be used in subsequent calculations.
In other words, the function of initialization function is to define the operators and variables (Parameter, Tensor, etc.) needed for Thor calculation.
Matrix_a_cov, self. Matrix_g_cov. These two variables are the information needed to calculate the second-order gradient, which are the input covariance matrix of each layer and the output covariance matrix of the first-order derivative of each layer, respectively, which have been saved during the forward and reverse processes at run time.
Let’s look at the input parameters when creating THOR:
- NET: the model established in this training;
- Learning_rate: -damping: superparameter of the regularized term added to the second-order matrix;
- Momentum superparameter;
- Weight_decay: weight decay, used to prevent overfitting. Default is 0.0, i.e. no use rights decay; LOSS_SCALE: Used for loss during scaling training to prevent gradient crossing bounds. The default value is 1.0, that is, scaling is not used. Batch_size: The amount of data currently used to train a step. Default is 32.
- Decay_filter: Selects weight decay on which layers, weight_decay>0; SPLIT_INDICES: This parameter is used to speed up the AllReduce process.
- The _get_Ainv_Ginv_Amax_Gmax_list function calculates the inverse of the covariance matrix A/G and returns the matrix after the inverse. The specific process is to traverse all layers of the model, deal with each layer by layer, add regularization terms to the covariance matrix of each layer, and then carry out Cholesky decomposition of the matrix to find the inverse. Full connection layer and convolution layer processing are currently supported in open source THOR.
def _get_Ainv_Ginv_Amax_Gmax_list(self, gradients, damping_step, matrix_a_allreduce, matrix_g_allreduce,
matrix_a_max_allreduce, matrix_g_max_allreduce):
"""get matrixA inverse list, matrixG inverse list, matrixA_max list, matrixG_max list"""
for i in range(len(self.params)):
thor_layer_count = self.weight_fim_idx_map[i]
conv_layer_count = self.weight_conv_idx_map[i]
layer_type = self.weight_layerType_idx_map[i]
if layer_type in [Conv, FC, Embedding]:
g = gradients[i]
matrix_A = self.matrix_A_cov[thor_layer_count]
matrix_G = self.matrix_G_cov[thor_layer_count]
matrix_A = F.depend(matrix_A, g)
matrix_G = F.depend(matrix_G, g)
A_shape = self.shape(matrix_A)
A_eye = self.eye(A_shape[0], A_shape[0], mstype.float32)
G_shape = self.shape(matrix_G)
G_eye = self.eye(G_shape[0], G_shape[0], mstype.float32)
if layer_type == Conv:
...
elif layer_type == FC:
matrix_A = matrix_A + damping * A_eye
matrix_A_inv = self.cholesky(matrix_A)
matrix_A_inv = self.vector_matmul(matrix_A_inv, matrix_A_inv)
- The _get_second_gradients function is used to calculate the final parameter update direction. In the paper, the parameter update direction formula is
, so the code is actually implemented as
, the code is as follows
def _get_second_gradients(self, new_grads, damping_step, gradients):
"""get second gradients for thor"""
params_len = len(self.params)
for i in range(params_len):
...
else:
...
elif layer_type == FC:
temp_a = self.matrix_A_cov[thor_layer_count]
temp_g = self.matrix_G_cov[thor_layer_count]
temp_a = self.cast(temp_a, mstype.float16)
temp_g = self.cast(temp_g, mstype.float16)
g = self.cast(g, mstype.float16)
g = self.matmul(temp_g, g)
g = self.matmul(g, temp_a)
g = self.cast(g, mstype.float32)
The construct function, which contains calls to the above two functions _get_Ainv_Ginv_Amax_Gmax_list and _get_second_gradients, is what will actually be executed during network training. This function completes the calculation of the second order matrix and the adjustment of the gradient updating direction.
def construct(self, gradients): params = self.params moments = self.moments damping_step = self.gather(self.damping, self.cov_step, self.axis) damping_step = self.cast(damping_step, mstype.float32) if self.thor: matrix_A_allreduce = () matrix_G_allreduce = () matrix_A_max_allreduce = () matrix_G_max_allreduce = () matrix_A_allreduce, matrix_G_allreduce, matrix_A_max_allreduce, matrix_G_max_allreduce = \ self._get_Ainv_Ginv_Amax_Gmax_list(gradients, damping_step, matrix_A_allreduce, Matrix_g_allreduce, Matrix_a_max_allreduce, Matrix_g_max_allreduce) # Calculate the inverse of A/G... new_grads = () for i in range(len(self.params)): ... If self.conv_layer_count > 0: If layer_type == Embedding:... layer_type == Embedding: elif layer_type == FC: temp_a = matrix_A_allreduce[thor_layer_count] temp_g = matrix_G_allreduce[thor_layer_count] fake_A = self.assign(self.matrix_A_cov[thor_layer_count], temp_a) fake_G = self.assign(self.matrix_G_cov[thor_layer_count], G = f.depend (g, fake_A)# ensure execution order g = f.depend (g, fake_G) temp_a = self.cast(temp_a, fake_A) mstype.float16) temp_g = self.cast(temp_g, mstype.float16) g = self.cast(g, mstype.float16) g = self.matmul(temp_g, G) g = self.matmul(g, temp_a)# change first order direction to second order direction g = self.cast(g, mstype.float32) elif layer_type == layerNorm: G = self._process_layerNorm (damping_step, g) new_grads = new_grads + (g,) gradients = new_grads # else: New_grads = () gradients = self._get_second_gradients(new_grads, damping_step, damping_step, Gradients) # Call the _get_second_gradients function to calculate the direction...
The practical application of THOR
In this section under the share of THOR’s practical application, the two examples ResNet50 and BERT, respectively, the two example code is open source, link is as follows: ResNet50:https://gitee.com/mindspore/m…
ResNet50[1]
The optimizer is called in the same way as mentioned at the beginning of the article, and the training process is expanded in this example.
Firstly, the training set required for network training is created and the network is defined as ResNet50; Then set the policy of the superparameter needed by THOR. Other superparameter values can be changed in SRC /config.py in this directory. Next, create the Thor optimizer and pass in the set overparameter value. Then transform the model to save the second-order information. Finally, you can train the network.
from mindspore.nn.optim import Momentum, Thor # from src.resnet import resnet50 as resnet from mindspore.train. Model import model... if __name__ == '__main__': ... Dataset = create_dataset(dataset_path=args_opt.dataset_path, do_train=True, repeat_num=1, repeat_num=1) batch_size=config.batch_size, target=target, Distribute =args_opt.run_distribute) step_size = dataset.get_dataset_size() # create resnet50 model net = resnet(class_num=config.class_num) ... # init lr if cfg.optimizer == "Thor": Lr_generator import get_thor_lr lr = get_thor_lr(0, config.lr_init, config.lr_decay, decay, config.lr_end_epoch, step_size, decay_epochs=39) # define loss, model if target == "Ascend": if args_opt.dataset == "imagenet2012": if not config.use_label_smooth: Config. Label_Smooth_Factor = 0 loss = crosSentropySmooth (sparse=True, reduction="mean", smooth_factor=config.label_smooth_factor, num_classes=config.class_num) else: loss = SoftmaxCrossEntropyWithLogits(sparse=True, Reduction ='mean') loss_scale = fixedLossScaleManager (config.loss_scale, drop_overflow_update=False) Model = model (net, loss_fn=loss, optimizer=opt, loss_scale_manager=loss_scale, metrics={'acc'}, amp_level="O2", keep_batchnorm_fp32=False) if cfg.optimizer == "Thor" and args_opt.dataset == "imagenet2012": From SRC. Lr_generator import get_thor_damping # set up damping = get_thor_damping(0, config.damping_init, Config.damping_decay, 70, step_size) # parallel acceleration when communication split_indices = [26, 53] # opt = THOR(net, lr) Tensor(damping), config.momentum, config.weight_decay, config.loss_scale, config.batch_size, Split_indices =split_indices) loss_fn=loss, optimizer=opt, loss_scale_manager=loss_scale, metrics={'acc'}, amp_level="O2", keep_batchnorm_fp32=False, frequency=config.frequency) ... Train (config.epoch_size-config.pretrain_epoch_size, dataset, callbacks= CB, sink_size=dataset.get_dataset_size(), dataset_sink_mode=dataset_sink_mode)
The last input
The script is ready to run.
BERT[2]
The steps in BERT are similar to those in ResNet50. Firstly, the training set required by network training is created and the network is defined as Bert. Then set the policy of the superparameter needed by THOR. Other superparameter values can be changed in SRC /config.py in this directory. The optimizer is passed in the overparameter value set by BERT when it is created, and in this case, it is passed in:
Exclude bias in LN layer and FC when weight decay is performed; Then transform the model to save the second-order information. Finally, you can train the network.
Optim import Lamb, Momentum, adamWeighttDecay, THOR # from SRC import bertNetworkWithLoss... def _get_optimizer(args_opt, network): """get bert optimizer, support Lamb, Momentum, AdamWeightDecay.""" if cfg.optimizer == 'Lamb': ... elif cfg.optimizer == "Thor": Lr = get_bert_thor_lr(cfg.thor.lr_max); from src.utils import get_bert_thor_lr, get_bert_thor_lr = get_bert_thor_lr(cfg.thor.lr_max); cfg.Thor.lr_min, cfg.Thor.lr_power, cfg.Thor.lr_total_steps) damping = get_bert_thor_damping(cfg.Thor.damping_max, cfg.Thor.damping_min, cfg.Thor.damping_power, Cfg.thor.damping_total_steps) split_indices = None if bert_net_cfg.num_hidden_layers == 12: if bert_net_cfg.use_relative_positions: split_indices = [29, 58, 87, 116, 145, 174, 203, 217] else: split_indices = [28, 55, 82, 109, 136, 163, 190, 205] elif bert_net_cfg.num_hidden_layers == 24: if bert_net_cfg.use_relative_positions: split_indices = [30, 90, 150, 210, 270, 330, 390, 421] else: Split_indices = [38, 93, 148, 203, 258, 313, 368, 397] # Indices = THOR(Network, LR, Damping, CFG. Thor. Momentum, cfg.Thor.weight_decay, cfg.Thor.loss_scale, cfg.batch_size, decay_filter=lambda x: 'layernorm' not in x.name.lower() and 'bias' not in x.name.lower(), split_indices=split_indices) ... return optimizer def run_pretrain(): ... DS = create_bert_dataset(device_num, rank, args_opt.do_shuffle, args_opt.data_dir) Args_opt.schema_dir) # create network and loss functions net_with_loss = bertNetworkWithLoss (bert_net_cfg, True)... If args_opt.load_checkpoint_path: param_dict = load_checkpoint(args_opt.load_checkpoint_path) load_param_into_net(net_with_loss, Param_dict) # Dynamic Loss Zoom If args_opt.enable_lossscale == "true":... # Fixed loss scaling value Net_with_grads = BertTrainOneStepCell(net_with_loss, net_with_loss) Optimizer =optimizer) # create network model = model (net_with_grads) # add graph to improve performance model = ConvertModelUtils().convert_to_thor_model(model, network=net_with_grads, optimizer=optimizer, Repeat_repeat_count (repeat_repeat_count, ds, callbacks= repeat_repeat_count, repeat_repeat_count, repeat_repeat_count, repeat_repeat_count) dataset_sink_mode=(args_opt.enable_data_sink == "true"), sink_size=args_opt.data_sink_steps) if __name__ == '__main__': set_seed(0)
The last input
The script is ready to run. This is the end of the content of the higher-order optimizer series. There are three articles in this series to share with you from the background of the optimizer, the introduction of the Mindspore self-developed optimizer and the source analysis & practical application of the Mindspore higher-order optimizer THOR. If there is any deficiency, we welcome your criticism and comments. You are also welcome to play with the MindSpore open source community.
References:
[1]He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.
[2]Devlin J, Chang M W, Lee K, et al. Bert: ArXiv Preprint arXiv:1810.04805, 2018.
Click on the attention, the first time to understand Huawei cloud fresh technology ~