Abstract: In the process of model development, the accuracy can not reach the expected is often a headache. To help users solve the problem of model debugging tuning, we tailored a visual debugging tuning component for MindSpore: MindInsight.
Share this article from huawei cloud community “technology dry | model optimization accuracy, speed, all I have! MindSpore model accuracy tuning of actual combat (2), the original author: HWCloudAI.
Introduction: In the process of model development, it is often a headache to fail to reach the expected accuracy. To help users solve the problem of model debugging tuning, we tailored a visual debugging tuning component for MindSpore: MindInsight. We also sorted out the debugging and tuning guide for common precision problems, which will be shared in the form of “MindSpore Model Precision Tuning Practice” series articles, hoping to help users easily locate accuracy problems and quickly optimize model accuracy.
Review MindSpore model accuracy tuning of actual combat series dry goods | click skip links – technology faster positioning accuracy problem! MindSpore Model precision tuning combat (I).
This article, the second in a series of articles, gives common ideas for precision debugging and tuning. This series assumes that your script is ready to run and calculate the loss value. If the script cannot be run, modify the script according to error messages.
When encountering precision problems, common debugging and tuning ideas are as follows:
1. Check code and overparameters
2. Check the model structure
3. Check the input data
4. Check the Loss curve
5. Check whether the accuracy is as expected
Code is an important source of accuracy problems, and checking code focuses on checking scripts and codes to find problems at the source (Section 2). The model structure reflects MindSpore’s understanding of the code. Checking the model structure focuses on checking whether MindSpore’s understanding is consistent with the algorithm engineer’s design (Section 3). Some problems will not be found until the process of dynamic training. Checking the input data (Section 4) and Loss curve (Section 5) is a combination of code and dynamic training phenomenon. To check whether the accuracy meets the expectation is to re-examine the overall accuracy tuning process, and consider tuning means such as adjusting overparameters, interpretation model and optimization algorithm (Section 6). In addition, familiarity with models and tools is important (Section 1). Each of these ideas is described below.
1. Precision tuning preparation
1.1 Review the algorithm design and be fully familiar with the model
Before precision tuning, the algorithm design should be reviewed to ensure that the algorithm design is clear. If the paper implementation model is referred to, all design details and the selection of overparameters in the paper should be reviewed. If you look at other framework script implementation models, make sure you have a unique benchmarking script that is accurate enough. If the algorithm is newly developed, important design details and overparameter selection should also be made clear. This information is important for checking the script steps later.
Before precision tuning, familiarize yourself with the model thoroughly. Once you are familiar with the model, you can accurately understand the information provided by MindInsight, determine if there is a problem, and identify the source of the problem. Therefore, it is important to take the time to understand model elements such as the algorithm and structure of the model, the role of operators and the meaning of parameters in the model, and the characteristics of the optimizer used in the model. Before starting to analyze the details of the accuracy problem, it is suggested to deepen the understanding of these model elements with the problem.
Before, the algorithm design should be reviewed to ensure that the algorithm design is clear. If the paper implementation model is referred to, all design details and the selection of overparameters in the paper should be reviewed. If you look at other framework script implementation models, make sure you have a unique benchmarking script that is accurate enough. If the algorithm is newly developed, important design details and overparameter selection should also be made clear. This information is important for checking the script steps later.
Before precision tuning, familiarize yourself with the model thoroughly. Once you are familiar with the model, you can accurately understand the information provided by MindInsight, determine if there is a problem, and identify the source of the problem. Therefore, it is important to take the time to understand model elements such as the algorithm and structure of the model, the role of operators and the meaning of parameters in the model, and the characteristics of the optimizer used in the model. Before starting to analyze the details of the accuracy problem, it is suggested to deepen the understanding of these model elements with the problem.
1.2 Getting Familiar with Tools
MindInsight is full of features and we recommend that you simply read the MindInsight tutorial to understand the main features. To locate precision problems, you are advised to enable summary training information collection, add SummaryCollector to the script, and use the training kanban to view training process data, as shown in the following figure. Summary guide to use, training visual guide to use.
When you need to debug the model online, refer to Enabling the Debugger.
2. Check the code and overparameters
Code is an important source of accuracy problems, and the problems of overparameter, model structure, data, algorithm design and implementation will be reflected in the script. Checking the script is a very effective means to locate accuracy problems. Checking the code mainly depends on code walking, so it is recommended to use the yellow duck debugging method: In the process of code walking, patiently explain the role of each line of code to the inexperienced “yellow duck”, so as to inspire inspiration and find code problems. Check script, attention should be paid to achieve the script (including data processing, model structure, loss function, the optimizer, etc) are consistent with the design, if the reference other scripts, to focus on implementation are consistent with other scripts, the script, all inconsistencies should have sufficient and reasonable reason, otherwise, they should be modified.
When checking scripts, pay attention to the case of overparameters. The overparameter problem is mainly caused by improper value of overparameters, for example
1. Unreasonable setting of learning rate;
2. The loss_scale parameter is unreasonable.
3. Improper weight initialization parameters.
In most cases, the SummaryCollector will automatically record common hyperparameters. You can use MindInsight’s training parameter details feature and trace analysis to view the hyperparameters. Combined with the code in the MindInsight model traceability analysis module and script, you can confirm the values of the overparameters and identify those that are clearly unreasonable. If there is a benchmark script, you are advised to compare the value of the overparameter with the benchmark script one by one. If there is a default parameter value, compare the default parameter value with the default parameter value to avoid the precision decrease or training error caused by different parameter default values of different frameworks.
3. Check the model structure
In terms of model structure, common problems are:
1. Operator use error (the operator used is not suitable for the target scenario, such as floating point division should be used, integer division is used incorrectly);
2. Weight sharing error (sharing weight that should not be shared);
3. Weight freezing error (freezing weights that should not be frozen);
4. Node connection error (the block that should be connected to the calculation diagram is not connected);
5. Loss function error;
6. Optimizer algorithm error (if self-implemented optimizer), etc.
It is recommended to check the model structure by examining the model code. In addition, MindInsight can assist users in checking the structure of the model. In most cases, SummaryCollector automatically records the diagrams, and MindInsight makes it easy for users to view the diagrams. After the model script runs, it is recommended that you use the MindInsight computational diagram visual module to view the model structure to deepen your understanding of the calculation diagram and confirm that the model structure meets your expectations. If a benchmarking script is available, you can also check the calculation diagram against the benchmarking script to see if there are any important differences between the current script and the calculation diagram of the benchmarking script.
Given the complexity of model structures in general, it would be unrealistic to expect all model structure problems to be discovered in this step. As long as the visualization of the model structure to deepen the understanding of the computational graph, to find obvious structural problems. In later steps, we will go back to this step to recheck and confirm if we find more specific signs of accuracy problems.
Note 1: MindInsight allows you to view the graphs recorded by the SummaryCollector and the pb graphs exported by the Save_Graphs parameter of the MindSpore Context. Please refer to the “Computational Diagram Visualization” section of our tutorial for more information.
Note 2: Script migration tools can convert models written in the PyTorch, TensorFlow framework into MindSpore scripts. Please visit the tutorial for more information.
4. Check the input data
By examining the data input to the model, the script can be used to determine whether there are problems with the data processing pipeline and data set. Common problems with input data are:
1. Too many missing data values;
2. The number of samples in each category is unbalanced;
3. There are outliers in the data;
4. Data labels are incorrect.
5. Insufficient training samples;
6. The data is not standardized, and the data input to the model is not in the correct range;
7. Finetune and Pretrain have different data processing methods;
8. There are different data processing modes in training stage and reasoning stage;
9. Incorrect data processing parameters, etc.
MindInsight helps users check input data and data processing pipelines. In most cases, the SummaryCollector automatically records the data entered into the model (after data processing) and the data processing pipeline parameters. Input model data will be displayed in the “data sampling” module, data processing pipeline parameters will be displayed in the “data graph” module and “data traceability” module. With MindInsight’s data sampling module, you can examine the data as it enters the model. If the data does not meet expectations (for example, the range of data is too large or the Angle of data rotation is too large), you can determine that there is a problem with the input data. MindInsight’s data graph and data traceability module allows you to examine the data processing process and parameter values in the data processing pipeline to detect improper data processing methods.
If you have a benchmarking script, you can also check whether the data output by the data processing pipeline is the same as that of the current script. For example, save the data output from the data processing pipeline as an Npy file, and then compare the data from the benchmark script with the current script using the numpy.allclose() method. If a difference is found, there may be a precision problem in the data processing phase.
If the data processing line does not find any problems, you can manually check whether the data set has problems such as unbalanced classification, incorrect label matching, too many missing values, and insufficient training samples.
5. Check the Loss curve
Many accuracy problems will be found in the process of network training. Common problems or phenomena include:
1. The weight initialization is unreasonable (for example, the initial value is 0, the initial value range is unreasonable, etc.);
2. There are too large and too small values in the weight;
3. The weight changes too much;
4. Incorrect weight freezing;
5. Incorrect weight sharing;
6. Saturation or weak activation value (for example, Sigmoid output is close to 1, Relu output is all 0);
7. Gradient explosion and disappearance;
8. Training epoch is inadequate;
9. NAN and INF exist in the result of operator calculation;
10. Overflow in operator calculation process (overflow in calculation process is not necessarily harmful), etc.
Some of these problems or phenomena can be shown through loss, while others are difficult to observe. MindInsight provides targeted features to observe these symptoms and automatically detect problems, helping you locate the root cause more quickly. Such as:
-
The parameter profile module of MindInsight shows how model weights change over the course of training.
-
MindInsight’s tensors visual module shows tensors and compares them.
-
The MindInsight debugger has a rich variety of powerful inspection capabilities built into it, You can check for weight problems (such as weight not updated, weight updated too much, weight value too large/too small), gradient problems (such as gradient disappearance, gradient explosion), activation value problems (such as activation value saturation or too weak), tensors all 0, NAN/INF, overflow of operator calculation process, etc.
Debugger use tutorial
In most cases, the SummaryCollector will automatically record the loss curve for the model, which can be viewed through MindInsight’s scalar visual module. The Loss curve can reflect the dynamic trend of network training. By observing the Loss curve, information such as whether the model converges and overfits can be obtained.
In most cases, the SummaryCollector automatically logs model parameter changes (five by default), which can be viewed in the Parameter profile module of MindInsight. To record the distribution of more parameters, see the histogram_regular parameter of the SummaryCollector, or refer to the HistogramSummary operator.
Tensors are not recorded automatically. If you want to see TensorSummary values in MindInsight, use the TensorSummary operator.
The following describes how to use MindInsight to locate accuracy problems based on common phenomena of Loss curves.
5.1 loss run to fly
Loss run is when NAN, +/-INF, or particularly large values are present in loss. Loss usually means that there is a problem with the algorithm design or implementation. The positioning roadmap is as follows:
1. Review scripts, model structures, and data
1) Check whether the super parameter has an unreasonable extra large/extra small value,
2) Check whether the model structure is correct, especially whether the Loss function is correct.
3) Check whether there are any missing values and special values in the input data.
2. Observe the parameter distribution diagram in the training kanban to check whether there are obvious anomalies in parameter updating. If abnormal parameter updates are found, use the debugger to locate the cause of abnormal parameter updates. 3. Use the debugger module to check the training site.
1) If NAN and +/-INF appear in the Loss value, add a global monitoring point using the “Check tensor overflow” condition to locate the operator node where NAN and +/-INF first appear, and check whether the input data of the operator will lead to abnormal calculation (such as dividing by zero). If the operator input data problem, can be targeted to add small value epsilon to avoid computational anomalies.
2) If the loss value is particularly large, the condition of “Check for Oversized Tensors” can be used to add global monitoring points to locate the operator node with the first large value and check whether the input data of the operator will lead to abnormal calculation. If the input data itself is abnormal, you can continue to trace the operator that generated the input data until you locate the specific cause.
3) if the suspected abnormal parameters update, gradient, etc, can use “check the weight change is too big”, “check gradient disappear”, “check the large gradient condition such as set up the monitoring stations, locate the abnormal weight or gradient, and then combined with tensor check view, resulto suspicious of positive operator, a reverse operator, the optimizer operator, etc.
5.2 Loss Convergence is slow
Slow convergence of loss indicates that the loss oscillates and converges slowly. It takes a long time to reach the expected value or the convergence fails to reach the expected value. Compared with loss running and flying, the numerical characteristics of slow convergence of Loss are not obvious, and it is more difficult to locate. The positioning roadmap is as follows:
1. Review scripts, model structures, and data
1) Check whether there is any unreasonable value of super parameter, especially check whether the learning rate is set too small or too large. If the learning rate is set too small, the convergence speed will be slow. If the learning rate is set too small, it will lead to loss concussion and no decrease.
2) Check whether the model structure is correctly realized, especially whether the Loss function and optimizer are correctly realized;
3) Check whether the range of input data is normal, especially whether the value of input data is too small
2. Observe the parameter distribution diagram in the training kanban to check whether there are obvious anomalies in parameter updating. If abnormal parameter updates are found, use the debugger to locate the cause of abnormal parameter updates. 3. Use debugger module to check the process of the training site.
1) The trainable (unfixed) weight can be monitored to check whether the weight change is too small by using the conditions of “check the weight change is too small” and “Check the weight change is not changed”. If the weight change is too small, we can further check whether the learning rate value is too small, whether the optimizer algorithm is correctly implemented, whether the gradient disappears, and make targeted repair.
2) The condition of “Check for gradient disappearance” can be used to monitor the gradient and check whether the gradient disappearance exists. If the gradient disappears, you can further check the cause of the gradient disappearance. For example, you can check for activation saturation and Relu output of 0 by checking the activation range condition.
5.3 Other Loss Symptoms
If loss is 0 on the training set, it generally indicates that the model has been overfitted. Please try to increase the size of the training set.
6, check whether the accuracy reaches the expected
MindInsight records the accuracy of each training session for the user. When you use the same SummaryCollector instance in model.train and Model.eval, model evaluation information is automatically recorded. After the training, you can check the accuracy of the training results with the MindInsight model traceability module.
6.1 Check the accuracy of the training set
If the loss and metric values of the model on the training set fail to meet expectations, the following ideas can be used for positioning and optimization:
1. Review code, model structure, input data and Loss curve
1) Check the script to see if the overparameter has an unreasonable value
2) Check whether the model structure is correctly implemented
3) Check whether the input data is correct
4) Check whether the convergence result and convergence trend of the Loss curve are abnormal
2. Try optimizing the overparameter using MindInsight traceability analysis. The importance of hyperparameters will be analyzed on the traceability analysis page. Users should give priority to adjusting the hyperparameters with high importance. The relationship between the hyperparameters and the optimization goal can be observed from the scatter diagram, so as to adjust the value of the hyperparameters accordingly.
3. Try tuning the overparameter using the MindInsight tuner. Note that the callback performs the hyperparameter search by performing multiple complete training sessions, which takes several times as long as a network training session. If a network training session takes a long time, the hyperparameter search will take a long time. Tutorial on how to use a parameter. Try to interpret the functional optimization model and data set using the MindInsight model. The model explanation function can visually display the most important areas for classification results through saliency maps, and suggest which types of labels should be optimized through the scoring system.
Use tutorial for model interpretation
4. Try to optimize the model structure/algorithm.
6.2 Check the accuracy on the validation set
If both the accuracy of the training set and the accuracy of the verification set are not as expected, the accuracy of the training set should be checked by referring to the previous section first. If the accuracy of the training set has reached the expectation, but the accuracy of the verification set has not, it is highly likely that the model has over-fitting, and the processing ideas are as follows:
1. Check whether the evaluation logic of the verification set evaluation script is correct. In particular, whether the data processing mode is consistent with the training set, whether the inference algorithm is wrong, and whether the correct model checkpoint is loaded.
2. Increase the amount of data. Including increasing sample size, data enhancement and perturbation.
3. Regularization. Common techniques include parameter norm punishment (such as adding a regular term to the objective function), parameter sharing (forcing two components of the model to share the same parameter value), and early termination of training.
4. Reduce the size of the model appropriately. For example, reduce the number of convolution layers.
6.3 Check the accuracy of the test set
If the accuracy of both validation set and test set is not as expected, the accuracy of validation set should be checked first by referring to the previous section. If the accuracy of the verification set has reached the expectation, but the accuracy of the test set has not, considering that the data of the test set is new data that the model has never seen, the reason is generally that the data distribution of the test set is inconsistent with that of the training set. The processing idea is as follows:
1. Check that the evaluation logic of the test set evaluation script is incorrect. In particular, whether the data processing mode is consistent with the training set, whether the inference algorithm is wrong, and whether the correct model checkpoint is loaded.
2. Check the data quality in the test set, such as whether the distribution range of data is obviously different from the training set, and whether there are a lot of noises, missing values or outliers in the data.
7, summary
Due to the existence of multiple possible causes of the same phenomenon, the location of precision problems depends very much on expert experience. Hope the above positioning methods and functions can play a good role of guidance, help you continue to accumulate successful experience, become a precision tuning master.
Click to follow, the first time to learn about Huawei cloud fresh technology ~