Abstract: The debugging and tuning guide for common precision problems is sorted out for you, which will be shared in the form of “MindSpore Model Precision Tuning Practice” series articles, to help you easily locate accuracy problems and quickly optimize model accuracy.
Share this article from huawei cloud community “technology dry | faster positioning accuracy problem! MindSpore model accuracy tuning of actual combat (a)”, the original author: HWCloudAI.
Introduction: In the process of model development, it is often a headache to fail to reach the expected accuracy. To help you solve the problem of model debugging tuning, we have customized a visual debugging tuning component for MindSpore: MindInsight.
The debugging and tuning guide for common precision problems is also sorted out for you, which will be shared in the form of “MindSpore Model Precision Tuning Practice” series articles, to help you easily locate accuracy problems and quickly optimize model accuracy.
This article, the first in a series of shared articles, will briefly introduce common precision problems, analyze the common phenomena and causes of precision problems, and give an overall tuning idea. This series assumes that your script is ready to run and calculate the loss value. If the script cannot be run, modify the script according to error messages. In precision tuning practice, it is relatively easy to detect anomalies. However, if we are not sensitive enough to explain anomalies, we may miss the root cause of the problem. This article explains common accuracy problems, which can improve your sensitivity to anomalies and help you locate accuracy problems faster.
1. Common phenomena and causes of accuracy problems
The model accuracy problem is different from the general software problem, and the positioning cycle is generally longer. In normal programs, the output of the program does not match expectations, indicating a bug (coding error). But for a deep learning model, the model accuracy is not as expected, there are more complicated reasons and more possibilities. Since the model accuracy requires a long time of training before the final result can be seen, the positioning accuracy problem usually takes a longer time.
1.1 Common Symptoms
The direct phenomenon of accuracy problems is generally reflected in loss (model loss value) and metrics (model metrics). Generally, the symptoms of loss are as follows: (1) Loss runs away, NAN occurs, +/ -INF occurs, maximum value (2) Loss does not converge or converges slowly (3) Loss is 0, etc. Model metrics generally indicate that metrics such as accuracy and precision of the model do not reach expectations.
The direct symptoms of accuracy problems are easier to observe, and more can be observed on tensors such as gradients, weights, and activation values with the help of visualization tools like MindInsight. Common phenomena include :(1) gradient disappearance (2) gradient explosion (3) weight not updated (4) weight change is too small (5) weight change is too large (6) activation value saturation, etc.
1.2 Common Causes
Behind the phenomenon is the cause of the accuracy problem, which can be simply divided into the categories of overparameter problem, model structure problem, data problem, algorithm design problem and so on:
1.2.1 Overparameter problem
Hyperparameters are the lubricant between model and data, and the selection of hyperparameters directly affects the quality of model fitting to data. Common problems with overparameters are as follows:
1. Unreasonable setting of learning rate (too large or too small)
2. The loss_scale parameter is incorrect
3. Improper weight initialization parameters
4. The epoch was too big or too small
5. Batch size is too large
Learning rate is too high or too low. Learning rate is arguably the most important superparameter in model training. If the learning rate is too high, loss oscillations may occur and the value cannot converge to the expected value. If the learning rate is too small, loss convergence is slow. The strategy of learning rate should be chosen according to theory and experience.
Epoch is too large or too small. The number of epochs directly affects whether the model is underfitted or overfitted. If the epoch is too small, the model stops training before the optimal solution is trained, and it is easy to fall short of fitting. The epoch is too large, the model training time is too long, and it is easy to overfit on the training set and fail to achieve the optimal effect on the test set. The number of epochs should be reasonably selected according to the changes of model effects in the validation set during training. The batch size is too large. When the batch size is too large, the model may not converge to a better minimum value, thus reducing the generalization ability of the model.
1.2.2 Data problems
A. Data set problems
The quality of the data set determines the upper limit of the algorithm effect. If the data quality is poor, no matter how good the algorithm is, it is difficult to get a good effect. Common data set problems are as follows:
1. Too many missing data values
2. The number of samples in each category is uneven
3. Outliers exist in the data
4. Insufficient training samples
5. Data is incorrectly labeled
If there are missing values and outliers in the data set, the model will learn wrong data relationships. In general, data with missing values or outliers should be removed from the training set, or reasonable defaults should be set. Data label error is a special case of outliers, but this case is more destructive to training, such problems should be identified in advance by sampling the input model data, etc.
The sample number of each category in the data set is unbalanced, which means the sample number of each category in the index data set has a large gap. For example, in the image classification data set (training set), there are 1000 samples in most categories, but only 100 samples in the category of “cat”, so it can be considered that there is an imbalance in the number of samples. The imbalanced sample size will result in poor prediction performance of the model in the category with small sample size. If there is imbalance in the number of samples, samples with small sample size should be added as appropriate. Generally speaking, supervised deep learning algorithm will achieve acceptable performance in the case of 5000 labeled samples per class. When there are more than 10 million labeled samples in the dataset, the model will outperform human.
Insufficient training sample means that the training set is too small relative to the model capacity. Insufficient training samples will lead to unstable training and prone to over-fitting. If the number of parameters in the model is not proportional to the number of training samples, it should be considered to increase the training samples or reduce the model complexity.
B. Data processing Problems Common data processing problems are as follows:
1. Common data processing algorithm problems
2. Data processing parameters are incorrect
3. Data are not normalized or standardized
4. The data processing method is inconsistent with the training set
5. The data set is not shuffled
The data is not normalized or standardized, which means that the data input to the model are not on the same scale. In general, the model requires the data of each dimension to be between -1 and 1, with an average of 0. If there is an order of magnitude difference between the scales of a certain two dimensions, the training effect of the model may be affected. In this case, data need to be normalized or standardized. Inconsistency between data processing mode and training set refers to inconsistency between data processing mode and training set when reasoning with model. For example, different zooming, cropping, normalized parameters and training sets of images will lead to different data distribution in reasoning and training, which may reduce the reasoning accuracy of the model. Note: Some data enhancement operations (such as random rotation, random cutting, etc.) are generally only applied in the training set, and data enhancement is not needed in reasoning.
The data set is not shuffled. It indicates that the data set is not shuffled during training. If the shuffle is not performed or the mixing is insufficient, the model will always be updated in the same data order, which severely limits the selectivity of the gradient optimization direction and leads to less space for the selection of convergence points, making it easy to overfit.
1.2.3 Algorithm problems
The algorithm itself has defects leading to the accuracy can not reach the expected.
A. API usage problems
Common API usage problems are as follows:
1. Using API does not follow MindSpore constraints
2. MindSpore construct constraint is not followed in composition.
Using an API that does not follow the MindSpore constraint means that the API used does not match the actual application scenario. For example, in cases where a divisor may contain zero, consider using DivNoNan instead of Div to avoid division by zero problems. For example, in MindSpore, the first parameter of DropOut is the probability of retention, which is the opposite of the probability of DropOut in other frameworks.
The composition does not follow the MindSpore construct constraint, which means that the network in graph mode does not follow the constraint declared in the syntax support of MindSpore static graph. For example, MindSpore does not currently support inverting functions with key-value pair arguments. For full constraints, see: mindspore.cn/doc/note/zh… The structure problem of calculation graph
The structure of calculation graph is the carrier of model calculation, and the error of calculation graph structure is usually caused by the wrong code when the algorithm is implemented. Common problems in the structure of computational graphs are:
1. Incorrect operator usage (the operator used is not suitable for the target scenario)
2. Weight sharing error (weight sharing that should not be shared)
3. Node connection error (the block that should be connected to the calculation diagram is not connected)
4. The node mode is incorrect
5. Weight freezing error (freezing weights that should not be frozen)
6. The Loss function is incorrect
7. Optimizer algorithm error (if self-implemented optimizer), etc
Weight sharing error means that the weight that should be shared is not shared, or the weight that should not be shared is shared. This type of problem can be examined visually with the MindInsight computational diagram.
Weight freezing error means that the weight that should be frozen is not frozen, or the weight that should not be frozen is frozen. In MindSpore, freezing weights can be achieved by controlling params parameters passed into the optimizer. Parameters not passed to the optimizer will not be updated. You can verify weight freezing by checking the script, or by looking at the parameter distribution diagram in MindInsight.
Node connection error means that the connection and design of each block in the calculation diagram are inconsistent. If you find a node connection error, you should carefully check whether the script was written incorrectly.
If the node mode is incorrect, it refers to the operator that partially distinguishes the training and reasoning modes. You need to set the mode based on the actual situation. Typical examples include :(1) BatchNorm operator, which should be turned on during training. This switch will be automatically turned on when calling net.set_train(True). (2) DropOut operator, which should not be used in reasoning.
The error in the Loss function means that the Algorithm of the Loss function is incorrectly implemented or a reasonable Loss function is not selected. For example, BCELoss and BCEWithLogitsLoss are different and should be chosen reasonably depending on whether the sigmoID function is required.
C. Weight initialization
The initial value of weight is the starting point of model training. The unreasonable initial value will affect the speed and effect of model training. Common problems with weight initialization are as follows:
1. All initial weights are 0
2. In distributed scenarios, the initial weights of nodes vary
All initial weight values are 0, which means that after initialization, the weight value is 0. This generally leads to weight update problems, and weights should be initialized with random values.
In the distributed scenario, the initial weight values of different nodes are different. That is, the initial weight values of the same name on different nodes are different after initialization. Normally, MindSpore does global AllReduce for gradients. Ensure that the weight update amount at the end of each step is the same, so that the weight on each node in each step is consistent. If the weights of each node are different during initialization, the weights of different nodes will be in different states in the following training, which will directly affect the accuracy of the model. In distributed scenarios, the same random number seeds should be fixed to ensure that the initial weight values are consistent.
1.3 Precision locating is difficult due to multiple possible causes for the same phenomenon
Taking loss non-convergence as an example (figure below), any problem that may lead to saturation of activation value, disappearance of gradient, and incorrect weight update may lead to Loss non-convergence. For example, some weights are wrongly frozen, the activation function used does not match the data (relu activation function is used, input values are all less than 0), and the learning rate is too small, which are all possible causes of loss non-convergence.
2. Overview of tuning ideas
In response to the above symptoms and causes of precision problems, common tuning ideas are as follows: check code and overparameters, check model structure, check input data, and check loss curves. If none of the above is found to be a problem, we can let the training run to the end and check that the accuracy (mainly model metrics) is as good as expected.
Among them, checking the structure of the model and the overparameter focus on checking the static characteristics of the model. Checking input data and Loss curve combines static features and dynamic training phenomena. To check whether the accuracy meets the expectation is to re-examine the overall precision tuning process, and consider tuning methods such as adjusting overparameters, interpreting models and optimizing algorithms.
To help users effectively implement the precision tuning ideas described above, MindInsight provides the capabilities shown below. In future articles in this series, we’ll cover the preparations for precision tuning, details of each tuning idea, and how to implement them using MindInsight’s capabilities.
3. Precision Checklist
Finally, we put together common precision problems for your convenience:
Isn’t it exciting to learn about the key technologies of MindSpore? [Click the link] and [register now], you can learn a classic case to master the Deep learning based on MindSpore in ModelArts platform!
Click to follow, the first time to learn about Huawei cloud fresh technology ~