Article/Cainiao Technology Department – a hao
preface
As is known to all, machine learning deep learning requires a large number of data to be trained to improve the generalization of the model, which can be tens of thousands, millions or even larger orders of magnitude. So why do we have to have so many different data sets? Let’s start with an example:
Overfitting and underfitting are used to describe the two states of the model in the training process. In general, the training process will look like this:
At the beginning of training, the model was still in the learning process and in the under-fitting region. As the training goes on, both the training error and the test error decrease. After reaching a critical point, the error of the training set decreases while the error of the test set increases, and then it enters the over-fitting region — because the trained network overfits the training set, but performs poorly on the data outside the training set. Because generalization error cannot be estimated from training error, reducing training error blindly does not mean that generalization error will be reduced.
Therefore, machine learning deep learning model should pay more attention to reduce generalization error in order to truly reflect the predictive power.
The main reasons for generalization error are as follows:
- Data training set sample size. Generally speaking, if the number of samples in the training data set is too small, it is more likely to lead to overfitting
- The data training set sample is single. If the sample data type is not comprehensive, the effect of prediction will be affected
- The sample is noisy. Too many distractors in the sample will also affect the performance of the prediction
- Model complexity. Look for an appropriate F(X,Y) function to represent the data set. If the complexity of the model is too low, it is easy to underfit. However, due to the high complexity, it is easy to overfit
The focus of this article will be to share how we produce high quality samples that meet expectations in our own business from a sample perspective.
Commonly used sample manufacturing schemes
-
Manual labeling. Collect a large number of page pictures, mark the blocks, basic components and business components on the pictures (what components and locations of components), and also collect the components under different input parameters. The whole process of sample generation costs a lot of manpower
-
Custom chemical production into a sample. Generate pages through code simulation. In the beginning, we also adopted this approach, introducing a large number of components according to the characteristics of our business, which was really flexible but had a long manufacturing cycle and high maintenance costs.
Sample making machine
After exploring the pit in the early stage, our sample manufacturing scheme is shown as follows, which will be expanded in turn:
Material center
The basis of sample generation relies on components, and the technology stack mainly includes React and Vue systems. Business division is mainly the basic components, business components. Creating sample components can be selected from the repository or custom imported into the NPM component package
parameter
Deep learning models usually have high complexity. Take high-order polynomial functions as an example, the polynomial can be denoted as P(x) :
P(x) is determined by the following polynomial:
Among them,
It is not difficult to see that the higher-order polynomial function model has many parameters. Overfitting is more likely to occur if the number of samples in the training data set is too small, especially if it is smaller than the number of model parameters (in terms of elements). Based on the theory, we support different component configurations, properties, rules… The introduction of
DSL describes
We define a set of intermediate STATE DSL description in the sample manufacturing process, the expression of incoming materials and parameters. Currently supports component name (automatically treated as annotation name), attribute, package name, version, import type (destruct, deconstruct), style; Themes, scaffolding and other packages to rely on during initialization, initialization scripts; Global style Settings
Scheduler
Scheduler, as the task control center, first parses the input DSL, iterating through components to import components, properties, and so on. The corresponding plug-ins are scheduled to perform different tasks. For example, the RF page in our business uses Vue components. The Scheduler will call Adaptor to wrap and adapt Vue components after parsing out Vue technology stack. Finally, the page sample is generated by scheduling simulator.
Plug-in center
Responsible for the execution of sub-tasks. At present, there are four types of plug-ins: Adaptor, Generator, Filter and Installer
The simulator
After Scheduler completes the task processing of the input DSL, the simulator will collect the result of parameter configuration to render the page picture and automatically generate a annotation information. The general process is as follows: