This paper is edited and compiled from “The Interpretability of Deep Learning and the Research and Application of Low-frequency event Learning in The Financial Field”, which was shared by Dong Si yi, data scientist of Ping An Life Insurance Company, in the latest Research and Application Practice of Machine Learning/Deep Learning in the Financial Field at Ping An Life &DataFunTalk Algorithm Theme Technology Salon. A slight modification without changing the original meaning.

The content to be shared today is mainly introduced from the above five aspects. Firstly, I will talk about the limitations of mainstream deep learning algorithms in the financial field, and then explore solutions to these limitations. In the exploration process, I will mainly explain the two points of interpretability and low-frequency learning.

Let’s talk about the limitations of deep learning in the field of finance. The problems faced by the financial field may be different from those faced by traditional Internet companies. For example, deep learning can be roughly divided into three parts: convolutional neural network, recursive neural network and deep neural network. Its advantages and disadvantages should be obvious. Simply speaking, convolutional neural network has a strong exploration of spatial structure correlation, recursive neural network has a strong exploration of time correlation (time series), and deep neural network has a strong exploration of global correlation. Their mainstream applications focus on computer vision, natural language processing and other directions, and are characterized by distinctive prior knowledge. To identify cats or dogs, the characteristics are obvious and data sets can be constructed. In natural language, words are limited and articles can be mined indefinitely. However, in the financial field, most people without prior knowledge do not know whether the characteristics of samples are good or not, and sometimes the knowledge may be obvious through long-term accumulation. For example, in the aspect of credit, a person who always fails to repay the loan will have problems. This may be a good prior knowledge, but such knowledge is very few and cannot be explored. Similarly, we can’t create data, if we want to be interested in forecasting macroeconomic trends or stock indexes, but we can’t create stock indexes, and we don’t know what the relationship is between trading volume and the number of points in stock indexes. In addition to the lack of interpretability, there is also a lack of professional knowledge. Therefore, when making decisions in deep learning, we must know why and prefer to know how the model makes decisions.

So how to solve these problems around these difficulties? First, let’s introduce our difficult problems. The problem of interpretability, low frequency time, sparsity, time variability of features and validity of data, and the unextensibility of data. These two points are the focus of today, and the other four points are also important. The exploration of interpretability can be divided into two aspects: local feature exploration and sensitivity analysis, which are complementary to each other. In fact, local exploration has made an indelible contribution to interpretability. Today, we mainly introduce some contributions of trees. Sensitivity analysis is mainly in the analysis of variance algorithm, low frequency learning can be done in many categories, today is mainly about the power mechanism, mainly introduces some of the latest research situation.

First of all, why does the tree model make a lot of contributions in terms of interpretability, I don’t say because it’s not directly interpretable in this system, but it does explain something. At present, algorithms are mainly divided into deep learning and non-deep learning, mainly centering on decision tree. The deep learning fitting ability is very strong, but the interpretation ability is poor; Decision tree can be interpreted and trained quickly, but its fitting ability is limited. At first the two schools were opposed, but later they learned from each other and used their strengths. For example, in terms of interpretation, Jefreeheten proposed asab Tree algorithm, rough learning and deep learning make prediction and analysis in image learning, and Tree model also make corresponding exploration in fitting ability. In addition, Alibaba and its jinfu have also done a lot of work in tree model exploration and combination with Deep Learning. Combined with ourselves, our solution is to use the tree model to explore local features with good interpretability and maximize the value. There are still some problems in deep learning to explore local features, mainly the influence of low-frequency features. Learn the sparse or dense data by using hybrid architecture WIDE&Deep or others, and finally interpret it by sensitivity analysis, and what is the basis for the classification of predicted values. Why not directly use deep learning to analyze its sensitivity? One of the problems of sensitivity analysis is that there is not much difference in importance and sensitivity is not strong. In addition, there is a feature in the financial field that there is no prior feature, and it is difficult to distinguish categories with obvious features, which is also the role of tree model.

Here’s how to use the tree model to its fullest. Wide&Deep model was proposed in 2016. Sparse data and dense data were trained together to find low-frequency features. We implemented relevant algorithms and models based on this idea. Different tree models have different characteristics, but many algorithms are based on GBDT algorithm. Take Chartputs as an example, which involves different subsimpling and subcorling local feature taking and vertical seed selection. The final results are completely different, because different data combination information will be mined from many angles. Can all algorithms be used? Just as Professor Zhou Zhihua proposed that in THE gcForest, there is no more than two trees are completely random, so there can be more exploration space. We also try to use all different models to combine similar nodes to form a knowledge base, which has a lot of feature relations between undirected edges to connect and form a large system with nodes. One advantage of this is that algorithms and data can be ignored, while most of the features in the knowledge base are useless and there will be a lot of interference.

We work in a lot of different areas in the financial sector, and one of the characteristics of the financial sector is that the data sometimes change, almost none of the characteristics are stable. So we might do a simple screening, generally using rules and scoring methods to do a screening of obviously invalid features. Its core is the stability of distribution and the stability and good trend of importance. The scoring method can be more quantified. Sometimes it is difficult to measure the ratio of distribution stability and importance stability. Sometimes training a model will introduce hundreds of tree models, with many fulcrums and sometimes hundreds of thousands of leaf nodes. Even after screening, there will be tens of thousands of features. If entering deep learning, it will collapse. Therefore, further screening is required, with the help of auto-encoder compression, in order to remove similar features. Although many algorithms are different, their splitting methods are very similar, and many similar leaf nodes will be found. The more models introduced, the more similar nodes will be generated. The introduction of different parameters in two identical GBDT models will eventually produce 1-2% similar leaf nodes. The repetition of important leaf nodes will seriously affect the model accuracy, and the weight estimation will also have a large deviation. You think these two things are important, but these two characteristics are really similar. Describing a problem in finance feels derived from different underlying data, but the logic may be the same.

Based on the above problem, we build some Conditional multi-fields Deep Neural networks based on Wide&Deep architecture according to our own business requirements. Sparse data is compressed and the hidden layer in the middle of auto-encoder compression is obtained as input. Normal dense format is applied. If there is time continuous feature, SKM will make a embedding, and then DNN will be used for training. Although there are many models for an architecture, it is not necessary to use all of them, and sometimes only the model on the left will suffice. If the original architecture of Google Wide&Deep algorithm is directly used, different modules will use different optimization algorithms. If the weight update adjustment is not good, the whole training process will have great disturbance, the distance fluctuation, and it is difficult to appear stable state. The reason for this is that the gradient update is inconsistent in the joint training, and the left side will be adjusted. Once the left side is adjusted, a disturbance will destroy the stability of the left side. Finally, coding later is used to limit the optimization ratio, so that the update ratio is as consistent as possible.

Sensitivity analysis partial statistical aspects, in the industrial field with more, in the pure computer field with less. Sensitivity analysis of an input disturbance, how the output is changing, and how much is the change, is used more. Sensitivity analysis is not widely used, but it is not unfamiliar. For example, linear regression is used to divide features into boxes in credit card scoring model. It is also part of sensitivity analysis to judge the impact of input on output by weight. In addition, deep learning is also a sensitivity issue behind the interference effect of counter samples in image recognition. In the field of deep learning, the local weight will be infinitely magnified, and significant features will appear. If the attack point exactly corresponds to the significant features, the results will be greatly biased.

Worst case analysis and reliability analysis are more industrial and rarely seen in the financial field. Sensitivity analysis is widely used in the industrial field, especially in the field of quantization. The purpose of introducing sensitivity analysis is to explain the DN of black box. DNN is also a black box problem, the general mechanism is known, but the internal operation principle is not clear. In fact, there has been research in this area for a long time, and it really became popular in 2015. The author of Sensitivity Analysis for Nerual Network proposed it in 2010, explaining how to use Sensitivity Analysis to explain neural Network. The purpose of sensitivity analysis is to quantify the sensitivity of variables, program it into a linear regression model, provide importance indicators, and use linear weighting. The commonly used methods include partial differential, regression model, one-at-a-time, analysis of variance, scatter plot and meta-model. The first three are first-order analysis, and the sensitivity of variables to their own changes is analyzed. The assumption of such application is that variables are basically unrelated. Basically, there is a relationship between variables in the financial field, so analysis of variance and meta-model are used. Scatter chart is for intuitive analysis. This is to adapt to the application scenario nonlinear, local correlation (local high order), high latitude, quantifiable, complex model and difficult to explain. Today, Analysis of Variance and Gaussian Process are mainly used as independent modules, and they will be explained jointly in the future.

The theoretical basis of anOVA is that any model is composed of some constant relations plus some univariate outputs and outputs of variables in pairs, such as an input of X and an output of Y, fn is the difference, and how fi changes if xi is disturbed. If you put all the variances together, change some parameters, output how much the perturbation changes, that’s what anOVA does. Variance analysis will differ greatly in the end due to different input parameters. Therefore, Sobol Index technology is used to normalize the variance, which is to use the calculated variance and its own variance as the ratio. The advantages of ANOVA are as follows: applicable to complex nonlinear models, sensitivity can be quantified and the value range is usually [0-1], which can be refined to measure the sensitivity of a parameter/variable value interval, and can measure the dependence relationship. Disadvantages: Enough data is required to ensure the accuracy of the calculation, and as the dimension increases, the data required also increases exponentially, and the ability to distinguish variables/parameters that are not significant enough is weak.

How to solve the shortcomings of variance analysis? Try to use a model to describe the distribution of the model, use a model to monitor the changing state, and use a model to learn the model to replace it. There are many such models. We choose Gaussian Process, which outputs the expected mean value and variance generated when you input a variable, which is also required for sensitivity analysis. Naive Bayes or other reasons are not used because our object is a complex nonlinear model, and naive Bayes fitting ability is limited. Gaussian Process requires setting many operation trees and functions, which are used to fit the mean value of variables and variances in different situations. It fits the distribution in the real world very well. Its principle is based on naive Bayes, to carry out more complex spatial mapping of variables, to find out the posterior weight distribution, and to use Inference to estimate the change in distribution of output brought by input/parameter changes. Detailed principles and ideas can be found in the paper “Probabilistic Sensitivity Analysis of System Availability Using Gaussian Process”. The core of sensitivity analysis is how the change of input changes the output. The importance of approximation can be obtained in many ways. If deep learning proves to be very difficult in terms of mechanism, it is mostly explained by different methods in the case of choosing suitable business situations.

Low-frequency events are too common in the financial field. In the quantitative stock index period, it is necessary to predict and judge the inflection point period. The frequency of inflection point is very few and its characteristics are very vague. In the field of anti-money laundering, inflection points are also rare, but some characteristics are obvious, and the results cannot be obtained if the model is directly used to learn. We have been trying to learn low-frequency events with attention before, but there is a difficulty in the occurrence of low-frequency events, and most of them are based on prior knowledge, and there is very little prior knowledge in the financial field. Attention is All You Need, a recent study by Researchers at Google, and Relational Recurrent Netrual, a research project by the team at Deep Mind Network makes a better exploration of the continuity of feature memory with Attention mechanism. These two articles mainly explain temporal correlation, but our field does not need to pay attention to temporal correlation, and we are only interested in some data samples that ARE very important to my total error. In order to improve the learning ability of low-frequency events, we set up a set of algorithms suitable for ourselves based on the above research results. The purpose is to learn significant features in important small data samples that are easy to be misclassified, and retain and inherit the learned features. We call this algorithm Low Frequent Events Detection with Attention Mechanism. Scaled dot-product Attention is used for scaling the actual output of value by giving you a request for keys and observing how well each query responds to the key. Mult-head Attention is multiple combinations, with some linear variations, for richer features.

In Relational Recurrent Netrual Network, the memory core mechanism is utilized to give a previous distribution of data features and a current input, and both of them are considered to learn together. How does this relate to our low-frequency event learning? The attentional mechanism forms a key matrix for the query and then computs weight normalization to influence the output value. In our model, the input is output through MLP. After adding the attention mechanism, the MLP module makes a binary classification and puts the intermediate embedding into discriminator, which is similar to the discriminator mechanism of gate network. By using other features to identify which ones are correct and which ones are not correct, you can know which ones you are sensitive to and which ones are not sensitive to. By setting thresholds, you can put the features that are wrong in memory core to learn and correct them. Find the unique but common characteristics of the errors to correct, correct the output, that is the whole idea of the model. MLP is the embedding in the middle hidden layer, which uses discriminator to classify and change the sample distribution in space.

The whole optimization process is divided into three parts, one is normal MLP optimization process, discriminator optimization process, and MC optimization process. It is divided into three loss functions, the first is the actual situation, the second is whether the situation is divided correctly, the real situation after MC optimization fitting. The following points need to be paid attention to in model training :(1) the second loss function depends on the prediction results of the model. Therefore, the training process of the model is asynchronously trained by three independent modules. (2) As the purpose of each module is different, the optimization algorithm and optimization strategy used are all different. (3) Selecting an appropriate threshold t is very important for model training (t>0.8 is recommended). (4) Select the appropriate number of Queries and keys. (5) THE MC module will only use samples that meet the requirements of the discriminator. (6) The training start time of each module is different: When the MLP training module becomes stable, the discriminator training is activated. When the discriminator training tends to be stable, the MC module can be activated in the following ways: Discriminator evaluation not only looks at the loss value, but also monitors the variation of Accuracy. The difference between loss function (1) and loss function (2) can be used to evaluate the function of MC module.

Here is a case, a certain recommendation task in life insurance, 2 classified targets (0 or 1), the total sample size is 170W, target sample size is 39W, not very low frequency. Algorithm architecture: The main framework is based on Conditional multi-fields DNN, and the dense data part uses: (1) DNN and (2) Attention Mechanism training details: After Attention mechanism is introduced, Discriminator is activated and preheated after the MLP model training reaches 1000 steps. After the discriminator (Loss) is stabilized, the MC module is activated and the prediction training is carried out. The model is then updated asynchronously each round. However, the gradual increase in accuracy indicates that the model has achieved a certain effect, but it does not prove how much can be achieved to prove the effect. The current definition is higher than the accurate effect of its own MLP. At the same time, the increasing difference indicates that the memory module is working.

Model results (1) Because the labels of data samples were not 1:1, the benchmark accuracy rate based on data was 77.06%; (2) When DNN was used to process dense data, the overall accuracy of the model was 82.35% (the prediction samples with the top 23% score were positive samples). But instead of 0.5, the threshold needs to be defined according to the actual situation of the sample; (3) The positive sample is about 24W, with an accuracy rate of about 61.53%; Negative sample prediction accuracy was 88.54%; (4) The discriminator can accurately distinguish the case where 91.04% of the samples are judged to be right or wrong. In other words, 8.96% of the misclassified samples are extremely similar to the correctly classified samples in the current low-dimensional mapping space; (5) From the conclusions of results (2) and (4), it can be concluded that the upper limit of sample size that has the chance to be corrected by MC is 8.69%; (6) After the final statistics, 69,700 samples were corrected by MC. 4.1% of the total sample. Positive samples accounted for 21,000 of them. Positive samples corrected accounted for 5.38% of the total positive samples.

Finally, the limitations of the algorithm are summarized: the mining of low-frequency events must be combined with the actual situation, not all scenarios are suitable for using such algorithm. To sum up, the algorithm can only be considered when the following conditions are met. (1) Low-frequency events should have a certain number (saturation should not be too low) and have common characteristics. (2) Low-frequency events should not be too similar to high-frequency events. (3) The accuracy of the main module (MLP) should not be too low. It should be at least slightly more accurate than the model as a whole. But algorithms also have many advantages over a single attention structure. ① Discriminator is used to distinguish the samples, mainly for the samples that are wrongly classified. Instead of using the Attention module directly. (2) THE MC module only learns small and important samples instead of the whole data set, which reduces the difficulty of learning and improves the efficiency of learning. (3) Asynchronous training makes training more stable, and different optimization strategies are adopted for different network structures, data structures and functions. (4) The memory module will learn the features of the whole low-frequency data set, and transfer the learned information effectively, so as to distinguish the characteristics of the data more effectively and enhance the generalization.