This paper has been accepted by SIGIR 2018
The author | Guo Lin Ye Hui rob subbaraman Liu Hehuan Sun kai Hou Jun to hangzhou
Edit | Natalie
AI Front Line introduction:Since human cognition and perception of the world mainly come from vision, good visualization can effectively help people understand the deep neural network, and carry out effective evaluation, optimization and adjustment. The premise of visualization is to reveal the relevant data of the model, so as to carry out visual analysis and evaluation, and finally make the neural network transition from “black box” to “white box”. To address these challenges, The Alibaba team has built a visual analysis platform — DeepInsight — for industrial-scale deep learning applications. The introduction to paper 39 of AI Front will explain this visualization platform for you and show you how Alibaba can visually understand CTR prediction model based on it.






For more quality content, please follow the wechat official account “AI Front” (ID: AI-Front).
background

Deep learning has made great progress in both research and application. But until now, deep learning algorithms have not been transparent enough to be used as “black boxes”. In recent years, efforts have been made to better understand the complex mechanism of deep learning, so as to ensure the safety and reliability of algorithms or achieve the purpose of further optimization.

While there has been some progress in the direction of algorithm-based interpretability in image processing and natural language processing, there is still a gap in e-commerce and advertising. Deep learning, on the other hand, is already being used in advertising on a large scale. Advertising, an important source of cash flow for many Internet sites, is the core business. Deep neural network model is the core module in the core business, so it becomes very important to understand and evaluate this “black box” algorithm effectively.

Since human cognition and perception of the world mainly come from vision, good visualization can effectively help people understand the deep neural network, and carry out effective evaluation, optimization and adjustment. The premise of visualization is to reveal the relevant data of the model, so as to carry out visual analysis and evaluation, and finally make the neural network transition from “black box” to “white box”. To address these challenges, we built DeepInsight, a visual analysis platform for industrial-scale deep learning applications.

We will take a simple but representative deep neural network model as an example to illustrate the application of model visualization analysis to several typical and important problems: 1. Generalization effect evaluation; 2. Feature design; 3. Model structure design.

Most research in image or natural language processing focuses on visualizing models from sample granularity. Different from these research scenarios, industry CTR estimation is faced with massive data and features, biased labeled data, sparse and complex signal patterns, while the final effect evaluation is primarily focused on macro business indicators. Starting from business characteristics, we start with statistical signals to explore and understand the macro performance of the model on the whole target data set. Details of the experimental work are available in our English paper (see link at the end of the paper).

Platform is introduced

The DeepInsight platform is designed for industrial-scale deep learning algorithm development and application. It provides a complete model training task lifecycle management and an efficient and comprehensive way to reveal data. The core features of the DeepInsight platform include multi-dimensional data visualization, real-time analysis of large-scale data, and data re-modeling.

The platform is based on distributed micro-service cluster deployment, which is composed of three subsystems including front-end WEB platform, back-end micro-service and deep learning component. Each micro-service instance is isolated and does not affect each other. At present, two large-scale parallel training frameworks, Tensorflow and MXNet, have been connected to support multi-task learning, transfer learning, reinforcement learning, GAN, model fusion and other complex scenes, aiming to improve the interpretability of neural network and solve a series of problems such as model debugging and problem location analysis by means of data exposure and visualization. Implement training tasks in a life-cycle management way to provide a one-stop visual evaluation service. In addition to enabling the business, the business will also feed back the post-processing data to the platform to build an AI visualization ecosystem based on DeepInsight’s data core.

Algorithm experimental

Without losing representativeness, the model adopts a simple GwEN structure [1]. For each input sample, the sparse feature ID is mapped to low-dimensional and dense Embedding vector, and then the Embedding vector of this feature group is obtained by sum pooling operation based on the feature group. The Embedding vectors of each feature group are connected together and passed as input to subsequent full connection layers. The model has four fully connected hidden layers and Relu is the activation function. The output layer outputs the estimated click-through rate (PCTR) through sigmoID operations.

For the model at different stages of training, we collect the state data of the model on different data sets by means of dynamic revealing, which is the basis of visual analysis.

Generalization effect and neuronal state fluctuation

It is well known that deep neural networks have powerful fitting ability. As the training goes on, the model will continue to fit the training data and become more and more sensitive to small differences in the input information. Given a model, the state of each neuron is determined by the sample input. The changes of input of different samples in the dataset lead to the fluctuation of neuron state, and the degree of fluctuation reflects the sensitivity of the model to the input information. On the other hand, the model is too sensitive to training data, which will reduce its generalization ability. Our visualization clearly shows the relationship between the model generalization effect and the degree of neuronal state fluctuation.

The figure below shows the average fluctuation degree of state value of each neuron in the fourth hidden layer of the model, and compares the statistical performance of models at different training stages in training and test sets. Before over-fitting, the fluctuation degree of neurons remained relatively stable, and the training/test sets were relatively consistent. When overfitting, the degree of fluctuation increases significantly, and the training set is obviously stronger than the test set. This reflects that the model under the over-fitting state is over-sensitive to training data.

We aggregate the mean fluctuation degree of all neurons in the whole hidden layer and find that this indicator can be correlated with the model effect change (AUC) on different data sets. The fluctuation degree of neurons provides us with a means to understand and detect over-fitting. In addition, this metric calculation does not require labels, so it can help us evaluate the model’s effectiveness on a set of data sets where click-feedback is not available.

Characteristic influence

Compared with traditional logistic regression models, deep neural networks have the ability to automatically mine nonlinear crossover features from inputs. However, it is found that the quality of the input feature itself greatly affects the effect of the model.

What features are important to the model? For the traditional logistic regression model, we can recognize the importance of feature by its weight. However, this is not true for deep neural networks.

We use gradient information to understand the influence of each feature group on the model. The input of the fully connected network is differentiated with respect to the model output (PCTR). The strength of the gradient represents the sensitivity of the model’s output estimates to small changes in the input, and thus reflects the influence of the input on the model. The stronger the gradient is, the greater the influence of the input on the model is. The influence of each feature set on the model can be described by aggregating the average strength of the gradient corresponding to each feature set.

The figure below compares the average influence of each feature group in two models of different states (unfitted Vs over-fitted). The difference between the two states can be clearly seen: in the case of overfitting, the model is overly sensitive to a small number of feature groups, especially the feature groups numbered 1 and 11. In fact, both of them are single features with a large number of ID values, such as user ID, requiring a large parameter space and carrying very little generalization information.

The utility of hidden layer and its information representation

By visualizing the output vector of hidden layer, we show the comprehensive representation of the input information of the model, so as to help us understand the internal mechanism of the model and the influence of the model structure on the effect. In the figure below, the output vectors of different hidden layers are projected to the 2-dimensional plane through TSNE. Different from the visualization results of image classification [2], we did not observe the separation of the two types of sample points: click and non-click. This is determined by the high noise of the sample information in our scene. However, you can see that clicking on the sample points has spatial aggregation. The aggregation of the third layer is more obvious than that of the second layer, indicating that the information represented by the third layer is more discriminative. But the fourth tier shows no sign of further improvement.

In turn, it guides the structure design of the model. Our experiment proves that the training of the model without the fourth layer can achieve similar results to the model with four hidden layers.

The utility of hidden layers and their representational remodeling

The previous section described the benefits of understanding each hidden layer for the model classification effect. Our DeepInsight platform allows us to easily re-model the revealed data to further our understanding of the structure of the model.

We use the detection layer method proposed by Alain and Bengio [3] to train the logistic regression probe model by taking the representation vector of the hidden layer for the sample as the input feature and clicking feedback of the sample as the label. Comparing the effects of probe models trained by different hidden layers can help us understand the effect of hidden layer structure on model effect. As shown in the figure below, it is obvious that from the first to the third layer, the discriminating power of output information of hidden layer on click behavior increases layer by layer. The fourth layer does not bring significant benefits, consistent with the conclusion in the previous section.

summary

We explored the visualization and interpretability of deep learning in the e-commerce advertising scene. By analyzing the internal data of the deep neural network model, we opened the “black box” in order to deeply understand the internal state and mechanism of the model. These explorations have been successfully implemented into platform services to facilitate algorithm development and business applications.

Address:

https://arxiv.org/abs/1806.08541

References:

[1] Guorui Zhou, Chengru Song, Xiaoqiang Zhu, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, Kun Gai.2017. Deep Interest Network for click-through Rate Prediction. ArXiv Preprint arXiv:1706.06978 (2017).

[2] Paulo Rauber, Samuel Fadel, Alexandre Falcao, and Alexandru Telea. 2017. Visualizing the hidden activity of artificial neural networks. IEEE transactions on Visualization and Computer Graphics 23, 1 (2017), 101 — 110.

[3] Guillaume Alain and Yoshua Bengio. 2016. Understanding intermediate layers using linear