Research on neural Support Decision Tree (NBDT) algorithm

Article/Ali tao department of leisure fish technology – gradually dripping

Abstract

Accuracy and interpretability are choices for children, adults for all

background

In many business scenarios of Xianyu, there are a lot of requirements for classification by using algorithms, such as picture classification, component identification, commodity stratification, dispute category prediction, etc. These scenarios often require interpretability of the results recognized by the model, that is, it is better to explain the hierarchy and source of the category in the process of recognition rather than just get its category. How to classify images with explanations has become a requirement in project research and development. Based on this, I conducted a research on NBDT algorithm.

NBDT is a model from a recent paper published by UC Berkeley and Boston University (April 2020). NBDT is a Boosting Neural Backed Decision tree, and it is a Boosting Neural Backed Decision tree. NBDT is a Boosting Neural Backed Decision tree, and a Boosting Neural Backed Decision tree. NBDT is just a decision tree, not multiple trees.

introduce

The feature of NBDT lies in that it integrates neural network NN into decision tree (to be exact, before decision tree), where NN is usually CNN, namely convolutional neural network. Personally, the structure of NBDT can be roughly considered as “CNN in front + DT behind”. DT= decision tree. NBDT is currently used in the field of image classification. Its advantage does not lie in how accurate it is. In fact, in the author’s experiment, its accuracy is slightly lower than that of “CNN ahead”. Its real advantage is a good balance between model accuracy and model interpretability. Specifically, it can achieve much higher (classification) accuracy than any tree model on the premise of slightly sacrificing the accuracy of CNN. At the same time, because it is integrated into the decision tree, it can also explicitly give the basis of model inference step by step. That is to say, NBDT can not only identify a picture of a dog as “dog”, It can also tell you how it is recognized step by step: For example, the image was identified as “animal” 99.49% of the time, “Chordate” 99.63% of the time, “Carnivore” 99.4% of the time, and “dog” 99.88% of the time. Such inferences undoubtedly enhance the explanatory power of the model.

Figure 1 – Dog classification (quoted from official Demo)

The principle of details

NBDT uses the framework of “pre-training +finetune”. The whole process can be roughly divided into the following three steps:

① Pre-train a CNN model, and take the weight of the last CNN layer as the hidden vector of “each class”.

For example, training a resnet18 CNN with cifar10 (a picture classification data set with 10 categories like “cat” and “dog”). The last layer of SUCH CNN is usually the Fully Connected layer (FC). Assume that the vector dimension of the output of the penultimate layer is D, then the dimension of the Fully Connected layer W is Wd∗10W_{d*10}Wd∗10, and each column vector of W corresponds to each category exactly. You can think of it as an implicit vector for each class. This approach is somewhat similar to Word2Vec.

(2) The hidden vector of categories is used to do Hierarchical Clustering and WordNet is used to form Hierarchical tree structure.

This tree structure is called the Induced Hierarchy. Specifically, we first do hierarchical clustering of the implicit vectors of the classes. In the source code, we call the AgglomerativeClustering class implementation of the SKlearn module directly. The hierarchical structure of clustering brings two problems :(1) two child nodes can be clustered together by the clustering algorithm. The child nodes all represent a class of entities, but their parent nodes do not have the description of an entity. (2) Suppose two child nodes are clustered together, and both child nodes have hidden vectors, how should the hidden vectors of their parent nodes be represented?

For problem (1), the author uses WordNet, a word network that contains the upper and lower relationships between nouns. In Python, you can import the WordNet module directly into the NLTK module. Since leaf nodes are physically described, for example, the 10 categories of CIFAR10, it is possible to find the “nearest common ancestor” of the two leaf nodes through WordNet. For example, “cat” and “dog” in WordNet may be most recently classified under “mammal”. So “mammal” is the parent of “cat” and “dog.” Thus, parent nodes can be “named” from bottom to top according to the results of hierarchical clustering, until there is only one root node, which forms the so-called “induced hierarchy”, which is “Step 1” in the figure below. This hierarchy of inducements is the decision tree in the dog image above.

Figure 2 – Training and inference (quote from original Paper)

For question (2), the author uses the mean value of the hidden vector of the child node to represent the hidden vector of the parent node. “Step C” is described in the figure below.

Figure 3 – Constructing a hierarchy (quote from original Paper)

③ The finetune model is used to add the classification loss of induction level into the total loss.

After the induction hierarchy (tree structure, DT) is established, the complete model is no longer CNN, but CNN+DT. In order to force the model’s prediction of new samples to follow the tree structure and deduce all the way from the root node to the leaf node, it is necessary to add the classification loss of tree structure into the total loss and finetune the model.

The first step here is to understand the way in which the full model is predicted. I think the author’s thinking here is very essential. When a new sample (an image) comes in, it first passes through the front CNN. Before the full connection layer W of the last layer, CNN outputs a D-dimensional vector X to the image. Matrix multiplication of X and W (essentially inner product with each column vector) is used to obtain the Logits distribution of the sample in each category, and the probability distribution is obtained by softmax. Since each column vector of W represents the implicit vector of DT leaf node, it is completely possible to replace W with this DT. Instead of directly multiplying x and W, we begin to traverse from the root node of DT, and let X compute the inner product with the implicit vector of each child node of DT in turn. There are two modes for traversing DT nodes: “Hard” and “Soft”. If DT is a binary tree, for example, in the Hard mode, x will calculate the inner product with the child nodes on the left and right sides respectively every time, and the larger side will be assigned to the x side until the leaf node is calculated. Finally, the leaf node where X falls is the final category to which X belongs. In Soft mode, X will traverse all intermediate nodes from the top down and calculate the inner product. Then, the final probability of the leaf node is the product of the probabilities of all intermediate nodes on the path to the leaf node. Finally, the category of X can be determined by comparing the final probability value of each leaf node.

Figure 4 – Node probability calculation (quoted from original Paper)

After understanding the details of the prediction of the full model, the “loss of classification in the induced hierarchy (tree structure)” can be explained. Correspondingly, the loss function also has two modes of “Hard” and “Soft”, as shown in the figure below. In case of the Loss in Hard mode, Loss will only accumulate the classification Loss of each node on the real path of the leaf node to which the sample belongs in DT (with A certain weight), and the non-real path (dotted line node W3 / W4 in figure A below) will not be included. Here, the classification Loss of each node is calculated by cross entropy. In the case of Soft mode Loss, the cross entropy of the final probability distribution on the leaf node and the real OneHOT distribution is directly calculated as Loss. In short, the loss function of Hard mode calculates “path cross entropy”, while that of Soft mode calculates “leaf node cross entropy”. $$\text{CrossEntropy}(x, class) = -\log\left(\frac{\exp(x[class])}{\sum

j \exp(x[j])}\right) = -x[class] + \log\left(\sum

J \exp(x[j])\right) The total loss of the final model will also consider the classification loss Lossoriginal of the original CNN, so the total loss submitted to finetune stage for optimization is: The total loss of the final model will also consider the classification loss of original CNN, so the total loss submitted to finetune stage for optimization is: The total Loss of the final model will also consider the classification Loss of the original CNN, so the total Loss that is finally optimized in the Finetune stage is Loss

{total} = Loss

{original}+Loss_{hard\ or\ soft} $$

According to my reading of the source code, the network weight of CNN is still optimized when BP reverse propagation of Loss is carried out. Intuitively, it is to force the output of CNN in front to conform to the expectation of DT in the back, and make the predicted category of output of samples according to the inferred path of DT conform to its real category as much as possible.

Figure 5 – Losses in Hard and Soft modes (quoted from original Paper)

The source code parsing

NBDT’s Python code is open source on Github, using PyTorch and NetworkX as a whole. There are about 4000+ lines in total, and the core scripts are model.py, Loss. py, graph.py and coruscate. The code, with few comments or parameter definitions, was laborious and took several days to read. The following is the core of several pieces of code to do parsing.

① Generation of “induction hierarchy”

The core function is build_induced_graph, which is used to input WordNet ID and CNN model of leaf nodes. The weight of FC is obtained from CNN model, and then hierarchical clustering is performed. WordNet is used to “name” clustering results, forming DT with entity meaning of tree nodes. This function corresponds to part ② of the principle details in this paper. Detailed explanation is as follows:

② Compute node probability forward

As mentioned above, the new sample will go through CNN first, and the D-dimensional vector X will be output before FC. Then, the implicit vector of each node of X and DT will inner product, and the implicit vector of each node is equal to the mean of the implicit vector of its child node. get

**node**

The logits method makes an optimization here: considering that the inner product of the vector mean is equal to the mean of the inner product of the vector (as shown in the following formula), it is not necessary to explicitly compute the implicit vector and then do the inner product. Instead, for a node, the mean of the logits of its child node is directly taken as its own logits. The specific code is as follows:

③ Total loss function

As mentioned earlier, total loss = original CNN loss + tree structure loss. Specifically, using the Hard model as an example, the following code explains how to calculate the tree structure losses along the decision path and incorporate them into the total losses.

Thesis experiment

In multiple data sets, the author compares the original CNN (WiderResnet28×10) with several “interpretable” neural network models. As can be seen from the table below, the accuracy of NBDT is only slightly lower than that of the original CNN, but it has far exceeded other models, indicating that NBDT has reached SOTA. In NBDT, the Soft score is higher than the Hard score, which is easy to understand because Soft considers global optimality and Hard considers continuous multiple local optimality.

Figure 6 – Experimental results (quoted from original Paper)

use

Please refer to the official Github for details about installation and usage

① Command line prediction

The NBDT command is invoked directly, followed by the image path (URL or local path). The first execution downloads WordNet and the official pre-training model. Since the pre-training model is for the CIFAR10 data set, try to enter a picture that belongs to one of the ten categories. As you can see from the output, the prediction behavior is “progressive.”

② Predict in Python

③ Complete use mode

The follow-up plan

The purpose of the research NBDT is to find a way to make the classification problem explicable, which can be applied to any scenario in which a decision path is given in the classification process. Although the application scenario introduced by the author in the paper is picture classification, as long as the previous CNN is replaced by other networks, in fact any classification problem can be explained by NBDT. For example, in the layering project of Xianyu quality commodities, we can construct the inducing hierarchy among commodities based on business knowledge (for example, the first layer is divided into professional sellers/individual sellers, and the second layer is divided into high/medium/low dynamic sales rate… The last layer is divided into different quality grades of goods, etc.), and then based on the hierarchical structure training NBDT classification. Another example is a typical picture classification scenario. The seller uploads a picture on Xianyu and hopes that the algorithm can automatically determine what category of goods he wants to sell. It is possible that he uploads a picture of a “chair” and a “table”, but what he actually wants to sell is “furniture”. So the hierarchical NBDT can automatically identify the goods it publishes as “furniture”, or provide recommended options for users to choose which layer he wants to sell, eliminating the trouble of manual filling. These are all practices that NBDT can try in the future.

reference

Thesis: arxiv.org/abs/2004.00…
Source: github.com/alvinwan/ne…