Desicion Tree of SecureBoost, a federated learning-security Tree model

@[toc]

Federal learning background

In view of the importance of data privacy, the awareness of data protection is gradually strengthened at home and abroad. In 2018, the European Union issued the General Data Protection Regulation (GDPR), and China’s Cyberspace Administration of China drafted the Data Security Management Measures (Draft for Comments). Therefore, the free flow of data under the precondition of security compliance has become the trend of The Times. The introduction of these laws and regulations, varying degrees of artificial intelligence to the traditional way of processing data put forward more challenges.

Nowadays, with the high development of AI, multi-dimensional and high-quality data is the bottleneck restricting its further development. As organizations attach increasing importance to data, data cooperation across organizations and between different departments within organizations will become more and more cautious, resulting in the existence of a large number of data in the form of islands

The essence of federated learning is a distributed machine learning technology or machine learning framework based on data privacy protection. Its goal is to realize joint modeling, improve the effect of AI model, and enable business on the basis of ensuring data privacy security and legal compliance.

Since it is modeling, the well-known ones in the industry can be roughly divided into GBDT and neural network in recent years. However, due to the characteristics of federated learning, it is necessary to protect the privacy of user features and labels, so homomorphic encryption, secret key sharing, differential privacy and other privacy computing means are needed to ensure security. However, based on this brings a relatively big challenge, the complex operation of neural network, exponential, logarithm and so on put forward a very big problem for modeling, with the current hardware and software encryption technology is still very difficult, but for GBDT, only need to carry out simple homomorphic operation to solve, Therefore, this article will share with you the security tree model of federated learning -Secure Boost.

BTW, although it is difficult for the neural network to act as a security barrier at present, it cannot achieve a good Balance between computational performance and model performance. However, after long-term thinking, the author has developed a scheme that he thinks is reliable, which will be verified gradually in the follow-up. If the final verification is reliable, I will Share it with everyone.

Since tree models are relatively more knowledgeable, it is impossible to solve clear SecureBoost in one step. Therefore, this article is divided into the following topics, the main context is: Decision Tree -> Integration method Bagging & Boosting -> GBDT -> XGBoost -> Secure Boost Tree. It is hoped that this series of articles will give readers a comprehensive grasp of the SecureBoost approach to federated learning.

Actually, for tree model series, I used to do algorithm, also used in a lot of, understand and think you are in place, but when I wrote the learning security tree model, found that many places do not understand completely, there are a lot of detail is not considered, will find that I write this theoretical thickness is not enough, the details did not understand. Later, I spent a lot of energy and time to recharge my batteries, which also made me understand that a lot of things you seem to understand, in fact, do not understand, only to really heart to do it again, you will understand some, no matter what you do is the most important.

2 Decision Tree

2.1 Definition of decision tree

What is a decision tree? Decision tree is a supervised learning method which can be used to deal with classification and regression problems.

Take the workplace as an example. At present, the whole company has a difficult industry and technology field to break the situation. At this time, many colleagues who are not lying down in the workplace hope that they can solve this problem. Based on this, the first thing to consider is whether you have the courage to do it, and then consider whether you have the ability to do it. If you don’t have the ability to do it, whether you can cooperate with the awesome person to do it. If the awesome person can do it by himself, there is nothing for him to do.

The following author introduces the knowledge related to decision tree respectively.

2.2 Decision tree foundation

To better describe the algorithms that generate decision trees, let’s introduce some basic concepts.

2.2.1 entropy

The word entropy has been used in various disciplines. The concept of entropy was put forward by Clausius, a German physicist, in 1865. It generally refers to a measure of the state of some system of matter, and the degree to which some system of matter is likely to occur. In information theory and probability statistics, entropy is a variable that represents the uncertainty of a random variable. Let’s say X is a discrete random variable with a finite number of values, and its probability distribution is zero


P ( X = x i ) = p i . i = 1 . 2 . . . . . n P = p_i (X = x_i), I = 1, 2,… ,n

Then the entropy of random variable X is defined as:


H ( X ) = i n p i l o g p i H(X) = -\sum_i^np_ilogp_i

As can be seen from the above definition, the greater the entropy, the stronger the uncertainty of the random variable, so from the perspective of feature importance, the less the feature has a strong ability to characterize splitting.

2.2.2 conditional entropy

Assuming random variables (X, Y), their joint distribution is as follows:


P ( X = x i . Y = y j ) = p i j . i = 1 . 2 . . . . . n ; j = 1 . 2 . . . . . m P(X=x_i, Y=y_j)=p_{ij}, I = 1,2… ,n; \ quad j = 1, 2,… ,m

Then, we define conditional entropy H (Y | X) said that under the condition of random variable X is known of the uncertainty of the random variable Y, also said that under the condition of a given X Y the conditional probability distribution of entropy mathematical expectation of X:


H ( Y X ) = i n p i H ( Y X = x i ) H(Y|X)=\sum_i^np_iH(Y|X=x_i)

2.2.3 Information gain

Features A set of training data D the information gain of g (D, A), then the experience of information gain is defined as the set D entropy H (D) and characteristics about D under the condition of A given condition entropy H (D | A) short of that


g ( D . A ) = H ( D ) H ( D A ) g(D,A)=H(D) – H(D|A)

If the information gain is larger, it means that the information entropy is smaller after dividing, which means that the data after dividing tends to be stable, and the more stable data means that we can better predict the data.

A training data set D, | D | said sample size, the number of samples.


Let’s say that the total number of samples in this batch K The categories are defined as c k . k = 1 . 2 . . . . . K , C k Defined as belonging to C k The number of samples, then k = 1 K C k = d Set, Suppose there are a total of K categories defined as c_k, K =1,2… , K | C_k | defined as belongs to the number of samples C_k is \ sum_ {K = 1} ^ K | C_k | = | d |, set

Characteristics of the A There are n I have two different values, a 1 . a 2 . . . . . a n . According to the characteristics A The value of the D Divided into n A subset of D 1 . D 2 . . . . . D n . D i for D i The sample of Characteristic A has n different values {a_1,a_2… , a_n}. Then D can be divided into n subsets D_1,D_2… , D_n | D_i | for D_i samples

The number, i = 1 n D i = D . And define the subset D i Belong to in C k The set of samples of D i j , D i j = D i studying C k . Number of \ sum_ {I = 1} ^ n | D_i | = | | D. At the same time, the set of samples belonging to C_k in subset D_i is defined as D_{ij}, then D_{ij} = D_i \cap C_k,

D i k Is defined as D i k The number of samples. | D_ {ik} | defined as D_ the number of samples in the {ik}.

Then according to the above description, the characteristics of A data set D experience condition entropy H (D | A)


H ( D A ) = i = 1 n D i D   k = 1 K D i k D i   log 2 D i k D i   H(D|A) = – \sum_{i=1}^n \frac{D_i}{D}\ \sum_{k=1}^K \frac{D_{ik}}{D_i} \ \log_2{ \frac{D_{ik}}{D_i} \ }

One problem with the information gain criterion is that it favours features that have many values and ignores their relevance to classification. For example, consider a scenario where we are working on a dichotomous task and each sample has a unique ID. At this time, if this ID is selected as the feature for node splitting, it will obtain greater information gain, because it can accurately classify all training samples. However, such classification results cannot be generalized and cannot predict the location samples. At this time, information gain rate can be used to solve the problem. ID3 algorithm uses information gain, while C4.5 uses information gain rate, which will be described in detail later.

2.3 Pruning strategy

For decision trees, it is often observed that a decision tree with perfect performance on the training set is less capable of generalization than a decision tree with poor performance on the training set. This phenomenon is called “over-fitting”, and the essential reason is that the learner regards a non-naive feature of training features as a potential real distribution in the process of learning. In other words, the learner learns so well that it puts some noise into the blood stream.

For decision trees, in order to prevent poor generalization ability caused by “over-fitting”, the common strategy is to use “pruning” old to denoise, and then learn more naive characteristics and the nature of data. There are two common pruning schemes:

  1. Pre-pruning: Pruning a tree during its formative stage.
  2. Post pruning: After the tree is generated, check which branches need to be removed.

2.4 the ID3 algorithm

ID3 algorithm was first proposed by J. Ross Quinlan in 1975 in the University of Sydney as a classification prediction algorithm, the core of the algorithm is “information entropy”. By calculating the information gain of each attribute, ID3 algorithm considers that the attribute with the highest information gain is a good attribute, and selects the attribute with the highest information gain as the classification standard for each partition. The process is repeated until a decision tree that can perfectly classify training samples is generated.

The technical core of decision tree lies in the idea of splitting nodes of the tree. ID3 algorithm adopts the information gain criterion to select features in the split nodes, and then builds the whole tree recursively. ID3 algorithm is a greedy algorithm used to construct a decision tree. ID3 algorithm originated from concept learning system (CLS). It takes the decreasing speed of information entropy as the criterion to select test attributes, that is, it selects the attributes with the highest information gain that have not been used to classify as the criterion at each node, and then continues this process until the generated decision tree can perfectly classify training samples.

2.4.1 ID3 algorithm to build a decision tree scheme

  • First, starting with the root node, the information gain for all possible features is calculated for the node.
  • Then, the feature with the maximum information gain is selected as the feature of the node for splitting.
  • Then, for the feature with the maximum information gain, the child nodes are constructed according to the different values of the cover feature.
  • Then repeat the process for the child nodes.
  • Finally, until the information gain of all features is small enough to reach a threshold or until there are no features to choose from.

2.4.2 Advantages and disadvantages of ID3 algorithm

  • Model structure: N tree, if there are many features, the amount of calculation is very considerable.
  • Model objective: Suitable for handling classification problems.
  • Feature correlation: there is no processing method for continuous features, which is suitable for processing discrete features. For missing features, there is no need to solve the problem of missing values in the pre-processing stage.
  • Model optimization: paper cutting optimization without model is easy to overfit.
  • Memory consumption: Expands rapidly with the feature space dimension of the sample.
  • Calculation efficiency: N-fork tree, with the feature space of the sample branching, the calculation time is high.

2.5 C4.5 algorithm

The C4.5 algorithm was developed by Ross Quinlan to generate decision trees. This algorithm is an extension of the ID3 algorithm previously developed by Ross Quinlan. The decision tree generated by THE C4.5 algorithm can be used for classification purposes, so the algorithm can also be used for statistical classification. The C4.5 algorithm is similar to the ID3 algorithm. As mentioned above, THE C4.5 algorithm is improved compared with the ID3 algorithm. In the process of number generation, the information gain ratio is used to select features.

2.5.1 C4.5 algorithm to build decision tree scheme

Please refer to ID3 algorithm. Some differences are discretization of continuous features.

2.5.2 Advantages and disadvantages of C4.5 algorithm are summarized

  • Model structure: N tree, if there are many features, the amount of calculation is very considerable.

  • Model objective: Suitable for handling classification problems.

  • Feature correlation: continuous feature discretization (in the discretization method, m-1 feature is selected for M features, and feature sorting is also required. The value of the segmentation threshold point of each candidate is the midpoint of two consecutive elements in the above sorted attribute value, which is equivalent to the increase of discrete features). Support processing of discrete features, for the missing feature values to a certain probability of different nodes.

  • Model optimization: Pessimistic pruning strategy for tree pruning.

  • Memory consumption: need to sort characteristic values, memory operation, easy OOM.

  • Calculation efficiency: N-fork tree, as the feature space of the sample branches, and requires multiple scanning and sorting, the calculation time is high.

2.6 CART algorithm

Those who are familiar with machine learning algorithms and frameworks will have such a feeling that many algorithms are seeking a balance between algorithms and engineering. Both good algorithm effect and high computational efficiency are indispensable.

Compared with ID3 and C4.5, CART tree has some simplified processing. It uses binary tree instead of multi-fork tree, thus improving the efficiency of decision tree generation and reducing the amount of computation. In addition, the classification tree reduces a lot of logarithmic operation time by using Gini coefficient as the amount of variable impurity.

2.6.1 classification tree

Let me introduce the classification tree based on CART. When CART is used as a classification tree, feature attributes can be of continuous type or discrete type. The node splitting strategy of the tree is based on gini index. Next, the Gini index will be introduced first.

2.6.1.1 Gini index

Definition: In the classification problem, assuming that there are K classes and the probability that the sample point belongs to the KTH class is PK, the Gini index of the probability distribution is


G i n i ( p ) = k = 1 K p k ( 1 P k ) = 1 k = 1 K p k 2 Gini(p) = \sum_{k=1}^Kp_k(1-P_k) = 1 – \sum_{k=1}^K{p_k}^2

For a given sample set D, its Gini index is:


G i n i ( D ) = 1 k = 1 K ( C K D   ) 2 Gini(D) = 1 – \sum_{k=1}^K{( \frac{|C_K|}{|D|} \ )}^2

Then, under the condition of feature A, the Gini index combined with D is, D1 and D2 are the two ranges after segmentation according to A certain value of feature A.

Gini (D, A) = \ frac {| D_1 |} {| D |} \ Gini (D_i) + \ frac {| D_2 |} {| D |} Gini (D_2) \

Gini index Gini represents the uncertainty of data, and Gini (D, A) represents the uncertainty of set D after feature A= A splitting. Different from the information gain, the smaller Gini index is, the lower the purity of the model is and the better the features are.

2.6.1.2 Generation of a classification tree

CART classification tree adopts Gini index for binary tree splitting, enumerates all possible segmentation points of all features for calculation, and selects the splitting point with the lowest Gini index to reduce the purity of the model. Meanwhile, for the prediction method after the establishment of decision tree, the CART classification tree adopts the category with the highest probability in leaf nodes as the prediction category of the current node.

2.6.2 regression tree

The characteristics of regression tree are basically continuous, and square error minimization criterion is used.

Process is as follows: for any feature A, in turn, calculate all possible split point value, through the value of the sample data is divided into D1 and D2, calculate their Loss by means of MSE and additive (regression tree output is not A category, it USES is with mean or median of eventually leaves to predict the output.) , and then select the feature value with the least loss, and then split the nodes, and re-divide the data sets of each node, and then recursively execute this process.


min j . s [ min C 1 x i R 1 ( j . s ) ( y i c 1 ) 2 + min C 1 x i R 2 ( j . s ) ( y j c 2 ) 2 ] \min\limits_{j,s}[\min\limits_{C1}\sum_{x_i \in R_1(j,s)}(y_i – c_1)^2 + \min\limits_{C1}\sum_{x_i \in R_2(j,s)}(y_j – c_2)^2 ]

Among them, the atmosphere convenience is J, the segmentation point is S, and the data set is divided into C1 and C2. We find the one with the minimum loss through the algorithm above.

Pros and cons of CART trees

  • Model structure: the binary tree form is used to improve the efficiency of decision tree sorting.

  • Model objective: Support classification tree and regression tree.

  • Feature correlation: Substitute tests are used to estimate missing values, and features can be used repeatedly across layers.

  • Model optimization: Prune practice optimization, prune based on cost complexity.

  • Calculation cost: Classification tree adopts Gini coefficient to reduce complex operations.

3 Self-introduction

Personal Introduction: Du Baokun, jingdong federal study from 0 to 1 builders, lead the team to build the jingdong federal learning solutions, has realized the electricity marketing support the industrialization of the vlsi federal learning solutions, support very large scale sample PSI privacy alignment, the safety of tree model and the neural network model and so on, modeling is not limited, At the same time, I led the team to achieve the landing of the business side, created a new business growth point, and generated significant business economic benefits.

Personally, I like to study technology. Based on the consideration of full-link thinking and decision technology planning, there are many research fields, ranging from architecture and big data to machine learning algorithms and frameworks. Welcome students who like technology to communicate with me, email: [email protected]