Federal learning background

In view of the importance of data privacy, the awareness of data protection is gradually strengthened at home and abroad. In 2018, the European Union issued the General Data Protection Regulation (GDPR), and China’s Cyberspace Administration of China drafted the Data Security Management Measures (Draft for Comments). Therefore, the free flow of data under the precondition of security compliance has become the trend of The Times. The introduction of these laws and regulations, varying degrees of artificial intelligence to the traditional way of processing data put forward more challenges.

Nowadays, with the high development of AI, multi-dimensional and high-quality data is the bottleneck restricting its further development. As organizations attach increasing importance to data, data cooperation across organizations and between different departments within organizations will become more and more cautious, resulting in the existence of a large number of data in the form of islands

The essence of federated learning is a distributed machine learning technology or machine learning framework based on data privacy protection. Its goal is to realize joint modeling, improve the effect of AI model, and enable business on the basis of ensuring data privacy security and legal compliance.

Since it is modeling, the well-known ones in the industry can be roughly divided into GBDT and neural network in recent years. However, due to the characteristics of federated learning, it is necessary to protect the privacy of user features and labels, so homomorphic encryption, secret key sharing, differential privacy and other privacy computing means are needed to ensure security. But based on this lead to a bigger challenge, the complexity of the neural network computation, index and logarithmic’ll give modeling proposed a very big problem, at the current hardware and software encryption technology is very difficult, but for SecureBoost, only need simple homomorphism operation is solved, the same model as Xgboost effect, Therefore, this article will share with you the security tree model of federated learning -Secure Boost.

BTW, although it is difficult for the neural network to act as a security barrier at present, it cannot achieve a good Balance between computational performance and model performance. However, after long-term thinking, the author has developed a scheme that he thinks is reliable, which will be verified gradually in the follow-up. If the final verification is reliable, I will Share it with everyone.

Since tree models are relatively more knowledgeable, it is impossible to solve clear SecureBoost in one step. Therefore, this article is divided into the following topics, the main context is: Decision Tree -> Integration method Bagging & Boosting -> GBDT -> XGBoost -> Secure Boost Tree. It is hoped that this series of articles will give readers a comprehensive grasp of the SecureBoost approach to federated learning.

Actually, for tree model series, I used to do algorithm, also used in a lot of, understand and think you are in place, but when I wrote the learning security tree model, found that many places do not understand completely, there are a lot of detail is not considered, will find that I write this theoretical thickness is not enough, the details did not understand. Later, I spent a lot of energy and time to recharge my batteries, which also made me understand that a lot of things you seem to understand, in fact, do not understand, only to really heart to do it again, you will understand some, no matter what you do is the most important.


Federated learning tree model scheme

As is known to all, machine learning in ancient times was limited by big data, big computing power and big framework. Traditional statistical machine learning methods were magnificent and dominated by LR and Xgboost. LR is a linear model and cannot fully fit nonlinear data distribution. However, GBDT is a nonlinear model. Although its nonlinear fitting ability is not so strong compared with deep learning, it can formalize the nonlinear distribution to a limited extent, thus forming core competitiveness. At that time, it was also a new bit continent, a machine learning practitioner who could use Xgb and understand Xgb would be welcomed to join a large number of companies.

In the industry search and push the scene, basically can see the figure of Xgboost, basically can say that an Xgb can be used to understand, can occupy a position in the industry. Xgb is also often the dark horse in Kaggle competitions. So Xgb is a big kill, bing Feng caused by, invincible. Xgb is still alive and kicking even as deep learning takes off, and it continues to shine in areas where data volumes are not so huge.

Philosophy tells us that everything does not exist in isolation, but in ever-changing relationships. Therefore, for federated learning, we should make both secure deep network model and secure Xgb model. Then, in the first two chapters, we introduce “Desicion Tree of SecureBoost” and “Integrated Learning of SecureBoost”. After introducing decision tree, ensemble learning, Bagging, Boosting, GBDT, Xgb is the basic part, so this chapter will introduce XGBoost, and the following chapter will introduce SecureBoostTree


5 XGBoost overview

XGBoost belongs to Ensemble Learning category and Boosting method, which is based on residual error to train model to fit real data scenes, and efficient calculation based on gradient histogram, achieving Boosting Tree with large scale parallel computing. Boosting Tree is the fastest and best open source Boosting Tree framework at present, more than 10 times faster than other common frameworks. Similar way and LightGBM, HistGradientBoostingClassifier, etc.

The implementation reflects the elegant way of deep integration of algorithm and engineering, compared with GBDT, XGBoost has the following aspects of optimization:Copy the code
  1. GBDT expands the target function Taylor to first order, whereas XgBoost expands the target function Taylor to second order, and XgBoost retains more information about the target function
  2. GBDT is to find a new fitting label for the new base model (the negative gradient of the previous addition model), while XgBoost is to find a new objective function for the new base model (the second-order Taylor expansion of the objective function about the new base model).
  3. L2 regularization is added to Xgboost to help the model obtain lower variance.
  4. Xgboost has added a strategy for automatically handling missing value characteristics. By dividing the samples with missing values into left subtree or right subtree, and comparing the advantages and disadvantages of the objective function under the two schemes, the samples with missing values are automatically divided, without the need for the missing feature filling pretreatment

Boosting algorithm Is a boosting algorithm with two trees.

  • There are five training samples: grandpa, grandma, mom, son and daughter.
  • The first tree is divided into two nodes according to whether the age is less than 15. The grandparent, grandparent and mother of the node on the right are all negative (they do not like playing games), so there is no need to separate them. The left node is then divided into sons and daughters according to gender. Finally, three leaf nodes are formed and three output values are determined.
  • In the second tree, it is divided into two nodes according to whether Computer is often used or not, and there is no information gain due to continued splitting, so it is no longer divided and two leaf nodes are formed.
  • Boosting algorithm: The sample “son” is input to the boosting algorithm consisting of two trees. The output value of the Tree in lesson 1 is 2, then the output value of the second Tree is 0.9, and the final output value is 2.9.

There is a very important concept about supervised learning in machine learning, the key trilogy of machine learning, model, strategy and algorithm. Therefore, Xgb model belongs to the accumulation of base models, the strategy is loss minimization (common loss functions are MSE and Logistic), and the algorithm uses quantile approximation calculation to enumerate the optimal split point.


The objective function of the tree

First of all, the objective function, MSE for classification and Logistics, XGB for regression, are well known

The loss function of XGB can also use these two and any loss function supporting the second derivative. Meanwhile, XGB uses regularization term to prevent overfitting and improve the generalization ability of the model, so the objective function of XGBoost is defined as follows:

6.1 Addition Model

There is a reason why XGB is sometimes hard to understand. General model of just the parameters of the model, but for the fitting of the Lable is not in this column, but XGB addition model, through the base model fitting one by one, so when the first tree, initial Y value is also used for fitting the Lable super parameter, so that everyone should pay attention to, some of the man of god, of course, if the first super parameter Settings, Then the model doesn’t need to be trained.

The specific addition model is as follows:

  • Set the overparameter of the initial value of the model Y: yi ‘0= initial overparameter {Y ^’ _I} ^{0} = initial overparameter yi ‘0= initial overparameter
  • The first tree: yi “1 = yi” 0 + f1 (x) {y ^ ` _i} ^ {1} = {y ^ ` _i} ^ {0} + f_1 (x) yi “1 = yi” 0 + f1 (x)
  • The second tree: yi ‘1 = yi’ 1 + f2 (x) {y ^ ` _i} ^ {1} = {y ^ ` _i} ^ {1} + f_2 (x) yi ‘1 = yi’ 1 + f2 (x)
  • .
  • N tree: yi “1 = yi” (n – 1) + fn (x) {y ^ ` _i} ^ {1} = {y ^ ` _i} ^ {} (n – 1) + f_n yi “1 = yi” (x) (n – 1) + fn (x)

Plus the regular term, the overall objective function is zero

OBJ (t) = ∑ I = 1 nL (yi, yi ‘t) + ∑ I = 1 n Ω (fn) + constantOBJ ^ = {(t)} \ sum_ {I = 1} ^ nL (y_i, ^ ^ {y ` _i} {t}) + \ sum_ {I = 1} ^ n \ Omega (f_n) + constantOBJ (t) = ∑ I = 1 nl (yi, yi ‘t) + ∑ I = 1 n Ω (fn) + constant = ∑ I = 1 nL (yi yi “(t – 1) + ft (xi)) + ∑ I = 1 n Ω (fn) + constant = \ sum_ {I = 1} ^ nL (y_i, ^ ^ {y ` _i} {} (t – 1) + f_t (x_i)) + \ sum_ {I = 1} ^ n \ Omega (f_n) + constant = ∑ I = 1 nl (yi yi “(t – 1) + ft (xi)) + ∑ I = 1 n Ω (fn) + constant

When this formula is given directly, we may see more meng force, I just saw the time also have this feeling, feel right, feel where is wrong, and then carefully deduced, just completely understand, listen to me.

  • The only variable is ft(xi) f_T (x_i)ft(xi), and everything else is computable.
  • Regular term and constant term: the regular term is decomposed. The first T-1 regular term is a constant, so the regular term at time T is executed.

6.2 Taylor expansion

First, let’s take a look at Taylor’s expanded definition, from Wikipedia

In mathematics, a Taylor series is used to represent a function by an infinite concatenation of terms, a series, whose added terms are given by the derivative of the function at a certain point. Taylor series is named after Sir Brook Taylor, an English mathematician who published Taylor’s formula in 1715. The Taylor series obtained by the derivative of a function at zero of an independent variable is also called the MacLaurin series, named after Scottish mathematician Colin MacLaurin.

Insert a picture description here

Then, in view of the objective function, we performed the Taylor expansion, reduce overall operation complexity, approximate calculation, the objective function has been described above, then we will be the objective function to the Taylor expansion formula set inside, when all the mathematical formula used is a word set, defined key elements, Then use the use of the use of key elements into the inside of the good. X corresponds to the predicted value of the previous T-1 tree, which is the only variable, delta x\triangle x△x corresponds to the t-th tree.

  1. First, the first and second partial derivatives of the objective function are defined as follows


g i = partial y ( t 1 ) L ( y i . y i ( t 1 ) ) . h i = partial 2 y ( t 1 ) L ( y i . y i ( t 1 ) ) . g_i=\partial {y^`}^{(t-1)}L(y_i, {y^`_i} ^{(t-1)}), h_i= {\partial }^2{y^`}^{(t-1)}L(y_i, {y^`_i} ^{(t-1)}),
2. Then, the first and second derivatives are substituted, so the objective function becomes as follows
O b j ( t ) = i = 1 n [ g i f i ( x i ) + 1 2 h i f t 2 ( x i ) ] + i = 1 n Ω ( f n ) + c o n s t a n t Obj^{(t)}= \sum_{i=1}^n[g_if_i(x_i) + \frac 12h_if_t^2(x_i)] + \sum_{i=1}^n\Omega(f_n) + constant
3. Then, remove constants. Since constants in a function do not play a role in minimizing the function, they can be removed directly
O b j ( t ) = i = 1 n [ g i f i ( x i ) + 1 2 h i f t 2 ( x i ) ] + i = 1 n Ω ( f n ) Obj^{(t)}= \sum_{i=1}^n[g_if_i(x_i) + \frac 12h_if_t^2(x_i)] + \sum_{i=1}^n\Omega(f_n)
4. Then, the variable replacement is followed by the value of the leaf node
f t f_t
, note that the variables are leaf nodes

  1. Because at the minimum value of unary quadratic function, the first derivative is equal to zero. Applying the most value formula of unary quadratic function, we can easily calculate the weight of each leaf node wj* and the target value of Obj that achieves the optimal value at this time:

At this point, the derivation of the objective function is completed. At this time, you can see that, due to the Taylor approximation expansion, the objective function is composed of the first and second derivatives of the actual Lable value and the predicted value at t-1, as well as the first and second derivatives of the loss function.


7 Tree generation strategy

The tree generation strategy, as a whole, is relatively simple, trying to split each leaf node in the tree; After each split, the original leaf node continues to split into left and right sub-leaf nodes, and the sample sets in the original leaf node are dispersed to the left and right leaf nodes according to the judgment rules of the node. After splitting a node, we need to detect whether the splitting will bring gain to the loss function. The gain is defined as follows:

However, the sample feature dimension of the data set is very large. Although we have a clear strategy now, enumeration is really a disaster, and the calculation force is difficult to match. At this time, what algorithm should we adopt?


8 tree generation algorithm

In the actual training process, when the t-th tree is established, XGBoost adopts greedy method to split tree nodes: When splitting a node, we will have many candidate segmentation points, and the general steps of finding the optimal segmentation point are as follows:

  1. First, all possible values for each feature of each node are iterated;
  2. Then, for each feature, the feature values are sorted according to the size of the feature values and the candidate is made.
  3. Then, the optimal split eigenvalue of each feature is found as a candidate by linear homing scanning.
  4. Finally, among the optimal splitting points of all features, the best splitting point (the feature with the maximum gain after splitting and its value) is found.

The above is a greedy method, each split to go through all the candidate split points, do a global scan. However, when the amount of data is too large and the distribution of data is very scattered, the efficiency of greedy algorithm will become extremely low and basically unusable.

Based on this, XGBoost proposed a series of schemes to speed up the search for the optimal split point, based on the Weighted Quantile Sketch algorithm:Copy the code
  • Feature Pre Sort +Cache: Before training, XGBoost will pre-sort each feature according to the size of its feature value, and then save it into quantile Sketch memory structure, which will be used repeatedly later to reduce redundant operations.
  • Quantile point method (histogram) : After sorting each feature according to its feature value, the quantile point method (histogram) is adopted to select only a limited number of features as the representative splitting points of the feature for optimal segmentation attempt.
  • Parallel search: note that trees are serial between trees and knowledge features are parallel. Based on the fact that each feature has been pre-stored as a block structure, XGBoost uses the thread pool mode of multiple threads to calculate the optimal segmentation point of each feature in parallel, which not only greatly improves the splitting speed of nodes, but also greatly facilitates the adaptive expansion of large-scale training set, and better supports the distributed architecture.

9 Reference Materials

  • SecureBoost: A Lossless Federated Learning Framework: arxiv.org/pdf/1901.08…
  • XGBoost: A Scalable Tree Boosting System: arxiv.org/pdf/1603.02…

10 set pieces

Personal Introduction: Du Baokun, a practitioner in the privacy computing industry, led the team to build JD’s federal learning solution 9N-FL from 0 to 1, and led the federal learning framework and federal start-up business. Framework level: realized the industrial federated learning solution supporting super-large scale in the field of e-commerce marketing, and supported many models such as super-large sample PSI privacy alignment, tree model and neural network model. Business level: achieved a good start in business, created new business growth points and generated significant business economic benefits. Personally, I like to learn new things and delve into technology. Based on the consideration of full-link thinking and decision technology planning, there are many research fields, ranging from engineering architecture, big data to machine learning algorithms and algorithm frameworks. Welcome students who like technology to communicate with me, email: [email protected]