Federal learning background

In view of the importance of data privacy, the awareness of data protection is gradually strengthened at home and abroad. In 2018, the European Union issued the General Data Protection Regulation (GDPR), and China’s Cyberspace Administration of China drafted the Data Security Management Measures (Draft for Comments). Therefore, the free flow of data under the precondition of security compliance has become the trend of The Times. The introduction of these laws and regulations, varying degrees of artificial intelligence to the traditional way of processing data put forward more challenges.

Nowadays, with the high development of AI, multi-dimensional and high-quality data is the bottleneck restricting its further development. As organizations attach increasing importance to data, data cooperation across organizations and between different departments within organizations will become more and more cautious, resulting in the existence of a large number of data in the form of islands

The essence of federated learning is a distributed machine learning technology or machine learning framework based on data privacy protection. Its goal is to realize joint modeling, improve the effect of AI model, and enable business on the basis of ensuring data privacy security and legal compliance.

Since it is modeling, the well-known ones in the industry can be roughly divided into GBDT and neural network in recent years. However, due to the characteristics of federated learning, it is necessary to protect the privacy of user features and labels, so homomorphic encryption, secret key sharing, differential privacy and other privacy computing means are needed to ensure security. But based on this lead to a bigger challenge, the complexity of the neural network computation, index and logarithmic’ll give modeling proposed a very big problem, at the current hardware and software encryption technology is very difficult, but for SecureBoost, only need simple homomorphism operation is solved, the same model as Xgboost effect, Therefore, this article will share with you the security tree model of federated learning -Secure Boost.

BTW, although it is difficult for the neural network to act as a security barrier at present, it cannot achieve a good Balance between computational performance and model performance. However, after long-term thinking, the author has developed a scheme that he thinks is reliable, which will be verified gradually in the follow-up. If the final verification is reliable, I will Share it with everyone.

Since tree models are relatively more knowledgeable, it is impossible to solve clear SecureBoost in one step. Therefore, this article is divided into the following topics, the main context is: Decision Tree -> Integration method Bagging & Boosting -> GBDT -> XGBoost -> Secure Boost Tree. It is hoped that this series of articles will give readers a comprehensive grasp of the SecureBoost approach to federated learning.

Actually, for tree model series, I used to do algorithm, also used in a lot of, understand and think you are in place, but when I wrote the learning security tree model, found that many places do not understand completely, there are a lot of detail is not considered, will find that I write this theoretical thickness is not enough, the details did not understand. Later, I spent a lot of energy and time to recharge my batteries, which also made me understand that a lot of things you seem to understand, in fact, do not understand, only to really heart to do it again, you will understand some, no matter what you do is the most important.

Federated learning tree model scheme

As we all know, in the early stage of machine learning, when deep learning is not generous and brilliant, it is basically LR and Xgboost. In the scene of search and push, Xboost can be seen basically. It can be said that an Xgb can be used to understand, and it can occupy a position in the industry. Xgb is also often the dark horse in Kaggle competitions.

Everything does not exist as an isolated existence, so for federated learning, the essence of federated learning is distributed modeling based on privacy security, so it is necessary to make both safe Xgb and LR and other traditional machine learning models, as well as a deep model of security. This series of articles mainly introduces the security Tree model series. Three chapters are planned. Based on the previous chapter, the basic content of The Desicion Tree of SecureBoost, this chapter will continue to introduce integrated learning and GBDT. The Xgboost and federated security tree models are described in the next section.

SecureBoost belongs to Ensemble Learning and Boosting method, which trains models based on residuals to fit real data scenes. The calculation way of implementation and Xgboost privacy, LightGBM, HistGradientBoostingClassifier, etc., before the transition to the integration of learning algorithms, however, need to introduce GBDT first.

3 Ensemble Learning

3.1 Integrated Learning

Ensemble Learning (Ensemble Learning) is a very effective and brilliant machine Learning method in the industry. It is not a single machine learning algorithm itself, but by building and combining multiple machine learning tasks. We often say that “two heads are better than one”.

Integrated learning can be used for classification problems, regression problems, feature selection, outlier detection and so on. It can be said that integrated learning can be seen in all machine learning fields.

Ensemble learning can be divided into two modes, Bagging and Boosting. The difference between the two modes mainly lies in the combination of weak learners. Since tree model is an effective machine learning model, ensemble learning is closely combined with tree model. Bagging and Boosting are combined with tree model respectively to generate:

  • Bagging + decision tree = random forest
  • Boosting + Decision tree =GBDT
  • Boosting + second-order conductable Loss function = Xgboost

3.2 Bagging and Boosting

In machine learning, there are two very important basic concepts: Variance and Bias, which are used to measure the model. However, the two concepts themselves are actually contradictory. Bagging and Boosting distribution explores Variance and Bias. Bagging focuses on getting an integration model with a smaller variance, while Boosting will mainly generate an integration model with a smaller deviation (even though the variance can be reduced).

3.2.1 Bagging (bootstrap aggregating)

Bagging refers to Bagging method. It calculates multiple weak learners in parallel, and then votes or averages the results of multiple weak learners to perform fusion and condensation to characterize the judgment of the model.

  1. The training set is extracted from the original sample set. N training samples were collected each round using the Bootstraping method from the original sample set (in the training set, some samples may be taken multiple times and some may not be taken at all). A total of K rounds were extracted to obtain K training sets. (K training sets are independent of each other)
  2. One model is obtained by using one training set each time, and k models are obtained by using k training sets. K models are independent of each other without dependence.
  3. Final model of collective intelligence generation:
    • For the classification problem: for the results of K models, the classification results were obtained by voting;
    • For regression problems, the mathematical expectation is calculated as the final result for K model results.

3.2.2 Boosting

The main idea is to assemble the weak learner into a strong learner, and train a group of weak classifiers to iterate the model by adjusting the weight of key samples through the iterative combination of weak learners in some sequential iterative processes.

Boosting has two core questions:

  1. How to change the probability distribution and sample weight of training data?

The error information can be corrected by improving the weights of the samples that were wrongly judged by the weak learner in the previous round and reducing the weights of the samples that were correctly judged in the previous round, or by learning resists.

  1. How to combine weak classifiers?

Weak classifiers are linearly combined by addition model. For example, AdaBoost increases the weight of the classifier with a small error rate through weighted majority voting, while reducing the weight of the classifier with a large error rate. Or GBDT’s addition model combination based on residual gradually reduces the residual by fitting the residual. The model generated in each step is superimposed to obtain the final model.

3.2.3 Bagging, Boosting

Bagging and Boosting are different in the following three ways:

  1. Sample selection:

    • Bagging: Bootstraping is performed by randomly aping samples, which are independent of each training set selected from the original set.
    • Boosting: The sample for each round of Boosting training is fixed, with the weight of each sample changing. The weights are adjusted according to the calculation results of the previous round.
  2. Sample weight:

    • Bagging: Sampling is done evenly, with each sample given the same weight.
    • Boosting: Adjusts the sample weight according to error rate, and a sample with a greater error rate has a greater weight.
  3. Training reasoning:

    • Bagging: Each weak learner has the same weight and can be calculated in parallel.
    • Boosting: Each weak learner has corresponding weights, which are larger for better fitted classifiers, and each weak learner needs sequential iterative calculation.

4 GBDT

A lot of people are confused about GBDT and Xgboost, and one of my favorite things to ask candidates is the difference between GDBT and Xgboost. In fact, it’s not a very difficult problem, you can go on the Internet and find a bunch of them, from a performance point of view, the difference between a first derivative and a second derivative. But for the most essential lack is not clear, so we can go to the Internet to check some information, but for the information obtained from the Internet, to their own screening, to refine, do their own mastery.

4.1 GDBT definition

The following definition is given: GBDT (Gradient Boosting Decision Tree), which is a Boosting algorithm. Constraints:

  • The weak learner used by GBDT must be CART and must be a regression tree.
  • GBDT is used for regression prediction. Of course, it can also be classified by threshold, but it is mainly for regression prediction.

Therefore, we must remember that the weak learner of GBDT must only be CART tree. Generally, Loss based on MSE uses least square method to calculate the first-step degree. Xgb’s weak learner is very flexible and generally requires only the second derivative. So many of the concepts of GBDT and Xgb confused, or only say the difference between the first derivative and the second derivative, are too one-sided.

4.2 GBDT derivation process

Each calculation of GBDT is to reduce the residual of the last calculation, so a relatively naive idea is that we reduce the gradient in the direction of the residual, and then build a new learner is not ok? Yes, that’s where GBDT came from.

  1. First, suppose the data set is (x1,y1),(x2,y2)… ,(xn,yn)(x_1, y_1), (x_2, y_2), … , (x_n, y_n)(x1,y1),(x2,y2),… ,(xn,yn), the model is F(x), F(x), F(x), to fit the data. Then the residual is defined as yi−F(xi) y_i-f (x_i) Yi −F(xi), which defines the unfitted part of the model.
  2. Then, assuming that the Loss Function is defined as MSE (Loss Function for regression problems), the Loss Function is defined as 12∑0n(yi−F(xi))2\frac 12 \sum_0^n(y_i-f (x_i))^221∑0n(yi−F(xi))2.
  3. Then, in view of the loss function is derived L (y, F (x) = 12 ∑ 0 n (yi – F (xi)) 2 L (y, F (x) = 1 2 \ \ frac sum_0 ^ n (y_i – F (x_i)) ^ 2 L (y, F (x) = 21 ∑ 0 n (yi – F (xi)) 2, And then I take the first derivative of the loss function, Partial partial F (x) = partial L L (y, F (xi) partial F (xi) = F (xi) – yi \ frac {\ partial L} {\ partial F (x)} = \ frac {\ partial {L (y, F (x_i)}} = {\ partial F (x_i)} F (x_i) – y_i partial F (x) = partial F partial L (xi) partial L (y, F (xi)) = F (xi) – yi
  4. Then, from the formula above, it can be deduced that the residual is in the same direction as the negative gradient. Verification completed.
  5. Therefore, GBDT splitting nodes can be selected according to the gradient of each feature.

5 set pieces

Personal Introduction: Du Baokun, a practitioner in the privacy computing industry, led the team to build JD’s federal learning solution 9N-FL from 0 to 1, and led the federal learning framework and federal start-up business. Framework level: realized the industrial federated learning solution supporting super-large scale in the field of e-commerce marketing, and supported many models such as super-large sample PSI privacy alignment, tree model and neural network model. Business level: achieved a good start in business, created new business growth points and generated significant business economic benefits. Personally, I like to learn new things and delve into technology. Based on the consideration of full-link thinking and decision technology planning, there are many research fields, ranging from engineering architecture, big data to machine learning algorithms and algorithm frameworks. Welcome students who like technology to communicate with me, email: [email protected]