Federated learning technology features
- Data isolation
Data is not leaked externally
Copy the code
- peer
Equal status of participants
Copy the code
- nondestructive
Federated learning efficiency is equal to or close to the full data model
Copy the code
- Mutual benefit
Participants benefit
Copy the code
Federated Learning classification
- Transverse federal
The data side has the same characteristic dimension
Copy the code
- Longitudinal federal
The data side sample IDS are the same
The traditional way of looking at data in a table
Group data in columns vertically
Each column contains the same data indices/ IDS
Copy the code
Vertical federated learning – Joint modeling requirements scenarios
For example
Micro crowd and cooperative enterprise joint modeling, micro crowd has Y (business performance), expected to optimize our y prediction model
Copy the code
The parties to the data
- The cooperative enterprise
- Small Banks
set
-
Only Wecrowd has Y= overdue performance
-
Partners cannot expose X that contains privacy
Problems of traditional modeling methods
-
Cooperative enterprises lack Y and cannot model independently
-
Full transmission of X data to micro media is not feasible
Expect the result
-
Under the condition of privacy protection, the joint model is established
-
Federated models outperform unilateral data modeling
Longitudinal federated learning
Each participant has the same data ID and different data characteristics (some participants may not have labels)
Copy the code
- Participants need to exchange intermediate results
- Supports models such as XGBoost/SecureBoost
- Neural network models can be supported through split learning
- Large-scale vertical federated systems are more complex
Schematic diagram of longitudinal federated learning
1. No data is exchanged between A and B
2. Encrypted entity alignment refers to Encrypted data alignment (using homomorphic encryption technology), namely, the process of sample fusion to obtain intersection
3. There is a third party in the process of model training
A and the third party send the public key used for data encryption to A and B respectively
B, A and B exchange the intermediate results of the model training process
C, A, and B calculate the gradient and loss values (for the tree model) respectively and send them to A third party
D. The third party shall summarize and send to A and B for updating model parameters
Copy the code
Key technologies for federated learning
Ways to protect privacy and security
-
Homomorphic Encryption(HE)
-
Secure Multi-Party Computation(MPC) such as Secret Sharing
-
Yao’s Garbled Circuit
-
Differential Privacy (DP)
Homomorphic encryption
Including total homomorphism and semi-homomorphism
Homomorphic encryption is an encryption function that encrypts the plaintext by adding and multiplying, and the result is equivalent to that of encrypting the ciphertext
Homomorphic encryption results in different encryption each time
Copy the code
Model Averaging based on homomorphic encryption
Homomorphic encryption features
-
Addition:[[u]]+[[v]]=[[u+v]]
-
Scalar multiplication:n*[[u]]=[[n*u]]
Scalar multiplication is the multiplication of a scalar R by a vector V (or matrix M), resulting in a vector (matrix) in which each element is the product of the scalar R and the corresponding element in V (M)
Copy the code
Sample ID matching based on privacy protection
Security intersection scheme of RSA+ hash mechanism
All vertical algorithms require sample alignment
Copy the code
Demand scenarios
The corresponding mathematical principle
Briefly explain the above process
Federal Feature Project
The problem
How can WOE and IV be calculated for side A (including X) and side B (including X and Y) features under the protection of privacy?
Copy the code
The difficulties in
WOE and IV are simultaneously dependent on X and Y (WOE&IV can be calculated locally on SIDE B)
2. Side A cannot expose X to side B, and side B cannot expose Y to side A
3. Ultimately, WOE and IV of all features can only be obtained for Side B
Copy the code
Group characteristic columns. Each group is counted based on the value of the label column Y, such as 0,1
Further explanation
Longitudinal logistic regression
1. The y values in the traditional logistic regression formula are 0 and 1
The y in this logical formula is plus minus 1
2. W is a vector of the weight of each feature. X is a specific value
3. Logistic regression is essentially the inner product of two vectors wa* XA + Wb *xb of a linear model Wx
4. Prediction is based on the models on both sides. One-sided models have no value
Copy the code
SecureBoost
Lossless and safe only interchange of gradient histograms without interchange of data
Copy the code
Federated computing information gain
1. Tree model LightGBM or XGBoost calculates the information gain of each candidate split point by calculating the gradient histogram
The sum of the first derivative g and the second derivative H of the cost function
Plug in the formula and you can calculate the maximum gain
Party1 and Party3 do not calculate the cost function of business performance Y
Party1 and Party3 do the encoding and send it to Party2 to ensure data security
Copy the code
Let's say the splitting feature is Bill Payment and the field is 5500
It numbers the encoding number and sends the sum of both g and H to Party2
Party2 maximum information gain can be obtained after decryption
Send the result (including the number of the split point) to Party3 from the partner with the maximum information gain, such as Party3
Tell Party3 that the split belongs to you
Copy the code
The structure of the tree
Each party sees a value, partyId, to which the node belongs
The number is visible only in Party2, but not in other nodes
The leaf node only exists with Party2 who provides the label and completely owns the leaf node
Copy the code
Start by issuing a prediction query to Party2
The root node is part of Party1
Party2 is sent to Party1
Party1 will check that the characteristic is Bill Payment threshold is 5000
The user attribute value is more than 4000, less than 5500 and we're going to the left
The node on the left belongs to Party3
Then send party3
Party3 Split = 800
Less than 5500
To the w2
W2 belongs to Party2 because it's a leaf node and only Party2 is labeled
Just pull the value out and end the query
Copy the code
Boost is an integrated algorithm that essentially queries multiple trees each time
Multiply these weights by the conversion factor and add them up
If it's a dichotomy problem do a Segmod directly
Multiple categories to find a softmax
For regression problems it is the value of a specific regression
Copy the code