Federated learning technology features

  • Data isolation
Data is not leaked externally

Copy the code
  • peer
Equal status of participants

Copy the code
  • nondestructive
Federated learning efficiency is equal to or close to the full data model

Copy the code
  • Mutual benefit
Participants benefit

Copy the code

Federated Learning classification

  • Transverse federal
The data side has the same characteristic dimension

Copy the code
  • Longitudinal federal
The data side sample IDS are the same



The traditional way of looking at data in a table

Group data in columns vertically

Each column contains the same data indices/ IDS

Copy the code

Vertical federated learning – Joint modeling requirements scenarios

For example

Micro crowd and cooperative enterprise joint modeling, micro crowd has Y (business performance), expected to optimize our y prediction model

Copy the code

The parties to the data

  • The cooperative enterprise

  • Small Banks

set

  • Only Wecrowd has Y= overdue performance

  • Partners cannot expose X that contains privacy

Problems of traditional modeling methods

  • Cooperative enterprises lack Y and cannot model independently

  • Full transmission of X data to micro media is not feasible

Expect the result

  • Under the condition of privacy protection, the joint model is established

  • Federated models outperform unilateral data modeling

Longitudinal federated learning

Each participant has the same data ID and different data characteristics (some participants may not have labels)

Copy the code
  • Participants need to exchange intermediate results
  • Supports models such as XGBoost/SecureBoost
  • Neural network models can be supported through split learning
  • Large-scale vertical federated systems are more complex

Schematic diagram of longitudinal federated learning


1. No data is exchanged between A and B

2. Encrypted entity alignment refers to Encrypted data alignment (using homomorphic encryption technology), namely, the process of sample fusion to obtain intersection

3. There is a third party in the process of model training

A and the third party send the public key used for data encryption to A and B respectively

B, A and B exchange the intermediate results of the model training process

C, A, and B calculate the gradient and loss values (for the tree model) respectively and send them to A third party

D. The third party shall summarize and send to A and B for updating model parameters

Copy the code

Key technologies for federated learning

Ways to protect privacy and security

  • Homomorphic Encryption(HE)

  • Secure Multi-Party Computation(MPC) such as Secret Sharing

  • Yao’s Garbled Circuit

  • Differential Privacy (DP)

Homomorphic encryption

Including total homomorphism and semi-homomorphism



Homomorphic encryption is an encryption function that encrypts the plaintext by adding and multiplying, and the result is equivalent to that of encrypting the ciphertext



Homomorphic encryption results in different encryption each time

Copy the code

Model Averaging based on homomorphic encryption


Homomorphic encryption features

  • Addition:[[u]]+[[v]]=[[u+v]]

  • Scalar multiplication:n*[[u]]=[[n*u]]

Scalar multiplication is the multiplication of a scalar R by a vector V (or matrix M), resulting in a vector (matrix) in which each element is the product of the scalar R and the corresponding element in V (M)

Copy the code

Sample ID matching based on privacy protection

Security intersection scheme of RSA+ hash mechanism

All vertical algorithms require sample alignment

Copy the code

Demand scenarios


The corresponding mathematical principle


Briefly explain the above process


Federal Feature Project

The problem

How can WOE and IV be calculated for side A (including X) and side B (including X and Y) features under the protection of privacy?

Copy the code

The difficulties in

WOE and IV are simultaneously dependent on X and Y (WOE&IV can be calculated locally on SIDE B)



2. Side A cannot expose X to side B, and side B cannot expose Y to side A



3. Ultimately, WOE and IV of all features can only be obtained for Side B

Copy the code

Group characteristic columns. Each group is counted based on the value of the label column Y, such as 0,1


Further explanation


Longitudinal logistic regression


1. The y values in the traditional logistic regression formula are 0 and 1

The y in this logical formula is plus minus 1



2. W is a vector of the weight of each feature. X is a specific value



3. Logistic regression is essentially the inner product of two vectors wa* XA + Wb *xb of a linear model Wx



4. Prediction is based on the models on both sides. One-sided models have no value

Copy the code

SecureBoost

Lossless and safe only interchange of gradient histograms without interchange of data

Copy the code

Federated computing information gain


1. Tree model LightGBM or XGBoost calculates the information gain of each candidate split point by calculating the gradient histogram

The sum of the first derivative g and the second derivative H of the cost function

Plug in the formula and you can calculate the maximum gain

Party1 and Party3 do not calculate the cost function of business performance Y

Party1 and Party3 do the encoding and send it to Party2 to ensure data security

Copy the code

Let's say the splitting feature is Bill Payment and the field is 5500

It numbers the encoding number and sends the sum of both g and H to Party2

Party2 maximum information gain can be obtained after decryption

Send the result (including the number of the split point) to Party3 from the partner with the maximum information gain, such as Party3

Tell Party3 that the split belongs to you

Copy the code

The structure of the tree

Each party sees a value, partyId, to which the node belongs

The number is visible only in Party2, but not in other nodes

The leaf node only exists with Party2 who provides the label and completely owns the leaf node

Copy the code

Start by issuing a prediction query to Party2

The root node is part of Party1

Party2 is sent to Party1

Party1 will check that the characteristic is Bill Payment threshold is 5000

The user attribute value is more than 4000, less than 5500 and we're going to the left

The node on the left belongs to Party3

Then send party3

Party3 Split = 800

Less than 5500

To the w2

W2 belongs to Party2 because it's a leaf node and only Party2 is labeled

Just pull the value out and end the query

Copy the code
Boost is an integrated algorithm that essentially queries multiple trees each time

Multiply these weights by the conversion factor and add them up

If it's a dichotomy problem do a Segmod directly

Multiple categories to find a softmax

For regression problems it is the value of a specific regression

Copy the code