Abstract: This paper mainly introduces the vertical federated logistic regression (LR) scheme adopted by Huawei Cloud Trusted Intelligent Computing Service (TICS).
This article is from The huawei cloud community “Logistic Regression in vertical Federated Learning Scenario (LR)”, author: Soda with ice.
Massive training data is an important condition for the successful application of artificial intelligence technology in various fields. For example, the AI algorithms in computer vision and financial recommendation systems rely on large-scale well-labeled data to achieve better reasoning results. However, in healthcare, banking and some government sectors, the increasing protection of data privacy in the industry has resulted in a severe shortage of available data. In view of the above problems, Huawei Cloud Trusted Intelligent Computing Services (TICS) designed a multi-party federal learning scheme to break down data barriers in banking, government and enterprise industries and realize data security sharing.
What is logistic regression?
Regression is a statistical analysis method to describe the interdependence between independent variables and dependent variables. As a common regression method, linear regression is often used to fit linear models (or linear relationships).
Logistic regression, although also called regression, is not a model fitting method, but a simple “dichotomous” algorithm. It has many advantages such as simple implementation and efficient algorithm.
FIG. 1.1 two-dimensional linear regression
FIG. 1.2 three-dimensional linear regression
1.1 Linear Regression
Figure 1.1 and 1.2 represent two-dimensional and three-dimensional linear regression models respectively. The direct fitting of Figure 1.1 (blue line) can be expressed as Y = Ax + B. The total Euclidean distance between all data points (red points) and the line is the shortest, and Euclidean distance is often used to calculate the target loss function and then solve the model. Similarly, the total Euclidean distance between all data points in FIG. 1.2 and the two-dimensional plane is the shortest. Therefore, the linear regression model can usually be expressed as:
Where θ represents the model coefficient.
1.2 Logistic regression (LR)
LR is a simple supervised machine learning algorithm. For input X, the logistic regression model can give the probability of y<0 or y>0, and then deduce whether the sample is a positive sample or a negative sample.
LR introducing the sigmoid function sample to infer the probability of positive samples, sample input x is positive samples of probability can be expressed as: P (y | x) = g (y), the g () as the sigmoid function,
The curve is shown in Figure 1.3, and the output range is 0~1:
Figure 1.3 Sigmoid curve
For the given model θ and sample x, the probability of y=1 can be expressed as:
So sigmoid for binary classification problems, especially when g (y) > 0.5, said P (y | x) = 1 > 0.5, the jailed for samples, corresponding y > 0; Conversely, when g (y) < 0.5, P (y | x) = 1 < 0.5, the negative samples, a corresponding y < 0.
1.3 LR loss function
LR using logarithmic loss function, the training set x ∈ S, loss function can be expressed as (reference zhuanlan.zhihu.com/p/44591359)…
Gradient descent algorithm is one of the classical solutions of LR model, and the model iterative update expression is as follows:
Among them
-
L () is the target loss function, which is essentially an average logarithmic loss function.
-
S’ is a batch data set (size batchsize). Random disturbance is introduced through batch processing to make the model weight approach the optimal value more quickly.
-
α is the learning rate, which directly affects the convergence rate of the model. If the learning rate is too large, the left and right oscillation of Loss cannot reach the extreme value point; if the learning rate is too small, the convergence rate of Loss is too slow and the extreme value point cannot be found for a long time.
Second, LR in the longitudinal federated learning scenario
Vertical federated learning has been introduced frequently, and many excellent products have emerged on the market, such as FATE, Huawei Trusted Intelligent Computing TICS, etc. Vertical federation enables multiple users to share data and features without exposing their own data and train higher-precision models, which is of great significance for many industries such as finance and government affairs.
Figure 2.1 Longitudinal federated LR
2.1 Vertical federated implementation of LR
Participants in vertical federated learning join the federation with the purpose of sharing data without exposing their own data, so any sensitive data must be encrypted before leaving their own trust domains (Figure 2.1, see arxiv.org/pdf/1711.10…
The vertical federated flow of LR is shown in Figure 2.2, with host representing the party with only features and guest representing the party with tags.
Figure 2.2 Implementation process of longitudinal federated LR algorithm
-
Before training can begin, both parties need to exchange homomorphic public keys.
-
Each batch loop of the epoch contains four steps, calEncryptedU–>calEncryptedGradient–>decryptGradient–>updateLrModel. Both guest and host need to be executed in this order (the flow chart only shows the execution process of guest as the initiator).
-
The purpose of gradient adding random noise in step A2 is to prevent the leakage of U and cause safety problems.
Since homomorphic cryptography only supports addition and multiplication of integers and floating-point numbers, the exponential part of the model iteration formula in 1.3 is expressed in Taylor expression form:
Click to follow, the first time to learn about Huawei cloud fresh technology ~