PaddleFL(Paddle Federated Learning)

PaddleFL is an open source federated learning framework based on PaddlePaddle. Researchers can easily copy and compare different federated learning algorithms with PaddleFL, and developers can easily deploy PaddleFL federated learning systems in large distributed clusters. PaddleFL provides a wide variety of federated learning strategies (horizontal federated, vertical federated) and their applications in computer vision, natural language processing, recommendation algorithms, and more. PaddleFL will also provide the application of traditional machine learning training strategies, such as multi-task learning and transfer learning in federated learning environments. With PaddlePaddle’s massively distributed training and Kubernetes’ flexible scheduling of training tasks, PaddleFL can be easily deployed based on full-stack open source software.

PaddleFL overview

Today, data is becoming more expensive and it is difficult to share raw data across organizations. Joint learning aims to solve the problem of data isolation and data knowledge security sharing between organizations. The concept of federated learning was developed by researchers at Google [1,2,3]. PaddleFL extends federated learning based on PaddlePaddle framework. PaddleFL also provides examples of applications in natural language processing, computer vision, and recommendation algorithms. PaddleFL supports two mainstream federal learning strategies: horizontal and vertical [4]. Multi-task learning [7] and transfer learning [8] in federated learning will be developed and supported in the future.

  • Horizontal federated learning strategies: Federated average [2], differential privacy [6], security aggregation [11];
  • Longitudinal federated learning strategy: two-party training based on PRIVC [5] and three-party training based on ABY3[10];

PaddleFL architecture

PaddleFL provides two main solutions: Data Parallel and Federated Learning with MPC (PFM).

  • Through Data Parallel, each Data party can complete model training based on classical horizontal federated learning strategies (such as FedAvg, DPSGD, etc.).

  • PFM is a federated learning scheme based on multi-Party Secure computing (MPC). As an important part of PaddleFL, PFM supports federated learning well in multiple scenarios, including horizontal, vertical, and federated migration learning. It provides both reliable security and considerable performance.

Operation Mechanism (Data Parallel)

In PaddeFL, the whole process of model training is divided into two stages: compilation stage and running stage. The compilation stage mainly defines federated learning tasks, while the running stage mainly carries out federated learning and training. Each stage mainly contains the following components:

A. Compilation phase

  • Fl-strategy: Users can define federated learning strategies using fl-strategy, such as Fed-AVG [2].

  • User-defined -Program: PaddlePaddle’s Program defines machine learning model structures and training strategies, such as multi-task learning.

  • Distributed-Config: In federated learning, the system is deployed in a Distributed environment. Distributed training configuration Defines information about distributed training nodes.

  • FL-Job-Generator: Given fl-Strategy, User-defined Program, and Distributed Training Config, fl-jobs on the Server side and Worker side of federated parameters will be generated by FL Job Generator. Fl-jobs are sent to the organization and federated parameter servers for joint training.

B. Operation phase

  • Fl-server: Federated parameter Server running in a cloud or third-party cluster.

  • Fl-worker: Each organization participating in federated learning will have one or more workers communicating with the federated parameter server.

  • Fl-scheduler: Plays a role of scheduling workers during training, and determines which workers can participate in training before each update cycle.

C. sample

CTR model example source code

  • fl_master.py

  • fl_scheduler.py

  • fl_server.py

  • fl_trainer.py

Federated Learning with MPC (Model Predictive Control)

The security training and reasoning tasks in PaddleFL MPC are implemented based on an efficient multi-party computing protocol, and PaddleFL supports the three-party secure computing protocol ABY3[10] and the two-party computing protocol PrivC[5]. PrivC based two-party federated learning mainly supports linear/logistic regression, DNN model. Tripartite federation based on ABY3 learning linear/logistic regression, DNN, CNN, FM, etc.

Receding Horizon Control (MPC), also known as RHC, is an advanced process Control method that has been applied in the chemical industry and oil refining industry since 1980.

In PaddleFL MPC, participants are divided into inputs, computers, and results. The input party is the holder of the training data and model and is responsible for encrypting the data and model and sending them to the compute party (ABY3 protocol uses three compute nodes and PrivC protocol uses two compute nodes). The computing party is the executor of the training and completes the training task based on the specific multi-party secure computing protocol. The computing party can only obtain the encrypted data and model to ensure data privacy. After the calculation, the result party will get the calculation result and recover the plaintext data. Each participant can play multiple roles. For example, a data owner can also participate in the training as a calculator.

A. Data preparation

  • Private data alignment: PFM allows the data owner (the data owner) to find a collection of samples shared by multiple parties without exposing their own data. This feature is necessary in vertical federated learning because it requires data alignment of multiple data parties prior to training and protects user data privacy.
  • Data encryption and distribution: PFM provides online and offline data encryption and distribution solutions. If the data is distributed offline, the data party encrypts the data and model by secret sharing [9] during the data preparation stage, and then transmits the data to the computing party by direct transmission or database storage. If online distribution is selected, the data party encrypts and distributes the data and model online during training. In the process of data encryption and distribution, each computing party only gets a portion of the data, so the computing party cannot restore the real data.

B. Training/reasoning

PFM has the same operating mode as PaddlePaddle. Before training, users need to define MPC protocol, training model and training strategy. Paddle_fl. MPC provides operators that can operate encrypted data. Instances of the operators will be created and run by the actuators in turn during the run time (gloo and GRPC network communication modes are supported for ciphertext communication during training).

Please refer to the following documentation for more information on the training phase.

C. Result reconstruction

After the security training and reasoning work is completed, the model (or predicted results) will be output by the computor in encrypted form. The result party can collect the encrypted result, decrypt it using the tools in PFM, and transmit the plaintext result to the user (currently, data sharding and reconstruction support offline and online modes).

You can obtain more information from the MPC reference sample.

The installation

Environment depends on

  • CentOS 7 (64 bit)
  • Python 3.5/3.6/3.7/3.8 (64-bit)
  • Pip3 9.0.1 + (64 – bit)
  • PaddlePaddle 1.8.5
  • Redis 5.0.8 (64 bit)
  • GCC/g + + 8.3.1
  • Cmake 3.15 +

Install the deployment

PaddleFL provides three installation methods:

1. Use PaddleFL in Docker

#Pull and run the dockerDocker pull paddlepaddle/paddlefl: 1.2.0 docker run - name < docker_name > -.net = host - it - v$PWD:/paddle <image id> /bin/bash
Copy the code

The environment configuration in Docker, along with paddlePaddle and Paddlefl, is now installed, and you can run the sample code directly to start using paddlefl.

2. Installation package installation

  • Install environment dependencies through YUM
## Update yum source
sudo yum -y clean all
sudo yum -y makecache
sudo yum update
## Download the yum source configuration file
cd/etc/yum.repos.d rm -rf ./* wget -O /etc/yum.repos.d/CentOS-Base.repo https://mirrors.aliyun.com/repo / Centos - vault - 8.5.2111. Repo# # to install python
sudo yum install python3

Verify that the installation is successful
python3 --version

## Check pytho3 installation location
whereis pytho3

## Set the default python version to Pytho3
sudo alternatives --set python /usr/bin/python3

# # upgrade PIP
sudo python -m pip install --upgrade pip

## Check the cmake version
cmake -version
Copy the code
  • Install Paddles

Guide to erection of flying OARS

Fly OARS for quick installation

## Install paddle: PIP under Linux installs CPU version of the computing platformSudo python -m PIP install paddlepaddle = = 2.2.2 -i https://mirror.baidu.com/pypi/simpleCopy the code
  • Verify that the paddles are installed successfully

Once installed, you can use Python or Python3 to enter the Python interpreter and type import paddles, followed by paddles.utils.run_check ().

Verification methods for versions prior to Paddle 2.0: Use Python or Python3 to enter the Python interpreter and type import paddles.fluid as fluid, followed by fluid.install_check.run_check().

PaddlePaddle is installed successfully! “, the installation is successful.

  • Install PaddleFL
Install PaddleFL # #
sudo pip3 install paddle_fl
Copy the code

The above command automatically installs the PaddleFL corresponding to PYTHon3.8.

3. Kubernetes simple deployment

Horizontal federal scheme

kubectl apply -f ./python/paddle_fl/paddle_fl/examples/k8s_deployment/master.yaml
Copy the code

For details, see K8S Deployment Example

You can also refer to K8S cluster application and Kubectl installation to configure your own K8S cluster.

Change of Paddle 1.x with Paddle 2.0 API

  1. Table of Paddle 1.8 and Paddle 2.0 API mappings

  2. PyTorch-PaddlePaddle API mapping table

  3. Version migration tool

  4. 2.0 Upgrade FaQs

reference

[1]. Jakub Koneč ny, H. Brendan McMahan, Daniel Ramage, Peter Richtarik. Federated Optimization: Distributed Machine Learning for On-Device Intelligence. 2016

[2]. H. Brendan McMahan, Eider Moore, Daniel Ramage, Blaise Agüera y Arcas. Federated Learning of Deep Networks using Model Averaging. 2017

[3]. Jakub Koneč ny, H. Brendan McMahan, Felix X. Yu, Peter Richtarik, Ananda Theertha Suresh, Dave Bacon. Federated Learning: Strategies for Improving Communication Efficiency. 2016

[4]. Qiang Yang, Yang Liu, Tianjian Chen, Yongxin Tong. Federated Machine Learning: Concept and Applications. 2019

[5]. Kai He, Liu Yang, Jue Hong, Jinghua Jiang, Jieming Wu, Xu Dong et al. PrivC – A framework for efficient Secure Two-Party Computation. In Proc. of SecureComm 2019

[6]. Martín Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, Li Zhang. Deep Learning with Differential Privacy. 2016

[7]. Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, Ameet Talwalkar. Federated Multi-Task Learning 2016

[8]. Yang Liu, Tianjian Chen, Qiang Yang. Secure Federated Transfer Learning. 2018

[9]. Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, Domonkos Tikk. Session-based Recommendations with Recurrent Neural Networks. 2016

[10]. En.wikipedia.org/wiki/Secret…

[11]. Payman Mohassel and Peter Rindal. ABY3: A Mixed Protocol Framework for Machine Learning. In Proc. of CCS 2018