By Davide Testuggine and Ilya Mironov, Facebook AI Applications Research Scientist
Original link:https://ai.facebook.com/blog/…
Opacus is a library that can train the differential privacy of the PyTorch model. It supports training on the client with minimal code changes, with little impact on training performance, and allows the client to track the privacy budget spending online at any given moment.
This version of the code is aimed at two target audiences:
ML practitioners will find this a gentle introduction to training a model with differential privacy because it requires minimal code changes.
Differential privacy scientists will find it easy to experiment and trim, allowing them to focus on what’s important.
Opacus is a new high-speed library for training PyTorch models using differential privacy (DP), which is more scalable than the latest approaches available. Differential privacy is a rigorous mathematical framework for quantifying the anonymity of sensitive data. It is commonly used in analytics, and there is growing interest in the machine learning (ML) community. With the release of Opacus, we hope to provide an easier way for researchers and engineers to adopt differential privacy in ML and accelerate DP research in this field.
Opacus offer:
- Speed: By using the Autograd hook in PyTorch, Opacus can calculate the gradient of each sample in a batch, which can be an order of magnitude faster than existing DP libraries that rely on microbatch.
- Security: Opacus uses a password-secure pseudo-random number generator
- Code that is critical to its security. This processes the whole batch of parameters at high speed on the GPU.
- Flexibility: Thanks to PyTorch, engineers and researchers can quickly prototype their ideas by mixing and matching our code with PyTorch code and pure Python code.
- Productivity: Opacus comes with tutorials, assistive features that warn of incompatible layers and automatic refactorings even before you start training.
- Interactivity: Opacus tracks how much of your privacy budget you’re spending at any given point in time (a core mathematical concept in DP), allowing for early stops and real-time monitoring.
Opacus defines a lightweight API by introducing the PriVACyEngine abstraction, which can both track your privacy budget and handle model gradients. You don’t need to call it directly to run because it’s connected to the standard PyTorch optimizer. It runs in the background, making training with Opacus as easy as adding the following line to the beginning of the training code:
Parameters (), LR =0.05) privacy_engine = privacyEngine (model, LR =0.05) Batch_Size =32, Sample_Size = Len (train_loader.dataset), Alphas = Range (2,32), Noise_Multiplier =1.3, Max_Grad_Norm =1.0, ) privacy_engine.attach(optimizer) That's it! Now it's business as usual
After training, the generated artifacts are the standard PyTorch model, with no additional steps or barriers to deploying a private model: If you can deploy the model today, you can deploy it after training it with DP without changing any code.
The Opacus library also includes pre-trained and fine-tuned models, tutorials for large models, and infrastructure designed for privacy research experiments.
Use Opacus for high-speed privacy training
Our goal with Opacus was to preserve the privacy of each training sample while limiting the impact on the accuracy of the final model.
Opacus does this by modifying the standard PyTorch optimizer to implement (and measure) the DP during the training process.
More specifically, our method focuses on differential private stochastic gradient descent (DP-SGD).
The core idea behind this algorithm is that we can protect the privacy of the training data set by interfering with the parameter gradient used by the model to update the weights (rather than directly obtaining the data). By adding noise to the gradient in each iteration, we can prevent the model from remembering its training examples, while still doing summary learning. The (unbiased) noise will naturally cancel out in many lots seen during training.
But adding noise requires a delicate balance: too much noise destroys the signal, and too little will not guarantee privacy. To determine the proper scale, let’s look at the norm of the gradient. It is important to limit the contribution of each sample to the gradient because outliers have a larger gradient than most samples. We need to ensure that these outliers remain private, especially since they are likely to be remembered by the model. To do this, we calculate the gradient for each sample in a small batch. We clipped the gradient separately, cumulated it back to a single gradient tensor, and then added the noise to the sum.
This sample-based calculation is one of the biggest obstacles to building Opacus. Compared to the typical operation of PyTorch, it is more challenging to automatically compute the gradient tensor for the entire batch because it makes sense for all other ML use cases and optimizes performance. To overcome this problem, we use efficient techniques to train standard neural networks to obtain all the required gradient vectors. For model parameters, we separately return the loss gradient for each example in a given batch.
Here is a diagram of the Opacus workflow in which we calculate the gradient for each sample.
By tracking some intermediate quantities while running the layers, we can train with whatever batch size is appropriate for memory, making our method an order of magnitude faster than alternative microbatch methods used in other software packages.
The importance of privacy protection machine learning
The security community encourages developers of security-critical code to use a small number of carefully vetted and professionally maintained libraries. This “don’t encrypt yourself” principle helps minimize the attack surface by allowing application developers to focus on what they know best: building great products. As the adoption and research of ML continues to accelerate, it is important for ML researchers to use easy-to-use tools to obtain mathematically rigorous privacy guarantees without slowing down the training process.
We hope to democratize access to such privacy-protection resources by developing PyTorch tools such as Opacus. We are using PyTorch’s faster, more flexible platform to bridge the gap between the security community and the average ML engineer.
Building community
Over the past few years, the Privacy Protection Machine Learning (PPML) community has grown rapidly. We’re excited about the ecosystem that has formed around Opacus, and one of our major contributors is OpenMined, a community of thousands of developers who are building privacy-centric applications. A number of PyTorch building blocks are used to provide the foundation for PySyft and PyGrid to achieve differentiated privacy and joint learning. As part of this partnership, Opacus will become a dependency of the OpenMined library (such as PySyft). We look forward to continuing our cooperation and further expanding the community.
Opacus is part of a broader effort by Facebook AI to advance work by safely developing computing technologies for machine learning and responsible artificial intelligence. Overall, this is an important stepping stone to moving their field toward building privacy-first systems in the future.
- In order to gain a deeper understanding of the concept of differential privacy, we will create a series of mid-level positions specializing in differential private machine learning. The first part looks at the key basic concepts. Read the PyTorch Medium blog here.
- We also offer comprehensive tutorials and the Opacus open source library here.
Source address: https://github.com/pytorch/op…