GitHub 21.5K Star Java engineers become god’s path, not to learn about it!
GitHub 21.5K Star Java engineers become god’s way, really not to learn about it!
Recently, in addition to writing business code, I have also been involved in private computing.
Now I have time to wrap things up. Because privacy computing may be strange to many engineering development friends, so ** THIS article is mainly to give you a popular science, let you know that there is such a technology and the main direction of the present. ** There will not be too much algorithm content and underlying principles, I believe we should be able to understand.
With the rapid development of information technologies such as mobile Internet, cloud computing and Internet of Things, the world has entered the “big data era” of data explosion. Data plays a vital role in every industry, and more and more scenarios require the flow and sharing of data from multiple parties. For example, the financial department we are in needs to use external financial data to conduct joint modeling in combination with our scene business data, so as to realize joint risk control, digital marketing, intelligent anti-fraud, accurate customer acquisition, etc.
Therefore, standing at this historical node, in terms of data cooperation and sharing, there are several important issues to be solved:
1. The phenomenon of “data islands” is widespread; High risk of data circulation security;
2. Increasingly strict data compliance supervision; Privacy leaks lead to a trust gap;
Among them, with the introduction of the Personal Information Protection Law in November 2021, regulatory issues are more urgent to be solved.
But in recent years, with the introduction of e-commerce law, data security Law, “Personal Information Protection Law” and other laws and regulations, we have to pay attention to the issue of personal privacy.
Although the protection of personal information is becoming more and more strict, it can be interpreted from many laws and regulations that, in fact, in the big form, the authorities are relatively supportive of our compliance with the use and development of data.
So how to solve these problems?
In fact, we can find some entry points from the Personal Protection Law. According to the definition of personal information in the Law, it is as follows:
All kinds of information recorded electronically or by other means relating to identified or identifiable natural persons shall not include the information after anonymization.
Therefore, if we can de-identify and anonymize personal information, then we can use this information.
As a result, many related technologies have been created to help us coordinate data across organizations in invisible ways through data availability. Such technologies are collectively known as privacy computing.
After many years of development, privacy protection computing has three main directions in terms of specific implementation technologies:
2. Trusted Execution Environment Based on Trusted Hardware (TEE) 3. Federated Learning Based on Hybrid Technology Solutions (FL)
The main way to achieve de-identification and anonymization is to make raw user data unidentifiable.
The multi-party security calculation is mainly based on cryptography, encrypting the data, so that we can use the encrypted data for algorithm modeling.
In addition, a trusted execution environment is a kind of hardware-based data, which basically puts data into a piece of hardware and is only used inside the hardware, not read directly from the outside.
There’s another technology that’s a mix of solutions, and that’s federated learning.
Including federal study because it does not depend on the hardware, can solve the problem of modeling complex algorithm advantages, while compared with other schemes to efficiency, but with the development of technology, how to break through the performance bottleneck, achieve the balance of practicality, safety, and to further enhance safety, these problems will be solved. Therefore, this technology is considered “the last kilometer of ARTIFICIAL intelligence” and “the foundation of the next generation of ARTIFICIAL intelligence collaborative algorithms and collaborative networks”.
The federal study
Federated LearningIt is an emerging artificial intelligence based technology, which was first proposed by Google in 2016. It was originally used to solve the problem of android phone end users updating their models locallyThe design goal is to carry out efficient machine learning among multi-participants or multi-computing nodes under the premise of ensuring the information security of big data exchange, protecting the privacy of terminal data and personal data, and ensuring legal compliance.
We call each enterprise involved in the joint modeling a participant, and divide federated learning into three categories: horizontal federated learning, vertical federated learning, and federated transfer learning, according to the different data distribution among the multiple participants.
The nature of horizontal federation learning is the combination of samples, which is suitable for the scenario where participants have the same business but different customers, that is, there is much overlap of features and little overlap of users. For example, banks in different regions have similar businesses (similar features) but different users (different samples). Mainly solve the problem of insufficient samples.
The nature of vertical federated learning is the combination of features, which is suitable for scenarios where users overlap more but features overlap less. For example, commercial supermarkets and banks in the same region reach users who are residents of the region (same samples) but have different businesses (different features). Mainly solve the problem of insufficient features.
Because our business is mainly financial business, the application scenario of privacy computing is based on federal learning and external banks and institutions to do joint risk control, multiple lending and other financial business. Therefore, we basically want to use the e-commerce data of our users and the credit data and credit investigation data of external institutions to conduct longitudinal bond learning.
The main flow of vertical federated learning is as follows:
Step 1: Encrypt sample alignment. This is done at the system level, so non-crossover users are not exposed at the enterprise perception level. Step1: the third party C sends the public key to A and B to encrypt the data to be transmitted. Step2: A and B respectively calculate the intermediate results of features related to themselves and encrypt and interact with each other to obtain their respective gradients and losses; Step3: A and B respectively calculate the gradient after encryption and add the mask and send it to C, while B calculates the loss after encryption and send it to C; Step4: C decrypts the gradient and loss and sends them back to A and B. A and B remove the mask and update the model.
FATE framework
Because at present, many large companies have invested in the technology of federal learning, among which Ali, Ant, Byte, Tencent and many other successful cases.
When we carried out the investigation related to federal learning this time, different colleagues investigated different frameworks. My side is mainly responsible for research based on open source framework.
Because the most important thing about federated learning is to solve the security problem of data, it seems that many external organizations are more receptive to open source frameworks. There are also many federal learning frameworks on the market, such as FATE of Webank open source, FedLearner of Byte open source, PaddleFL of Baidu open source, etc. FATE is the most widely used of these, and is considered a model for federal learning.
FATE (Federated AI Technology Enabler) is an open source project initiated by the AI department of Webank, which provides a reliable secure computing framework for the Federated learning ecosystem. The FATE project uses multi-party secure computing (MPC) and homomorphic encryption (HE) technologies to build underlying secure computing protocols to support different types of machine learning secure computing, including logistic regression, tree-based algorithms, deep learning, and transfer learning.
There are four deployment modes of FATE, which are docker-compose based deployment, Standalone deployment, Native cluster deployment, and KubeFATE based deployment.
-
Docker-compose: For a quick taste of FATE, running models and data on a single machine is easy to deploy.
-
Standalone Standalone deployment: Just want to develop the algorithm, and the development machine performance is not high.
-
Kubefate-based: The use of FATE needs to be expanded due to the large data set and model, and there is data in it that needs to maintain a FATE cluster, so the deployment solution based on KubeFATE in Kubernetes cluster is considered.
-
Native cluster deployment: this is usually used for special reasons, such as not being able to deploy Kubernetes internally, or having to do your own secondary development of FATE deployment. For quick verification, we mainly adopted two deployment modes based on Docker-compose and KubeFATE. Deployment process or encountered a lot of problems.
The deployment process of these two deployment methods and the resolution of some problems are not the focus of this article, I have put them separately in my blog, if you are interested in my blog to read.
Here is a deployment architecture diagram for FATE:
Too many details to go into here.
Based on FATE, we cooperated with external institutions to build a set of federal learning environment. Ali as one side and external institutions as the other side modeled about 100,000 data of federal learning.
The end result is pretty much what we expected, and the performance cost of federated learning versus local modeling is negligible.
other
The above is a summary of my research and practice on privacy computing & federated learning during this period.
On the one hand, it is necessary for our work. On the other hand, we should learn more about the new technology. Especially those things that are important for the present and for the future.
Just like I signed my name on ali Intranet: no limits.
For this part of the content, I just touch, a lot of content is based on my own understanding of the expression, if there is any mistake in the article, welcome to help point out. At the same time, we also welcome friends with relevant experience to communicate with us.
About the author: Hollis, a person with a unique pursuit of Coding, is a technical expert of Alibaba, co-author of “Three Courses for Programmers”, and author of a series of articles “Java Engineers becoming Gods”.
Follow the public account [Hollis], the background reply “into god map” can be downloaded to receive the Java engineer advanced mind map.