The background,

1.1 Privacy Disclosure Scenario

With the improvement of users’ awareness of privacy protection and the implementation of national laws and regulations such as Data Security Law and Personal Information Protection Law, how to ensure users’ privacy security and meet regulatory requirements in the process of collecting and using user data has become a challenging problem. In the daily business of Internet manufacturers, common user privacy disclosure scenarios are as follows:

(1) Data statistical query: the results of statistical query on user data are directly returned to customers (such as customer insight and other businesses), and it is possible to obtain individual information from statistical results through differential attack.

  • A Internet company, for example, provide customer group portrait service for external customers, customer query group A and group B (group B compared with group A, only more than one user A) distribution of residence, if the second query results in the number of people living in nanjing gulou, 1 more than the first report you can infer A place to live in nanjing gulou, Disclosing a’s private information.

(2) User data collection: Mobile apps and mobile terminals usually collect a variety of user information (such as geographical location, health status, etc.) to improve service quality and user experience. However, direct collection may lead to the disclosure of users’ privacy, and is also strictly restricted by laws and regulations.

  • For example, when user A sees a doctor in a specialized hospital, the Internet manufacturer may infer that user A suffers from a certain disease by collecting user A’s geographical location, thus causing user A’s privacy disclosure.

Therefore, for the majority of Internet manufacturers, it has become an important work to develop high-quality privacy protection services to solve the problem of user privacy disclosure in statistical queries, data collection and other scenarios, while ensuring the availability of data, so as to meet regulatory requirements and enable business.

1.2 delabeling and differential privacy

Traditional privacy protection methods usually remove the identifier information of user records (such as name, ID number, device ID, etc.) through decoupling and generalization, or generalize and compress the quasi-identifiers of user records (such as street, zip code, etc.) through anonymization technology (such as K-anonymity, L-diversity, etc.). The attacker cannot directly or indirectly reassociate the processed data with the user accurately. However, the security of traditional methods is closely related to the background knowledge grasped by attackers, and it is difficult to quantitatively analyze the level of privacy protection. For example, in the query scenario above, the traditional anonymization method cannot play the expected role because the attacker has background knowledge (knowing whether user A is in the query range).

To solve these problems, Differential Privacy (DP) [1] technology came into being. The technology provides a rigorous, verifiable measure of privacy protection that is independent of the attacker’s background knowledge. Because of these characteristics, differential privacy has been widely recognized and applied in academia and industry. In particular, the general definition of differential privacy is:

Then, algorithm M is said to provide ε -dp, where S is the set composed of all possible outputs of algorithm M, and the parameter ε is called the privacy budget. The degree of differential privacy protection can be controlled by adjusting the value of privacy budget ε. The smaller the ε is, the smaller the influence of adding or deleting a record on the result is, the greater the privacy protection intensity is, and the lower the availability of the calculation result is, and vice versa. Therefore, in practical application, according to different scenarios and requirements, setting reasonable ε value to achieve the balance between privacy protection and data availability is one of the key issues in the application of differential privacy technology.

1.3 Protection services based on differential privacy

In order to solve the problem of privacy leakage in statistical query and user data collection scenarios, volcano Engine security research team based on differential privacy technology, relying on the self-developed Jeddak data security privacy computing platform, The DPSQL Service (Differentially Private SQL Query Service) and LDPDC Service (Locally Differentially Private Data) are developed respectively Collection Service), on the basis of protecting user privacy in the process of query and Collection, realizes the target of high availability of data. The following describes the two services respectively.

DPSQL query protection service

Centralized Differential Privacy (CDP for short) [1] mode is adopted by DPSQL to receive SQL statistical query requests in the form of middleware and return query results that meet the requirement of Differential Privacy. Due to the diversity of query requests in real-world scenarios, DPSQL service construction faces the following key challenges:

  1. How to integrate query dialects of different types of databases, so as to reduce the use cost and ensure the customer’s query experience?
  2. How to calculate the appropriate differential privacy noise in the case of complex SQL statements, and ensure the privacy protection effect and data utility?

The following will elaborate DPSQL’s response measures from two aspects of service architecture and key design, and give a brief introduction to the implementation application.

2.1 Service Architecture

The DPSQL service consists of three components:

  1. DPSQL core service: the original SQL statistical query statement is used as the input to output the results satisfying the differential privacy, including SQL parsing and rewriting, differential privacy adding noise and other modules;
  2. Metadata management services: maintain database metadata and data table attributes for sensitivity analysis of data table attributes;
  3. Privacy budget management service: maintain the privacy budget allocation and consumption record of each data table, provide privacy budget margin query, report and audit functions, so as to facilitate the privacy control of query requests.

A typical query request processing flow is as follows:

  • First, the core service accepts the SQL query submitted by the customer, parses and rewrites the statement to facilitate calculation of privacy noise (such as changing AVG calculation to SUM/COUNT).
  • Then, the core service invokes the metadata management service, calculates the sensitivity of the data table corresponding to the rewritten SQL query, and executes the rewritten SQL query on the database to get the original query result.
  • Finally, the core service invokes the privacy budget management service to get the privacy budget allocated for the query, and adds noise to the original query result with sensitivity and returns it.

2.2 Key Design

To address the challenges of SQL dialect compatibility and query noise calculation mentioned above, the team implemented a multi-source heterogeneous SQL parsing and rewriting mechanism and an adaptive differential privacy and noise mechanism in DPSQL.

2.2.1 SQL Parsing and rewriting mechanism of Multi-source Heterogeneous Database

  • Using flexible and extensible SQL parsing mechanism (Parser), it can support a variety of SQL dialects, and is no different from traditional database query.
  • The customized SQL rewriting mechanism (REWriter) supports multiple syntax features, such as aggregate function, multi-layer subquery, Join, and Group by.

2.2.2 Adaptive differential privacy noising mechanism

  • According to the type of aggregation function contained in SQL query, the privacy budget is adaptively allocated to the query to reduce the consumption of privacy budget.
  • According to the aggregation function type of SQL query, the sensitivity of aggregation function in multi-table link query, multi-layer sub-query and other scenarios is analyzed efficiently, and the appropriate differential privacy and noise algorithm is allocated to improve the service performance and the availability of query results.

2.3 Landing Application

At present, DPSQL service has been connected to the customer data platform of Volcano Engine, providing user group insight service of privacy protection for customers in banking, automobile enterprises, retail and other industries. With its outstanding performance in privacy protection and business compliance, DPSQL service was successfully selected as the “TOP10 outstanding application cases of privacy computing 2021” released by OpenMPC, the first private computing open community in China.

LDPDC collection and protection service

LDPDC service takes Local Differential Privacy [2] as the core technology and provides users with LDP-SDK on the end to realize data disturbance processing on the end. At the same time, the server computing service is provided to summarize and analyze the data collected by ldP-SDK. Similarly, LDPDC faces the following challenges:

  1. How to reduce communication cost while meeting users’ personalized privacy protection requirements?
  2. How to reduce noise in data acquisition and improve data availability for analysis tasks?

Similarly, the countermeasures of LDPDC are described in the following two aspects: service architecture and key design, and the implementation application is briefly introduced.

3.1 Service Architecture

LDPDC service consists of two modules:

  1. Client: Built-in LDP-SDK, including personalized disturbance mechanism, to accept the user’s personalized privacy protection requirements, and according to the user data disturbance processing, so as to provide differential privacy protection for users;
  2. Server: Collects the data transmitted by the summary client and provides a customized noise reduction and aggregation mechanism to reduce the noise and aggregate the summary data to improve data availability. The processed data can be used in recommendation system, statistical query, machine learning and other data analysis services.

3.2 Key Design

LDPDC designed a personalized disturbance mechanism and a customized noise reduction and aggregation mechanism to address the challenges of on-end disturbance and convergence noise reduction.

3.2.1 Individual disturbance mechanism

  • Users are provided with privacy protection intensity configuration options (low, medium and high) to meet their personal data privacy protection requirements.
  • Provides efficient data compression and interaction mechanisms, such as GRR and OLH, to reduce the amount of information transmission and interaction between clients and servers, and reduce communication costs.

3.2.2 Customized noise reduction and aggregation mechanism

  • For different types of personal data, customized noise reduction and aggregation mechanisms are used to ensure the efficient use of collected data.
  • Unbiased processing mechanism is provided to make the statistical information after noise aggregation equal to the statistical information of real data in theory.
  • A consistency processing mechanism is provided to make the aggregated statistics consistent with the public background knowledge, for example, setting the frequency less than 0 to 0.

3.3 Landing Application

At present, LDPDC service will start to be applied in services such as geographic location collection to assist business departments in compliance governance of user information collection and provide policy support for advertising and recommendation services.

Four, conclusion

DPSQL service and LDPDC service are successful practices of differential privacy technology in volcano engine application scenarios. In the future, differential privacy-related services will appear in the Volcano Engine Cloud Security product matrix for Volcano Cloud customers. Volcano Engine security research team will continue to explore business scenarios, dig deeply into user data privacy protection needs, study the implementation of cutting-edge privacy protection technology, and provide strong guarantee for user data privacy security.

references

[1] Dwork C., Mcsherry F., Nissim K., et al. Calibrating Noise to Sensitivity in Private Data Analysis [A]. Theory of Cryptography, Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006, Proceedings: 265 — 284.

[2] Kasiviswanathan S.P., Lee H.K., Nissim K., et al. What Can We Learn Privately? [A]. 49th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2008, October 25-28, 2008, Philadelphia, PA, USA: 531–540.