Abstract: How to strengthen the technical level of data security and privacy protection, the data warehouse products themselves put forward more functional requirements, is also the most effective way to build data security.

This article is shared from huawei cloud community “GaussDB(DWS) Security: A Real Way to protect privacy — Data desensitization”, original author: WO Dada.

The introduction

The arrival of the era of big data has subverted the operation mode of traditional industries and stimulated new production potential. Data becomes an important factor of production and the carrier of information, and the flow between data also hides the value information of higher dimensions. For data controllers and data processors, how to maximize the value of data flow is the original intention and significance of data mining. However, the exposure of a series of information leaks makes data security more and more widely concerned.

Countries and regions have gradually established and improved laws and regulations related to data security and privacy protection to provide legal guarantee for user privacy protection. How to strengthen the technical level of data security and privacy protection, put forward more functional requirements for the data warehouse product itself, is also the most effective way of data security construction.

What is data desensitization?

Data Masking, as its name implies, shields sensitive Data and deforms some sensitive information (such as id card number, mobile phone number, card number, customer name, customer address, email address, salary, etc.) through desensitization rules to achieve reliable protection of private Data. Common desensitization rules include replace, rearrange, encrypt, truncate, and mask. Users can also customize desensitization rules based on the desired desensitization algorithm.

Generally, good data desensitization implementation, need to follow the following two principles, first, as far as possible for the application of desensitization, retain meaningful information before desensitization; Second, prevent hackers from cracking to the greatest extent possible.

Data desensitization is divided into static data desensitization and dynamic data desensitization. Static data desensitization is the “relocation and simulation replacement” of data, which is to extract data for desensitization and send it down to the downstream link for free access and reading and writing. After desensitization, data is isolated from the production environment to meet business needs and ensure the security of production database. Dynamic data desensitization, access sensitive data at the same time real-time desensitization processing, can be different roles, different permissions, different data types to perform different desensitization schemes, so as to ensure the availability and security of returned data.

The data desensitization function of GaussDB (DWS) provides a complete, secure, flexible, transparent, and friendly data desensitization solution for dynamic data desensitization, eliminating pain points such as high dependence and high cost of desensitization at the service application layer and internalizing data desensitization as the security capability of database products. After the user identifies the sensitive field, the desensitization policy can be created by binding built-in desensitization function based on the target field. The desensitization Policy is one-to-one with the table object. A desensitization strategy contains three key elements: table object, effective conditions, and desensitization column – desensitization function pair. It is a collection of all desensitization columns on the table object. Different fields can adopt different desensitization functions according to data characteristics. The query statement will trigger the desensitization of sensitive data only when the validity condition is true. The desensitization process is implemented inside the SQL engine and is transparent and invisible to the users in the generation environment.

How to desensitize with data?

Dynamic data desensitization is a real-time desensitization process in the process of query statement execution according to whether the effective conditions are met. A valid condition is usually a judgment based on the current user role. The visible range of sensitive data is preset for different users. System administrator, with maximum authority, visible to any field in any table at any time. Identifying restricted user roles is the first step in creating a desensitization policy.

Sensitive information depends on actual service scenarios and security dimensions. For example, sensitive fields of a user include name, ID number, mobile phone number, and email address. In the banking system, as a customer, it may also involve the bank card number, expiration date, payment password and so on; In the company system, employees may also be involved in salary, education background, etc. In the health care system, as a patient, it may also involve information about visits and so on. Therefore, identifying and sorting out sensitive fields in specific business scenarios is the second step in creating a desensitization strategy.

Product built-in a series of common desensitization function interface, can be for different data types and data characteristics, specify parameters, so as to achieve different desensitization effect. The desensitization function can use the following built-in interfaces and supports custom desensitization functions. The three built-in desensitization functions can cover the desensitization effects of most scenarios. Custom desensitization functions are not recommended.

  • MASK_NONE: does not perform desensitization and is only used for internal testing.
  • MASK_FULL: full desensitization to a fixed value.
  • MASK_PARTIAL: partial desensitization of contents within the desensitization range using the specified desensitization character.

Different desensitization columns can use different desensitization functions. For example, mobile phone numbers usually display the last four digits followed by an “*”; The amount is uniformly displayed as a fixed value of 0, etc. Determining the desensitization function that needs to be bound to the desensitization column is the third step to create a desensitization policy.

Taking emP of employee table of a company, Alice, Matu and July as examples, this paper briefly introduces the application process of data desensitization. Among them, the table EMP contains the employee’s name, mobile phone number, email, paycard number, salary and other private data. Users Alice is the hr manager, and users Matu and July are ordinary employees.

Assume that the table, the user, and the user have permissions to view the emP of the table.

1. Create the desensitization strategy mask_EMp, which only allows Alice to view all information of employees, while Matu and July are invisible to the paycard number and salary. The field card_no is a numeric type, and MASK_FULL is fully desensitized to a fixed value of 0. Card_string is a character field. MASK_PARTIAL is used to partially desensitize the original data according to the specified input/output format. The salary field is a numeric type, using all digit values before the penultimate of the number 9 partial desensitization.

postgres=# CREATE REDACTION POLICY mask_emp ON emp WHEN (current_user ! = 'alice') ADD COLUMN card_no WITH mask_full(card_no), ADD COLUMN card_string WITH mask_partial(card_string, 'VVVVFVVVVFVVVVFVVVV','VVVV-VVVV-VVVV-VVVV','#',1,12), ADD COLUMN salary WITH mask_partial(salary, '9', 1) length(salary) - 2);Copy the code

Switch to Matu and July to view the employee table EMP.

postgres=> SET ROLE matu PASSWORD 'Gauss@123'; postgres=> SELECT * FROM emp; id | name | phone_no | card_no | card_string | email | salary | birthday - + -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - | 1 Anny | | 13420002340 | 0 # # # # - # # # # # # # # - 1234 | [email protected] | 2 | | 1999-10-02 00:00:00 Bob | 99999.9990 18299023211 | | 0 # # # # # # # # - # # # # - 3456 | 66 [email protected] | 3 | | 1989-12-12 00:00:00 9999.9990 cici | 15512231233 | | | [email protected] | | 1992-11-06 00:00:00 (3 rows) postgres=> SET ROLE july PASSWORD 'Gauss@123'; postgres=> SELECT * FROM emp; id | name | phone_no | card_no | card_string | email | salary | birthday - + -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - | 1 Anny | | 13420002340 | 0 # # # # - # # # # # # # # - 1234 | [email protected] | 2 | | 1999-10-02 00:00:00 Bob | 99999.9990 18299023211 | | 0 # # # # # # # # - # # # # - 3456 | 66 [email protected] | 3 | | 1989-12-12 00:00:00 9999.9990 cici | 15512231233 | | | [email protected] | | 1992-11-06 00:00:00 (3 rows)Copy the code

2. Due to the job adjustment, Matu entered the HR department to participate in the recruitment of the company, and also made all the information of employees visible, and modified the effective conditions of the strategy.

postgres=> ALTER REDACTION POLICY mask_emp ON emp WHEN(current_user NOT IN ('alice', 'matu'));
Copy the code

Switch to the users Matu and July and review the employee table EMP.

postgres=> SET ROLE matu PASSWORD 'Gauss@123'; postgres=> SELECT * FROM emp; id | name | phone_no | card_no | card_string | email | salary | birthday ----+------+-------------+------------------+---------------------+----------------------+------------+----------------- - 1 | anny | | 13420002340 | 1234123412341234 | 1234-1234-1234-1234 [email protected] | | 10000.0000 1999-10-02 Bob 00:00:00 2 | | 18299023211 | 3456345634563456 | 3456-3456-3456-3456 | 66 [email protected] | | 9999.9900 3456-3456-3456 00:00:00 3 | cici | 15512231233 | | | [email protected] | | 1992-11-06 00:00:00 (3 rows) postgres=> SET ROLE july PASSWORD 'Gauss@123'; postgres=> SELECT * FROM emp; id | name | phone_no | card_no | card_string | email | salary | birthday - + -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - | 1 Anny | | 13420002340 | 0 # # # # - # # # # # # # # - 1234 | [email protected] | 2 | | 1999-10-02 00:00:00 Bob | 99999.9990 18299023211 | | 0 # # # # # # # # - # # # # - 3456 | 66 [email protected] | 3 | | 1989-12-12 00:00:00 9999.9990 cici | 15512231233 | | | [email protected] | | 1992-11-06 00:00:00 (3 rows)Copy the code

3. Employee information phone_no, email and birthday are also private data. Update desensitization policy mask_EMp and add three desensitization columns.

postgres=> ALTER REDACTION POLICY mask_emp ON emp ADD COLUMN phone_no WITH mask_partial(phone_no, '*', 4);
postgres=> ALTER REDACTION POLICY mask_emp ON emp ADD COLUMN email WITH mask_partial(email, '*', 1, position('@' in email));
postgres=> ALTER REDACTION POLICY mask_emp ON emp ADD COLUMN birthday WITH mask_full(birthday);
Copy the code

Switch to user July to view the employee table EMP.

postgres=> SET ROLE july PASSWORD 'Gauss@123'; postgres=> SELECT * FROM emp; id | name | phone_no | card_no | card_string | email | salary | birthday - + -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - | 1 Anny 134 * * * * * * * * | | | 0 # # # # - # # # # # # # # - 1234 | * * * * * * * * 163. com 2 Bob | | | | 1970-01-01 00:00:00 99999.9990 182 * * * * * * * * | | 0 # # # # - # # # # # # # # - 3456 | * * * * * * * * * * * qq.com | 3 | | 1970-01-01 00:00:00 9999.9990 cici | 155 * * * * * * * * | | | ************sina.com | | 1970-01-01 00:00:00 (3 rows)Copy the code

4. For user friendliness, GaussDB (DWS) provides system views redaction_policies and Redaction_Columns for users to view more desensitization information.

postgres=> SELECT * FROM redaction_policies; object_schema | object_owner | object_name | policy_name | expression | enable | policy_description ---------------+--------------+-------------+-------------+-----------------------------------+--------+---------------- ---- public | alice | emp | mask_emp | ("current_user"() = 'july'::name) | t | (1 row) postgres=> SELECT object_name, column_name, function_info FROM redaction_columns; object_name | column_name | function_info -------------+-------------+-------------------------------------------------------------------------------------------- ----------- emp | card_no | mask_full(card_no) emp | card_string | mask_partial(card_string, 'VVVVFVVVVFVVVVFVVVV'::text, 'VVVV-VVVV-VVVV-VVVV'::text, '#'::text, 1, 12) emp | email | mask_partial(email, '*'::text, 1, "position"(email, '@'::text)) emp | salary | mask_partial(salary, '9'::text, 1, (length((salary)::text) - 2)) emp | birthday | mask_full(birthday) emp | phone_no | mask_partial(phone_no, '*'::text, 4) (6 rows)Copy the code

5. When employee information can be shared within the company, you can delete the desensitization policy mask_EMP in the EMP table.

postgres=> DROP REDACTION POLICY mask_emp ON emp;
Copy the code

For more details, see GaussDB (DWS) 8.1.1 product documentation.

The secret behind data desensitization implementation

The GaussDB (DWS) data desensitization function is based on the existing FRAMEWORK of the SQL engine to implement real-time desensitization when restricted users execute query statements. For its internal implementation, see the figure above. We regard the desensitization policy as the rule bound on the table object. During the optimizer Query rewrite phase, each TargetEntry of the TargetList in the Query Tree is traversed. If a desensitization column of the base table is involved, And the current desensitization rule takes effect (that is, it meets the effective conditions of the desensitization policy and is enabled), then it is concluded that the Var object to be desensitized is involved in this TargetEntry. At this point, the desensitization column system table PG_redaction_column is traversed to find the desensitization function bound to the corresponding desensitization column. Replace it with the corresponding FuncExpr. After the above rewrite of Query Tree, the optimizer will automatically generate a new execution plan, the executor executes according to the new plan, and the Query results will be desensitized to sensitive data.

Statement execution with data desensitization increases the logical processing of data desensitization compared to the original statement, which is bound to bring extra overhead to the query. This part of overhead is mainly affected by three factors: the data scale of the table, the number of desensitized columns involved in the query target column, and the desensitized function used by the desensitized column.

For a simple query statement, the TPCH table Customer is taken as an example to test the above factors, as shown in the figure below.

The base table CUSTOMER in Figures (a) and (b) adopts MASK_FULL desensitization function and MASK_PARTIAL desensitization function according to field types and characteristics. MASK_FULL is only desensitized to a fixed value for any length and type of raw data, so the output results are quite different from the original data. Figure (a) shows the execution time of simple query statements in desensitized and non-desensitized scenarios under different data scales. The solid icon indicates the non-desensitization scenario, and the hollow icon indicates the restricted user, that is, the desensitization scenario. It can be seen that the larger the data size, the greater the difference between the query time with desensitization and the original statement. Figure (b) shows the impact of the number of desensitized columns involved in query at 10x data scale on statement execution performance. When one desensitization column was involved, the query with desensitization was slower than the original statement. It was found that the MASK_PARTIAL desensitization function was used for this column. The query result only changed the format of the result, but the length of the result content did not change, which was in line with the theoretical guess that “statement execution with desensitization will have corresponding performance deterioration”. As the number of desensitized columns involved in the query increased, we found a strange phenomenon that the desensitized scene actually executed faster than the original statement. Further tracing the desensitization function associated with desensitization columns in multi-column scenarios, it was found that the desensitization columns using MASK_FULL desensitization function saved a lot of time in the output result set compared with the original data, so the simple query with data desensitization under multi-column queries actually sped up a lot.

In order to support the above speculation, we adjusted the desensitization function, and all desensitization columns used MASK_PARTIAL to perform partial desensitization on the original data, so as to retain the external readability of the original data on the desensitization results. Therefore, as shown in Figure (c), when all desensitization columns are associated with partial desensitization functions, the statement with data desensitization deteriorates by about 10% compared with the original statement. Theoretically, such deterioration is within the acceptable range. The above tests are for simple queries only, but the performance degradation may be more pronounced when statements are complex enough to have aggregate functions or complex expression operations.

conclusion

The data desensitization function of GaussDB (DWS) is an important technical breakthrough in internalizing database products and strengthening data security capabilities. It covers the following three aspects:

1. A set of simple and easy-to-use data desensitization strategy syntax;

2. A series of flexible built-in desensitization functions that can cover common privacy data desensitization effects;

3. A complete and convenient desensitization strategy application scheme enables real-time, transparent and efficient desensitization of original statements in the execution process.

In a word, this data desensitization function can fully meet the data desensitization demands of customers’ business scenarios, support the desensitization effect of common private data, and realize the reliable protection of sensitive data.

Click to follow, the first time to learn about Huawei cloud fresh technology ~