Internet Enterprises: How to build data security system?

The background,

Facebook’s data breach has become the focus of the Internet industry, wiping out tens of billions of dollars in market value, a price that could feed a vast security team on the planet and even buy some of the biggest security companies outright.

Despite a lot of criticism in the media, realistically speaking, Facebook is facing an industry problem. Any billion-dollar Internet company may not have much resistance to this kind of problem, just because laws and national conditions in different regions of the world are different, so it is not at the forefront of the public opinion. But the global trend is to pay more and more attention to privacy, in the field of security, data security this sub-field has been mentioned to a new height, so the author will take the opportunity to talk about the construction of data security. (Sensitive information in this article is routinely omitted or omitted.)

Second, the concept of

Here especially emphasize that “privacy” and “data security” is two completely different concept of privacy is a more for security professionals to compliance, mainly is the index according to the data collected and the excessive abuse of the laws and regulations of compliance, for a lot of their own profit model based on the data of the Internet companies, This question is particularly challenging. Some companies even define themselves as data companies, and if they don’t use data to do something, either the user experience is compromised or the business value is halved. The impending GDPR implementation, and the possibility of some companies leaving Europe, shows how difficult it will be. Of course, there are some companies in the market that advocate privacy protection very much. To a large extent, they do not really represent the wishes of users, but just talk about it because they do not have data or lack of data.

Data security is one of the most important means to realize privacy protection. Readers who have a certain understanding of security may also be aware that data security is not an independent element, but needs to be combined with network security, system security, business security and other factors, only when all done well, can finally achieve the effect of data security. Therefore, this paper as far as possible to the data security as the core, but not with data security weak related to the traditional security system protection all listed, as far as possible for data security this proposition systematic, and avoid wordy. In addition, I also plan to separate the topic of other sub-areas in the summer and autumn, such as massive IDC intrusion prevention system, please look forward to.

Iii. Construction of the whole life cycle

Although some students in the industry also said that data is borderless, if we follow the leakage path, it may not be able to “cure” the effect, but in fact, it is impossible to achieve borderless data security with the current technology. The following figure summarizes a full life cycle of data security measures:

Iv. Data collection

Data leakage is partly caused by the replication of user session traffic. Although there is a technical threshold, it is also one of the security incidents with high frequency. However, many enterprises are unaware of it. The following describes data protection in the data collection stage from several dimensions.

Flow protection

Full-site HTTPS is the mainstream trend of the Internet. It solves the problems that links between users and servers are sniffed, traffic mirroring, and data stolen by third parties. These problems are actually quite serious, such as telecom operators internal fraud occasionally, all kinds of diversion hijacking insert advertising (of course, can also save data, insert Trojan horse), even AWS has been hijacked DNS request, for those who master link resources can be tantamount to a “nuclear war”. Even if the target IDC does well in intrusion prevention, the attacker can directly copy traffic or even direct APT without positive penetration. Finally, it only depends on whether the income achieved after traffic manipulation is cost-effective.

HTTPS is a superficial phenomenon that implies that any unencrypted traffic on the Internet is free of privacy and data security, and that HTTPS is not necessarily secure. HTTPS itself has a variety of security issues, such as the use of insecure protocol TLS1.0, SSL3, the use of outdated weak encryption algorithm suite, implementation framework security vulnerabilities such as heart bleeding, and many digital certificates themselves caused by security issues.

Side-effects of site-wide HTTPS are CDN and high security IP. In history, a large Internet company was sniffed by NSA and obtained user data. The reason was that the CDN did not use encryption when returning to the source, that is, the user’s browser is encrypted to the CDN, but the CDN is in plain text to the IDC source site. If the CDN is encrypted to the source site, it is necessary to give the certificate private key of the website to the CDN manufacturer, which is also a great security risk for the company that has not completely built the CDN itself. Therefore, the Keyless CDN technology is derived later, which can realize the CDN back source encryption without providing its own certificate.

The problem of unencrypted wan traffic should also be avoided in the “backyard” — traffic replication and backup synchronization between IDCs. The corresponding solution is automatic encryption of inter-IDC traffic and TLS tunneling.

Business security Attributes

There are also two business security directions involved between the user and the server. The first problem is account security. As long as the account leakage (bump library & explosion) reaches a certain order of magnitude, the data of these accounts will be summarized, and the effect of batch data leakage will surely be produced.

The second problem is reverse crawling. The problem of crawler exists in all occasions where data can be obtained through pages and interfaces. It is no problem to crawl millions of data in about one hour. Active or passive disclosure of accounts + crawler technology has fostered a lot of black production and grey areas of data acquisition.

UUID

UUID is used to establish an intermediate mapping layer and mask the relationship between the UUID and real user information. For example, in the open platform, third-party application data can only read UUID, but can not directly obtain personal wechat signal. More potential shielding is the significance of individual identification data, because the system, mobile phone number is more and more can represent personal identity, and generally binding various accounts, change the cost is very high, find a phone number can be on the right person, so if anyone with individual identification data in information needs to be “connecting bridge”, anonymous and desensitization. For example, when the merchant ID can uniquely identify a brand and store name, the data structure originally used for program retrieval has suddenly become individual identification data, which also needs to be included in the protection category.

V. Business handling at the front desk

Authentication model

In many enterprise application architectures, only login verification is set at the beginning of business logic processing, user authentication will not appear in subsequent transaction processing, which leads to a series of unauthorized vulnerabilities. In fact, unauthorized vulnerability is not all the hazards of this model, including various K/V, RDS (relational database), message queue, etc., RPC without authentication leads to arbitrary read security problems.

The data layer only knows that the request comes from a data access layer middleware, from an RPC call, but does not know which user, or which such as customer service system or other upstream applications, it is impossible to determine whether the current data (object) has full access. The vast majority of Internet companies use open source software or modified open source software. This kind of open source software is basically without security features, or only has very weak security features, so that it is completely unsuitable for the 4A model (authentication, authorization, management, audit) under the massive IDC scale. It’s probably the norm in the Internet industry that you can read and write at will on the Intranet, while the outside is well defended. Principal contradiction or authentication granularity and the problem of calculation of elasticity, the solution about this problem you can refer to the author’s another article reviewed the next generation of network isolation and access control, including that Google’s approach is to network the RPC authentication, because Google’s Intranet only RPC a protocol, so avoid the most security problems.

For the authentication model of business flows, Data and App need to be separated in essence, and the model that Data does not trust App by default is established. Whole-process Ticket and level-by-level authentication in applications are concrete implementation methods under this idea.

As a service

Servitization cannot be regarded as a security mechanism, but security is the beneficiary of servitization. 1) All teams will expose their data and functions through the service interface. 2) Teams must communicate with each other through these interfaces. 3) Other forms of inter-process communication are not allowed: direct links are not allowed, data stores of other teams are not allowed to be read directly, shared memory mode is not supported, and there is no back door. The only communication allowed is through service interface calls on the network. 4) It doesn’t matter what technology they use. HTTP, Corba, Pubsub, custom protocols – irrelevant. Mr Bezos doesn’t care. 5) All service interfaces, without exception, must be designed from scratch to be externalizable. That is, the team must plan and design interfaces that can be presented to external developers. No exceptions. 6) Anyone who doesn’t will be fired.

The security significance of servitization is that the data must be accessed through the interface, shielding all kinds of direct access to the data, with API control and audit will be much more convenient.

Network encryption

Some of the industry’s Top companies have even done encryption on IDC’s Intranet, that is, data transmission between back-end components is encrypted, such as Google’s RPC encryption and Amazon’s TLS. Because the traffic on the IDC Intranet is much larger than that on the public network, it is a test of engineering capabilities. This requirement may be a bit harsh for most companies that still feel the pressure of major business iterations, so I think it’s reasonable to use these metrics to measure where a company’s security capabilities fall into the spectrum. Does a proprietary agreement count? If the proprietary protocol does not contain encryption stronger than standard TLS (SHA256), or if it is just a hash with asymmetric information, I don’t think it counts.

Database Audit

Database audit/database firewall is an intrusion detection/defense component, which is a product in the field of strong resistance. However, its significance in data security is also obvious: preventing SQL injection from pulling data in batches, detecting API authentication vulnerabilities and successful access of crawlers.

In addition, the database audit has another meaning, which refers to the operation of internal personnel on the database. It should avoid the dangerous action of some RD or DBA dragging or deleting the database out of anger. Large Internet companies typically have a database access layer component that allows them to audit and control dangerous operations.

6. Data storage

The biggest part of data storage for data security is data encryption. Amazon CTO Werner Vogels once concluded, “All new AWS services are designed with data encryption in mind at the prototype stage.” Foreign Internet companies generally pay more attention to data encryption.

HSM/KMS

A common problem in the industry is not encrypting, or encrypting but not using the right method: using custom UDFs, incorrect algorithm selection or encryption strength, or random number issues, or keys without Rotation mechanism, keys not stored in KMS. The proper method of data encryption itself is the idea of trusted computing. The trust root is stored in HSM and the encryption uses layered key structure to facilitate dynamic conversion and expiration. When Intel cpus generally start to support SGX security features, data such as keys, fingerprints, and credentials will be processed in a more civilian way using chip-level isolation technology similar to Trustzone.

Structured data

Here mainly refers to the static encryption of structured data, symmetric encryption algorithm for such as mobile phone, ID card, bank card and other fields that need to be secret encryption and persistence, in addition to the database, data warehouse encryption is similar. For example, in the Amazon Redshift service, each chunk of data is encrypted with a random key, which is encrypted and stored with a master key. The user can customize this master key, which ensures that only the user can access confidential data or sensitive information. Since this section is a more common technique, it will not be expanded.

File encryption

To encrypt a single file independently, block encryption is generally adopted. Typical scenarios, such as iCloud mentioned in advanced Guide for Internet Enterprise Security, encrypt mobile phone backup blocks and store them in S3 of AWS. Each file block is encrypted with random key and stored in meta data of the file. Meta data is then wrapped with a File key, which is then encrypted with a specific type of data key (involving data types and access permissions), and the data key is then wrapped with the Master key.

File system encryption

File system encryption is transparent, so as long as the application has access rights, file system encryption is also “unaware” to the user. It mainly solves the problem that the storage media can be accessed after the persistence of cold data. It is useless to pull out a hard disk in the machine room or try to recover data from a discarded hard disk. However, for API authentication vulnerability or SQL injection, obviously file system encryption is transparent, as long as App has permission, vulnerability exploitation also has permission.

7. Visit, operation and maintenance

In this link, mainly elaborated to prevent the internal personnel to exceed the authority of some measures.

Role separation

R&d and o&M should be separated, key holders and data o&m should be separated, and o&M roles and audit roles should be separated. The privileged account must be reclaimed to meet the minimum rights and multiple rights separation audit principles.

Operational audit

Bastion machine (skip machine) is a conventional audit method for human operation and maintenance. With the deepening of operation and maintenance automation in large IDC, operation and maintenance operations are API-based, so the invocation of these APIS also need to be included in the audit category, and data mining method is needed in the case of a relatively large order of magnitude.

Tool chain desensitization

Typical tool desensitization includes monitoring systems and Debug tools/logs. In the category of monitoring systems, the operation and security monitoring systems usually need desensitization to user tokens and sensitive data because they contain the user traffic of the whole site. At the same time, these systems may also obtain some operational data through simple calculation, such as fuzzy transaction numbers, which are the places that need desensitization. Serious security events, such as CVV codes, occur in Debug logs. Therefore, they are data leakage points that need to be paid attention to.

Production to test

The production environment and test environment must be strictly defined and separated. If production data needs to be transferred to test under special circumstances, it must be desensitized and anonymized.

Viii. Background data processing

Number of warehouse safety

At present, big data processing is basically a necessity for every Internet company. It usually carries all the user data of the company, and some companies even use more computing power for data processing than for foreground transaction processing. The open source platforms represented by Hadoop do not have strong security capabilities, so they need to do a lot of transformation before becoming public cloud services. At the time of the relatively small companies can choose internal trust models, don’t too entanglements the safety of the open source platform itself, but in the company scale is larger, the data RD and BI analysts when tens of thousands of internal trust models will need to be abandoned, this time is needed is a one-stop platform for the authorization & audit, need to see the data of blood inheritance relationships, Data that requires high sensitivity is still encrypted. At this scale, the maturity of the toolchain determines the need for data localization, and the more mature the toolchain, the less data needs to be localized to the developer, which greatly improves security capabilities. At the same time, all computing is encouraged to be mechanized, programmed and automated, and manual operation is avoided as far as possible.

The ability to categorize, identify, distribute, process, and access data requires a global view of the larger picture and the ability to establish “situational awareness” based on the behavior of data users.

Because data warehouse is the largest data collection and distribution center, each company’s values on data ownership will also affect the implementation form of data security solution: exile + detection or isolation + control.

Anonymization algorithm

The greater significance of anonymising algorithm actually lies in privacy protection rather than data security (the author plans to write a separate article about privacy protection). If it is meaningful to data security, anonymising may lie in reducing the possibility of data abuse and weakening the impact surface after data leakage.

Ix. Display and use

This link generally refers to a large number of application system background, operating reports and all places where data can be displayed and seen, may be the disaster area of data leakage.

Show the desensitization

Desensitization of sensitive information that needs to be displayed on the page. One type is complete desensitization, partial field coding no longer display complete information and fields, the other type is incomplete desensitization, display desensitization information by default, but still retain the view details button (API), so that all view details will have a Log, corresponding to the audit requirements. Specific desensitization needs to consider the work scene and efficiency of the comprehensive assessment.

The watermark

Watermarking is mainly used in screenshot scenes, which can be divided into bright watermarking and dark watermarking. The bright watermarking is visible to the naked eye, while the dark watermarking is the identification information hidden in the picture that is invisible to the naked eye. There are many forms of watermark, some resist screenshots, also resist taking photos. There are a lot of antagonistic elements involved.

Border security

The boundary here is actually the data boundary of the company consisting of the office network and the production network. As the degree of office mobility increases, the boundary is further blurred. Therefore, the boundary is logical rather than physical, and is equivalent to the office network, production network, and authenticated mobile devices that support MDM. For the data within this boundary, DLP is used for detection. The term DLP has been around for a long time, but in fact its product form and technology have changed, which is used to deal with the data protection mode of heavy detection and light blocking in large-scale environment.

In addition to DLP, the entire office network adopts BeyondCorp’s “zero trust” architecture to implement dynamic access control for the entire OA application, fully removing anonymous access, all HTTPS, and minimum permissions based on roles, that is, each account can access only limited even if leaked. At the same time, increase the cost of account leakage (multi-factor authentication) and detection methods, providing the ability to remotely wipe once a leak is detected.

The fortress machine

As an alternative, fortnite is mainly used to solve the problem of avoiding operations in local scenarios and developers downloading sensitive data locally. Similar to VDI, this method is heavy and has a low threshold for use, which is not suitable for widespread promotion.

Share and redistribute

For companies with large business plates, their data will not only flow in their own systems, but usually have open platforms and upstream and downstream data applications throughout the entire industrial chain. The Facebook debacle is one of those issues, and it’s impossible not to open up because it affects the core of the company — the business value on which it lives.

So the solution to this problem is equivalent to: 1) kernel limited compromise (sacrificing some business interests to protect user privacy); 2) One-stop data security service.

Prevent downstream data precipitation

First, all data invoked by third parties should be desensitized and encrypted if not necessary. If it is necessary to query detailed data in some scenarios, set a separate API and control the risk of account behavior and API query.

Secondly, if they have cloud infrastructure, the public cloud platform can promote the third party to go on the cloud, so as to (1) security empowerment, to avoid some security problems caused by their own lack of capacity; (2) Centralized data, after centralized on the cloud to facilitate the implementation of one-stop overall security solutions (data encryption, risk control, anti-crawl and data leak detection services), greatly reduce external risks and reduce the problem of evil and inside theft to a certain extent.

The climb

Anti climb mainly for public page here, or crawl through interface information, because it is impossible for desensitization it in all the links do very thoroughly, so even through a lot of “public” information can be gathered and data mining, and eventually form such as the user relationship chain, business data or auxiliary decision class data, create over the effects of information disclosure.

Authorization audit

Set up a special team to conduct machine audit and manual audit on the third party of the open platform, prohibit “unlicensed operation” and false third party, raise the threshold of malicious third party access, and provide a basis for the developer/partner company credit rating.

Legal provisions

All third-party access must have a strict user agreement that specifies data usage rights, data disclosure restrictions and privacy protection requirements. As with the GDPR, clarify the role of the data handler and the penalty treaty.

11. Data destruction

Data destruction mainly refers to the safe deletion, in particular, the master instance of the data is easy to view, and the backup class of data is ignored. If you want to achieve fast and secure deletion, it is best to use encrypted data because full overwrite is unlikely to be completed in a short time, but secure deletion of encrypted data requires only the deletion of the key.

12. Boundary of data

Data governance often involves “boundaries.” Whether you want to admit it or not, there are always boundaries, but they are expressed differently. If there are no boundaries, there is no data security.

The enterprise internal

In the case of not exceeding the network security law and privacy protection regulations, enterprises legally have absolute control over their internal data, which makes the construction of internal data security actually turn into an operational work in the end, and the challenge is no more than the cost of each business party to promote the implementation. However, for larger companies, internal autonomy may not be enough, so data security will generate demand for closed loop in the industrial chain.

Ecological construction

In order to make the part of data security construction outside the internal value chain of enterprises more flat, large enterprises may need to acquire the data control right and standard setting right of upstream and downstream enterprises through investment acquisition and other means, so as to push their own data security standards in the big ecosystem to the end. If you can’t control your data, you can’t secure it. In the case of insufficient discourse power, the realistic choice is to provide more tools to partners, which is also an extension of data control ability.

13. ROI and Construction

For many small companies, the data security construction method could really is a little bit more, for smaller companies even what may also can’t digest so much demand, because most the development framework of open source software and don’t have the ability, in need of DIY composition is very high, so we comb pre-conditions, priorities and ROI, Keeping data secure is acceptable to anyone, but there is also room for entrepreneurship.

basis

Accounts, permissions, logging, desensitization, and encryption are all fundamental to data security. At the same time, there are some parts that are not completely fundamental, but can be reflected as advantages: unified infrastructure and unified application architecture. If these two are highly unified, data security construction can get twice the result with half the effort.

Log collection

Logging is fundamental to data risk control, but there are also two important elements:

1. Whether the office network is beyondcorp-based provides great convenience for data risk control.
2. Servitization, all data calls are in the form of API, providing a unified form for logging.

Data risk control

In data security, “universally applicable” work is data risk control, suitable for all kinds of enterprises, combined with equipment information, account behavior, query/climb (read) behavior to do the risk control model. For 2C user class, 2B third party cooperation class, OA employee account class are applicable. The author intends to describe the specific strategy in detail in the following article “Intrusion Prevention System Construction”.

Author’s brief introduction

Zhao Yan, senior director of Security Department of Meituan-Dianping Group, is responsible for information security and privacy protection of all businesses of The Group. Before joining Meituan-Dianping, he served as chief architect of Huawei Cloud security, Director of Enterprise security technology of Qihoo 360, Director of Security of Jiuyou.com, and security expert of Green Alliance Technology. White Hat times is a core member of Ph4nt0m Security Team and the first generation of senior practitioners in the field of Internet Security.

Team introduction

Most of the core members of the security department of Meituan-Dianping Group have years of practical experience in the Internet and security field. Many students have participated in the security system construction of large Internet companies, and there are many global security operation talents with millions of IDC scale offensive and defensive confrontation experience. There are also CVE diggers in the ministry of Security, as well as speakers invited to speak at top international conferences such as Blackhat. Team members include penetration, Web, binary, kernel, global compliance and privacy protection, distributed system developers, big data analytics, algorithms, and of course, many beautiful operational girls.

We want to make a set of adaptive security system of millions of IDC scale and hundreds of thousands of terminal mobile office network, including building on zero trust architecture, spanning cloud infrastructure [network layer, virtualization/container layer, Server software layer (kernel/user mode), language virtual machine layer (JVM/JS V8), Web application layer, data access layer] based on big data + machine learning automatic security event awareness system and strive to build the industry’s cutting-edge built-in security architecture and in-depth defense system. With the rapid development and business complexity of Meituan-Dianping, it provides a broad platform for security practitioners to implement the best security practices in the industry and explore emerging fields.

Amway has a little AD

The security department of Meituan-Dianping Group is recruiting various partners such as Web& binary attack and defense, background & system development, machine learning & algorithm. If you would like to join us, please send your resume to zhaoyan17@meituan.com

For specific job information, please refer to the link “View here”.

Meituan-dianping Security Emergency Response Center MDSRC homepage Click “View here”

Enterprise Security series – dedicated to providing practical large – scale Internet solutions

Enterprise Security Best Practices from Google Whitepaper

Port Monitoring for Internet Enterprise Security

A Preliminary Study on Next Generation Network Isolation and Access Control

New Exploration on Data Security Protection of Internet Companies

Coming soon

Construction of Large Internet Intrusion Perception and Defense System construction of Enterprise Security Governance System GDPR Compliance and Privacy Protection Practice