New exploration on data security protection of Internet companies

In recent years, the situation of data security becomes more and more serious, and various data security incidents emerge in an endless stream. In the current climate, Internet companies have basically reached a consensus that while attacks cannot be completely stopped, the bottom line is that sensitive data cannot be leaked. That is, servers can be mounted, but sensitive data cannot be hauled away. A server is an acceptable loss for an Internet company, but a leak of sensitive data can have a significant reputational and economic impact on the company.

In the field of data security of Internet companies, both the life cycle of data security proposed by traditional theories and the solutions provided by security vendors are facing difficulties in landing. Its core point lies in its poor operability in massive data and complex application environment.

For example, the data security life cycle suggests that data should be classified and classified before protection. But Internet companies have largely grown wild, discovering data security problems only after they got big. However, the stock data has been formed, and tens of thousands of data tables are increasing every day. In this case, how to achieve data classification and classification? Manual combing is obviously unrealistic, and the speed of combing can’t keep up with the growth rate of data.

For example, the data audit solutions provided by security vendors are also based on traditional relational database hardware boxes. What is the data audit scheme in Hadoop? Faced with massive data, many manufacturers can’t afford to buy so many hardware boxes.

Therefore, Internet companies urgently need some measures in line with their own characteristics to ensure data security. To this end, Meituan-Dianping information security center for some specific level of exploration. These explorations map to the IT level, mainly including application systems and data warehouses, which are described separately next.

First, application system

The application system is divided into two parts, one is to fight against external attacks, which is the security awareness that most companies have, but awareness is not equal to ability, which is the basic skills of a responsible enterprise. Traditional problems include overreach, traversal, SQL injection, security configuration, low-level vulnerabilities, etc., which are all mentioned in OWASP’s Top10 risks. In practice, SDL, security operation and maintenance, red-blue confrontation and other means are mainly considered, and the main problems are solved in the form of productization. I won’t highlight it here.

1.1 Scan and crawler

Under the new situation, there are still problems of scanning and crawler. Sweep number refers to bump database or weak password: bump database is to use the leaked account password to test, after successful theft of user data, or steal user funds; Weak passwords are simple password problems. For this kind of problem, the industry constantly explore new methods, including fingerprint technology, complex IP authentication code, human recognition, credibility, tried to many down, but the black is in the rising against technology, including a key xinji, simulator, IP agent, imitate human behavior, so it is a constant process.

For example, some companies detect changes in sensors such as acceleration when a user logs in, because when a user taps on the phone’s screen, the Angle and gravity will inevitably change. If these sensors do not change during user clicks, scripting is suspected. Add another dimension to determine the user’s recent battery changes, and you can determine if it’s a phone that’s being used by a human, or a phone from a black studio. Black industry in the confrontation found that the company used this kind of strategy, it is easy to resolve, all data can be forged, you can see a large number of such technology tools for sale on a treasure.

Crawler rivalry is another new issue, following an article that stated that crawlers account for more than 75% of some companies’ data access traffic. Crawlers provide no business value and require a large amount of resources, as well as data leakage.

After the rise of Internet finance, crawler has made new changes, from unauthorized crawler to user authorized crawler. Little short of money, for example, in the Internet the company’s web site to microfinance, while Internet financial companies do not know zhang can loan, repayment ability, therefore ask xiao zhang to provide in the shopping website, email account password, or other applications crawl zhang daily consumption data, as a reference for the credit score. In order to obtain the loan, Xiao Zhang provided the account password, which constituted the authorization to crawl. This is a great change from the previous unauthorized crawling. Internet financial companies can access more sensitive information, which not only increases the resource burden, but also may expose user passwords.

The resistance to crawler is also a comprehensive subject, there is no technical solution to all the problems. In addition to the previous means of device fingerprint, IP reputation, etc., it also includes various machine learning algorithm models to distinguish normal behavior from abnormal behavior, and can also start from the direction of association model. But it’s also a confrontational process, and Kurochan is gradually experimenting to mimic human behavior. The future will be machine against machine, and it will be cost that decides who wins or loses.

1.2 the watermark

In recent years, the industry has also appeared some sensitive internal documents, screenshots of the event. Some events have caused media hype and caused public opinion influence on the company, so it is necessary to trace the source of such external behavior. The watermarking technology to be solved in terms of robustness includes spatial filtering, Fourier transform, geometric deformation and so on. Simply speaking, it is a technology to restore the information under bad conditions after transformation.

1.3 Data honeypot

It means creating a fake data set to capture visitors and detect attacks. Foreign companies have made a corresponding product, its implementation can be roughly understood as, in a data file to join a “Trojan horse”, all visitors open again, will send corresponding records back to the server. Through this “Trojan horse”, can trace the details of the attacker information. We did something similar, and unfortunately, the data file sat there for a long time without anyone accessing it. The lack of access has to do with our positioning of the honeypot, which at this stage we prefer to use as an experimental gadget rather than a mass adoption, because the Trojan itself may carry some risks.

1.4 Big Data Behavior Audit

The emergence of big data provides more possibilities for association audit, which can analyze abnormal behaviors through various data association. In this regard, traditional security audit firms have made some attempts, but from an objective point of view, it is still relatively basic, unable to cope with the behavior audit of large Internet companies in the complex situation. Of course, it is not demanding of traditional security audit firms, which is related to business, business is to pursue profits. In this case, Internet companies have to do more on their own.

For example, to prevent the inside ghost, you can use various data association analysis and the rule of “sharing a device with bad people” to detect the inside ghost. By drawing inferences, we can derive more rules to catch ghosts in accordance with the characteristics of our own data from the information flow, logistics, capital flow and other major directions.

In addition, anomalies can be found through UEBA (User and entity behavior analysis), which needs to collect data at buried points in each link, and the corresponding rule engine system, data platform and algorithm platform are required to support the back end.

For example, the common clustering algorithm: some people and the majority of people’s behavior is not consistent, these people may have abnormalities. The specific scenario can be as follows: Normal user behavior first is to open the page, select the product, and then log in and place an order. Unusual behavior can include logging in, then changing your password, then choosing a new store and using a big coupon. For each of these data fields, you can derive various variables, and from these variables, you can end up with an exception judgment.

Another example is the association model, where a gang of bad guys is usually connected. These dimensions can include IP, equipment, WiFi MAC address, GPS location, logistics address, capital flow and other dimensions, combined with their own data, can be associated with a gang. If one member of the gang is marked black, the circle of connections demotes their reputation based on the strength of their connections.

The foundation of UEBA is sufficient data support, which can be external data suppliers. For example, Tencent and Ali provide some external data services, including the judgment of IP reputation, using these data, can play a joint prevention and control effect. It can also be internal. Internet companies always have several lines of business to serve a customer. It depends on the data sensitivity of security personnel, which data can be used by them.

1.5 Data desensitization

In application systems, there is always a lot of user sensitive data. Application system is divided into internal and external, external system desensitization, mainly to prevent collision and crawler. Internal system desensitization, mainly to prevent internal leakage of information.

External system desensitization protection, can be layered to treat. By default, the bank card number, ID card, mobile phone number, address and other key information is forcibly desensitized, and the key location is replaced by ****, so that even if the database or crawler is hit, the relevant information can not be obtained, so as to protect user data security. But there are always customers who need to see their own or modify their own complete information, then need layered protection, mainly according to the commonly used equipment to judge, if it is commonly used equipment, can be barrier-free click after display. Push a strong validation if the device is heavily used.

Meituan Dianping has another feature in its day-to-day business. In the contact between delivery riders and buyers, riders may not be able to find the specific location and need to communicate with buyers. At this time, at least two information including address and mobile phone number are exposed. And for the protection of buyer information, we also explored and tried. Mobile phone number information, we solve through a “small” mechanism, the rider is given a temporary transfer number, using this number to contact the buyer, while the real number is not visible. Address information, we use the picture display in the system, after the order is completed, the address information is not visible.

Internal system desensitization protection, practice can be divided into several steps. The first is to detect the sensitive information in the internal system. Here, you can choose to obtain it from Log or from the JS front end. The two schemes have their own advantages and disadvantages. From the Log, depends on the company’s overall Log specifications, or one Log for each system, long docking cycle and heavy workload. Obtained from front-end JS, the solution is relatively lightweight, but the impact of performance on business should be considered.

The purpose of detection is to continuously detect changes in sensitive information, because in a complex internal environment, the system will be constantly transformed and upgraded. If there is no means of continuous monitoring, it will become a moving project, which cannot ensure continuity.

What should be done after detection is desensitization. Desensitization process needs to communicate with the business side clearly, which fields must be forced complete desensitization, which is semi-desensitization. When the application system permission construction is standardized, role-based desensitization can be considered. For example, risk control case personnel must have complete information of the user’s bank card. At this time, immunization permission can be granted according to the role. But customer service personnel do not need to view the complete information, then forced desensitization. Between immunization and desensitization, there is another layer called semi-desensitization, which means that when needed, you can click to see the full number, and the click action is recorded.

There should be a global view of desensitization as a whole. How many sensitive user information is accessed every day, how much information desensitization, what is the reason for not desensitization. In this way, changes can be tracked as a whole. The goal is to constantly reduce the access rate of sensitive information. When the view has abnormal fluctuation, it represents that the business has changed and the cause of the event needs to be traced.

Second, data warehouse

Data warehouses are at the heart of a company’s data, and when things go wrong there are huge risks. The governance of data warehouse is a long-term and gradual construction process, in which the security link is only a small part, and more is the data governance level. This paper mainly talks about some tool construction in the security link, including data desensitization, privacy protection, big data behavior audit, asset map, data scanner.

2.1 Data desensitization

Desensitization of data warehouse refers to the deformation of sensitive data to protect sensitive data, which is mainly used for data analysts and developers to explore unknown data. Desensitization can take several forms in practice, including the confusion and substitution of data, and the use of data without changing the representation of the data itself. In the case of massive data of large Internet companies, the cost of data confusion and replacement is very high. In practice, the commonly used method is relatively simple partial covering, such as covering the mobile phone number, 139****0011. This method has simple rules. Can play a certain degree of protection effect.

However, in some scenarios, simple covering cannot meet business requirements, so other methods need to be considered, such as Tokenization for credit card numbers, segmentation for range data, diversity of cases, and even Base64 covering for pictures. Therefore, the need to provide different services according to different scenarios is the result of cost, efficiency and usage considerations.

Data covering should consider the original table and desensitized table. The original data must have a copy, on this basis is another copy of a desensitization table or in the original data to do visual desensitization, are two different costs of the program. In addition to copy a table desensitization, is a more thorough way, but equal to each sensitive data table to copy out a copy, is a cost problem for storage. While visual desensitization is dynamic desensitization of data presentation through rules, which can achieve desensitization effect at a lower cost, but there is a possibility of being bypassed.

2.2 Privacy Protection

Privacy protection In academic circles, some methods have been proposed, including K anonymity, edge anonymity, differential privacy and other methods, which aim to solve the privacy protection in the case of data aggregation. For example, some companies publicize a part of the data after removing sensitive information for algorithm competition. At this time, we need to consider different data aggregation, can be associated with a person’s personal logo. Currently, Google’s DLP API is used in production, but its use is complicated and specific to scenarios. The key to privacy protection is to be able to carry out large-scale engineering. In the context of the era of big data, these are all new topics. At present, there is no complete method to solve all the problems of privacy protection.

2.3 Big data Asset map

It is a platform for analyzing and visualizing data assets of big data platforms. The most common request is that Department A applies for the data of Department B. As the Owner of the data, B certainly wants to know how he uses the data after it is given to A and whether he passes it on to others. An asset map is needed to track the flow and usage of data assets. On the other hand, for the security department, it is necessary to know what highly sensitive data assets exist on the current data platform, how the assets are used, and who has what permissions on the platform. Therefore, a visual asset map is formed through metadata, kinship, and operation logs. It is not enough to form a map. By extension, intervention measures such as timely warning and retrieval of permissions are also needed.

2.4 Database scanner

Data scanning refers to data scanning on big data platforms. Its significance is to discover sensitive data on big data platforms and implement corresponding protection mechanisms. A large Internet company’s data sheet may directly generate tens of thousands of tables every day, and more tables are derived from these tables. According to the traditional definition of data security, the first step of data security is to classify and level, but this step is difficult to proceed. In the case of massive inventory table, how to classify and grade? Artificial carding is obviously unrealistic, the speed of carding can not catch up with the speed of new. Some automated tools are needed to calibrate the data. Therefore, the database scanner can use regular expressions to find some basic highly sensitive data, such as mobile phone number, bank card, and other structured fields. For non-structured fields, it needs to be confirmed by machine learning + manual labeling.

To sum up, the importance of data security becomes more and more prominent when the business develops to a certain extent. Tool building at the micro level is a support that minimizes disruption to business while improving efficiency. At the macro level, in addition to the data security within its own system, the data security of partners, companies after investment, logistics, riders, businesses, outsourcing and other organizations will also affect its own security, which can be described as “lips are dead and teeth are cold”. At present, the security level of various organizations is uneven, so Internet companies that have developed are required to assume more responsibilities to help partners improve security levels and jointly build joint defense.

Author’s brief introduction

Peng Fei, head of data security in Security Department of Meituan-Dianping Group, is responsible for data security and privacy protection of all businesses of the Group.

Team to introduce

The security Department of Meituan-Dianping Group brings together a number of cutting-edge security experts and outstanding technical talents in China, and adheres to the concept of “professionalism, operation and service” to jointly escort the rapid development of the whole line of business of the Group. The team is committed to building a set of massive IDC environment across the network layer, virtualization layer, Server software layer (kernel/user mode), language execution virtual machine layer (JVM/Zend/JavaScript) V8), Web application layer, data access layer (DAL) based on big data + automatic security incident perception system of machine learning and trying to create built-in security architecture and defense in depth system, by the broad platform and opportunity, depth development, pay attention to the practice of the construction of enterprise security, security team the best development direction to move on.

Amway has a little AD

The Security Department of Meituan-Dianping Group is recruiting various partners such as Web& binary attack & Defense, background & system development, machine learning & Algorithm, etc. It should be a good opportunity for students pursuing in the field of security and engineering technology.

If you would like to join us, please send your resume to email

For specific job information, please refer to FreeBuf recruitment website

SRC home page: Meituan-Dianping Security Emergency Response Center

Stay tuned for our enterprise security series – practice-oriented large-scale Internet security solutions

Enterprise Security Best Practices from Google Whitepaper

Port Monitoring for Internet Enterprise Security

Coming Soon

Identification and Thinking of Key Points of Personal Information Protection

How to build THE WAF of Meituan-Dianping

Design and Implementation of Distributed Intrusion Sensing System under Massive IDC

Maturity Measurement of Large Internet Security System

If you are interested in our team, you can follow uscolumn.