Introduction: The so-called black grey industry, including network black production, grey production two industrial chains, with the rapid development of the Internet, network black grey industry is constantly evolving, the current network black grey industry has tended to platform, professional, refined operation. Based on the characteristics of black-gray attack, we propose a black-gray attack identification method based on community coding. The community discovery part is based on graph relation, and the coding part introduces large-scale graph embedding representation learning. Compared with traditional graph relation mining, unknown attacks can be better identified and measured. Moreover, we also propose an engineering implementation based on asynchronous quasi-real time, which has stronger strain flexibility against frequently changing black ash production attacks.

The full text is 4424 words and the expected reading time is 12 minutes.

The background,


The so-called black gray industry includes two industrial chains, network black gray industry, with the rapid development of the Internet, network black gray industry is also developing, the current network black gray industry has formed a platform, professional, refined, independent and closely cooperative industry chain. From major network security accidents in recent years, black ash to produce pure attack mode is no longer confined to half open, but converted to accumulate their tools and bad means of commercial competition, according to incomplete statistics, the current network of black ash to produce the size of the market has more than one hundred billion yuan, under the market size of the billions of level, developed very much niche, such as trojans, viruses, Keep number brush sheet, wool pulling, telecom fraud, knowledge piracy, traffic hijacking, etc.

The Internet platform to prevent black ash produced by network attack, developed a lot of defense and recognition technology, most often by the masses of users to perceive the verification code technology is one of them, in the verification code technology behind, there are a lot of recognition method, for example, by rule engine according to the analysis of rules of anti attack to intercept, through behavioral sequence modeling to black ash to produce a single request behavior decision, The correlation between users is mined through the graph relationship to identify black ash production groups, and unsupervised identification of black ash production attacks is carried out based on hierarchical clustering, mean clustering, gaussian mixture and other clustering models. All these methods can play a role in identification and defense of black ash production to varying degrees.

Based on the above characteristics of black gray attack, we propose a community coding based black gray attack identification method. The community discovery part is based on graph relation, and the coding part introduces a large-scale graph embedding representation learning method. Compared with the existing graph relation mining, unknown attacks can be better identified and measured. Moreover, we also propose an asynchronous quasi-real-time engineering algorithm, which has stronger strain flexibility against the frequently changing black ash production attack.

Second, community structure


Based on community coding, the black ash attack identification method introduces large-scale graph embedding representation learning technology on the basis of the original map mining, which can not only dig out the association of the black ash itself, but also identify the potential network structure of the black ash, making the identification process more accurate and stable.

This method is based on two types of correlation graph, namely isomorphic graph and isomorphic graph, which are often encountered in the mining process of black ash production. Isograph indicates that all nodes in the network diagram are of the same type. For example, for an account association, all nodes in the network diagram are user account ids. Heterogeneous graphs indicate that the types of all nodes in the network diagram may be different. For example, in an account association network, nodes in the network diagram may have other types of nodes, such as IP addresses, device numbers, and mobile phone numbers, in addition to account IDS.

The following figure shows the isomorphism and heterogeneity. The isomorphism diagram shows the network structure composed of all account ids, and the heterogeneity diagram shows the network structure composed of account ids, device ids, mobile phone numbers, and IP addresses.

For graph structure network, in addition to edge relations, nodes themselves also have many inherent attributes. For example, for UGC scene of an account ID, there will be different active time, different business scenes (such as browsing pictures and texts, browsing videos, publishing pictures and texts, etc.) and different operation types (such as Posting, commenting, liking, forwarding, etc.). The attributes of such nodes are shown in the table below.

In order to explain the general situation, all isomeric graph structures are used in the following explanation by default, because isomeric graph is a special case of isomeric graph, even if isomeric graph is used in actual promotion, it does not affect the method of using isomeric graph for analysis. The following figure shows the network relationship in the actual scenario.

In order to identify the correlation structure of community, there are more recognition method, commonly used statistical characteristics based on node, based on the in and out of the degree of node distribution change, based on the edge of the associated custom weight, methods of manual annotation, this method can identify many associated community, but the map is difficult to define the edge weights in the association, there are more false call, Therefore, in practice, we encode based on the existing community mining results to improve the identification effect of black and gray production. At the same time, embedded graph coding can also carry out unsupervised learning of similarity based on the neighbor relationship of nodes.

Figure embedded coding


Graph embedded coding is a node2VEc method to encode nodes into vectors. The GraphSAGE proposed by William L. Hamilton, Rex Ying and Jure Leskovec et al., Stanford University in 2016 is the graph embedded coding method adopted by us. In contrast to Node2vec, which is embedded at the node level of the graph, GraphSAGE is embedded at the level of the entire graph. GraphSAGE also uses node feature information and structure information to get the map of Graph Embedding. Compared with the previous method of saving mapping results, GraphSAGE saves the map generated by Embedding, which has stronger scalability and better performance in node classification and link prediction.

The GraphSAGE algorithm flow consists of three steps:

(1) The neighbor nodes of each node in the figure are sampled. Since the degree of each node is inconsistent, a fixed number of neighbors are sampled for each node in order to calculate efficiently.

(2) Aggregate the information contained in neighbor vertices according to the aggregation function.

(3) The vector representation of each vertex in the graph is obtained for use by downstream tasks.

The following figure shows the sampling and aggregation diagram provided by the author of the algorithm in the paper.

Sample nodes using the method shown below. Node of each layer is generated by the upper layer and has nothing to do with this layer. In this way, account ID 1 of layer 1 has aggregated the information of device ID 1 and mobile phone number 2 of layer 0. At layer 2, mobile phone number 2 has aggregated the information of IP address 1 after two layers of sampling. The second-order neighbor of account ID 1 contains all information about device ID 1, mobile phone number 1, device ID 2, mobile phone number 2, account ID 2 and IP address 1.

In the sampling process, the fixed number of sampling layers (2 layers are used in this practice) and the number of nodes at each sampling point (for example, the upper limit of the number of neighbor nodes is 200) can control the memory consumption and computing time of each sampling process. This method is suitable for large-scale data sets and is very effective for mining black ash communities under large data sets.

The aggregation function adopted by us is mean aggregation, which directly averages each dimension in emebdding of the target node and all neighbors, and then non-linear transformation. The corresponding function expression in the original paper is as follows:

The main idea is to splicing the KTH −1 layer vectors of the target vertex and neighbor vertices, then calculate the mean value of each dimension of the vector, and make a nonlinear transformation of the obtained results to generate the KTH layer representation vector of the target vertex. The calculation methods of different aggregation functions are different. In addition to the mean aggregation function, there are also pooled aggregators and LSTM aggregators, etc. After testing, there is no obvious difference between different aggregation functions for the mining of black ash industry community.

In this way, the vector representation of each node on the network graph can be obtained through the sampling and neighbor aggregation of the black ash community mentioned above. For example, the vector of account ID 1 in the figure above contains the network structure information of device ID 1, mobile phone number 1, device ID 2, mobile phone number 2, account ID 2 and IP address 1, as well as the behavior characteristics of these ids at different time points, different business scenarios and different operation types.

4. Model training


Based on the above sampling and aggregation function, parameter learning began. Different loss functions of GraphSAGE represent different parameter learning methods. As shown below, loss function is a kind of unsupervised loss, which tends to make adjacent vertices have similar representations, while the representations of vertices far from each other become more different.

The above formula indicates that node U has similar embedding representation with neighbor node V that randomly walks to, and dissimilar embedding representation with non-neighboring node VN that is obtained by negative sampling.

As for the node embedding learned without supervised loss, it can continue to be used by downstream tasks. This practice adopts this method. Of course, specific loss functions can also be used for specific classification tasks, such as using cross entropy for classification prediction.

The positive and negative samples were determined manually by statistical features of nodes, and the coding vector of nodes was used as features for classification model training. The following figure shows the visualization of partial data mining results.

Intuitively speaking, it is reasonable to mine community groups, which are generally divided into three categories: red is a community, yellow is a community, and the rest (green) is automatically classified into one category.

Five, engineering realization


In order to give full play to the value of coding in practical applications, we propose an asynchronous quasi – real – time black-ash identification scheme. The following diagram illustrates the flow structure of this identification scheme.

Start a user request, the client can collect key factor to the user’s information, such as account ID, IP address, device number, phone number, etc., to write this portion of the log information into the staging area, the staging area to store all over the last 10 minutes (also can be some other time period, in general, log volume is larger, the shorter the time registers, Key factor information of users who have requested this client.

Key information over 10 minutes will be stored in the offline partition log library, and map construction will be carried out based on the partition log library, community mining of black ash production, community coding, and vector representation of training classification model. The classification model can use the partition log training of the past period of time regularly instead of real-time training.

For is in the key factor of the staging area, will the request of real-time map construction, user to the user as the center is sampled node vectors, and do use has trained classification model for the representation of the user vector black ash production forecast, if prediction for normal users, allows users on the client side of the operation, if the users to black ash, The user is denied client operations.

Below is a more detailed process of the training section:

Below is a more detailed process for the prediction part:


Sixth, innovation


In this practice, a community coding based black ash attack identification method is proposed. The main innovative technical points include:

Coding based on the community to have black ash production supervision identification method, compared with the existing pattern mining algorithm, the method does not rely on a single node properties directly, but the relations of the whole community structure of code into a vector said, said of black ash to produce more accurate, but also for does not appear in the history of black ash to produce accounts, also can through the similarity between the network structure, Identification by vector representation. At the same time, it can effectively avoid the wrong identification caused by noise (such as the black gray account connected to the shopping mall wifi).

As the result of graph coding itself has strong similarity between neighbors, weak similarity between non-neighbors. Therefore, unsupervised learning (such as density clustering and hierarchical clustering) can also be carried out by using the encoded vector to identify the internal relationships among IP, account ID, mobile phone number and device number of some black ash production, and mining clusters composed of similar black ash production.

The large scale graph embedding representation learning method is combined with the asynchronous prediction of small data sets by using the staging area, and the supervised model after coding is used for fast prediction, which provides a reference method for practical engineering applications.


Seven, part of the practical effect


The factor coding model is constructed including IP (IP address), cookie_id (cookie information), device_id (device ID), userid (userid), mobile (mobile phone number) and other 5 key factors. The coding features include scence (business scenario), Page (page), and risk (risk degree) (the coding features depend on the specific scenario).

According to 10 minutes as a window of temporary storage area, the number of factors is obtained in the order of millions, the number of coded relations is nearly ten million, and the length of coded vectors is about 300. After manual verification, it is risky to routinely produce IP dimension by using supervised method for EMbding results. The accuracy of this strategy can reach about 95% by using other strategies for cross verification. Unsupervised model has poor effect in practice and has not been used in practice.


Development and thinking

Based on the above practice findings, community coding method has a good positive effect in the identification of black ash production, but in the actual production process, there are still some problems, the following is a brief discussion.

Computing problems: Graph computing consumes computing resources and requires a lot of prediction in a short time during the complete process. Currently, combined with mini Batch training prediction, it takes about half an hour to complete the calculation of a log partition in an independent GPU environment.

Selection of key factors and coding features: The results of graph algorithms depend very much on graph structure and feature attributes, which brings great challenges to factor selection and feature engineering in the early stage.

Identify process automation issues: GraphSAGE algorithm itself have the ability to unsupervised recognition, but black ash identification in the process of production found it very vulnerable to the influence of center node (which can be solved by weight optimization to get a degree), so the current definition, still need to do some early community and need manual or other modeling method involved in this process, difficult to achieve full automation.

Job Information:

Baidu mobile ecology division MEG, user center recruiting r&d positions (PHP/GO/C++). We are mainly responsible for the company’s Passport, user assets, attributes, Baidu APP membership and other core business directions, and are committed to creating an efficient, convenient and safe user system. Join us if you are interested in Passport, Mega QPS services, distributed design & Governance.

Baidu Geek, a public account of the same name, said, “Just enter the internal tweet, we look forward to your joining us!”

Recommended reading:

The limit optimization of baidu C++ engineers (concurrent)

The limit optimization of baidu C++ engineers (memory)

Baidu large-scale Service Mesh implementation practice

The invention relates to a system and method based on real time bit calculation

———- END ———-

Baidu said Geek

Baidu official technology public number online!

Technical dry goods, industry information, online salon, industry conference

Recruitment information · Internal push information · technical books · Baidu surrounding

Welcome to your attention