‍‍

With the development of Internet hacking, cheating has gradually become large-scale and industrialized, and gang cheating is increasingly rampant. In order to further enhance the security and user experience of the baidu accounts, maintain the company’s core interests, baidu account security policy team combined with their own advantages in the field of account security, constructed to handle huge amounts of data, have good scalability of mining association mapping black gang producing ability, fully practical application and expanding ground scene, At the same time, it also explores the application value of related technologies in risk control anti-cheating scenarios in cutting-edge fields such as graph neural network, and is committed to building an efficient and complete anti-cheating capability of risk control based on graph.

The full text is 3770 words, and the expected reading time is 14 minutes.

A list,

According to a statistical report on Internet Development in China, the number of Internet users in China reached 1.011 billion as of June 2021. Based on such a large user group, the Internet business continues to develop rapidly. The growing Internet ecosystem has naturally spawned a series of dark and grey businesses hiding in hidden corners. With the progress of technology development, black ash is produced from the original cottage cheating also turned to streamline, cheating in scale, industrialization mode, the current network has super billions of black ash to produce scale, and have a variety of business scenarios into cheating, the first is the account system, and then into the specific business scenarios, Engaged in brushing, wool pulling, drainage, fraud, money laundering and other fraud and cheating. Black ash industry not only causes money losses to Internet companies, but also affects users’ service experience and property security in the long run, threatening the sustainable and healthy development of business.

In order to effectively crack down on cheating gangs in black industry and ensure the basic security of the company, the security strategy team, starting from the account dimension, actively built an anti-cheating framework based on atlas, and constantly explored the application of atlas related technologies in anti-cheating scenarios of risk control. At present, based on the map technology, the construction mainly includes gang mining ability and map node representation ability.

Two, gang excavation

As is known to all, in actual business scenarios, black industry cheating gangs are usually limited by resources, costs and other realistic conditions, and often share resources. This has become the start of mining black production gangs, the traditional method is: through the statistical feature factor screening way to screen a part of the relevant accounts, but this method is difficult to further dig out the whole related cheating gangs, can only deal with this problem case by case. The following compares the differences between traditional methods and basic association map mining of cheating gangs (case data has been desensitized) :

_ Figure – 1 Case __ Analysis

As shown on the left side of FIG. 1, it can be seen from the chart that the case account has used a considerable number of feature factors and a device. These relationships are transformed into the graph structure, namely the graph structure on the right side of the picture (account number: blue mark, feature factor: red mark, device: green mark). By mining these correlation factors, we can get a batch of accounts that have used these suspicious factors, which is actually the core idea of atlas gang mining, but it takes a lot of time to dig out the whole gang through this method. In fact, as shown in Figure 2, this account is only the tip of the iceberg in the whole black ring, which is difficult to excavate with traditional methods.

Figure -2 Group belonging to the case

The above examples show the advantages of association mapping in gang mining. Combined with the existing business scenarios, the team constructed an association atlas framework covering different scenarios, different granularity (day, week, month), and different feature relationship types (same composition, heterogeneous graph), involving a variety of nodes of different types and a variety of complex edge relationship features. Figure 3 shows the framework of the association atlas.

Figure 3 Association map infrastructure

The map in the actual production environment are billions of nodes and edges of data to be processed, it is a huge challenge, redesigned to optimize the whole process of calculating, this architecture can handle huge amounts of data and has a wealth of extensibility, namely can dig through simple configuration different heterogeneous groups, also support to expand new business scenarios, through the scene, On the basis of the original business data combined with account system unique account security information can be more comprehensive mining and analysis of black production gangs. In addition, the ability of CASE analysis and call expansion using association atlas has been implemented in the actual business.

In actual business, the association map can be used for gang mining to find suspected gangs related to CASE and monitor abnormal gang cheating in business. In newly connected business scenarios, all suspected gangs mined through the association map have gang cheating of varying degrees. However, new technologies will also bring some new challenges. It is precisely because feature-based association, namely, binding binds different accounts, that the correlation between accounts is not reliable, which often leads to the following problems:

  1. It is not always true that the gangs linked by hard association such as device information are black cheating gangs. Common accounts may also share equipment and use public networks. Not all the gangs mined by association are black cheating gangs, so it is necessary to classify and qualitative gangs.

  2. In actual business, there will be a huge gang atlas due to dirty data, long time span, resource crossing between black production gangs, account trading and other factors. The gang atlas may contain some normal accounts or accounts of different gangs.

Therefore, there are more practices and explorations related to atlas.

3. Gang node representation

In view of the existing problems in the association graph, although some filtering can be carried out by limiting conditions and defining weights to mitigate the impact of the above problems on the whole association graph, such a one-sip-fits-all approach is difficult to achieve proper results for dealing with complex edge relations and multiple node types of the graph. Therefore, there is a further exploration of the mapping technology — the representation of nodes in a gang.

Node, according to the characteristics of single account node through the deep learning method of abstract for a fixed dimension vectors, the vector said, this account by account feature vectorization, can do more further downstream work, such as: correlation prediction of node, the node of cluster, node, classification, and so on. Node representation in the atlas not only considers the characteristic information of the account node itself, but also includes the structural information in the atlas where the account node is located, mainly the neighbor information and edge relation information of the node.

The team investigated various node representation model methods, such as Deepwalk[1], LINE[2], Node2vec [3] and other random walk based methods, as well as GCN[4], GAT[5], GraphSAGE[6], PinSAGE[7] and other methods.

In account business scenario, account features are sparse, node scale is huge and there is no explicit label, so the node representation model is trained by link prediction task. Considering the magnitude of the whole data and the problem of dynamic change, the GraphSAGE model is improved for inter-node link prediction. Firstly, the target node is sampled locally based on random walk to obtain its neighbor node. The two-hop neighbor information of the target node is aggregated by the two-layer GraphSAGE structure, and the prediction results are obtained by combining the representation vector intersection of the two target nodes. In the semi-supervised learning method, cross entropy was used as the loss function and mini-batch training method was used to train the model. The model architecture is shown in Figure 4 below.

Figure 4 link prediction framework

As shown in Formula (1), the node characteristics of the model input are also required. In addition, the subgraph structure of the target node and the relationship pair of the target node are also required. Formula (2-4) is the process of the first-layer node of the model merging with its neighbor nodes.

The model obtained the final link prediction result through the dot product of the representation vector of the target node relationship pair, and optimized the model parameters by stochastic gradient descent. It is shown in Formula (5).

score = \sigma(e_i \bullet e_j), (5)

To compare, and realizes the MLPS, GCN model, the basis of the vectors of the same conditions of above parameters, respectively, to generate the same set of accounts according to vector, vector to intuitive display model generation said the distinction between sex, here belong to account node in selecting associated map TOP25 gang node, the node number as color label, T-sne and UMAP dimensionality reduction were used for visual comparison, and the visualization results of T-SNE were as follows. FIG. 5 shows the THREE-DIMENSIONAL spatial distribution of node representation vector generated based on Graphsage-sum after t-SNE dimensionality reduction. Compared with FIG. 6 and FIG. 7, respectively, the three-dimensional distribution of node representation vector generated based on MLP and GCN, it can be seen that the differentiation of graphsage-sum node representation vector is significantly better than others. Those with the same color number belong to the same gang (because the gang label in the association graph is used as a reference, the gangs with different labels in the graph may actually be the same gang, that is, different color numbers overlap). In the GraphSAGE graph, the aggregation with the same color label of all gangs is more compact, and the differentiation of different gangs is obvious. There were also fewer groups with overlapping color tags. (Note: There are too many gang labels and limited colors. Color and tag number should be combined to distinguish different groups.)

FIG. 5 T-SNE dimension reduction display of node representation generated based on GraphSAGE

Figure 6 Node representation generated based on MLP shows t-SNE dimension reduction

Figure 7 Node representation generated based on GCN shows t-SNE dimension reduction

After obtaining the node representation model, a variety of downstream tasks can be applied based on it, including prediction of correlation between nodes, node classification, representation vector of generating groups, node clustering and so on. Gang qualitative requirements in actual business, for example, compared with only using basic account dimension feature XGboost classification model of statistics, said further increase node vector characteristics, after its preliminary test result is achieved the level of 90 + %, believe through data training groups and all model the actual groups classified qualitatively the effect will be further promotion.

Four, outlook

This paper introduces the practice and exploration of atlas correlation technology in risk control and anti-cheating, some of which have been applied and achieved good results, and there are more or less some problems that need to be further solved.

  1. In order to solve the problem of large gangs and gang characterization in the association map, what kind of downstream tasks can be designed based on the node representation model.

  2. At present, node representation model is greatly limited by GPU. How to efficiently produce node representation vector in atlas and how to further improve the generalization effect of the model?

The whole anti-cheating framework of risk control based on atlas technology still needs to be improved. Not only the technical capabilities mentioned above, but also more technologies need to be further explored, studied and applied. Such as graph sampling technology, graph representation ability, graph visualization, real-time graph processing ability and so on.

References:

[1] Perozzi B, Al-Rfou R, Skiena S. Deepwalk: Online learning of social representations[C]//Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 2014: 701-710.

[2] Tang J, Qu M, Wang M, et al. Line: Large-scale information network embedding[C]//Proceedings of the 24th international conference on world wide web. 2015: 1067-1077.

[3] Grover A, Leskovec J. node2vec: Scalable feature learning for networks[C]//Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 2016: 855-864.

[4] Kipf T N, Convolutional networks[J]. ArXiv PrePrint arXiv:1609.02907, 2016. (in Chinese)

[5] Veličković P, Cucurull G, Casanova A, et al. Graph networks for attentional attention [J]. ArXiv Preprint arXiv:1710.10903, 2017.

[6] Hamilton W L, Ying R, Leskovec J. Inductive representation learning on large graphs[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 1025-1035.

[7] Ying R, He R, Chen K, et al. Graph convolutional neural networks for web-scale recommender systems[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018: 974-983.

[8] Chen T, He T, Benesty M, et al. Greater Boost: Extreme gradient Boosting [J]. R Package Version 0.4-2, 2015, 1(4): 1-4.


Recommended reading:

Android Refactoring — Refactoring practices around players

Discussion on Baidu Reading/Library NA side typesetting technology

Continuous delivery practices under cloud native architecture

Architecture and data science behind hundreds of thousands of experiments a year

———- END ———-

Baidu said Geek

Baidu official technology public number online!

Technical dry goods, industry information, online salon, industry conference

Recruitment information · Internal push information · technical books · Baidu surrounding

Welcome to your attention