At present, spammers on Zhihu tend to produce similar spam content in large quantities, or produce specific behaviors intensively, in order to achieve quick results. In response to this large, similar, and relatively clustered nature, we have recently begun to try to use clustering to discover and mine spammers. At present, anti-spam mainly uses content – and behavior-oriented clustering.
The purpose of clustering is to bring together similar content and behaviors. The common clustering methods include K-means and hierarchical clustering. There is also a cluster analysis scheme based on density and graph.
Cluster analysis groups data objects based only on the information found in the data that describes them and their relationships. The goal is that objects within groups are similar (related) to each other, while objects between groups are different (unrelated). The greater the similarity (homogeneity) within groups and the greater the difference between groups, the better the clustering. Introduction to Data Mining
From the above definition, the measurement of similarity is one of the keys of clustering. Common similarity algorithms include Edit Distance, Conscine Similarity, Jaccard similarity, Pearson correlation coefficient, etc. In this clustering, we used some text similarity algorithms, mainly including Jaccard and Sim-Hash.
Jaccard
Jaccard similarity takes the proportion of intersection of two sets to union as the similarity of two sets, e.g The similarity J(A, B) of set A and B can be expressed as:
sim-hash
However, jaccard does not perform well in large data scenarios, so we will try sim-hash. Sim-hash was proposed by Charikar in 2002 and was later applied at Google to detect similar web pages. Sim-hash generates an N-bit fingerprint for the input text. Unlike traditional hash functions such as MD5 and SHA-1, sim-hash also generates an approximate fingerprint for the input text. The more similar the text is, the fewer different binary digits (denoted as k) its fingerprints have. For two texts, the steps for comparing sim-hash similarity are as follows:
- Participle: the use of word segmentation in text in order to reduce the use of stop words and other common words (e.g., is, in…). Tf-idf (Term frequency-inverse Document Frequency) is used to increase the weight of each word, where TF refers to the Frequency of occurrence of a word in the text, while IDF is related to the common degree of a word (i.e. the number of texts containing the word). The more common a word is, the lower its IDF value is. Tf-idf weight is the product of TF and IDF. In a text, words with relatively high tF-IDF weight become the keywords of the text. Please refer to this article for a more detailed explanation.
- Hash: Computes the hash value of each word. Converts characters into numbers to improve computing efficiency.
- Weighted, multiplying 1 in the hash by the weight of a positive number and 0 by the weight of a negative number.
- Merge, which adds the weighted hash values by column to obtain a sequence of numbers.
- Dimensionality reduction, the number sequence obtained in Step 4 is converted to 0, 1, digits greater than 0 are converted to 1, and digits less than 0 are converted to 0.
- Compare the hamming distance of the generated SIM-hash value. Hamming distance is the Hamming distance of two hash values. I believe you are familiar with it. For more information, please refer to the explanation in the link.
Diagram of simHash generation process
We tested the time consumption of two methods running 100W times, and the test code is as follows:
Def test(): Unit ={var s1 = ""; Var s2 = "this is a test"; var t1 = System.currentTimeMillis(); for (i 0 to 1000000) { var dis = ZSimilarity.jaccard(s1.split(""), s2.split("")); } var t2 = System.currentTimeMillis(); Println ("jaccard elapsed time: "+ (t2-t1) +" ms "); t1 = System.currentTimeMillis(); val hash_s1 = ZSimHash.hash(s1, 64) val hash_s2 = ZSimHash.hash(s2, 64) for (i 0 to 1000000) { var dis = ZSimHash.hammingDistance(hash_s1, hash_s2, 64); } t2 = System.currentTimeMillis(); Println ("sim-hash elapsed time: "+ (t2-t1) +" ms "); }Copy the code
The results show:
Jaccard: 21772 ms Sim-hash: 9981 msCopy the code
In the case of short text, sim-hash improves detection efficiency by at least half. The gap in detecting long text is even more pronounced. Therefore, using Sim-hash can effectively shorten the time of similarity detection.
According to the test, for 64-bit SIM-hash fingerprint, k=3 is a reasonable threshold to judge whether two texts are similar, because when k=3, both recall rate and accuracy rate can be at a satisfactory level (about 75%). In order to improve recall and accuracy, k=4 is used to ensure recall in practice, and jaccard similarity calculation is performed again for each group based on SIM-hash to improve accuracy.
Curves of accuracy and recall rate under different thresholds
At present, nearly ten million write behaviors are generated on the Web side of the station every day. Taking private messages for example, if a single process is used to compare the similarity of 10W private messages every day, each private message needs to be compared with 99999 other private messages in pairs. According to the experimental data, it takes nearly 1s to traverse using Sim-hash and 27 hours to detect all 10W data once. In this scenario, how to effectively and quickly cluster the full amount of data?
Spark is our current solution. Spark is a high-performance distributed computing framework. Spark computations are based on memory. Spark allows computing intermediate results and data sets to be stored in memory, greatly reducing disk I/O and network communication time. Compared with the map-reduce model, it provides richer operators (e.g. filter, flatMap, etc.). These operators are divided into two classes, transformation and action. All transformations are delayable. When a transformation is called, the calculation is not triggered immediately. The calculation is triggered. This ensures spark’s fault tolerance. If a task fails on a node, it does not cost much to re-compute on a new node.
Content of clustering
Before Spark is used, data is prepared by Using HiveQL and Python scripts. When a large amount of data is used, the efficiency is poor. Spark can be seamlessly integrated with Hive, improving data processing efficiency. On Hive, HiveQL is actually converted into a series of Map-Reduce processes that perform calculations on the Hadoop platform. When hive statements are executed on Spark, HiveQL is converted to a series of transformations and actions. Spark reads the Hive metadata, converts the metadata to spark RDD, and performs calculation based on the metadata. Thanks to spark’s rich operators and memory-based features, and spark’s operators can replace data cleaning tasks previously performed by Python scripts, the spark SQL execution efficiency is at least 10 times higher than that of hiveQL.
Content clustering is achieved by graph segmentation, that is, constructing a similarity graph G=(V, E), with each document as the vertex and the similarity between documents as the weight of connected edges. For sim-hash, the weight of the edge connected between two points is the Hamming distance of the two hash values. As shown in the figure below, if we take k=3 as the threshold value, and take the edge with ownership greater than the threshold value as the new subgraph (i.e. the black edge in the graph), and calculate the connected subgraph in the subgraph, two clusters (1,2,3) and (4,5,6,7) can be obtained.
In actual use, we had to use Graphx (spark for parallel computing API chart and graph, see Graphx | Apache spark) provide connectedComponents interface, but later found in the case of data quantity is big, Repeated iteration brings significant performance problems. Therefore, using the propagation of text similarity (a is similar to B, and B is similar to C, then A is similar to C), we use Spark SQL to transform the problem into a problem of “finding the least similar node”. For example, in the figure below, the smallest node similar to 1 is 1, the smallest node similar to 2 is 1, and the smallest node similar to 3 is also 1. The smallest similar nodes at these three points are all 1, so they belong to the same cluster. The four nodes 4, 5, 6 and 7 belong to another independent cluster because the least similar node is 4.
Cluster segmentation diagram
Due to the high cost of Jaccard similarity calculation, sim-hash is used in practice to improve the efficiency of similarity calculation. The input data is pre-grouped using a 64-bit hash value, with k=4 as the threshold. Since high recall can be guaranteed under such conditions, but the accuracy is relatively low, it is necessary to use Jaccard subdivision for each group to improve the accuracy. This approach reduces the number of Jaccard similars that need to be calculated, and also makes up for the lack of accuracy when sim-Hash similars are recalled with high accuracy. In addition, in order to reduce unnecessary waste of computing resources, the similarity of the two connected nodes is calculated only once. However, similarity comparison is still an approximate Cartesian product calculation. In order to improve the calculation efficiency of this part, spark’s broadcast mechanism is adopted to cache a variable in the memory of all nodes, thus reducing communication overhead in the calculation process.
Currently, for private messages, clustering can be completed in 1-3 minutes, and Spark fully improves the efficiency of data processing.
Behavior clustering
The main idea of behavior clustering is to express the user’s behavior path in the way of text, transform behavior clustering into content clustering, and gather similar behaviors together through text similarity clustering.
Behavior path expression: in a post behavior of users, take at least two requests before and after, and calculate the time interval between each behavior, sequence expression into the user’s behavior by “method request path | | | the time interval and the last request” form the text of the combination.
Compared with content clustering, behavior clustering faces a larger amount of data, and there are about 30W + key business writes every day after data cleaning. Considering that it is not meaningful to compare the similarity between different classes of write behaviors, clustering is performed separately for each business, thus reducing the calculation size of 30W * 30W to 1W * 1W + 3W * 3w +….. . Since the implementation of clustering is generally consistent with the logic of content clustering, I will not go into more details here.
Summary: Clustering is an effective alternative to manual strategies for bulk spammer content and behavior. At present, both behavior and content clustering are online by offline processing. Clustering is antispam’s initial attempt to use Spark. In the future, it will continue to optimize and improve its processing efficiency, and Spark Streaming will also be used to improve the real-time performance of clustering.
Author: Zhou Aote, Sun Xian and Chen Lei
I also want to thank the other students in the anti-cheating team for their help
Reference
[1] Detecting Near-Duplicates for Web Crawling
[2] TF-IDF and cosine similarity application