I haven’t updated for more than half a month, because I am a little busy (actually lazy) recently.

Recently got a user comment function, users posted comments, can see his comments under article again, but as a successor of socialism practice the socialist core values, so give comments sensitive word filtration function cannot little, looking for information on the net, found that there has been a very mature solution. There are two common solutions

  1. Full text search, match one by one. This may not sound lofty enough, but in the case of large amounts of data, it will be efficient
  2. DFA algorithm – Determine finite state automata With an encyclopedia link to determine finite state automata

DFA algorithm introduction

DFA is a kind of calculation model, the data source is a finite set, through the current state and event to determine the next state, the state = + event to the next state, thus gradually build a directed graph, the node is state, so only in the DFA algorithm search and judgment, no complicated calculation, so as to improve the efficiency of algorithm

See article Implementing sensitive word filtering in Java

Implementation logic

Constructing data structures

To convert the sensitive words into a tree structure, for example, the sensitive words have several [‘ Jap ‘,’ Japanese ‘,’ Japanese man ‘], then the data structure is as follows (picture cited in reference article)

Each word is a node, continuous nodes to form a word, the Japanese corresponding is in the middle of the chain, we can use the object or a map to build the tree, the chestnut map was used to construct nodes, each node has a status identification, used to represent the current node is the last one, each link has to be a destination node, Let’s take a look at the flowchart for building a node

Judgment logic

Starting from the text of the first word to check first, such as you and I are Japan, the first word , you can’t find this in the first layer of the tree node, then continue to find the second word, when the day, the first layer node found, so to find this in the next layer of nodes, then figure out whether the node at the end of the node at the same time, and if the end node, the matching is successful, Otherwise continue to match

Code implementation

#### Constructs the data structure

/** * @description * private * @returns */
private makeSensitiveMap(sensitiveWordList) {
    // Construct the root node
    const result = new Map(a);for (const word of sensitiveWordList) {
        let map = result;
        for (let i = 0; i < word.length; i++) {
            // Get the words in sequence
            const char = word.charAt(i);
            // Check whether it exists
            if (map.get(char)) {
                // Get the node at the next level
                map = map.get(char);
            } else {
                // Sets the current node to a non-ending node
                if (map.get('laster') = = =true) {
                    map.set('laster'.false);
                }
                const item = new Map(a);// The new node is the end node by default
                item.set('laster'.true); map.set(char, item); map = map.get(char); }}}return result;
}
Copy the code

The final map structure is as follows

Find sensitive words

/** * @description * Check whether sensitive words exist * @private * @param {any} TXT * @param {any} index * @returns */
private checkSensitiveWord(sensitiveMap, txt, index) {
    let currentMap = sensitiveMap;
    let flag = false;
    let wordNum = 0;// Record filtering
    let sensitiveWord = ' '; // Record the sensitive words filtered out
    for (let i = index; i < txt.length; i++) {
        const word = txt.charAt(i);
        currentMap = currentMap.get(word);
        if (currentMap) {
            wordNum++;
            sensitiveWord += word;
            if (currentMap.get('laster') = = =true) {
                // Indicates the end of the word
                flag = true;
                break; }}else {
            break; }}// Two words into a word
    if (wordNum < 2) {
        flag = false;
    }
    return { flag, sensitiveWord };
}
/** * @description * check whether there are sensitive words * @param {any} TXT * @returns */
public filterSensitiveWord(txt, sensitiveMap) {
    let matchResult = { flag: false.sensitiveWord: ' ' };
    // Filter out all entries except Chinese, English, and digits
    const txtTrim = txt.replace(/[^\u4e00-\u9fa5\u0030-\u0039\u0061-\u007a\u0041-\u005a]+/g.' ');
    for (let i = 0; i < txtTrim.length; i++) {
        matchResult = checkSensitiveWord(sensitiveMap, txtTrim, i);
        if (matchResult.flag) {
            console.log(`sensitiveWord:${matchResult.sensitiveWord}`);
            break; }}return matchResult;
}
Copy the code

The efficiency of

To see the efficiency of DFA, I did a simple test with a text length of 5,095 Chinese characters and 2,000 sensitive words in the sensitive lexicon, comparing the DFA algorithm with the indexOfAPI provided by String native objects

// Simple string matching -indexof
ensitiveWords.forEach((word) = > {
    if(ss.indexOf(word) ! = =- 1) {
        console.log(word)
    }
})
Copy the code

Run the two algorithms 100 times, and get the following results

It can be seen intuitively that the average time of DFA is about 1ms and the maximum time is 5ms. IndexOf has an average time of around 9ms and a maximum of 14ms, so DFA has a significant advantage in terms of efficiency.