The sensitive word filtering function is used in many places. Theoretically, in Web applications, wherever user input is involved, text verification is required, such as XSS verification, SQL injection validation, sensitive word filtering, etc. Today’s focus is on how to implement sensitive word filtering gracefully and efficiently.

Sensitive word filtering scheme one

Let me start with how I implemented sensitive word filtering at my last company. I was young at the time, so I used the simplest filtering scheme. In simple terms, for the text to be detected, it traverses all sensitive words and detects whether the input text contains the specified sensitive words one by one. This way is the simplest sensitive word filtering scheme, it is not difficult to achieve, the example code is as follows:


   @Test
    public void test1(){
        Set<String>  sensitiveWords=new HashSet<>();
        sensitiveWords.add("shit");
        sensitiveWords.add("Silly force");
        sensitiveWords.add("Fool");
        String text="You're an idiot.";
        for(String sensitiveWord:sensitiveWords){
            if(text.contains(sensitiveWord)){
                System.out.println("Sensitive words exist in the input text. -"+sensitiveWord);
                break; }}}Copy the code

The code is simple and does the trick. However, there is a big problem with this scheme. As the number of sensitive words increases, the detection time of sensitive words will increase linearly. Since the number of sensitive words on previous projects was only a few dozen, there were no major performance issues with this solution. But if there are thousands of sensitive words in a project, this approach can be CPU intensive.

Sensitive word filtering scheme two

After searching the sensitive word filtering scheme on the Internet, a new algorithm named DFA is found, which is Deterministic Finite Automaton algorithm. Its basic idea is to retrieve sensitive words based on state transition, and all sensitive words can be detected only by scanning the text to be detected once, so the efficiency is much higher than that of scheme one.

Suppose we had five sensitive words to detect: dumb, fool, dumbass, bad guy, and bad guy. In this case, we can first synthesize the phrases with the same prefix in the sensitive words into a tree structure, and the words with different prefixes belong to different branches of the tree. Taking the above five sensitive words as an example, they can be initialized into the following two trees:

What are the benefits of forming a tree of sensitive words? The biggest benefit is that it can reduce the retrieval times. We only need to traverse the text to be detected once, and then search for the corresponding subtree of the character in the sensitive lexicon. If there is no corresponding subtree, it means that the currently detected character is not in the sensitive lexicon, so we directly skip the detection of the next character. If there is a corresponding subtree, then check whether the next character is a child node of the subtree corresponding to the previous character, so that iteration can find out whether the text to be detected contains sensitive words.

Let’s take the text “are you stupid?” as an example, we detect each character in turn, because the first four characters are not in the sensitive lexicon, and the corresponding subtree can not be found, so we skip directly. When detected “stupid” word, find words in the library have corresponding sub-tree, we remember him for tree – 1, then search the next character is “forced” tree – 1 the child nodes of the subtree, found that just happens to be, then judge the characters are “forced” leaf node, if it is, then matched to a sensitive word, In this case, the character “bi” happens to be a leaf node of tree-1, so the sensitive word “bi” is successfully retrieved. Have you noticed that, in our search process, we only need to scan the detected text once, and for the sensitive words that do not exist in the detected text, such as “bad guy” and “bad guy” in this example, we will not scan at all, so the efficiency is greatly improved compared with scheme 1.

In Java, we can use a HashMap to store the above tree structure. Again, for the sensitive words, we can break each sensitive word string into characters and store it in a HashMap like this:


{
    "Stupid": {
        "Force": {
            "isEnd": "Y"
        },
        "Child": {
            "isEnd": "Y"
        },
        "Big": {
            "个": {
                "isEnd": "Y"}}}}Copy the code

The first character of each word is a key, and the value is another HashMap. The key of the corresponding HashMap is the second character, and if there is a third character, it is stored in the value of the second character, which is also a HashMap. And so on, all the way down to the last character, which of course corresponds to a HashMap, but that HashMap only needs to store an end flag, like in the example above, we have a {“isEnd”,”Y”} HashMap, To indicate that the key corresponding to the value is the last character of the sensitive word.

Similarly, the sensitive words “bad guy” and “bad guy” are stored in this way and will not be listed here.

What are the benefits of using HashMap storage? We know that HashMap can be queried with the time complexity of O(1) in an ideal situation, so we can retrieve whether the current character is in the sensitive lexicon with the time complexity of O(1) during the process of traversing the string to be detected, which is much more efficient than scheme 1.

Next, the code.

The first is to initialize the sensitive lexicon:


   private Map<Object,Object> sensitiveWordsMap;

    private static final String END_FLAG="end";

    private void initSensitiveWordsMap(Set<String> sensitiveWords){
        if(sensitiveWords==null||sensitiveWords.isEmpty()){
            throw new IllegalArgumentException("Senditive words must not be empty!");
        }
        sensitiveWordsMap=new HashMap<>(sensitiveWords.size());
        String currentWord;
        Map<Object,Object> currentMap;
        Map<Object,Object> subMap;
        Iterator<String> iterator = sensitiveWords.iterator();
        while (iterator.hasNext()){
            currentWord=iterator.next();
            if(currentWord = = null | | currentWord. The trim (), length () < 2) {/ / sensitive word length must be greater than or equal to 2continue;
            }
            currentMap=sensitiveWordsMap;
            for(int i=0; i<currentWord.length(); i++){ char c=currentWord.charAt(i); subMap=(Map<Object, Object>) currentMap.get(c);if(subMap==null){
                    subMap=new HashMap<>();
                    currentMap.put(c,subMap);
                    currentMap=subMap;
                }else {
                    currentMap= subMap;
                }
                if(I ==currentWord.length()-1){// If it is the last character, put a end flag. // If it is not the last character, there is no need to save the end flag, again to save space. currentMap.put(END_FLAG,null); }}}}Copy the code

The logic of the code, which is described above, is to loop over the set of sensitive words and put them into a HashMap, and I won’t go into this again.

Here’s a scan of sensitive words:


public enum MatchType {

    MIN_MATCH("Minimum matching rule"),MAX_MATCH("Maximum matching rule");

    String desc;

    MatchType(String desc) {
        this.desc = desc;
    }
}

    public Set<String> getSensitiveWords(String text,MatchType matchType){
        if(text==null||text.trim().length()==0){
            throw new IllegalArgumentException("The input text must not be empty.");
        }
        Set<String> sensitiveWords=new HashSet<>();
        for(int i=0; i<text.length(); i++){ int sensitiveWordLength = getSensitiveWordLength(text, i, matchType);if(sensitiveWordLength>0){
                String sensitiveWord = text.substring(i, i + sensitiveWordLength);
                sensitiveWords.add(sensitiveWord);
                if(matchType==MatchType.MIN_MATCH){
                    break; } i=i+sensitiveWordLength-1; }}return sensitiveWords;
    }

    public int getSensitiveWordLength(String text,int startIndex,MatchType matchType){
        if(text==null||text.trim().length()==0){
            throw new IllegalArgumentException("The input text must not be empty.");
        }
        char currentChar;
        Map<Object,Object> currentMap=sensitiveWordsMap;
        int wordLength=0;
        boolean endFlag=false;
        for(int i=startIndex; i<text.length(); i++){ currentChar=text.charAt(i); Map<Object,Object> subMap=(Map<Object,Object>) currentMap.get(currentChar);if(subMap==null){
                break;
            }else {
                wordLength++;
                if(subMap.containsKey(END_FLAG)){
                    endFlag=true;
                    if(matchType==MatchType.MIN_MATCH){
                        break;
                    }else{ currentMap=subMap; }}else{ currentMap=subMap; }}}if(! endFlag){ wordLength=0; }return wordLength;
    }
Copy the code

MatchType refers to the matching rule. Sometimes we just need to find a sensitive word, and sometimes we need to know how many sensitive words are contained in the text to be detected. The former corresponds to the minimum matching principle, while the latter is the maximum matching principle.

The getSensitiveWordLength method is used to calculate the length of sensitive words in the text to be detected based on the given text and the starting subscript, as well as matching rules. If there is no sensitive word, return 0; if there is, return the length of the matched sensitive word.

The getSensitiveWords method scans the text to be detected, detects whether each character is in the sensitive lexis one by one, and then intercepts the detected sensitive words into a collection and returns it to the client.

Finally, write a test case to test it:


   public static void main(String[] args) {
        Set<String> sensitiveWords=new HashSet<>();
        sensitiveWords.add("You're an idiot.");
        sensitiveWords.add("You're an idiot.");
        sensitiveWords.add("You're a bad guy.");
        sensitiveWords.add("You big fool.");
        sensitiveWords.add("I bought a watch last year.");
        sensitiveWords.add("shit");

        TextFilter textFilter=new TextFilter();
        textFilter.initSensitiveWordsMap(sensitiveWords);
        String text="You you you you are silly force ah you, say you, you big fool.";
        System.out.println(textFilter.getSensitiveWords(text,MatchType.MAX_MATCH));
    }
Copy the code

The output is as follows:

As you can see, we’ve managed to filter out the sensitive words.

Sensitive word filtering scheme three

Solution 2 is good enough in terms of performance, but it can be easily cracked. For example, I can bypass it by adding a space between the sensitive words in the text to be detected. To solve this problem is not difficult, also has a simple method is to initialize an invalid character library, such as: space, such as *, #, @ character, and then in front of the detection of the text, to remove invalid characters in the text to be detected first, then tested character is no these invalid characters, so still can continue to use filtering scheme 2. As long as the detected text is not too long, we only need to scan the detected text once more to remove invalid characters on the basis of scheme 2, and the performance loss is still acceptable.

If the sensitive word is in English, then the issue of case should also be considered. A simpler solution is to store all sensitive words in lower case when initializing them. At the same time, when detecting the text, the text to be detected is also unified into lowercase, so that the problem of case can be solved.

The tricky part is the mixing of Chinese and Pinyin. For example, the sensitive word “silly force” can be easily bypassing by mixing Chinese and Pinyin with “sha force”. I haven’t yet come up with a good solution for this situation, and readers with ideas can leave comments at the end of this article.

Complete code: github.com/hzjjames/Te…