What is a prefix tree

Prefix tree is divided into Trie, dictionary tree and search tree, which is characterized by high search efficiency but large memory consumption. It is often used in string retrieval, word frequency statistics and string sorting.

Here we use it to make a sensitive word filter.

Prefix tree is used to record sensitive words, assuming that the sensitive words at this time: ABC, BD, bg

Start by drawing a prefix tree:

Root is the Root and no characters are recorded.

Then we assume a string and perform sensitive word recognition on the string.

Assume the string is “asdabcbdaas”.

We have three arrows, as follows:

Prepare a StringBuilder to store the filtered strings.

They are called arrows no. 1, no. 2, and No. 3.

We started filtering the text for sensitive words.

Arrows 1 and 2 point to A, and we can look for the presence of A in the first row of the prefix tree.

2. Arrow # 1 goes down and discovers the presence of A.

Arrows 3 and 2 stand still, arrow 3 moves one character down, and arrow 3 points to S.

4. Arrow no. 1 goes down and looks for S in the next node of the prefix tree A, but does not find it.

5. The string as between arrows 2 and 3 is not a sensitive word. Both arrows 2 and 3 move to the next d. Add “as” to the filtered string StirngBuilder.

At this point the StringBuilder = “as”;

6. The arrow # 1 moves back to the node in the first row, finds no D, and continues to retrieve the next character.

7. Node 2 continues on and encounters A.

8. The prefix tree has a in the first row of nodes.

Pointer 3 moves to the next character of pointer 2 and reads character B.

10, also look for b in the second line of the prefix tree A, find.

11. Pointer 3 goes ahead and finds C.

12. Pointer 1 continues to look for C in the next row of node B, and finds C.

At this point, the string “ABC” has been searched completely. It can be judged that “ABC” between pointer 2 and pointer 3 is a sensitive word. Replace it with *** and add StringBuilder.

The StringBuilder = “as * * *”;

And so on, until StringBuilder= “as*** ***aas”;

Let’s write Java.

We’ll create a new sensitive word file named TXT. We’ll call it “sensitive-word.txt”. Enter sensitive words in a one-word line format. As follows:

The handsome fish man made a fortune this year and his glory lives on for the tribeCopy the code

Then implement a Java filter.

package com.nowcoder.community.util;

import org.apache.commons.lang3.CharUtils;
import org.apache.commons.lang3.StringUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.stereotype.Component;

import javax.annotation.PostConstruct;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.HashMap;
import java.util.Map;

@Component
public class SensitiveFilter {

    private static final Logger logger = LoggerFactory.getLogger(SensitiveFilter.class);

    / / replace operator
    private static final String REPLACEMENT = "* * *";

    / / the root node
    private TrieNode rootNode = new TrieNode();

    @PostConstruct
    public void init(a) {
        try (
                InputStream is = this.getClass().getClassLoader().getResourceAsStream("sensitive-words.txt");
                BufferedReader reader = new BufferedReader(new InputStreamReader(is));
        ) {
            String keyword;
            while((keyword = reader.readLine()) ! =null) {
                // Add to the prefix tree
                this.addKeyword(keyword); }}catch (IOException e) {
            logger.error("Failed to load sensitive word file:"+ e.getMessage()); }}// Add a sensitive word to the prefix tree
    private void addKeyword(String keyword) {
        TrieNode tempNode = rootNode;
        for (int i = 0; i < keyword.length(); i++) {
            char c = keyword.charAt(i);
            TrieNode subNode = tempNode.getSubNode(c);

            if (subNode == null) {
                // Initialize the child node
                subNode = new TrieNode();
                tempNode.addSubNode(c, subNode);
            }

            // Point to the child node and enter the next loop
            tempNode = subNode;

            // Set the end flag
            if (i == keyword.length() - 1) {
                tempNode.setKeywordEnd(true); }}}/** * filter sensitive words **@paramText Indicates the text to be filtered@returnFiltered text */
    public String filter(String text) {
        if (StringUtils.isBlank(text)) {
            return null;
        }

        / / pointer to 1
        TrieNode tempNode = rootNode;
        2 / / pointer
        int begin = 0;
        3 / / pointer
        int position = 0;
        / / the result
        StringBuilder sb = new StringBuilder();

        while (position < text.length()) {
            char c = text.charAt(position);

            // skip symbols
            if (isSymbol(c)) {
                // If pointer 1 is at the root, count the symbol in the result and let pointer 2 take a step down
                if (tempNode == rootNode) {
                    sb.append(c);
                    begin++;
                }
                // Whether the symbol is at the beginning or in the middle, pointer 3 takes a step down
                position++;
                continue;
            }

            // Check the child node
            tempNode = tempNode.getSubNode(c);
            if (tempNode == null) {
                // Strings starting with begin are not sensitive words
                sb.append(text.charAt(begin));
                // Go to the next position
                position = ++begin;
                // Redirect to the root node
                tempNode = rootNode;
            } else if (tempNode.isKeywordEnd()) {
                // If sensitive words are found, replace the begin~position string
                sb.append(REPLACEMENT);
                // Go to the next position
                begin = ++position;
                // Redirect to the root node
                tempNode = rootNode;
            } else {
                // Check the next characterposition++; }}// Count the last batch of characters in the result
        sb.append(text.substring(begin));

        return sb.toString();
    }

    // Check whether it is a symbol
    private boolean isSymbol(Character c) {
        // 0x2E80~0x9FFF are east Asian characters
        return! CharUtils.isAsciiAlphanumeric(c) && (c <0x2E80 || c > 0x9FFF);
    }

    / / prefix tree
    private class TrieNode {

        // Keyword end identifier
        private boolean isKeywordEnd = false;

        // Child node (key is the child character,value is the child node)
        private Map<Character, TrieNode> subNodes = new HashMap<>();

        public boolean isKeywordEnd(a) {
            return isKeywordEnd;
        }

        public void setKeywordEnd(boolean keywordEnd) {
            isKeywordEnd = keywordEnd;
        }

        // Add child nodes
        public void addSubNode(Character c, TrieNode node) {
            subNodes.put(c, node);
        }

        // Get the child node
        public TrieNode getSubNode(Character c) {
            returnsubNodes.get(c); }}}Copy the code