What is a prefix tree
Prefix tree (Trie), also known as dictionary tree, is a multi-fork tree structure, which is summarized from the figure above
- Basic properties: from the root to a node, concatenate a long string; The root node is empty
- Features: high search efficiency and large memory consumption
- Application: string retrieval word frequency statistics string sort, etc
We can declare a Hashmap in each node. Use a Hashmap to store child nodes.
The keys of a Hashmap are characters, and the values are the corresponding child nodes.
Principle of filtering sensitive words
By the principle of the prefix tree, we’re essentially turning it into a finder
If a string in the text matches the node string in the tree, it is regarded as a sensitive word and replaced with ***.
The specific implementation
- First, we need to have a text act
Sensitive word library
And we willSensitive word library
Converted toThe prefix tree
- Create an object to store filtered articles
StringBuilder
- The input
text
One by one withThe prefix tree
Child node comparison of the root node (which is empty)- Start search if there is a match, otherwise take
text
The next character of the - When searching, replace with *** if there is a matching string, otherwise skip to the next character
- Start search if there is a match, otherwise take
How to judge the conditions?
We need to add an identifier isWordEnd to each node. When this identifier is true, it means that the sensitive string ends, exactly.
Lookup logic
A pointer, tempNode, is required to identify the current node
Two Pointers, BEGIN and position, are required to identify the length of the string retrieved from text
That is, we rely on the movement of these three Pointers to process sensitive words in text and store them in StringBuilder
Having discussed their roles above, the following are the conditions for their respective movements:
The following use c to refer to the character extracted by text
According to the logic of judgment as shown in the figure above
Sensitive word filter code
The prefix tree class
Private class TrieNode {private Map<Character, TrieNode> subNodes = new HashMap<>(); Private Boolean isWordEnd = false; Public Boolean isWordEnd() {return isWordEnd; Public void setWordEnd(Boolean wordEnd) {isWordEnd = wordEnd; } // Add child nodes public void addSubNode(Character c, TrieNode node) {subNodes. Put (c, node); } public TrieNode getSubNode(Character c) {return subnodes.get (c); }}Copy the code
Adds the sensitive thesaurus to the prefix tree
Private TrieNode rootNode = new TrieNode(); @postconstruct public void init() {try (// Use the classloader to read the sensitive thesaurus and cache InputStream is = this.getClass().getClassLoader().getResourceAsStream("sensitive-words.txt"); BufferedReader reader = new BufferedReader(new InputStreamReader(is)); ) { String SensitiveWord; while ((SensitiveWord = reader.readLine()) ! = null) {// Add to the prefix tree this.addSensitiveWord(SensitiveWord); }} catch (IOException e) {logger.error(" failed to load sensitive words: "+ LLDB message ()); Private void addSensitiveWord(String sensitiveWord) {TrieNode tempNode = rootNode; For (int I = 0; i < sensitiveWord.length(); i++) { char c = sensitiveWord.charAt(i); TrieNode subNode = tempNode.getSubNode(c); TrieNode subNode = tempNode.getSubNode(c); If (subNode == null) {// Initialize subNode = new TrieNode(); tempNode.addSubNode(c, subNode); } // point to the child node and enter the next loop tempNode = subNode; // Set end flag if (I == sensitiveWord.length() -1) {tempNode.setWordEnd(true); }}}Copy the code
Text filtering
public String filter(String text) { if (StringUtils.isBlank(text)) { return null; } // TrieNode tempNode = rootNode; // text start position int begin = 0; Int position = 0; StringBuilder sb = new StringBuilder(); while (position < text.length()) { char c = text.charAt(position); // Skip the symbol if (isSymbol(c)) {if (tempNode == rootNode) {sb.append(c); // If (tempNode == rootNode) {sb.append(c); begin++; } // whether the symbol is at the beginning or in the middle, the current position is one step down position++; continue; } // Check the child node tempNode = tempNode.getSubNode(c); If (tempNode == null) {// A string beginning with begin is not a sensitive word sb.append(text.charat (begin)); Position = ++begin; // redirect to rootNode tempNode = rootNode; } else if (tempnode.iswordend ()) {begin~position (sb.append(REPLACEMENT); Begin = ++position; // redirect to rootNode tempNode = rootNode; } else {// check next character position++; Sb.append (text.substring(begin)); return sb.toString(); } private Boolean isSymbol(Character c) {// 0x2E80~0x9FFF CharUtils.isAsciiAlphanumeric(c) && (c < 0x2E80 || c > 0x9FFF); }Copy the code
The complete code
@Component public class SensitiveFilter { private static final Logger logger = LoggerFactory.getLogger(SensitiveFilter.class); Private static final String REPLACEMENT = "***"; Private TrieNode rootNode = new TrieNode(); @postconstruct public void init() {try (// Use the classloader to read the sensitive thesaurus and cache InputStream is = this.getClass().getClassLoader().getResourceAsStream("sensitive-words.txt"); BufferedReader reader = new BufferedReader(new InputStreamReader(is)); ) { String SensitiveWord; while ((SensitiveWord = reader.readLine()) ! = null) {// Add to the prefix tree this.addSensitiveWord(SensitiveWord); }} catch (IOException e) {logger.error(" failed to load sensitive words: "+ LLDB message ()); Private void addSensitiveWord(String sensitiveWord) {TrieNode tempNode = rootNode; For (int I = 0; i < sensitiveWord.length(); i++) { char c = sensitiveWord.charAt(i); TrieNode subNode = tempNode.getSubNode(c); TrieNode subNode = tempNode.getSubNode(c); If (subNode == null) {// Initialize subNode = new TrieNode(); tempNode.addSubNode(c, subNode); } // point to the child node and enter the next loop tempNode = subNode; // Set end flag if (I == sensitiveWord.length() -1) {tempNode.setWordEnd(true); }}} /** * @return */ public String filter(String text) {if (stringutils.isblank (text)) {return null; } // TrieNode tempNode = rootNode; // text start position int begin = 0; Int position = 0; StringBuilder sb = new StringBuilder(); while (position < text.length()) { char c = text.charAt(position); // Skip the symbol if (isSymbol(c)) {if (tempNode == rootNode) {sb.append(c); // If (tempNode == rootNode) {sb.append(c); begin++; } // whether the symbol is at the beginning or in the middle, the current position is one step down position++; continue; } // Check the child node tempNode = tempNode.getSubNode(c); If (tempNode == null) {// A string beginning with begin is not a sensitive word sb.append(text.charat (begin)); Position = ++begin; // redirect to rootNode tempNode = rootNode; } else if (tempnode.iswordend ()) {begin~position (sb.append(REPLACEMENT); Begin = ++position; // redirect to rootNode tempNode = rootNode; } else {// check next character position++; Sb.append (text.substring(begin)); return sb.toString(); } private Boolean isSymbol(Character c) {// 0x2E80~0x9FFF CharUtils.isAsciiAlphanumeric(c) && (c < 0x2E80 || c > 0x9FFF); } private class TrieNode {private Map<Character, TrieNode> subNodes = new HashMap<>(); Private Boolean isWordEnd = false; Public Boolean isWordEnd() {return isWordEnd; Public void setWordEnd(Boolean wordEnd) {isWordEnd = wordEnd; } // Add child nodes public void addSubNode(Character c, TrieNode node) {subNodes. Put (c, node); } public TrieNode getSubNode(Character c) {return subnodes.get (c); }}}Copy the code
The resources
What is a prefix tree?