Write a sensitive word filter using the prefix tree

What is a prefix tree

Prefix tree (Trie), also known as dictionary tree, is a multi-fork tree structure, which is summarized from the figure above

Basic properties: from the root to a node, concatenate a long string; The root node is empty
Features: high search efficiency and large memory consumption
Application: string retrieval word frequency statistics string sort, etc

We can declare a Hashmap in each node. Use a Hashmap to store child nodes.

The keys of a Hashmap are characters, and the values are the corresponding child nodes.

Principle of filtering sensitive words

By the principle of the prefix tree, we’re essentially turning it into a finder

If a string in the text matches the node string in the tree, it is regarded as a sensitive word and replaced with ***.

The specific implementation

First, we need to have a text actSensitive word libraryAnd we willSensitive word libraryConverted toThe prefix tree
Create an object to store filtered articlesStringBuilder
The inputtextOne by one withThe prefix treeChild node comparison of the root node (which is empty)
- Start search if there is a match, otherwise taketextThe next character of the
- When searching, replace with *** if there is a matching string, otherwise skip to the next character

How to judge the conditions?

We need to add an identifier isWordEnd to each node. When this identifier is true, it means that the sensitive string ends, exactly.

Lookup logic

A pointer, tempNode, is required to identify the current node

Two Pointers, BEGIN and position, are required to identify the length of the string retrieved from text

That is, we rely on the movement of these three Pointers to process sensitive words in text and store them in StringBuilder

Having discussed their roles above, the following are the conditions for their respective movements:

The following use c to refer to the character extracted by text

According to the logic of judgment as shown in the figure above

Sensitive word filter code

The prefix tree class

Private class TrieNode {private Map<Character, TrieNode> subNodes = new HashMap<>(); Private Boolean isWordEnd = false; Public Boolean isWordEnd() {return isWordEnd; Public void setWordEnd(Boolean wordEnd) {isWordEnd = wordEnd; } // Add child nodes public void addSubNode(Character c, TrieNode node) {subNodes. Put (c, node); } public TrieNode getSubNode(Character c) {return subnodes.get (c); }}Copy the code

Adds the sensitive thesaurus to the prefix tree

Private TrieNode rootNode = new TrieNode(); @postconstruct public void init() {try (// Use the classloader to read the sensitive thesaurus and cache InputStream is = this.getClass().getClassLoader().getResourceAsStream("sensitive-words.txt"); BufferedReader reader = new BufferedReader(new InputStreamReader(is)); ) { String SensitiveWord; while ((SensitiveWord = reader.readLine()) ! = null) {// Add to the prefix tree this.addSensitiveWord(SensitiveWord); }} catch (IOException e) {logger.error(" failed to load sensitive words: "+ LLDB message ()); Private void addSensitiveWord(String sensitiveWord) {TrieNode tempNode = rootNode; For (int I = 0; i < sensitiveWord.length(); i++) { char c = sensitiveWord.charAt(i); TrieNode subNode = tempNode.getSubNode(c); TrieNode subNode = tempNode.getSubNode(c); If (subNode == null) {// Initialize subNode = new TrieNode(); tempNode.addSubNode(c, subNode); } // point to the child node and enter the next loop tempNode = subNode; // Set end flag if (I == sensitiveWord.length() -1) {tempNode.setWordEnd(true); }}}Copy the code

Text filtering

public String filter(String text) { if (StringUtils.isBlank(text)) { return null; } // TrieNode tempNode = rootNode; // text start position int begin = 0; Int position = 0; StringBuilder sb = new StringBuilder(); while (position < text.length()) { char c = text.charAt(position); // Skip the symbol if (isSymbol(c)) {if (tempNode == rootNode) {sb.append(c); // If (tempNode == rootNode) {sb.append(c); begin++; } // whether the symbol is at the beginning or in the middle, the current position is one step down position++; continue; } // Check the child node tempNode = tempNode.getSubNode(c); If (tempNode == null) {// A string beginning with begin is not a sensitive word sb.append(text.charat (begin)); Position = ++begin; // redirect to rootNode tempNode = rootNode; } else if (tempnode.iswordend ()) {begin~position (sb.append(REPLACEMENT); Begin = ++position; // redirect to rootNode tempNode = rootNode; } else {// check next character position++; Sb.append (text.substring(begin)); return sb.toString(); } private Boolean isSymbol(Character c) {// 0x2E80~0x9FFF CharUtils.isAsciiAlphanumeric(c) && (c < 0x2E80 || c > 0x9FFF); }Copy the code

The complete code

@Component public class SensitiveFilter { private static final Logger logger = LoggerFactory.getLogger(SensitiveFilter.class); Private static final String REPLACEMENT = "***"; Private TrieNode rootNode = new TrieNode(); @postconstruct public void init() {try (// Use the classloader to read the sensitive thesaurus and cache InputStream is = this.getClass().getClassLoader().getResourceAsStream("sensitive-words.txt"); BufferedReader reader = new BufferedReader(new InputStreamReader(is)); ) { String SensitiveWord; while ((SensitiveWord = reader.readLine()) ! = null) {// Add to the prefix tree this.addSensitiveWord(SensitiveWord); }} catch (IOException e) {logger.error(" failed to load sensitive words: "+ LLDB message ()); Private void addSensitiveWord(String sensitiveWord) {TrieNode tempNode = rootNode; For (int I = 0; i < sensitiveWord.length(); i++) { char c = sensitiveWord.charAt(i); TrieNode subNode = tempNode.getSubNode(c); TrieNode subNode = tempNode.getSubNode(c); If (subNode == null) {// Initialize subNode = new TrieNode(); tempNode.addSubNode(c, subNode); } // point to the child node and enter the next loop tempNode = subNode; // Set end flag if (I == sensitiveWord.length() -1) {tempNode.setWordEnd(true); }}} /** * @return */ public String filter(String text) {if (stringutils.isblank (text)) {return null; } // TrieNode tempNode = rootNode; // text start position int begin = 0; Int position = 0; StringBuilder sb = new StringBuilder(); while (position < text.length()) { char c = text.charAt(position); // Skip the symbol if (isSymbol(c)) {if (tempNode == rootNode) {sb.append(c); // If (tempNode == rootNode) {sb.append(c); begin++; } // whether the symbol is at the beginning or in the middle, the current position is one step down position++; continue; } // Check the child node tempNode = tempNode.getSubNode(c); If (tempNode == null) {// A string beginning with begin is not a sensitive word sb.append(text.charat (begin)); Position = ++begin; // redirect to rootNode tempNode = rootNode; } else if (tempnode.iswordend ()) {begin~position (sb.append(REPLACEMENT); Begin = ++position; // redirect to rootNode tempNode = rootNode; } else {// check next character position++; Sb.append (text.substring(begin)); return sb.toString(); } private Boolean isSymbol(Character c) {// 0x2E80~0x9FFF CharUtils.isAsciiAlphanumeric(c) && (c < 0x2E80 || c > 0x9FFF); } private class TrieNode {private Map<Character, TrieNode> subNodes = new HashMap<>(); Private Boolean isWordEnd = false; Public Boolean isWordEnd() {return isWordEnd; Public void setWordEnd(Boolean wordEnd) {isWordEnd = wordEnd; } // Add child nodes public void addSubNode(Character c, TrieNode node) {subNodes. Put (c, node); } public TrieNode getSubNode(Character c) {return subnodes.get (c); }}}Copy the code

The resources

What is a prefix tree?