Sensitive words are not the platform
Recently, a Canadian-born rapper has been exposed for sexual indiscretions, possibly involving lulling an underage girl into acting as a raper.
Of course, as for whether it is true, in fact, whether a person is Neptune, WeChat, QQ chat records inside remember clearly. When it comes to criminal cases, TX can review all the historical records. Tencent video and an electric eel termination, it is not necessarily groundless, after all, interests related.
But there were two things that I noticed in the process that were really remarkable:
(1) His gossiped girlfriend, Xiao Yi, was scolded for clearing all social platforms. As a melon big household X blog, can only server breakdown, do not know the sensitive word filter?
(2) the whistleblowers are beautiful bamboo | received many blood red in the photo, every day is each big blow artificial intelligence platform, also does not have the function of the filter?
Of course, I know nothing about artificial intelligence.
But for sensitive words, recently wrote a small tool, if the major platform needs, has been open source, welcome to pick up their own.
https://github.com/houbb/sensitive-word
At least you can desensitize the beautiful Chinese words as follows:
You * * *, I * * you * *, you * * *, * *! XXX!
Writing purpose
Based on the DFA algorithm, the content of the sensitive thesaurus includes 6W+ (the source file contains 18W+, after one deletion).
In the later stage, we will continue to optimize and supplement the sensitive lexicon, and further improve the performance of the algorithm.
I hope to refine the classification of sensitive words, but I feel it is too much work, so I haven’t done it yet.
Let’s talk about vision here. Vision is to become the number one tool for sensitive words.
Of course, the first is always empty.
features
- 6W+ thesaurus, and constantly optimized and updated
- Based on DFA algorithm, the performance is better
- Based on Fluent – API implementation, use elegant and simple
- Support sensitive word judgment, return, desensitization and other common operations
- Support full Angle half Angle interchange
- Support English case exchange
- Support for the interchange of common forms of numbers
- Support the exchange of traditional and simplified Chinese characters
- Support the exchange of common English forms
- Support user-defined sensitive words and whitelists
- Support data dynamic update, effective in real time
Quick start
To prepare
- JDK1.7 +
- Maven 3.x+
Maven is introduced into
< the dependency > < groupId > com. Making. Houbb < / groupId > < artifactId > sensitive - word < / artifactId > < version > 0.0.15 < / version > </dependency>
An overview of the API
SensitiveWordHelper, as a tool class for sensitive words, has the following core methods:
methods | parameter | The return value | instructions |
---|---|---|---|
contains(String) | The string to be validated | Boolean value | Verify that the string contains sensitive words |
findAll(String) | The string to be validated | String list | Returns all sensitive words in the string |
replace(String, char) | Replaces the sensitive word with the specified char | string | Returns the desensitized string |
replace(String) | use* Replace sensitive words |
string | Returns the desensitized string |
Using the instance
See SensitiveWordHelperTest for all test cases
Determine whether sensitive words are included
The Five-Starred Red Flag is waving in the wind, and the portrait of Chairman Mao stands in front of Tian 'anmen Square. ; Assert.assertTrue(SensitiveWordHelper.contains(text));
Returns the first sensitive word
The Five-Starred Red Flag is waving in the wind, and the portrait of Chairman Mao stands in front of Tian 'anmen Square. ; String word = SensitiveWordHelper.findFirst(text); Assert. Assertequals (" five-star red flag ", word);
Returns all sensitive words
The Five-Starred Red Flag is waving in the wind, and the portrait of Chairman Mao stands in front of Tian 'anmen Square. ; List<String> wordList = SensitiveWordHelper.findAll(text); Assert.assertequals ("[Five-Starred Red Flag, Chairman Mao, Tian 'anmen Square]", wordlist.toString ());
The default replacement policy
The Five-Starred Red Flag is waving in the wind, and the portrait of Chairman Mao stands in front of Tian 'anmen Square. ; String result = SensitiveWordHelper.replace(text); **** The wind is blowing, and the picture of *** stands in front of ***. , result);
Specify what to replace
The Five-Starred Red Flag is waving in the wind, and the portrait of Chairman Mao stands in front of Tian 'anmen Square. ; String result = SensitiveWordHelper.replace(text, '0'); "0000 blows in the wind, and the picture of 000 stands before it." , result);
More features
Many subsequent features, mainly for a variety of processing for a variety of situations, as far as possible to improve the hit rate of sensitive words.
It was a long offensive and defensive battle.
Ignoring case
final String text = "fuCK the bad words.";
String word = SensitiveWordHelper.findFirst(text);
Assert.assertEquals("fuCK", word);
Ignore the half rounded corners
Final String Text = "Fuck the bad words."; String word = SensitiveWordHelper.findFirst(text); Assert. AssertEquals (" fuck ", the word);
Ignore the way the numbers are written
Here, the conversion of common forms of numbers is realized.
Final String text = "This is my WeChat:9 ⓿ 2four-CAS. ③⑸ Ohm and FIFTH Ohm "; List<String> wordList = SensitiveWordHelper.findAll(text); Assert. AssertEquals (" [9 ⓿ two boss ⁹ ₈ (3) [5] _____ the ➃ and ㊄] ", wordList. The toString ());
Ignore the simplified characters
Final String Text = "I love my motherland and the Five-Starred Red Flag" ; List<String> wordList = SensitiveWordHelper.findAll(text); Assert.assertequals ("[five-star red flag]", wordlist.toString ());
Ignore the written form of English
Final String text = "Ⓕⓤc the bad words"; List<String> wordList = SensitiveWordHelper.findAll(text); Assert. AssertEquals (" [Ⓕ ⓤ c ⒦] ", wordList. The toString ());
Ignore repeat words
Final String text = "ⒻⒻⒻf null u null c void the bad words"; List<String> wordList = SensitiveWordHelper.findAll(text); Assert. AssertEquals (" [Ⓕ Ⓕ Ⓕ ⓤ f u ⓤ ⒰ c ⓒ ⒦] ", wordList. The toString ());
Mail detection
Final String text = "[email protected]"; List<String> wordList = SensitiveWordHelper.findAll(text); Assert.assertEquals("[[email protected]]", wordList.toString());
Feature configuration
instructions
All of the above features are enabled by default, and sometimes businesses need the flexibility to define the associated configuration features.
So V0.0.14 opens up property configuration.
Configuration method
In order to make the use more elegant, the definition of Fluent-API is used uniformly.
Users can define SensitiveWordBs as follows:
SensitiveWordBs wordBs = SensitiveWordBs.newInstance() .ignoreCase(true) .ignoreWidth(true) .ignoreNumStyle(true) .ignoreChineseStyle(true) .ignoreEnglishStyle(true) .ignoreRepeat(true) .enableNumCheck(true) .enableEmailCheck(true) .enableUrlCheck(true) .init(); The Five-Starred Red Flag is waving in the wind, and the portrait of Chairman Mao stands in front of Tian 'anmen Square. ; Assert.assertTrue(wordBs.contains(text));
Configuration instructions
The description of each configuration is as follows:
The serial number | methods | instructions |
---|---|---|
1 | ignoreCase | Ignoring case |
2 | ignoreWidth | Ignore the half rounded corners |
3 | ignoreNumStyle | Ignore the way the numbers are written |
4 | ignoreChineseStyle | Ignore the writing format of Chinese |
5 | ignoreEnglishStyle | Ignore the written form of English |
6 | ignoreRepeat | Ignore repeat words |
7 | enableNumCheck | Whether digital detection is enabled. Default 8 consecutive digits are considered sensitive words |
8 | enableEmailCheck | Yes, email detection is enabled |
9 | enableUrlCheck | Whether link detection is enabled |
Dynamic loading (user-defined)
scenario
Sometimes we want to design the loading of sensitive words to be dynamic, such as console modification, and then take effect in real time.
V0.0.13 supports this feature.
Interface specification
To implement this feature and to be compatible with previous functionality, we defined two interfaces.
IWordDeny
The interface is as follows, and you can customize your own implementation.
Returns a list indicating that the word is a sensitive word.
* @author binbin.hou * @since 0.0.13 */ public interface iWordDeny {/** * Get the result * @Return the result * @since 0.0.13 */ List<String> deny(); }
Such as:
Public class MyWordDeny implements = public List<String bb0 Deny () {return Arrays.asList(" Arrays.implements "); }}
IWordAllow
The interface is as follows, and you can customize your own implementation.
Returns a list indicating that the word is not a sensitive word.
* @author binbin.hou * @since 0.0.13 */ public interface iWordAllow {/** * Get the result * @return the result * @since 0.0.13 */ List<String> allow(); }
Such as:
Public class MyWordAllow implements IWordAllow {@ Override public a List < String > allow () {return Arrays. The asList (" flag "); }}
Configured to use
After the interface is customized, of course, it needs to be specified to take effect.
To make the use more elegant, we designed the guide class SensitiveWordbs.
You can specify sensitive words through wordDeny(), non-sensitive words through wordAllow(), and initialize the dictionary of sensitive words through init().
The default configuration of the system
SensitiveWordBs wordBs = SensitiveWordBs.newInstance() .wordDeny(WordDenys.system()) .wordAllow(WordAllows.system()) .init(); The Five-Starred Red Flag is waving in the wind, and the portrait of Chairman Mao stands in front of Tian 'anmen Square. ; Assert.assertTrue(wordBs.contains(text));
Note: init() is time-consuming to build for the sensitive word DFA, and it is generally recommended that the application be initialized only once. Instead of repeated initializations!
Specify your own implementation
We can test our custom implementation as follows:
String text = "This is a test of my custom sensitive words." ; SensitiveWordBs wordBs = SensitiveWordBs.newInstance() .wordDeny(new MyWordDeny()) .wordAllow(new MyWordAllow()) .init(); Assert.assertQuals ("[My custom sensitive words]", wordbs.findAll (text).toString());
Here only my custom sensitive words are sensitive words and the test is not sensitive words.
Of course, this is all to use our custom implementation, it is generally recommended to use the system default configuration + custom configuration.
You can do it in the following way.
Configure multiple at once
- Multiple sensitive words
WordDenys.Chains () method which combines multiple implementations into one WordDeny.
- Multiple whitelists
Wordallows. Chains () method combines multiple implementations into one iWordAllow.
Example:
String text = "This is a test. My custom sensitive words." ; IWordDeny wordDeny = WordDenys.chains(WordDenys.system(), new MyWordDeny()); IWordAllow wordAllow = WordAllows.chains(WordAllows.system(), new MyWordAllow()); SensitiveWordBs wordBs = SensitiveWordBs.newInstance() .wordDeny(wordDeny) .wordAllow(wordAllow) .init(); Assert.assertQuals ("[My custom sensitive words]", wordbs.findAll (text).toString());
Here is the use of both the system default configuration, and the custom configuration.
Spring integration
background
In actual use, for example, changes can be made in page configuration and then take effect in real time.
Data stored in the database, the following is an example of a pseudo code, you can refer to SpringSensitiveWordConfig. Java
Requirements, Version V0.0.15 and above.
Custom data source
The simplified pseudocode is as follows. The source of the data is the database.
MyDDWordAllow and MyDDWordDeny are custom implementation classes based on the database as the source.
@Configuration public class SpringSensitiveWordConfig { @Autowired private MyDdWordAllow myDdWordAllow; @Autowired private MyDdWordDeny myDdWordDeny; / initialize the bootstrap class * * * * @ return to initialize the bootstrap class * @ since 1.0.0 * / @ Bean public SensitiveWordBs SensitiveWordBs () {SensitiveWordBs sensitiveWordBs = SensitiveWordBs.newInstance() .wordAllow(WordAllows.chains(WordAllows.system(), MyddwordAllow)).wordDeny(myddwordDeny) // Various other config.init (); return sensitiveWordBs; }}
The initialization of the sensitive lexicon is time consuming. It is recommended to do an init initialization when the program starts.
The dynamic change of
In order to ensure that changes to sensitive words can take effect in real time and that the interface is as simple as possible, there is no add/remove method here.
Instead, when sensitivewordbs.init () is called, the sensitive lexicon is rebuilt from iWordDeny + iWordAllow.
Because initialization can take a long time (at the second level), all optimizations to init do not affect the old thesaurus function until it is complete, and the new one will prevail when it is completed.
@Component public class SensitiveWordService { @Autowired private SensitiveWordBs sensitiveWordBs; /** update thesaurus ** Each time the database information changes, the first call to update the database sensitive thesaurus method. * This method is called if it needs to take effect. * * Note: Reinitialization does not affect the use of old methods. After initialization is complete, the new will prevail. */ public void refresh() {// Each time a database change occurs, the method that updates the database's sensitive lexicon is first called and then called. sensitiveWordBs.init(); }}
As above, you can actively trigger an initialization of sensitivewords.init () when the database lexicon changes and needs the lexicon to take effect; .
Other uses remain the same without restarting the application.
The crown xi elder brother smiled slightly, want to work, first life.
Develop reading
Sensitive word tool implementation ideas
DFA algorithm
Sensitive lexicon optimization process
Stop thinking about words and record them
summary
Again, we use the law to defend ourselves, but we must not allow some people to turn everything into entertainment, thinking that money can buy everything.
On the occasion of the centenary, we must not let our ancestors’ blood flow in vain.
Not to mention is a three no Canadian actor, the proposal to dispose of according to law, and then (ノ ‘pas) (beautiful Chinese words)