Sensitive words don’t have a platform
Recently, a Canadian rapper has been exposed as having an indecent personal life, possibly involving seducing an underage girl to become a raper.
Of course, as for whether it is true, in fact, whether a person is aquaman, wechat, QQ chat record inside remember clearly. When rise to criminal case again, TX can review all history record completely. Tencent video and some electric eel to terminate the contract, not necessarily without basis, after all, related interests.
But there were two things I noticed about the whole process:
(1) Xiao Yi, his gossip girlfriend, was scolded to delete all social media platforms. As a big X blog, can only server paralysis, do not know sensitive word filtering?
(2) the whistleblowers are beautiful bamboo | received many blood red in the photo, every day is each big blow artificial intelligence platform, also does not have the function of the filter?
Of course, I know nothing about artificial intelligence.
But for sensitive words, recently wrote a small tool, if the platform needs, has been open source, welcome to take.
Github.com/houbb/sensi…
At least you can desensitize the beautiful Chinese words below:
You XXX, I XXX you XXX, you XXXX, XXX!! XXX!Copy the code
Writing purpose
Based on DFA algorithm, the content of sensitive lexicon includes 6W+ (18W+ source file, after a deletion).
In the later stage, the sensitive lexicon will be continuously optimized and supplemented to further improve the performance of the algorithm.
I hope to refine the classification of sensitive words, but I feel the workload is heavy, so I haven’t done it yet.
Let’s talk about vision. Vision is the number one sensitive word tool.
Of course, the first is always empty.
features
-
6W+ lexicon, and constantly optimized and updated
-
Based on DFA algorithm, the performance is better
-
Based on fluent- API implementation, use elegant and concise
-
Supports common operations such as the judgment, return, and desensitization of sensitive words
-
Support full Angle half Angle interchange
-
English case interchange is supported
-
Supports interchangeability of common forms of numbers
-
Support Chinese traditional and simplified exchange
-
Support interchange of common forms of English
-
Users can define sensitive words and whitelist
-
Supports dynamic data update and takes effect in real time
Quick start
To prepare
-
JDK1.7 +
-
Maven 3.x+
Maven is introduced into
<dependency>
<groupId>com.github.houbb</groupId>
<artifactId>sensitive-word</artifactId>
<version>0.0.15</version>
</dependency>
Copy the code
An overview of the API
The core methods of SensitiveWordHelper are as follows:
methods | parameter | The return value | instructions |
---|---|---|---|
contains(String) | String to be verified | Boolean value | Verify that the string contains sensitive words |
findAll(String) | String to be verified | String list | Returns all sensitive words in the string |
replace(String, char) | Replaces the sensitive word with the specified char | string | Returns the desensitized string |
replace(String) | use* Replace sensitive words |
string | Returns the desensitized string |
Using the instance
See SensitiveWordHelperTest for all test cases
Determine whether sensitive words are included
final String text = "The five-starred red flag flapped in the wind, and the portrait of Chairman MAO stood in front of Tian 'anmen.";
Assert.assertTrue(SensitiveWordHelper.contains(text));
Copy the code
Returns the first sensitive word
final String text = "The five-starred red flag flapped in the wind, and the portrait of Chairman MAO stood in front of Tian 'anmen.";
String word = SensitiveWordHelper.findFirst(text);
Assert.assertEquals("Five-star Red Flag", word);
Copy the code
Returns all sensitive words
final String text = "The five-starred red flag flapped in the wind, and the portrait of Chairman MAO stood in front of Tian 'anmen.";
List<String> wordList = SensitiveWordHelper.findAll(text);
Assert.assertEquals("[Five-star Red Flag, Chairman MAO, Tian 'anmen]", wordList.toString());
Copy the code
Default replacement policy
final String text = "The five-starred red flag flapped in the wind, and the portrait of Chairman MAO stood in front of Tian 'anmen.";
String result = SensitiveWordHelper.replace(text);
Assert.assertEquals("**** fluttering in the wind, the portrait of the communist Party standing in front of the communist Party., result);
Copy the code
Specify what to replace
final String text = "The five-starred red flag flapped in the wind, and the portrait of Chairman MAO stood in front of Tian 'anmen.";
String result = SensitiveWordHelper.replace(text, '0');
Assert.assertEquals("0000 fluttering in the wind, 000 portraits standing before 000.", result);
Copy the code
More features
The following features are mainly aimed at the processing of various situations, so as to improve the hit rate of sensitive words as much as possible.
It was a long battle.
Ignore case
final String text = "fuCK the bad words.";
String word = SensitiveWordHelper.findFirst(text);
Assert.assertEquals("fuCK", word);
Copy the code
Ignore the fillet
final String text = "fuck the bad words.";
String word = SensitiveWordHelper.findFirst(text);
Assert.assertEquals("fuck", word);
Copy the code
Ignore the numbers
Here the conversion of numbers to common forms is implemented.
final String text = "This is my wechat. 9⓿ 四 次 ⁹₈③ (5) coil resistance (0) : 0.";
List<String> wordList = SensitiveWordHelper.findAll(text);
Assert.assertEquals("[9⓿ quadruple ⁹₈③ (4) coil resistance (0) : 4-0) ㊄", wordList.toString());
Copy the code
Ignore complexity
final String text = "I love my country and the five-star Red flag.";
List<String> wordList = SensitiveWordHelper.findAll(text);
Assert.assertEquals("[Five-star Red Flag]", wordList.toString());
Copy the code
Ignore the English writing format
final String text = "Ⓕ ⓤ c ⒦ the bad words." ";
List<String> wordList = SensitiveWordHelper.findAll(text);
Assert.assertEquals("[Ⓕ ⓤ c ⒦]." ", wordList.toString());
Copy the code
Ignore repeated words
final String text = "Ⓕ Ⓕ Ⓕ ⓤ f u ⓤ ⒰ c ⓒ ⒦ the bad words." ";
List<String> wordList = SensitiveWordHelper.findAll(text);
Assert.assertEquals("[Ⓕ Ⓕ Ⓕ ⓤ f u ⓤ ⒰ c ⓒ ⒦]", wordList.toString());
Copy the code
Mail detection
final String text = "Good man, email [email protected]";
List<String> wordList = SensitiveWordHelper.findAll(text);
Assert.assertEquals("[[email protected]]", wordList.toString());
Copy the code
Feature configuration
instructions
The preceding features are enabled by default. Sometimes services need to flexibly define related features.
So v0.0.14 opens up attribute configuration.
Configuration method
In order to make the usage more elegant, we use the fluent-API way to define.
Users can use SensitiveWordBs for the following definition:
SensitiveWordBs wordBs = SensitiveWordBs.newInstance()
.ignoreCase(true)
.ignoreWidth(true)
.ignoreNumStyle(true)
.ignoreChineseStyle(true)
.ignoreEnglishStyle(true)
.ignoreRepeat(true)
.enableNumCheck(true)
.enableEmailCheck(true)
.enableUrlCheck(true)
.init();
final String text = "The five-starred red flag flapped in the wind, and the portrait of Chairman MAO stood in front of Tian 'anmen.";
Assert.assertTrue(wordBs.contains(text));
Copy the code
Configuration instructions
The configurations are described as follows:
The serial number | methods | instructions |
---|---|---|
1 | ignoreCase | Ignore case |
2 | ignoreWidth | Ignore the fillet |
3 | ignoreNumStyle | Ignore the numbers |
4 | ignoreChineseStyle | Ignore the Chinese writing format |
5 | ignoreEnglishStyle | Ignore the English writing format |
6 | ignoreRepeat | Ignore repeated words |
7 | enableNumCheck | Whether to enable digital detection. By default, 8 consecutive digits are considered sensitive words |
8 | enableEmailCheck | Yes Mailbox detection is enabled |
9 | enableUrlCheck | Whether link detection is enabled |
Dynamic loading (user-defined)
scenario
Sometimes we want to design the loading of sensitive words to be dynamic, such as console changes, and then take effect in real time.
V0.0.13 supports this feature.
Interface specification
To implement this feature and to be compatible with previous functionality, we defined two interfaces.
IWordDeny
The interfaces are as follows, and you can customize your own implementation.
Returns a list indicating that the word is a sensitive word.
/** * reject the presence of data - the content returned is treated as a sensitive word *@author binbin.hou
* @since0.0.13 * /
public interface IWordDeny {
/** * Get results *@returnResults *@since0.0.13 * /
List<String> deny(a);
}
Copy the code
Such as:
public class MyWordDeny implements IWordDeny {
@Override
public List<String> deny(a) {
return Arrays.asList("My custom sensitive words."); }}Copy the code
IWordAllow
The interfaces are as follows, and you can customize your own implementation.
Returns a list indicating that the word is not a sensitive word.
/** * Allowed content - The returned content is not treated as a sensitive word *@author binbin.hou
* @since0.0.13 * /
public interface IWordAllow {
/** * Get results *@returnResults *@since0.0.13 * /
List<String> allow(a);
}
Copy the code
Such as:
public class MyWordAllow implements IWordAllow {
@Override
public List<String> allow(a) {
return Arrays.asList("Five-star Red Flag"); }}Copy the code
Configured to use
After an interface is defined, it must be specified to take effect.
To make use more elegant, we have designed the induction class SensitiveWordBs.
You can specify sensitive words with wordDeny(), non-sensitive words with wordAllow(), and initialize the sensitive word dictionary with init().
Default system Settings
SensitiveWordBs wordBs = SensitiveWordBs.newInstance()
.wordDeny(WordDenys.system())
.wordAllow(WordAllows.system())
.init();
final String text = "The five-starred red flag flapped in the wind, and the portrait of Chairman MAO stood in front of Tian 'anmen.";
Assert.assertTrue(wordBs.contains(text));
Copy the code
Note: Init () is time-consuming to build the sensitive word DFA. It is recommended to initialize it only once during application initialization. Instead of repeating initialization!
Specify your own implementation
We can test the custom implementation as follows:
String text = "This is a test of my custom sensitive words.";
SensitiveWordBs wordBs = SensitiveWordBs.newInstance()
.wordDeny(new MyWordDeny())
.wordAllow(new MyWordAllow())
.init();
Assert.assertEquals("[My custom sensitive words]", wordBs.findAll(text).toString());
Copy the code
Here only my custom sensitive words are sensitive words, and the test is not sensitive words.
Of course, this is all our custom implementation, it is generally recommended to use the system’s default configuration + custom configuration.
You can use the following method.
Configure multiple
- Multiple sensitive words
The worddenys.chains () method, which combines multiple implementations into the same IWordDeny.
- Multiple whitelist
The wordeasy.chains () method, which merges multiple implementations into the same IWordAllow.
Example:
String text = "This is a test. My custom sensitive words.";
IWordDeny wordDeny = WordDenys.chains(WordDenys.system(), new MyWordDeny());
IWordAllow wordAllow = WordAllows.chains(WordAllows.system(), new MyWordAllow());
SensitiveWordBs wordBs = SensitiveWordBs.newInstance()
.wordDeny(wordDeny)
.wordAllow(wordAllow)
.init();
Assert.assertEquals("[My custom sensitive words]", wordBs.findAll(text).toString());
Copy the code
Both the default configuration and the custom configuration are used.
Spring integration
background
In practical use, for example, you can modify the page configuration and then take effect in real time.
Data stored in the database, the following is an example of a pseudo code, you can refer to SpringSensitiveWordConfig. Java
Requires version V0.0.15 or later.
Custom data source
The simplified pseudo-code is as follows: The source of the data is the database.
MyDdWordAllow and MyDdWordDeny are database-based custom implementation classes.
@Configuration
public class SpringSensitiveWordConfig {
@Autowired
private MyDdWordAllow myDdWordAllow;
@Autowired
private MyDdWordDeny myDdWordDeny;
/** * Initialize the bootstrap class *@returnInitialize the bootstrap class *@since1.0.0 * /
@Bean
public SensitiveWordBs sensitiveWordBs(a) {
SensitiveWordBs sensitiveWordBs = SensitiveWordBs.newInstance()
.wordAllow(WordAllows.chains(WordAllows.system(), myDdWordAllow))
.wordDeny(myDdWordDeny)
// Various other configurations
.init();
returnsensitiveWordBs; }}Copy the code
The initialization of sensitive lexicon is time-consuming, so it is recommended to initialize init once when the program starts.
The dynamic change of
To ensure that sensitive words can be modified in real time and the interface is simplified as much as possible, the add/remove method is not added here.
Instead, when calling sensitiveWordbs.init (), the sensitive lexicon is rebuilt according to IWordDeny+IWordAllow.
Because initialization can take a long time (in seconds), all optimizations for init will not affect the old lexicon functionality until it is complete, and the new one will prevail.
@Component
public class SensitiveWordService {
@Autowired
private SensitiveWordBs sensitiveWordBs;
/** update thesaurus ** Each time the database information changes, the first call to update the database sensitive thesaurus method. * This method is called if it needs to take effect. * * Note: Reinitialization does not affect the use of old methods. After the initialization is complete, the new version prevails. * /
public void refresh(a) {
// Each time the database information changes, first call the method that updates the database sensitive lexicon, then call this method.sensitiveWordBs.init(); }}Copy the code
As mentioned above, you can proactively trigger an initialization of sensitiveWordbs.init () when a change to the database thesathol is needed to take effect; .
Other uses remain unchanged and you do not need to restart the application.
Edison smiled, to do things, to be a man.
Develop reading
Sensitive word tool implementation ideas
DFA algorithm
Sensitive thesaurus optimization process
Stop thinking about words
summary
Again, we use the law to defend ourselves, but we must not allow some people to entertain everything, thinking that money can buy everything.
On the occasion of the centenary, we should not let the blood of our ancestors flow in vain.
Moreover, is a Canadian entertainer of the three, proposed to dispose of the blue, and then (Blue ‘daredevil) blue (beautiful Chinese)