Simple requirements
Close to work, Xiao Ming finished today’s task, is ready to go home from work.
A message flashed up.
“Recently found that the public spelling check function is good, help users find typos, good experience. Make one for our system.”
Looking at the news, Xiao Ming silently greeting a sentence in his heart.
“I TND of can do this, go to somebody else headquarters to go to work directly, suffer your gas in this.”
“Ok,” Xiao Ming replied. “I’ll see first.”
Today, the king of Heaven came I also have to work, Jesus also can’t stay.
Xiao Ming thought, and went home.
Calm analysis
When it comes to spelling check, Xiao Ming actually knows.
I have never eaten pork, or seen pigs run.
Usually saw some public number big guy to share, said that the public number launched a spell check function, there will be no more typos.
Later, Xiao Ming still saw many typos in their articles. Then, there was no later.
Why not ask the all-powerful github?
Xiaoming opened Github and found that there seems to be no mature Java-related open source projects, and some of them have a few stars. It is not safe to use them.
NLP is supposed to do more python, Java to implement Chinese and English spelling check and error correction? But I only know how to write CRUD!
Xiao Ming silently picked up a thread…
The night outside the window is like water, and I can’t help thinking, where do I come from? Where to? What is the meaning of life?
Still the ashes of residual heat fell in the xiaoming east buy slippers, his mind runaway horse hot a clever.
Without any thought, without any clue, or first wash and sleep.
That night, Xiao Ming had a long dream. There were no typos, all the words were in the right place…
transit
The next day, Xiao Ming opened the search box, input spelling correct.
Gratifying is, found an English spelling correction algorithm explanation.
I have thought about it all day long, rather than learning it for a moment. Xiao Ming sighed and looked up.
Algorithm ideas
English words are mainly composed of 26 English letters, so there may be mistakes in spelling.
First, you can get the correct English words, excerpts are as follows:
apple,16192 applecart,41 applecarts,1 appledrain,1 appledrains,1 applejack,571 applejacks,4 appleringie,1 appleringies,1 apples,5914 applesauce,378 applesauces,1 applet,2Copy the code
Each line is separated by a comma, followed by the frequency of the word.
For example, if the user enters appL, if the word does not exist, you can insert/delete/replace it to find the closest word. (Essentially finding the word with the smallest edit distance)
If the entered word exists, it is correct and no processing is required.
Access to thesaurus
So where do you get an English thesaurus?
Xiao Ming thought about it, then went to various places to look up a circle, finally found a relatively perfect English word frequency word library, a total of 27W+ words.
Excerpts:
aa,1831
aah,45774
aahed,1
aahing,30
aahs,23
...
zythums,1
zyzzyva,2
zyzzyvas,1
zzz,76
zzzs,2
Copy the code
The core code
Get all possible cases of the user’s current input. The core code is as follows:
/** * Builds all possible error cases for the current word **@paramWord Enter the word *@returnReturn result *@since 0.0.1
* @authorAn old horse whistles against the west wind
private List<String> edits(String word) {
List<String> result = new LinkedList<>();
for (int i = 0; i < word.length(); ++i) {
result.add(word.substring(0, i) + word.substring(i + 1));
}
for (int i = 0; i < word.length() - 1; ++i) {
result.add(word.substring(0, i) + word.substring(i + 1, i + 2) + word.substring(i, i + 1) + word.substring(i + 2));
}
for (int i = 0; i < word.length(); ++i) {
for (char c = 'a'; c <= 'z'; ++c) {
result.add(word.substring(0, i) + c + word.substring(i + 1)); }}for (int i = 0; i <= word.length(); ++i) {
for (char c = 'a'; c <= 'z'; ++c) {
result.add(word.substring(0, i) + c + word.substring(i)); }}return result;
}
Copy the code
Then compare them to the correct words in the thesaurus:
List<String> options = edits(formatWord);
List<CandidateDto> candidateDtos = new LinkedList<>();
for (String option : options) {
if(wordDataMap.containsKey(option)) { CandidateDto dto = CandidateDto.builder() .word(option).count(wordDataMap.get(option)).build(); candidateDtos.add(dto); }}Copy the code
In the end, the result returned needs to be compared according to the frequency of the word, which is relatively simple overall.
Chinese spelling
A miss
The spelling of Chinese looks similar to English at first glance, but there is something very special about Chinese.
Because the spelling of all Chinese characters is fixed, there are no typo when users type, only other characters.
It makes no sense to say that a single word is another word. There must be a word, or a context.
That makes it much harder to correct.
Xiao Ming helplessly shook his head, Chinese culture, extensive and profound.
Algorithm ideas
There are many ways to correct Chinese characters:
(1) Confusion set.
For example, the commonly used other words, it is written as a mistake to change.
(2) N – “gramm
This is the context of a single word. The most widely used is the 2-gram. Sougou’s lab has the data.
In other words, when the first word is fixed, the second word will have the corresponding probability. The higher the probability, the more likely the user intended to input.
Like running really fast, actually running really fast might be the right thing to do.
Error correction
Of course, another difficulty with Chinese is that you can’t directly insert/delete/replace one character into another.
But similarly, there are many ways:
(1) Homophones/homonyms
(2) The shape of the word
(3) Synonyms
(4) Words and words out of order, words and words
Algorithm implementation
Due to the difficulty of implementation, Xiao Ming chose the simplest set of puzzles.
First, find a dictionary of common words. Here are some excerpts:
A crane, a bird of a feather, a bird of a feather, as usual, as usual. Dejected soul dejected soul stand to help and help and into the noise and into the dragon plate tiger according to the dragonCopy the code
The first is the other word, and the second is the correct usage.
Take the other characters as the dictionary, and then carry out the fast-forward participle of the Chinese text to get the corresponding correct form.
Of course, we can start simple, let the user input a fixed phrase, the implementation is to directly parse the corresponding map
public List<String> correctList(String word, int limit, IWordCheckerContext context) {
final Map<String, List<String>> wordData = context.wordData().correctData();
// Check whether it is incorrect
if(isCorrect(word, context)) {
return Collections.singletonList(word);
}
List<String> allList = wordData.get(word);
final int minLimit = Math.min(allList.size(), limit);
List<String> resultList = Guavas.newArrayList(minLimit);
for(int i = 0; i < minLimit; i++) {
resultList.add(allList.get(i));
}
return resultList;
}
Copy the code
Long mixed Chinese and English text
Algorithm ideas
Actual articles are generally mixed in Chinese and English.
To make it easier for users to use, you can’t just type one phrase at a time.
So what do we do?
The answer is participle, the input sentence, participle into a word. Then distinguish the Chinese and English, the corresponding processing.
For word segmentation, open source projects are recommended:
Github.com/houbb/segme…
Algorithm implementation
The modified core algorithm can be reused in Both Chinese and English.
@Override
public String correct(String text) {
if(StringUtil.isEnglish(text)) {
return text;
}
StringBuilder stringBuilder = new StringBuilder();
final IWordCheckerContext zhContext = buildChineseContext();
final IWordCheckerContext enContext = buildEnglishContext();
// The first step is to execute the participle
List<String> segments = commonSegment.segment(text);
// Only when all is true is considered true.
for(String segment : segments) {
// If it is English
if(StringUtil.isEnglish(segment)) {
String correct = enWordChecker.correct(segment, enContext);
stringBuilder.append(correct);
} else if(StringUtil.isChinese(segment)) {
String correct = zhWordChecker.correct(segment, zhContext);
stringBuilder.append(correct);
} else {
// Ignore othersstringBuilder.append(segment); }}return stringBuilder.toString();
}
Copy the code
The default implementation of the participle is as follows:
import com.github.houbb.heaven.util.util.CollectionUtil;
import com.github.houbb.nlp.common.segment.ICommonSegment;
import com.github.houbb.nlp.common.segment.impl.CommonSegments;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
/** * Default mixed participle, support Chinese and English. * *@author binbin.hou
* @since0.0.8 * /
public class DefaultSegment implements ICommonSegment {
@Override
public List<String> segment(String s) {
// Separated by Spaces
List<String> strings = CommonSegments.defaults().segment(s);
if(CollectionUtil.isEmpty(strings)) {
return Collections.emptyList();
}
List<String> results = new ArrayList<>();
ICommonSegment chineseSegment = InnerCommonSegments.defaultChinese();
for(String text : strings) {
// Perform Chinese word segmentation
List<String> segments = chineseSegment.segment(text);
results.addAll(segments);
}
returnresults; }}Copy the code
First, the word segmentation is carried out for the blank space, and then the fast-forward word segmentation is done for Chinese with the other characters of the puzzle set.
Of course, it’s not hard to say.
It is really more troublesome to achieve, Xiao Ming to complete the implementation has been open source:
Github.com/houbb/word-…
If you feel helpful, you can fork/star a wave
Quick start
Word-checker is used to check the spelling of words. Support English word spelling detection, and Chinese spelling detection.
Without further ado, let’s go through the experience of using the utility class directly.
Features that
-
Can quickly determine if the current word is misspelled
-
You can return the best match
-
You can return a list of correct matches, with support for specifying the size of the returned list
-
Error message I18N is supported
-
Support case, full corner and half corner formatting
-
Support for custom thesaurus
-
Built-in 27W+ English thesaurus
-
Support basic Chinese spelling detection
Quick start
Maven is introduced into
<dependency>
<groupId>com.github.houbb</groupId>
<artifactId>word-checker</artifactId>
<version>0.0.8</version>
</dependency>
Copy the code
The test case
Based on the input, the best corrective results are automatically returned.
final String speling = "speling";
Assert.assertEquals("spelling", EnWordCheckers.correct(speling));
Copy the code
Core API Introduction
The core API is under the EnWordCheckers utility class.
function | methods | parameter | The return value | note |
---|---|---|---|---|
Determine if the spelling is correct | isCorrect(string) | The words to be tested | boolean | |
Returns the best corrective result | correct(string) | The words to be tested | String | If no word can be corrected is found, the word itself is returned |
Determine if the spelling is correct | correctList(string) | The words to be tested | List | Returns a list of all matching corrections |
Determine if the spelling is correct | correctList(string, int limit) | The word to be tested, returning the size of the list | Returns a list of corrections of the specified size | The size of the list is less than or equal to LIMIT |
Test cases
See EnWordCheckerTest. Java
Is it spelled correctly
final String hello = "hello";
final String speling = "speling";
Assert.assertTrue(EnWordCheckers.isCorrect(hello));
Assert.assertFalse(EnWordCheckers.isCorrect(speling));
Copy the code
Returns the best match
final String hello = "hello";
final String speling = "speling";
Assert.assertEquals("hello", EnWordCheckers.correct(hello));
Assert.assertEquals("spelling", EnWordCheckers.correct(speling));
Copy the code
The match list is corrected by default
final String word = "goox";
List<String> stringList = EnWordCheckers.correctList(word);
Assert.assertEquals("[good, goo, goon, goof, gook, goop, goos, gox, goog, gool, goor]", stringList.toString());
Copy the code
Specifies the size of the correct match list
final String word = "goox";
final int limit = 2;
List<String> stringList = EnWordCheckers.correctList(word, limit);
Assert.assertEquals("[good, goo]", stringList.toString());
Copy the code
Chinese spelling correction
Core API
To reduce learning costs, the core API and ZhWordCheckers are consistent with English spelling detection.
Is it spelled correctly
final String right = "Right";
final String error = "Everything changes.";
Assert.assertTrue(ZhWordCheckers.isCorrect(right));
Assert.assertFalse(ZhWordCheckers.isCorrect(error));
Copy the code
Returns the best match
final String right = "Right";
final String error = "Everything changes.";
Assert.assertEquals("Right", ZhWordCheckers.correct(right));
Assert.assertEquals("All change is the same.", ZhWordCheckers.correct(error));
Copy the code
The match list is corrected by default
final String word = "Everything changes.";
List<String> stringList = ZhWordCheckers.correctList(word);
Assert.assertEquals("Every change is the same.", stringList.toString());
Copy the code
Specifies the size of the correct match list
final String word = "Everything changes.";
final int limit = 1;
List<String> stringList = ZhWordCheckers.correctList(word, limit);
Assert.assertEquals("Every change is the same.", stringList.toString());
Copy the code
Long text mixed in Chinese and English
scenario
For actual spelling correction, the best experience is when the user enters a long text, possibly in a mix of Chinese and English.
Then implement the corresponding functions described above.
Core method
The WordCheckers utility class provides autocorrect for long text mixed in Chinese and English.
function | methods | parameter | The return value | note |
---|---|---|---|---|
The spelling of the text is correct | isCorrect(string) | The text to be detected | boolean | Return true if all are true |
Returns the best corrective result | correct(string) | The words to be tested | String | If no text can be corrected is found, it returns itself |
Determine if the text is spelled correctly | correctMap(string) | The words to be tested | Map | Returns a list of all matching corrections |
Determine if the text is spelled correctly | correctMap(string, int limit) | The text to be detected, returning the size of the list | Returns a list of corrections of the specified size | The size of the list is less than or equal to LIMIT |
Is the spelling correct
final String hello = "Hello, hello";
final String speling = "Speling, you use poison to poison.";
Assert.assertTrue(WordCheckers.isCorrect(hello));
Assert.assertFalse(WordCheckers.isCorrect(speling));
Copy the code
Returns the best corrective result
final String hello = "Hello, hello";
final String speling = "Speling, you use poison to poison.";
Assert.assertEquals("Hello, hello", WordCheckers.correct(hello));
Assert.assertEquals("Hey, Spelling, let's fight fire with fire.", WordCheckers.correct(speling));
Copy the code
Determine if the text is spelled correctly
For each word, the corresponding corrective result.
final String hello = "Hello, hello";
final String speling = "Speling, you use poison to poison.";
Assert.assertEquals("{hello=[hello], =[], you =[you], good =[good]}", WordCheckers.correctMap(hello).toString());
Assert.assertEquals("{=[], speling=[spelling, spewing, sperling, spelding, spelding], you =[you], ok =[ok], Set fire to fire =[set fire to fire]}", WordCheckers.correctMap(speling).toString());
Copy the code
Determine if the text is spelled correctly
As above, specify the maximum number of returns.
final String hello = "Hello, hello";
final String speling = "Speling, you use poison to poison.";
Assert.assertEquals("{hello=[hello], =[], you =[you], good =[good]}", WordCheckers.correctMap(hello, 2).toString());
Assert.assertEquals("{= [], speling = [spelling, spewing], you = [you], good = [good], to poison poison = work [with]}", WordCheckers.correctMap(speling, 2).toString());
Copy the code
Formatting process
Sometimes the user input is varied, and the tool supports formatting.
case
Uppercase will be uniformly formatted as lowercase.
final String word = "stRing";
Assert.assertTrue(EnWordCheckers.isCorrect(word));
Copy the code
The Angle of half Angle
The full Angle will be uniformly formatted as a half Angle.
final String word = "string";
Assert.assertTrue(EnWordCheckers.isCorrect(word));
Copy the code
Custom English thesaurus
Configuration file
You can create a file resources/data/define_word_checker_en.txt in the project resources directory
The contents are as follows:
my-long-long-define-word,2
my-long-long-define-word-two
Copy the code
Different words stand on separate lines.
The first column of each line represents the word, and the second column represents the number of occurrences, separated by a comma.
The greater the number of times, the higher the priority to return when correcting, which defaults to 1.
User – defined thesaurus takes precedence over system – built thesaurus.
The test code
After we specify the corresponding word, the spelling test will take effect.
final String word = "my-long-long-define-word";
final String word2 = "my-long-long-define-word-two";
Assert.assertTrue(EnWordCheckers.isCorrect(word));
Assert.assertTrue(EnWordCheckers.isCorrect(word2));
Copy the code
Custom Chinese thesaurus
Configuration file
You can create a file resources/data/define_word_checker_zh.txt in the project resource directory
The contents are as follows:
Stick to the rulesCopy the code
Use the English space to separate, error followed by correct.
summary
The correction of Chinese and English spelling has always been a hot and difficult topic.
In recent years, commercial applications have been successful because of advances in NLP and artificial intelligence.
The main implementation is based on the traditional algorithm, the core in thesaurus.
Xiaoming put the complete implementation has been open source:
Github.com/houbb/word-…
Welcome to fork/star
subsequent
After several days of hard work, Xiao Ming finally finished a simple spell-checking tool.
“Last time and I said the public spelling check function still want?”
“No, YOU don’t say I have forgotten.” “The product appears somewhat surprised. It doesn’t matter whether we do that or not. We’ve been squeezing a bunch of business needs lately. You’re on top of it.”
“……”
“I recently saw a function on XXX is also very good, you give our system to do one.”
“……”