Word error correction

When we usually use Word or other text editing software, we often encounter the function of Word correction. For example, in Word:

Misspelled words

Word correction algorithm

First, we need a corpus, which almost all NLP tasks have. The word correction corpus is bit.txt, which contains the following contents:

  • Gutenberg corpus data;
  • Wiktionary;
  • A list of the most commonly used words in the UK national Corpus.

Download from github.com/percent4/-w… .

Python implementation

The full Python code for word correction (spelling_correcter.py) is as follows:

# -*- coding: utf-8- * -import re, collections

def tokens(text):
    """ Get all words from the corpus """
    return re.findall('[a-z]+', text.lower())

with open('E://big.txt'.'r'as f:
    WORDS = tokens(f.read())
WORD_COUNTS = collections.Counter(WORDS)

def known(words):
    """ Return the subset of words that are actually in our WORD_COUNTS dictionary. """
    return {w for w in words if w in WORD_COUNTS}


def edits0(word):
    """ Return all strings that are zero edits away from the input word (i.e., the word itself). """
    return {word}


def edits1(word):
    """ Return all strings that are one edit away from the input word. """
    alphabet = 'abcdefghijklmnopqrstuvwxyz'

    def splits(word):
        """ Return a list of all possible (first, rest) pairs that the input word is made of. """
        return [(word[:i], word[i:]) for i in range(len(word) + 1)]

    pairs = splits(word)
    deletes = [a + b[1:] for (a, b) in pairs if b]
    transposes = [a + b[1] + b[0] + b[2:] for (a, b) in pairs if len(b) > 1]
    replaces = [a + c + b[1:] for (a, b) in pairs for c in alphabet if b]
    inserts = [a + c + b for (a, b) in pairs for c in alphabet]
    return set(deletes + transposes + replaces + inserts)


def edits2(word):
    """ Return all strings that are two edits away from the input word. """
    return {e2 for e1 in edits1(word) for e2 in edits1(e1)}


def correct(word):
    """ Get the best correct spelling for the input word """
    # Priority is for edit distance 0, then 1, then 2
    # else defaults to the input word itself.
    candidates = (known(edits0(word)) or
                  known(edits1(word)) or
                  known(edits2(word)) or
                  [word])
    return max(candidates, key=WORD_COUNTS.get)


def correct_match(match):
    """ Spell-correct word in match, and preserve proper upper/lower/title case. """

    word = match.group()

    def case_of(text):
        """ Return the case-function appropriate for text: upper, lower, title, or just str.: """
        return (str.upper if text.isupper() else
                str.lower if text.islower() else
                str.title if text.istitle() else
                str)

    return case_of(word)(correct(word.lower()))


def correct_text_generic(text):
    """ Correct all the words within a text, returning the corrected text. """
    return re.sub('[a-zA-Z]+', correct_match, text)
Copy the code

test

With the word correction program above, let’s test some words or sentences. As follows:

original_word_list = ['fianlly'.'castel'.'case'.'monutaiyn'.'foresta', \
                      'helloa'.'forteen'.'persreve'.'kisss'.'forteen helloa', \
                      'phons forteen Doora. This is from Chinab.']

for original_word in original_word_list:
    correct_word = correct_text_generic(original_word)
    print('Orginial word: %s\nCorrect word: %s'%(original_word, correct_word))
Copy the code

The following output is displayed:

Orginial word: fianlly

Next, we tested the following Word document (Spelling Error. Docx) at github.com/percent4/-w…

Word documents with Word errors

The Python code for word-correcting this document is as follows:

from docx import Document
from nltk import sent_tokenize, word_tokenize
from spelling_correcter import correct_text_generic
from docx.shared importRGBColor # COUNT_CORRECT =0File = Document("E://Spelling Error.docx")

#print("Paragraph Number :"+str(len(file.paragraphs)))

punkt_list = r"Is that.? \ '!" (a) / \ \ - < > : @ # $% ^ & * ~"

documentDef write_correct_paragraph(I):global[I].strip() # Sentences = sent_tokenize(text=paragraph) # Words_list = [word_tokenize]for sentence in sentences]

    p = document.add_paragraph(' '*7# handle to the paragraphfor word_list in words_list:
        for word inWord_list: # Capitalize the first letter of the first word of each sentence and leave two Spaces emptyif word_list.index(word) == 0 and words_list.index(word_list) == 0:
                if word not in punkt_list:
                    p.add_run(' ') # correct_word = correct_text_generic(word) # Correct_text_generic (wordifcorrect_word ! = word: colored_word = p.add_run(correct_word[0].upper()+correct_word[1:])
                        font = colored_word.font
                        font.color.rgb = RGBColor(0x00.0x00.0xFF)
                        COUNT_CORRECT += 1
                    else:
                        p.add_run(correct_word[0].upper() + correct_word[1:)else:
                    p.add_run(word)
            else:
                p.add_run(' ') # correct_word = correct_text_generic(word)if word not inPunkt_list: # If the word is modified, the color is redifcorrect_word ! = word: colored_word = p.add_run(correct_word) font = colored_word.font font.color.rgb = RGBColor(0xFF.0x00.0x00)
                        COUNT_CORRECT += 1
                    else:
                        p.add_run(correct_word)
                else:
                    p.add_run(word)

for i in range(len(file.paragraphs)):
    write_correct_paragraph(i)

document.save('E://correct_document.docx')

print('Modify and save the file! ')
print('changed %d altogether. '%COUNT_CORRECT)
Copy the code

The output is as follows:

Modify and save the file!

The modified Word document is as follows:

Word document after Word correction

Among them, the part of red font is the original word with spelling errors, and the words after spelling correction, a total of 19 changes.

conclusion

Word correction is not as difficult as expected, but it is not as easy ~github.com/percent4/-w… .