Use Python to correct English words

Word error correction

When we usually use Word or other text editing software, we often encounter the function of Word correction. For example, in Word:

Misspelled words

Word correction algorithm

First, we need a corpus, which almost all NLP tasks have. The word correction corpus is bit.txt, which contains the following contents:

Gutenberg corpus data;
Wiktionary;
A list of the most commonly used words in the UK national Corpus.

Download from github.com/percent4/-w… .

Python implementation

The full Python code for word correction (spelling_correcter.py) is as follows:

# -*- coding: utf-8- * -import re, collections

def tokens(text):
    """ Get all words from the corpus """
    return re.findall('[a-z]+', text.lower())

with open('E://big.txt'.'r') as f:
    WORDS = tokens(f.read())
WORD_COUNTS = collections.Counter(WORDS)

def known(words):
    """ Return the subset of words that are actually in our WORD_COUNTS dictionary. """
    return {w for w in words if w in WORD_COUNTS}


def edits0(word):
    """ Return all strings that are zero edits away from the input word (i.e., the word itself). """
    return {word}


def edits1(word):
    """ Return all strings that are one edit away from the input word. """
    alphabet = 'abcdefghijklmnopqrstuvwxyz'

    def splits(word):
        """ Return a list of all possible (first, rest) pairs that the input word is made of. """
        return [(word[:i], word[i:]) for i in range(len(word) + 1)]

    pairs = splits(word)
    deletes = [a + b[1:] for (a, b) in pairs if b]
    transposes = [a + b[1] + b[0] + b[2:] for (a, b) in pairs if len(b) > 1]
    replaces = [a + c + b[1:] for (a, b) in pairs for c in alphabet if b]
    inserts = [a + c + b for (a, b) in pairs for c in alphabet]
    return set(deletes + transposes + replaces + inserts)


def edits2(word):
    """ Return all strings that are two edits away from the input word. """
    return {e2 for e1 in edits1(word) for e2 in edits1(e1)}


def correct(word):
    """ Get the best correct spelling for the input word """
    # Priority is for edit distance 0, then 1, then 2
    # else defaults to the input word itself.
    candidates = (known(edits0(word)) or
                  known(edits1(word)) or
                  known(edits2(word)) or
                  [word])
    return max(candidates, key=WORD_COUNTS.get)


def correct_match(match):
    """ Spell-correct word in match, and preserve proper upper/lower/title case. """

    word = match.group()

    def case_of(text):
        """ Return the case-function appropriate for text: upper, lower, title, or just str.: """
        return (str.upper if text.isupper() else
                str.lower if text.islower() else
                str.title if text.istitle() else
                str)

    return case_of(word)(correct(word.lower()))


def correct_text_generic(text):
    """ Correct all the words within a text, returning the corrected text. """
    return re.sub('[a-zA-Z]+', correct_match, text)
Copy the code

test

With the word correction program above, let’s test some words or sentences. As follows:

original_word_list = ['fianlly'.'castel'.'case'.'monutaiyn'.'foresta', \
                      'helloa'.'forteen'.'persreve'.'kisss'.'forteen helloa', \
                      'phons forteen Doora. This is from Chinab.']

for original_word in original_word_list:
    correct_word = correct_text_generic(original_word)
    print('Orginial word: %s\nCorrect word: %s'%(original_word, correct_word))
Copy the code

The following output is displayed:

Orginial word: fianlly

Next, we tested the following Word document (Spelling Error. Docx) at github.com/percent4/-w…

Word documents with Word errors

The Python code for word-correcting this document is as follows:

from docx import Document
from nltk import sent_tokenize, word_tokenize
from spelling_correcter import correct_text_generic
from docx.shared importRGBColor # COUNT_CORRECT =0File = Document("E://Spelling Error.docx")

#print("Paragraph Number :"+str(len(file.paragraphs)))

punkt_list = r"Is that.? \ '!" (a) / \ \ - < > : @ # $% ^ & * ~"

documentDef write_correct_paragraph(I):global[I].strip() # Sentences = sent_tokenize(text=paragraph) # Words_list = [word_tokenize]for sentence in sentences]

    p = document.add_paragraph(' '*7# handle to the paragraphfor word_list in words_list:
        for word inWord_list: # Capitalize the first letter of the first word of each sentence and leave two Spaces emptyif word_list.index(word) == 0 and words_list.index(word_list) == 0:
                if word not in punkt_list:
                    p.add_run(' ') # correct_word = correct_text_generic(word) # Correct_text_generic (wordifcorrect_word ! = word: colored_word = p.add_run(correct_word[0].upper()+correct_word[1:])
                        font = colored_word.font
                        font.color.rgb = RGBColor(0x00.0x00.0xFF)
                        COUNT_CORRECT += 1
                    else:
                        p.add_run(correct_word[0].upper() + correct_word[1:)else:
                    p.add_run(word)
            else:
                p.add_run(' ') # correct_word = correct_text_generic(word)if word not inPunkt_list: # If the word is modified, the color is redifcorrect_word ! = word: colored_word = p.add_run(correct_word) font = colored_word.font font.color.rgb = RGBColor(0xFF.0x00.0x00)
                        COUNT_CORRECT += 1
                    else:
                        p.add_run(correct_word)
                else:
                    p.add_run(word)

for i in range(len(file.paragraphs)):
    write_correct_paragraph(i)

document.save('E://correct_document.docx')

print('Modify and save the file! ')
print('changed %d altogether. '%COUNT_CORRECT)
Copy the code

The output is as follows:

Modify and save the file!

The modified Word document is as follows:

Word document after Word correction

Among them, the part of red font is the original word with spelling errors, and the words after spelling correction, a total of 19 changes.

conclusion

Word correction is not as difficult as expected, but it is not as easy ~github.com/percent4/-w… .

Word error correction

Word correction algorithm

Python implementation

test

conclusion

Related Posts

JDK source code HashMap parsing

Dialysis aliyun video cloud “low code audio and video factory” energy engine — vPaaS video native application development platform

Build database queries using the Specification of Spring Data JPA