This article is participating in Python Theme Month. See the link for details
introduce
This article continues with the introduction of bi-gram in Python.
data
The data is stored in the token.txt file, and each line is in the form of a participle:
Hangzhou County, Zhejiang ProvinceCopy the code
Of course, each line can also be a sentence.
implementation
Import datetime, math # START = 'S' # END = 'E' class Corrector(): def __init__(self): Wordcount, self.words = self.init() # mapping dictionary self.c2n, Self.n2c = self.createdict (self.wordcount) # self.n2c = self.createdict (self.wordcount) # Here are just a few participles given to get the word frequency and the word set def init(self): Datetime.datetime.now () # result = [] with open('token.txt', 'r') as f: for line in f.readlines(): Result.append (line.strip()) # wordcount = {} # Words = [] for word in result: if word and len(word) > 2: Word = START + word + END words.append(word) # for c in word: if C not in wordcount: wordcount[c] = 1 else: Wordcount [c] += 1 # For I, c in enumerate(word): if I < 2: continue if word[i-2: I] not in wordcount: wordcount[word[i - 2:i]] = 1 else: wordcount[word[i - 2:i]] += 1 print("Time taken to get token_freq and words : "+ STR (datetime.datetime.now() -start)) return wordcount, words # def createDict(self, CHAR_FREQ): start = datetime.datetime.now() c2n = {} n2c = {} count = 0 for c in CHAR_FREQ: c2n[c] = count n2c[count] = c count += 1 print("Time taken to get token_freq and words : "+ STR (datetime.datetime.now() -start)) return c2n, n2c # def calMatrix(self): Start = datetime.datetime.now() matrix = [[0] * len(self.wordcount) for I in range(len(self.wordcount))] # For key in self. Words: for I, C in enumerate(key): if I < 2: continue else: Matrix [self.c2n[key[i-2: I]]][self.c2n[c]] += 1 # Calculate the trigram probability matrix for I, line in enumerate(matrix): for j, freq in enumerate(line): matrix[i][j] = round((matrix[i][j]+1) / (self.wordcount[self.n2c[i]]+self.N), 10) print("Time taken to get token_freq and words : "+ STR (datetime.datetime.now() -start)) return matrix # def predict(self, strings): result = {} for s in strings: r = 0 s = START + s + END for i, c in enumerate(s): if i < 2: continue if s[i-2:i] in self.c2n and s[i] in self.c2n: r += math.log(self.matrix[self.c2n[s[i-2:i]]][self.c2n[s[i]]] + 1) else: r = 0 break t = s.lstrip(START).rstrip(END) if t not in result: Result [t] = r return result c = Corrector() print(c.predict([' cN ', 'cn ',' cn '])Copy the code
Print:
Time taken to get token_freq and words: 0:00.000100 Time taken to get token_freq and words: 0:00:00.000004 Time Taken to get token_freq and Words: 0:00:00.000226 {' zhejiang ': 0.3520474483650825, 'Hangzhou ': 0.1978967684980622, 'This hangzhou ': 0}Copy the code
experience
Because this paper is to implement the code, so the data is very simple, the results calculated quickly, but when I use more data for calculation, will be quite slow, finally I will try to matrix storage, found that reached 3 gb, really is a practice only after the exponentially explosive growth of the matrix have intuitive experience. The previous Bi-Gram matrix storage was only about 30MB. In real projects, pruning a matrix to prevent it from getting too large is the most common way to do this: the SRILM tool.
And because there is no smoothing process, the matrix will be very sparse, which is also a point to pay attention to, there are many common methods, here is just a list, the details can be learned online:
- Laplacian smoothing (this is the one used in this article)
- Interpolation and backtracking
- Absolute Discounting
- Kneser-Ney Smoothing
- .