This article is participating in Python Theme Month. See the link for details
In this paper, the realization of n-GRAM in the simplest bi- Gram simple case, the specific principle can see the reference in detail.
code
Math def init() import math def init(): * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Words = [] for word in texts: if word and len(word)>2: word = 's' + word + 'e' words.append(word) for c in word: if c not in wordcount: wordcount[c] = 1 else: Def createDict(CHAR_FREQ): def createDict(CHAR_FREQ): c2n = {} n2c = {} count = 0 for c in CHAR_FREQ: C2n [c] = count n2c[count] = c count += 1 return c2n,n2c ## CreateDict (wordcount) ## Def calMatrix(): Matrix = [[0]*len(wordcount) for I in range(len(wordcount))] # for i,c in enumerate(key): if i == 0 or len(key)-1==i: continue else: Matrix [C2N [key[I-1]]][C2n [C]] += 1 ## Calculate the Bi-gram probability matrix for I,line in enumerate(matrix): for J,freq in enumerate(line): Matrix [I][j] = round(matrix[I][j]/wordcount[n2c[I]],7) return matrix ## matrix = calMatrix() result = [] for s in strings: r = 0 s = 's' + s + 'e' for i,c in enumerate(s): if i==0: continue if c in c2n: r += math.log(matrix[c2n[s[i-1]]][c2n[s[i]]]+1) else: R = 0 break result.append(r) return result print(predict([' ZDQ ',' ZDQ ',' ZDQ ']))Copy the code
Results print:
[0.28768204745178066, 0.9808292280117259, 2.3671235891316167, 0.9808292280117259]
Copy the code
The correct district is the string of “political yuan district”, but if in the case of not knowing the input of “political district district”, “political dynasty district”, “political yuan district”, “political home district” four words, from the result can be seen that “political yuan district” is the highest score of the word, so the first choice of this answer. Of course, this answer is also the correct answer from the result known in advance, so when we use bi-gram, we can also find the string with the largest score value and the result with the highest probability.
expand
In fact, on the basis of bi-gram, tri-Gram can also be extended, which is also a commonly used model. However, if four-gram and so on are larger, it will have stronger discrimination ability for the next word or word and more constraint information, but the generated matrix is also more clean. The final number of n-grams is larger, Vn, where V is the dictionary or dictionary size.
When modifying the above code, you only need to modify the calMatrix method.
reference
- Blog.csdn.net/songbinxu/a…