After BosonNLP fully opened the word segmentation and part of speech tacking engine in early September, many friends, especially those engaged in data processing and natural language research, wondered how Boson could achieve such high accuracy after trial. Hopefully this article will help you understand the implementation principle behind Posen’s participles.
As we all know, Chinese is not separated by Spaces like English words, so in general, Chinese word segmentation and part of speech tagging are often the first step of Chinese natural language processing. A good word segmentation system is an important guarantee for effective Chinese data analysis and product development.
The structured prediction model used by Posen is a variation of the traditional Linear chain CRF. In the past few years, the research on word segmentation has been dominated by the literature on the prediction of word segmentation and POS tagging by encoding in character units. Although this kind of model is easier to implement, it is difficult to capture the relationship between higher-order predictive variables. For example, the tri-gram feature used in the traditional part of speech tagging can obtain the results with high accuracy, but it is difficult to establish such a correlation for first-order or even high-order characters CRF. So Posen added word information beyond the character encoding to make this higher-order effect equally detectable.
In word segmentation and part of speech tagging, new word recognition and combinational segmentation ambiguity are two core challenges. Posen has done a lot of optimizations in this respect, including the processing of special characters and the capture of features of more regular word formation. For example, in recent years, semi-supervised approach is popular, and statistical data on large-scale unlabeled data are used to improve labeled results in supervised learning, which has also been applied in our word segmentation implementation. For example, by using accressory variety as the feature, new words in different fields can be found effectively and the generalization ability can be improved.
We all know that context information is an important means to solve the ambiguity of combinatorial segmentation. As an algorithm oriented to the actual commercial environment, in addition to the requirements of accuracy, attention should also be paid to the time complexity of the model algorithm to be efficient enough. For example, compared with ordinary linear-chain CRF, skip-chain CRF can achieve better accuracy because it adds more context information. However, other training and decoding processes, whether accurate or approximate, are difficult to meet our requirements for speed. So it wasn’t used in our final implementation. An interesting participle improvement is that we capture information about fixed collocation pairs that are common in Chinese. For example, “arrive at a certain conclusion”, “answer a certain question” and so on. If “come to” comes before “come to” conclusion, “then” come to “and” come to “as one word will be highly likely, and there will be very little possibility of conflicting word segmentation schemes. Such fixed collocations can also be modeled to solve the problem of partial participle errors.
How do you determine if two words are fixed collocations? We calculate the normalized point-by-point mutual information (NPMI) between the two words to determine the collocation relationship between the two words. Point-by-point mutual information (PMI), often used in natural language processing, is used to measure how close two events are. Normalized point-by-point mutual information (NPMI) is a normalized form of point-by-point mutual information, which normalizes the value of point-by-point mutual information to between -1 and 1. Two words are said to co-occur if they appear within a certain distance. Two words with high NPMI were selected as fixed collocations, and then this set of fixed collocations was added to the word segmentation program as a combination feature. For example, “answer” and “question” are a set of fixed collocations. If “answer” is marked, it will find out whether there is a “question” within a later distance. If there is a “question”, then the feature is activated.
Normalized point by point mutual information (NPMI) calculation formula
The calculation formula of point by point mutual information (PMI)
It can be seen that if we extract the fixed collocation without limiting the distance, the probability of a word occurring accidentally later will increase and the stability of the statistics will be reduced. In the concrete implementation, we restrict the distance of the word pairs that become fixed collocations in the original text to be less than a constant. Specifically, an inverted index can be used to find the position of a word and then judge whether its position is within an acceptable range. A big problem with this simple implementation is that, in a particular constructed text, it may require traversal of the positional array to determine whether two words are fixed collocations, which can take O(n) time per query, and can be further reduced to O(logn) using binary search.
In fact, this word has a much more efficient algorithm implementation for the retrieval problem. We adoptThe sliding windowThe method of statistics: in the enumeration of words at the same time to maintain a word list, save in the current position before and after a distance of possible character sequence into the word; As the position of the enumerated word moves backwards, the window moves with it. In this way, when traversing to the “answer”, we can check the table to determine whether there is a “question” in the back, and also check the table to determine whether there is a “answer” in the front when we meet the “question”. When the next word is enumerated, the glossary is adjusted accordingly. Use a hash table to query the word list so that the time complexity for calculating a fixed collocation type can be O(1).
By introducing the above contextual information, the accuracy of word segmentation and part of speech tagging has been improved by nearly 1%, while the time complexity of the algorithm has not changed. We’re also iterating to make sure the engine is more accurate, more versatile and easier to use. We will be there in the futureBosonNLPWeChat account to enjoy more of our experience in natural language processing, welcome!