Selected from GitHub by Rui Xuan Luo, Xu Jingjing, Sun Xu, Editor of Machine Heart.
Recently, Peking University opened an open source Chinese word segmentation tool kit, which has very high word segmentation accuracy in multiple word segmentation data sets. Among them, the error rates of stuttering participle widely used are as high as 18.55% and 20.42, while pKUSEG of Peking University is only 3.25% and 4.32%.
Pkuseg is a new Chinese word segmentation toolkit developed by language Computing and Machine Learning Research Group of Peking University. It is easy to use, support multi-domain word segmentation, in different fields of data have greatly improved the accuracy of word segmentation.
-
Project address: github.com/lancopku/PK…
Pkuseg has the following characteristics:
-
High score word segmentation accuracy: Compared with other word segmentation toolkits, this tool greatly improves word segmentation accuracy in different fields of data. According to the test results of Peking University research group, PKUSEG reduced the word segmentation error rate by 79.33% and 63.67% on sample data sets (MSRA and CTB8), respectively.
-
Multi-domain word segmentation: The research team trained a variety of word segmentation models for different domains. Users can choose different models freely according to the domain characteristics of the word to be segmented.
-
Support user self-training model: support users to use new annotation data for training.
In addition, THULAC and stutter word segmentation tool kits were selected to compare the performance with PkusEG. They chose Linux as their test environment to test the accuracy of different toolkits on news data (MSRA) and mixed text (CTB8) data. In addition, the test used the word segmentation evaluation script provided by the second International Chinese Word Segmentation Evaluation Competition. The results are as follows:
We can see that the most widely used stutter word segmentation accuracy is the lowest, and THULAC constructed by Tsinghua is not as accurate. Of course, pkusEG is trained on these data sets, so it will be a little more accurate on these tasks.
Pretraining model
In word segmentation mode, the user needs to load the pre-trained model. The research team provides three models trained on different types of data. Users can choose different pre-training models according to their specific needs. The following is a description of the pre-training model:
-
MSRA: Model of training on MSRA (News Corpus). The new version of the code uses this model.
-
CTB8: A model trained on CTB8 (a hybrid corpus of news text and web text).
-
WEIBO: model trained on WEIBO (network text corpus).
Among them, MSRA data is provided by the second International Chinese Word Segmentation Competition, CTB8 data is provided by LDC, and WEIBO data is provided by NLPCC. In the GitHub project, all three pre-training models are available for download.
Installation and use
Pkuseg can be installed using PIP or downloaded directly from GitHub:
pip install pkusegCopy the code
It’s also easy to use pkuseg to implement a word segmentation, which is basically the same as any other participle:
' ''Code Example 1: Using default model and default dictionary segmentation'' '
import pkuseg
Load the model with the default configuration
seg = pkuseg.pkuseg()
# participle
text = seg.cut('I love Tiananmen In Beijing')
print(text)
' ''Code Example 2: Setting up a user-defined dictionary'' '
import pkuseg
# expect the words in the user's dictionary to be fixed when dividing words
lexicon = ['Peking University'.Tiananmen Square, Beijing]
Load the model, given the user dictionary
seg = pkuseg.pkuseg(user_dict=lexicon)
text = seg.cut('I love Tiananmen In Beijing')
print(text)
' ''Code Example 3'' '
import pkuseg
# assuming the user has downloaded the ctb8 model and placed it in the './ctb8' directory, load the model by setting model_name
seg = pkuseg.pkuseg(model_name='./ctb8')
text = seg.cut('I love Tiananmen In Beijing')
print(text)Copy the code
For large text data sets, if fast word segmentation is needed, we can also adopt multi-threaded mode:
' ''Code Example 4'' '
import pkuseg
TXT file to output. TXT, using default model and dictionary, open 20 processes
pkuseg.test('input.txt'.'output.txt', nthread=20)
Copy the code
Finally, Pkuseg can also retrain a segmentation model:
' ''Code Example 5'' '
import pkuseg
Utf8 = 'msr_training.utf8'; utf8 = 'msr_test_gold
pkuseg.train('msr_training.utf8'.'msr_test_gold.utf8'.'./models', nthread=20)Copy the code
These are all examples on GitHub, please refer to the GitHub project for details, such as parameter descriptions and reference papers.