1 introduction
Pkuseg-python is easy to use and supports multi-domain word segmentation, which greatly improves the accuracy of word segmentation across different domains of data.
Pkuseg is a new Chinese word segmentation toolkit developed by language Computing and Machine Learning Research Group of Peking University. Pkuseg has the following characteristics:
- Accuracy of high score words. Compared to other word segmentation toolkits, our tool kit significantly improves the accuracy of word segmentation in different fields of data. According to our test results, PKUSEG reduced the segmentation error rate by 79.33% and 63.67% on the sample data sets (MSRA and CTB8), respectively.
- Multidomain participles. We trained word segmentation models in many different domains. Users can choose different models freely according to the domain characteristics of the word to be segmented.
- Support user – self – training model. Support users to use new annotation data for training.
2 Compilation and installation
- Download via PIP (with model files)
PIP install pkuseg is referenced by importing pkuseg
- Download from Github (to download the model file, see Pre-training Model)
Putting the pkuseg file in a directory and using the model by importing pkuseg requires downloading or training yourself.
3 Performance Comparison
Pkuseg officially compares with THULAC from Tsinghua University and current mainstream Chinese word segmentation jieba, and the accuracy of Pkuseg is much higher than THULAC and Jieba.
The researchers chose Linux as the test environment to test the accuracy of different toolkits on news data (MSRA) and mixed text (CTB8) data, and used the word segmentation evaluation script provided by the second International Chinese Word Segmentation Evaluation Competition. The results are as follows:
MSRA | F-score | Error Rate |
---|---|---|
jieba | 81.45 | 18.55 |
THULAC | 85.48 | 14.52 |
pkuseg | 96.75 (+ 13.18%) | 3.25 (77.62%) |
CTB8 | F-score | Error Rate |
---|---|---|
jieba | 79.58 | 20.42 |
THULAC | 87.77 | 12.23 |
pkuseg | 95.64 (+ 8.97%) | 4.36 (64.35%) |
It can be seen that pKUSEG is significantly better than the other two in terms of F1-score and error rate
4 Use Tutorial
Code example 1 uses the default model and default dictionary participle
import pkuseg
seg = pkuseg.pkuseg() Load the model with the default configuration
text = seg.cut('I love Tiananmen In Beijing') # participle
print(text)
Copy the code
loading model
finish
['我'.'love'.'Beijing'.'Tian 'an Gate']
Copy the code
Code example 2 Sets up a user-defined dictionary
import pkuseg
lexicon = ['Peking University'.Tiananmen Square, Beijing] # expect the words in the user's dictionary to be fixed when dividing words
seg = pkuseg.pkuseg(user_dict=lexicon) Load the model, given the user dictionary
text = seg.cut('I love Tiananmen In Beijing') # participle
print(text)
Copy the code
loading model
finish
['我'.'love'.Tiananmen Square, Beijing]
Copy the code
Code example 3 specifies the model
By default, the premodel used by Pkuseg is MSRA
import pkuseg
seg = pkuseg.pkuseg(model_name='ctb8') # assuming the user has downloaded the ctb8 model and placed it in the './ctb8' directory, load the model by setting model_name
text = seg.cut('I love Tiananmen In Beijing') # participle
print(text)
Copy the code
loading model
finish
['我'.'love'.'Beijing'.'Tian 'an Gate']
Copy the code
Code example 4 supports multithreading
You can specify input and output files, read text directly from the input, and output the segmentation results to the result file
import pkuseg
pkuseg.test('data/input.txt'.'data/output.txt', nthread=20) TXT file to output. TXT, using default model and dictionary, open 20 processes
Copy the code
Loading model finish Total time: 128.30054664611816Copy the code
Input.txt took two minutes to start with just one sentence. In addition, I suspect that the time is not linearly increased by the number of sentences, and then added an article to input. TXT, almost 100 sentences is also two minutes
Code example 5 Training model
Because the format of data in mSR_training. utf8 has not been determined, the test will not be done without training set
import pkuseg
Utf8 = 'msr_training.utf8'; utf8 = 'msr_test_gold
pkuseg.train('msr_training.utf8'.'msr_test_gold.utf8'.'./models', nthread=20)
Copy the code
5 Parameter Description
pkuseg.pkuseg(model_name='msra', user_dict='safe_lexicon'Model_name Model path. The default is'msra'Represents our pre-trained model (for PIP download users only). Users can fill in the path of the model they downloaded or trained, such as model_name='./models'. User_dict Sets the user dictionary. The default is'safe_lexicon'Represents a Chinese dictionary (PIP only) that we provide. The user can pass in an iterator that contains several custom words.Copy the code
pkuseg.test(readFile, outputFile, model_name='msra', user_dict='safe_lexicon', nthread=10)
readFile input File path outputFile outputFile path model_name same as pkuseg.pkuseg user_dict same as pkuseg.pkuseg nthread number of processes started during the testCopy the code
pkuseg.train(trainFile, testFile, savedir, nThread =10) trainFile Training File pathtestFile Path of the test File savedir Path for saving the training model nthread Number of processes started during trainingCopy the code
6 Related Papers
This toolkit is based on the following literature:
- Xu Sun, Houfeng Wang, Wenjie Li. Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection. The ACL. 253-262. 2012
- Jingjing Xu, Xu Sun. Dependency-based Gated Recursive Neural Network for Chinese Word Segmentation. ACL 2016: 567-572
7. Be Objective
- Is the performance comparison with the rest of the word segmentation toolkits fair
In view of this problem, someone has also raised doubts in the issue. If you are interested, you can take a look. There is no too much comment here
- The second one does not support part-of-speech tagging feeling. This can be used in combination with jieba to separate words with space. Then jieba performs another part-of-speech tagging (unverified).
Welcome to my brief letter to Great