NLP Series (1) Pkuseg-Python: a Chinese word segmentation kit with higher accuracy than Jieba

1 introduction

Pkuseg-python is easy to use and supports multi-domain word segmentation, which greatly improves the accuracy of word segmentation across different domains of data.

Pkuseg is a new Chinese word segmentation toolkit developed by language Computing and Machine Learning Research Group of Peking University. Pkuseg has the following characteristics:

Accuracy of high score words. Compared to other word segmentation toolkits, our tool kit significantly improves the accuracy of word segmentation in different fields of data. According to our test results, PKUSEG reduced the segmentation error rate by 79.33% and 63.67% on the sample data sets (MSRA and CTB8), respectively.
Multidomain participles. We trained word segmentation models in many different domains. Users can choose different models freely according to the domain characteristics of the word to be segmented.
Support user – self – training model. Support users to use new annotation data for training.

2 Compilation and installation

Download via PIP (with model files)

PIP install pkuseg is referenced by importing pkuseg

Download from Github (to download the model file, see Pre-training Model)

Putting the pkuseg file in a directory and using the model by importing pkuseg requires downloading or training yourself.

3 Performance Comparison

Pkuseg officially compares with THULAC from Tsinghua University and current mainstream Chinese word segmentation jieba, and the accuracy of Pkuseg is much higher than THULAC and Jieba.

The researchers chose Linux as the test environment to test the accuracy of different toolkits on news data (MSRA) and mixed text (CTB8) data, and used the word segmentation evaluation script provided by the second International Chinese Word Segmentation Evaluation Competition. The results are as follows:

MSRA	F-score	Error Rate
jieba	81.45	18.55
THULAC	85.48	14.52
pkuseg	96.75 (+ 13.18%)	3.25 (77.62%)

CTB8	F-score	Error Rate
jieba	79.58	20.42
THULAC	87.77	12.23
pkuseg	95.64 (+ 8.97%)	4.36 (64.35%)

It can be seen that pKUSEG is significantly better than the other two in terms of F1-score and error rate

4 Use Tutorial

Code example 1 uses the default model and default dictionary participle

import pkuseg
seg = pkuseg.pkuseg() Load the model with the default configuration
text = seg.cut('I love Tiananmen In Beijing') # participle
print(text)
Copy the code

loading model
finish
['我'.'love'.'Beijing'.'Tian 'an Gate']
Copy the code

Code example 2 Sets up a user-defined dictionary

import pkuseg
lexicon = ['Peking University'.Tiananmen Square, Beijing]	# expect the words in the user's dictionary to be fixed when dividing words
seg = pkuseg.pkuseg(user_dict=lexicon)	Load the model, given the user dictionary
text = seg.cut('I love Tiananmen In Beijing')		# participle
print(text)
Copy the code

loading model
finish
['我'.'love'.Tiananmen Square, Beijing]
Copy the code

Code example 3 specifies the model

By default, the premodel used by Pkuseg is MSRA

import pkuseg
seg = pkuseg.pkuseg(model_name='ctb8') # assuming the user has downloaded the ctb8 model and placed it in the './ctb8' directory, load the model by setting model_name
text = seg.cut('I love Tiananmen In Beijing') # participle
print(text)
Copy the code

loading model
finish
['我'.'love'.'Beijing'.'Tian 'an Gate']
Copy the code

Code example 4 supports multithreading

You can specify input and output files, read text directly from the input, and output the segmentation results to the result file

import pkuseg
pkuseg.test('data/input.txt'.'data/output.txt', nthread=20) TXT file to output. TXT, using default model and dictionary, open 20 processes
Copy the code

Loading model finish Total time: 128.30054664611816Copy the code

Input.txt took two minutes to start with just one sentence. In addition, I suspect that the time is not linearly increased by the number of sentences, and then added an article to input. TXT, almost 100 sentences is also two minutes

Code example 5 Training model

Because the format of data in mSR_training. utf8 has not been determined, the test will not be done without training set

import pkuseg
Utf8 = 'msr_training.utf8'; utf8 = 'msr_test_gold
pkuseg.train('msr_training.utf8'.'msr_test_gold.utf8'.'./models', nthread=20)	
Copy the code

5 Parameter Description

pkuseg.pkuseg(model_name='msra', user_dict='safe_lexicon'Model_name Model path. The default is'msra'Represents our pre-trained model (for PIP download users only). Users can fill in the path of the model they downloaded or trained, such as model_name='./models'. User_dict Sets the user dictionary. The default is'safe_lexicon'Represents a Chinese dictionary (PIP only) that we provide. The user can pass in an iterator that contains several custom words.Copy the code

pkuseg.test(readFile, outputFile, model_name='msra', user_dict='safe_lexicon', nthread=10)
readFile input File path outputFile outputFile path model_name same as pkuseg.pkuseg user_dict same as pkuseg.pkuseg nthread number of processes started during the testCopy the code

pkuseg.train(trainFile, testFile, savedir, nThread =10) trainFile Training File pathtestFile Path of the test File savedir Path for saving the training model nthread Number of processes started during trainingCopy the code

6 Related Papers

This toolkit is based on the following literature:

Xu Sun, Houfeng Wang, Wenjie Li. Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection. The ACL. 253-262. 2012
Jingjing Xu, Xu Sun. Dependency-based Gated Recursive Neural Network for Chinese Word Segmentation. ACL 2016: 567-572

7. Be Objective

Is the performance comparison with the rest of the word segmentation toolkits fair

In view of this problem, someone has also raised doubts in the issue. If you are interested, you can take a look. There is no too much comment here

The second one does not support part-of-speech tagging feeling. This can be used in combination with jieba to separate words with space. Then jieba performs another part-of-speech tagging (unverified).

Welcome to my brief letter to Great