FoolNLTK

Chinese Processing Kit

GitHub users have opened an open source Chinese processing toolkit built with two-way LSTM, which can not only realize word segmentation, partof speech tagging and named entity recognition, but also enhance the effect of word segmentation with user-defined dictionary.

The characteristics of

  • May not be the fastest open source Chinese word segmentation, but it is probably the most accurate open source Chinese word segmentation
  • Based on BiLSTM model training
  • Including word segmentation, part-of-speech tagging, entity recognition, all have relatively high accuracy
  • User-defined dictionary
  • You can train your own model
  • Batch processing

Dependencies :(Windows test successful)

  • Python3.5 +
  • Tensorflow > = 1.0.0

The installation

pip install foolnltk
Copy the code

Directions for use

participles
import fool

text = "A fool in Beijing."
print(fool.cut(text))
# [' a ', 'fool ',' in ', 'Beijing ']
Copy the code

Command line word segmentation. You can specify the -b parameter. The number of lines cut each time can speed up word segmentation

python -m fool [filename]
Copy the code
User-defined dictionary

The dictionary format is as follows: The higher the weight of the word, the longer the length of the word, the more likely it is to appear. The weight value should be greater than 1

Uncomfortable mushroom 10 what ghost 10 participle tool 10 Beijing 10 Beijing Tian 'anmen 10Copy the code

Loading the dictionary

import fool
fool.load_userdict(path)
text = ["I watched you suffer shiitake mushrooms at Tiananmen square in Beijing."."I was sunning myself in Beijing and you were watching snow in Africa."]
print(fool.cut(text))
#[[' I ', 'in ',' Beijing ', 'Tiananmen ',' look ', 'you ',' uncomfortable ', 'shiitake mushroom '],
# [' me ', 'in', 'Beijing', 'the sun', 'you' and 'in', 'Africa', 'look at', 'snow']]
Copy the code

Delete the dictionary

fool.delete_userdict();
Copy the code

POS tags

import fool
    
text = ["A fool in Beijing."]
print(fool.pos_cut(text))
# [[(' a ', 'm'), (' fool ', 'n'), (' in ', 'p'), (' Beijing ', 'ns)]]
Copy the code

Entity recognition

import fool 
text = ["A fool in Beijing."."Hello?"]
words, ners = fool.analysis(text)
print(ners)
#[[(5, 8, 'location', 'Beijing ')]]
Copy the code

Note that for any missing model files, try looking at sys.prefix at the /usr/local/open source address: github.com/rockyzhengw…