First, several projects github Star comparison

You can see a lot of Chinese word segmentation articles on the Internet, but it is not clear exactly what to choose. Of course, generally speaking, there is no best only the most appropriate, in the segmentation ability, traditional support and so on can be considered in the direction. Of course, Star on Github can also be used as a basis for choosing an open source project.

  • HanLP making 21.4 k star

[github.com/hankcs/HanL…

github.com

] (link.zhihu.com/?target=htt…).

  • Jieba making 24.9 k star

[fxsjy/jieba​

github.com] (Link.zhihu.com/?target=htt…)

  • ik-analyzer github 589 star

You can see that iK-Analyzer and ES Solr etc. have integration and it seems that 589STAR is less, of course ik-Analyzer is mainly in code.google, when the last code.Google update was only the 2012 version

[wks/ik-analyzer​

github.com

] (link.zhihu.com/?target=htt…).

Ansj_seg 5.7 k

[NLPchina/ansj_seg​

github.com

] (link.zhihu.com/?target=htt…).

I suggest to use jieba

Ii. Specific instructions

(1) Hanlp participle

[hankcs/HanLP​

github.com

] (link.zhihu.com/?target=htt…).

Shortest path word segmentation, Chinese word segmentation, part-of-speech tagging, new word recognition, named entity recognition, automatic summarization, text clustering, sentiment analysis, word vector word2VEc and other functions, supporting custom dictionaries;

HMM, CRF, TextRank, WORD2VEC, clustering, neural network and other algorithms are adopted.

Support Java, C++, Python language;

(2) Stutter participle

[github.com/fxsjy/jieba…

github.com

] (link.zhihu.com/?target=htt…).

Find the maximum segmentation combination based on word frequency, with Chinese word segmentation, keyword extraction, part-of-speech tagging functions, support for custom dictionary;

HMM model and Viterbi algorithm are adopted.

Support Java, C++, Python language;

(3) LTP of HIT

[HIT-SCIR/ltp​

github.com] (Link.zhihu.com/?target=htt…)

Chinese word segmentation, part-of-speech tagging, parsing and other functions;

For commercial use; Call the interface, the number of requests per second is limited;

Programming languages are C++, Python, Java version;

(4) THULAC, Tsinghua University

[thunlp/THULAC​

github.com] (Link.zhihu.com/?target=htt…)

Chinese word segmentation and part-of-speech tagging;

Available in Java, Python, and C++;

(5) PKUSEG, Peking University

[lancopku/pkuseg-python​

github.com] (Link.zhihu.com/?target=htt…)

Support word segmentation by domain, part-of-speech tagging function, support user self-training model;

Self-developed ADF training method based on CRF model;

There is a Python version;

(6) Stanford participle

[The Stanford Natural Language Processing Group​

nlp.stanford.edu

] (link.zhihu.com/?target=htt…).

Support multi-language segmentation including Chinese and English, provide training model interface, can also use the existing model, but the speed is slower;

CRF algorithm implemented by Java;

(7) KCWS participle

[koth/kcws​

github.com] (Link.zhihu.com/?target=htt…)

It has Chinese word segmentation and part-of-speech tagging functions, and supports custom dictionaries.

Using WORD2VEC, BI-LSTM, CRF algorithm;

ZPar (8)

[frcchang/zpar​

github.com] (Link.zhihu.com/?target=htt…)

Chinese, English, Spanish word segmentation, part-of-speech tagging;

C++ language;

IKAnalyzer (9)

[wks/ik-analyzer​

github.com] (Link.zhihu.com/?target=htt…)

Chinese word segmentation function, support custom dictionary;

Jcseg (10)

[Spirit of the Lion/JCSEG

gitee.com

] (link.zhihu.com/?target=htt…).

It has Chinese word segmentation, keyword extraction, automatic summarization, part-of-speech tagging, entity recognition and other functions, and supports custom dictionaries;

Based on MMSEG, textRank, BM25 and other algorithms;

FudanNLP (11)

[FudanNLP/fnlp​

github.com] (Link.zhihu.com/?target=htt…)

Chinese word segmentation part-of-speech tagging entity name recognition keywords extraction;

SnowNLP (12)

[isnowfy/snownlp​

github.com] (Link.zhihu.com/?target=htt…)

Chinese word segmentation, part-of-speech tagging, sentiment analysis, text classification, keyword extraction and other functions;

Based on HMM, Naive Bayes, TextRank, TF-IDF and other algorithms;

Python libraries;

(13) ANSJ participle

[NLPchina/ansj_seg​

github.com] (Link.zhihu.com/?target=htt…)

Chinese word segmentation, name recognition, part of speech tagging, user-defined dictionary and other functions;

Based on N-Gram +CRF+HMM algorithm;

(14) me

[nltk/nltk​

github.com] (Link.zhihu.com/?target=htt…)

She is good at English word segmentation and also supports Chinese word segmentation. However, it is recommended to use other word segmentation tools for Chinese corpus segmentation first, and then use its processing function.

Python library;

15. Take care of his food

[code.google.com/p/paoding​

code.google.com

] (link.zhihu.com/?target=htt…).

3.2, the other

(1) NLPIR, Institute of Computing Science, Chinese Academy of Sciences

It has the functions of word segmentation, part-of-speech tagging, new word recognition, named entity recognition, sentiment analysis, keyword extraction and other functions, and supports custom dictionaries;

(2) Tencent Wenzhi

(3) the BosonNLP

(4) Baidu NLP

(5) Aliyun NLP

6) Sina Cloud

(7) Pangu participle

The article pick to

Hua Tianqing: Chinese word segmentation methods and software tools summary notes

zhuanlan.zhihu.com] (zhuanlan.zhihu.com/p/86322679)