How to choose a Chinese word analyzer on jieba ik-Analyzer ansj_seg HanLP

First, several projects github Star comparison

You can see a lot of Chinese word segmentation articles on the Internet, but it is not clear exactly what to choose. Of course, generally speaking, there is no best only the most appropriate, in the segmentation ability, traditional support and so on can be considered in the direction. Of course, Star on Github can also be used as a basis for choosing an open source project.

HanLP making 21.4 k star

[github.com/hankcs/HanL…

github.com

] (link.zhihu.com/?target=htt…).

Jieba making 24.9 k star

[fxsjy/jieba

github.com] (Link.zhihu.com/?target=htt…)

ik-analyzer github 589 star

You can see that iK-Analyzer and ES Solr etc. have integration and it seems that 589STAR is less, of course ik-Analyzer is mainly in code.google, when the last code.Google update was only the 2012 version

[wks/ik-analyzer

github.com

] (link.zhihu.com/?target=htt…).

Ansj_seg 5.7 k

[NLPchina/ansj_seg

github.com

] (link.zhihu.com/?target=htt…).

I suggest to use jieba

Ii. Specific instructions

(1) Hanlp participle

[hankcs/HanLP

github.com

] (link.zhihu.com/?target=htt…).

Shortest path word segmentation, Chinese word segmentation, part-of-speech tagging, new word recognition, named entity recognition, automatic summarization, text clustering, sentiment analysis, word vector word2VEc and other functions, supporting custom dictionaries;

HMM, CRF, TextRank, WORD2VEC, clustering, neural network and other algorithms are adopted.

Support Java, C++, Python language;

(2) Stutter participle

[github.com/fxsjy/jieba…

github.com

] (link.zhihu.com/?target=htt…).

Find the maximum segmentation combination based on word frequency, with Chinese word segmentation, keyword extraction, part-of-speech tagging functions, support for custom dictionary;

HMM model and Viterbi algorithm are adopted.

Support Java, C++, Python language;

(3) LTP of HIT

[HIT-SCIR/ltp

github.com] (Link.zhihu.com/?target=htt…)

Chinese word segmentation, part-of-speech tagging, parsing and other functions;

For commercial use; Call the interface, the number of requests per second is limited;

Programming languages are C++, Python, Java version;

(4) THULAC, Tsinghua University

[thunlp/THULAC

github.com] (Link.zhihu.com/?target=htt…)

Chinese word segmentation and part-of-speech tagging;

Available in Java, Python, and C++;

(5) PKUSEG, Peking University

[lancopku/pkuseg-python

github.com] (Link.zhihu.com/?target=htt…)

Support word segmentation by domain, part-of-speech tagging function, support user self-training model;

Self-developed ADF training method based on CRF model;

There is a Python version;

(6) Stanford participle

[The Stanford Natural Language Processing Group

nlp.stanford.edu

] (link.zhihu.com/?target=htt…).

Support multi-language segmentation including Chinese and English, provide training model interface, can also use the existing model, but the speed is slower;

CRF algorithm implemented by Java;

(7) KCWS participle

[koth/kcws

github.com] (Link.zhihu.com/?target=htt…)

It has Chinese word segmentation and part-of-speech tagging functions, and supports custom dictionaries.

Using WORD2VEC, BI-LSTM, CRF algorithm;

ZPar (8)

[frcchang/zpar

github.com] (Link.zhihu.com/?target=htt…)

Chinese, English, Spanish word segmentation, part-of-speech tagging;

C++ language;

IKAnalyzer (9)

[wks/ik-analyzer

github.com] (Link.zhihu.com/?target=htt…)

Chinese word segmentation function, support custom dictionary;

Jcseg (10)

[Spirit of the Lion/JCSEG

gitee.com

] (link.zhihu.com/?target=htt…).

It has Chinese word segmentation, keyword extraction, automatic summarization, part-of-speech tagging, entity recognition and other functions, and supports custom dictionaries;

Based on MMSEG, textRank, BM25 and other algorithms;

FudanNLP (11)

[FudanNLP/fnlp

github.com] (Link.zhihu.com/?target=htt…)

Chinese word segmentation part-of-speech tagging entity name recognition keywords extraction;

SnowNLP (12)

[isnowfy/snownlp

github.com] (Link.zhihu.com/?target=htt…)

Chinese word segmentation, part-of-speech tagging, sentiment analysis, text classification, keyword extraction and other functions;

Based on HMM, Naive Bayes, TextRank, TF-IDF and other algorithms;

Python libraries;

(13) ANSJ participle

[NLPchina/ansj_seg

github.com] (Link.zhihu.com/?target=htt…)

Chinese word segmentation, name recognition, part of speech tagging, user-defined dictionary and other functions;

Based on N-Gram +CRF+HMM algorithm;

(14) me

[nltk/nltk

github.com] (Link.zhihu.com/?target=htt…)

She is good at English word segmentation and also supports Chinese word segmentation. However, it is recommended to use other word segmentation tools for Chinese corpus segmentation first, and then use its processing function.

Python library;

15. Take care of his food

[code.google.com/p/paoding

code.google.com

] (link.zhihu.com/?target=htt…).

3.2, the other

(1) NLPIR, Institute of Computing Science, Chinese Academy of Sciences

It has the functions of word segmentation, part-of-speech tagging, new word recognition, named entity recognition, sentiment analysis, keyword extraction and other functions, and supports custom dictionaries;

(2) Tencent Wenzhi

(3) the BosonNLP

(4) Baidu NLP

(5) Aliyun NLP

6) Sina Cloud

(7) Pangu participle

The article pick to

Hua Tianqing: Chinese word segmentation methods and software tools summary notes

zhuanlan.zhihu.com] (zhuanlan.zhihu.com/p/86322679)

How to choose a Chinese word analyzer on jieba ik-Analyzer ansj_seg HanLP

First, several projects github Star comparison

Ii. Specific instructions

Related Posts

Learn Java data structure, not floating all difficult!

SQL8 select current salary of all employees

Python Crawl Douban movie (generate charts & source code) | Python theme month