Chinese word segmentation tool jieba introduction | free to learn

Akik Look at that coder

Public account: Look at that code farmer

1. Jieba profile

In natural language processing tasks, Chinese text needs to get a single word through word segmentation, so the Chinese word segmentation tool jieba is needed

Jieba is an open source project at github.com/fxsjy/jieba

It performs well in word segmentation accuracy and speed.

2. Jieba installation

Fully automatic installation

pip install jieba / pip3 install jieba
Copy the code

Semi-automatic installation

To download pypi.python.org/pypi/jieba/
Run after decompression python setup.py install

Manual installation:

Place the entire jieba directory in the Python site-packages directory

3. Analysis of the principle of participle on jieba

Initialization. Load the dictionary file to get each word and the number of words it appears in
Syncopate phrases. Using re, the text is divided into statements, and then the statements are divided into words
Build a DAG. Through string matching, the directed acyclic graph of all possible word segmentation cases, also known as DAG, is constructed
Construct the maximum path probability of the node and the end position. Calculate the maximum probability in all paths from each Chinese character node to the end of the statement, and write down the end position of the word formation corresponding to the maximum probability in DAG.
Build a shard composition. According to the node path, the result of word segmentation is obtained, which is the result of word segmentation.
HMM New word processing: For new words, that is, words not in Jieba dictionary, we deal with them by statistical method, and HMM (hidden Markov model) is used in Jieba to deal with them.
Return word segmentation results: Yield the words that were segmented in the previous step. Yield can save storage space relative to list.

HMM model: Hidden Markov model

4. Jieba characteristics

1. Three modes of participle on jieba

Jieba supports three modes of word segmentation

Precise mode: The text is precisely cut apart, no redundant words.
Full mode: Scan out all possible words in the text, with redundancy.
Search engine pattern: On the basis of precise pattern, the long words are segmented again.

Let’s use a case to demonstrate the three modes of Jieba

import jieba

sentence = 'I love Chongqing Hot and Sour powder and old hot pot'

seg_list = jieba.cut(sentence, cut_all=True)
print('Full mode: {}'.format('/'.join(seg_list)))

seg_list = jieba.cut(sentence, cut_all=False)
print('Precision mode: {}'.format('/'.join(seg_list)))

seg_list = jieba.cut_for_search(sentence)
print('Search engine pattern: {}'.format('/'.join(seg_list)))
Copy the code

The output is

Jieba. cut accepts three arguments. The first argument is the string to be cut. The second parameter, cut_all, controls whether full mode is used. The default is False, which means precise mode is used. The third parameter controls whether the HMM model is used to identify unknown words.
Jieba. cut_for_search accepts two arguments, the first of which is the string to be divided; The second parameter controls whether to use the HMM model.

2. Jieba supports traditional Chinese participles

3. Jieba supports custom dictionaries

4. Jieba Supports MIT License Agreement (Open Source Software License Agreement)

5. Jieba loading custom dictionary

Jieba basic usage can only meet our basic needs of word segmentation, but the actual situation is more complicated.

In the process of Jieba word segmentation, there are many undefined new words in Jieba lexicon, which makes Jieba unable to divide words according to the developer’s will. So how can we solve this problem?

The method is to load the custom dictionary

Loading a custom dictionary using jieba.load_userdict function

This function takes only one argument,
The parameter is the path to a file-like object or a custom dictionary.
It is important to note that the file must be utF-8 encoded

The definition of a custom dictionary is as follows:

Each word occupies a line, and each line is made up of three parts.
Separated by Spaces, they are: word, word frequency and part of speech.
The frequency and part of speech can be omitted, but the order cannot be changed.

Then edit the following code to see it in action

import jieba

sentence = 'I love Chongqing Hot and Sour powder and Chongqing old hot pot'

seg_list = jieba.cut(sentence)
print('No custom dictionary: {}'.format('/'.join(seg_list)))

jieba.load_userdict('./userdict.txt')
seg_list = jieba.cut(sentence)
print('With custom dictionary: {}'.format('/'.join(seg_list)))
Copy the code

The output is

We can see that the participles are divided according to our wishes

6. Jieba adjust the dictionary language

Most of the time we don’t need to use a custom dictionary to make development more difficult. So Jieba provides us with functions to adjust the dictionary.

throughjieba.add_word(word, freq=None, tag=None)andjieba.del_word(word)These two functions can dynamically add and delete dictionaries.
throughsuggest_freq(segment, tune=True)This function adjusts the frequency of individual words so that they can or cannot be shred.

For example, in the following case:

import jieba
sentence="I eat hot and sour noodles."
words="Hot and sour powder"
jieba.add_word(words)
jieba.suggest_freq(('Hot and sour powder'), tune=True)

seg_list = jieba.cut(sentence)
print('\'.join(seg_list))
Copy the code

The output is

7. Jieba part-of-speech tagging

Jieba not only provides word segmentation, but also part-of-speech tagging, which is very helpful for text mining.

The part of speech of each word after the sentence participle is marked with the marking method compatible with ICTCLAS

Ictclas: Institute of Computing Technology, Chinese Lexical Analysis System

Take the following example

import jieba.posseg as pseg

words = pseg.cut('I eat hot and sour noodles.')
for word, flag in words:
    print('{} {}'.format(word, flag))
Copy the code

The output is

The comparison table of stuttering participles is as follows

adjectives	The difference between words	conjunctions	adverbs	interjection	locality
A. an adjective	B the difference between words	C conjunctions	D adverbs	E interjection	F locality
AD vice form words	BL	Cc conjunction
The an of the word
Ag Descriptive morpheme
Al An adjective

The prefix	The suffix	numerals	noun	Onomatopoetic words,	prepositions
H prefix	K suffix	M numerals	N the noun	O onomatopoetic words	P prepositions
		Mq quantifiers	Nr names		Pba preposition “ba”
			Nr1 Chinese surname		Pbei preposition “bei”
			Nr2 Chinese name
			NRJ Japanese name
			NRF transliteration name
			Ns place names
			NSF transliteration place name
			Nt Organization name
			Nz other proper names
			Nl noun idiom
			Ng nominal morpheme
			Ns place names

quantifiers	pronouns	Place word	Word denoting time	A partical	The verb
Q quantifiers	R pronouns	S place word	T word denoting time	U a partical	V is a verb
Qv momentum term	Rr personal pronoun		Tg Temporal morpheme	The uzhe	Vd vice verbs
Qt when quantifiers	Rz demonstrative			The ule after ather	Vshi verb “to be”
	RZT Time demonstrative			Uguo too	Vyou verb “to have”
	RZS premises demonstrative pronoun			The bottom of ude1	Vf tendency verb
	RZV predicate demonstrative pronoun			Ude2 to	Vx form verb
	Ry interrogative pronoun			Ude3 too	Vi. Intransitive verb
	Ryt time interrogative pronoun			What usuo	Vl Verbal idiom
	Rys locative interrogative pronoun			Udeng and so on and so on	Vg Verbal morpheme
	Ryv is an interrogative pronoun			Uyy as general as general
	Rg nominal morpheme			Udh words
				Uls for the record for the record
				The uzhi
				Ulian even (” Even a schoolboy can do it “)

punctuation	string	Ai ya	State of word
W punctuation mark	X string	Y (delete yg)	Z status word
WKZ left parenthesis, full Angle :([{” [< half Angle: [{<	Xx non-morpheme words		Tg Temporal morpheme
Wky: :)]} “]]} “:) {>	Xu URL
Wyz left quotation mark, full corner: “‘”
Wyy close quotation mark, full corner: “‘”
Wj period, full Angle:.
Ww question mark, full Angle:? Half Angle:?
Wt exclamation mark, full corner:! Half Angle:!
Wd comma, full corner:, half corner:,
Wf semicolon, full corner:; Half Angle:;
Wenton, full Angle:,
Wm colon, full: : half: :
Ws ellipsis, full Angle:… …
Wp dashes, full corner: —- Half corner: — —-
Wb percent thousand, full Angle: % ‰ Half Angle: %
Wh unit symbol, full Angle: ￥$￡° C Half Angle: $

8. Jieba

After dividing the target text into lines, dividing the lines among multiple Python processes in parallel word segmentation, and then merging the results, gives you a significant speed increase in word segmentation.

Usage:

jieba.enable_parallel(5) # Enable the parallel word segmentation mode. The parameter is the number of parallel processes
jieba.disable_parallel()# Turn off parallel word segmentation

Jieba provides a set of performance test data on its Github page:

The exact word segmentation of Jin Yong’s complete works on a 4-core 3.4GHz Linux machine has achieved a speed of 1MB/s, which is 3.3 times that of the single-process version.

9. Jieba Tokenize

Jieba library uses return Tokenize to return the starting position of the word in the original text

Take the following example:

import jieba

words= jieba.tokenize('I eat hot and sour noodles.')
for tk in words:
    print("word %s\t\t start: %d \t\t end:%d" % (tk[0], tk[1], tk[2]))
Copy the code

The output is

If you find this helpful:

1. Click “like” to support it, so that more people can see this article

2, pay attention to the public number: look at that code farmers, we study together and progress together.

This article is part of the “Gold Nuggets For Free!” Event, click to view details of the event

Chinese word segmentation tool jieba introduction | free to learn

1. Jieba profile

2. Jieba installation

3. Analysis of the principle of participle on jieba

4. Jieba characteristics

1. Three modes of participle on jieba

2. Jieba supports traditional Chinese participles

3. Jieba supports custom dictionaries

4. Jieba Supports MIT License Agreement (Open Source Software License Agreement)

5. Jieba loading custom dictionary

6. Jieba adjust the dictionary language

7. Jieba part-of-speech tagging

8. Jieba

9. Jieba Tokenize

Related Posts

GPU monitoring scheme based on DCGM and Prometheus

LeetCode.242 is a valid letter ectopic word

Eight sort – bubble sort