Akik Look at that coder

Public account: Look at that code farmer

1. Jieba profile

In natural language processing tasks, Chinese text needs to get a single word through word segmentation, so the Chinese word segmentation tool jieba is needed

Jieba is an open source project at github.com/fxsjy/jieba

It performs well in word segmentation accuracy and speed.

2. Jieba installation

  1. Fully automatic installation
pip install jieba / pip3 install jieba
Copy the code
  1. Semi-automatic installation
  • To download pypi.python.org/pypi/jieba/
  • Run after decompression python setup.py install
  1. Manual installation:
  • Place the entire jieba directory in the Python site-packages directory

3. Analysis of the principle of participle on jieba

  • Initialization. Load the dictionary file to get each word and the number of words it appears in
  • Syncopate phrases. Using re, the text is divided into statements, and then the statements are divided into words
  • Build a DAG. Through string matching, the directed acyclic graph of all possible word segmentation cases, also known as DAG, is constructed
  • Construct the maximum path probability of the node and the end position. Calculate the maximum probability in all paths from each Chinese character node to the end of the statement, and write down the end position of the word formation corresponding to the maximum probability in DAG.
  • Build a shard composition. According to the node path, the result of word segmentation is obtained, which is the result of word segmentation.
  • HMM New word processing: For new words, that is, words not in Jieba dictionary, we deal with them by statistical method, and HMM (hidden Markov model) is used in Jieba to deal with them.
  • Return word segmentation results: Yield the words that were segmented in the previous step. Yield can save storage space relative to list.

HMM model: Hidden Markov model

4. Jieba characteristics

1. Three modes of participle on jieba

Jieba supports three modes of word segmentation

  • Precise mode: The text is precisely cut apart, no redundant words.
  • Full mode: Scan out all possible words in the text, with redundancy.
  • Search engine pattern: On the basis of precise pattern, the long words are segmented again.

Let’s use a case to demonstrate the three modes of Jieba

import jieba

sentence = 'I love Chongqing Hot and Sour powder and old hot pot'

seg_list = jieba.cut(sentence, cut_all=True)
print('Full mode: {}'.format('/'.join(seg_list)))

seg_list = jieba.cut(sentence, cut_all=False)
print('Precision mode: {}'.format('/'.join(seg_list)))

seg_list = jieba.cut_for_search(sentence)
print('Search engine pattern: {}'.format('/'.join(seg_list)))
Copy the code

The output is

  • Jieba. cut accepts three arguments. The first argument is the string to be cut. The second parameter, cut_all, controls whether full mode is used. The default is False, which means precise mode is used. The third parameter controls whether the HMM model is used to identify unknown words.
  • Jieba. cut_for_search accepts two arguments, the first of which is the string to be divided; The second parameter controls whether to use the HMM model.

2. Jieba supports traditional Chinese participles

3. Jieba supports custom dictionaries

4. Jieba Supports MIT License Agreement (Open Source Software License Agreement)

5. Jieba loading custom dictionary

Jieba basic usage can only meet our basic needs of word segmentation, but the actual situation is more complicated.

In the process of Jieba word segmentation, there are many undefined new words in Jieba lexicon, which makes Jieba unable to divide words according to the developer’s will. So how can we solve this problem?

The method is to load the custom dictionary

Loading a custom dictionary using jieba.load_userdict function

  • This function takes only one argument,
  • The parameter is the path to a file-like object or a custom dictionary.
  • It is important to note that the file must be utF-8 encoded

The definition of a custom dictionary is as follows:

  • Each word occupies a line, and each line is made up of three parts.
  • Separated by Spaces, they are: word, word frequency and part of speech.
  • The frequency and part of speech can be omitted, but the order cannot be changed.

Then edit the following code to see it in action

import jieba

sentence = 'I love Chongqing Hot and Sour powder and Chongqing old hot pot'

seg_list = jieba.cut(sentence)
print('No custom dictionary: {}'.format('/'.join(seg_list)))

jieba.load_userdict('./userdict.txt')
seg_list = jieba.cut(sentence)
print('With custom dictionary: {}'.format('/'.join(seg_list)))
Copy the code

The output is

We can see that the participles are divided according to our wishes

6. Jieba adjust the dictionary language

Most of the time we don’t need to use a custom dictionary to make development more difficult. So Jieba provides us with functions to adjust the dictionary.

  • throughjieba.add_word(word, freq=None, tag=None)andjieba.del_word(word)These two functions can dynamically add and delete dictionaries.
  • throughsuggest_freq(segment, tune=True)This function adjusts the frequency of individual words so that they can or cannot be shred.

For example, in the following case:

import jieba
sentence="I eat hot and sour noodles."
words="Hot and sour powder"
jieba.add_word(words)
jieba.suggest_freq(('Hot and sour powder'), tune=True)

seg_list = jieba.cut(sentence)
print('\'.join(seg_list))
Copy the code

The output is

7. Jieba part-of-speech tagging

Jieba not only provides word segmentation, but also part-of-speech tagging, which is very helpful for text mining.

The part of speech of each word after the sentence participle is marked with the marking method compatible with ICTCLAS

  • Ictclas: Institute of Computing Technology, Chinese Lexical Analysis System

Take the following example

import jieba.posseg as pseg

words = pseg.cut('I eat hot and sour noodles.')
for word, flag in words:
    print('{} {}'.format(word, flag))
Copy the code

The output is

The comparison table of stuttering participles is as follows

adjectives The difference between words conjunctions adverbs interjection locality
A. an adjective B the difference between words C conjunctions D adverbs E interjection F locality
AD vice form words BL Cc conjunction
The an of the word
Ag Descriptive morpheme
Al An adjective
The prefix The suffix numerals noun Onomatopoetic words, prepositions
H prefix K suffix M numerals N the noun O onomatopoetic words P prepositions
Mq quantifiers Nr names Pba preposition “ba”
Nr1 Chinese surname Pbei preposition “bei”
Nr2 Chinese name
NRJ Japanese name
NRF transliteration name
Ns place names
NSF transliteration place name
Nt Organization name
Nz other proper names
Nl noun idiom
Ng nominal morpheme
Ns place names
quantifiers pronouns Place word Word denoting time A partical The verb
Q quantifiers R pronouns S place word T word denoting time U a partical V is a verb
Qv momentum term Rr personal pronoun Tg Temporal morpheme The uzhe Vd vice verbs
Qt when quantifiers Rz demonstrative The ule after ather Vshi verb “to be”
RZT Time demonstrative Uguo too Vyou verb “to have”
RZS premises demonstrative pronoun The bottom of ude1 Vf tendency verb
RZV predicate demonstrative pronoun Ude2 to Vx form verb
Ry interrogative pronoun Ude3 too Vi. Intransitive verb
Ryt time interrogative pronoun What usuo Vl Verbal idiom
Rys locative interrogative pronoun Udeng and so on and so on Vg Verbal morpheme
Ryv is an interrogative pronoun Uyy as general as general
Rg nominal morpheme Udh words
Uls for the record for the record
The uzhi
Ulian even (” Even a schoolboy can do it “)
punctuation string Ai ya State of word
W punctuation mark X string Y (delete yg) Z status word
WKZ left parenthesis, full Angle :([{” [< half Angle: [{< Xx non-morpheme words Tg Temporal morpheme
Wky: :)]} “]]} “:) {> Xu URL
Wyz left quotation mark, full corner: “‘”
Wyy close quotation mark, full corner: “‘”
Wj period, full Angle:.
Ww question mark, full Angle:? Half Angle:?
Wt exclamation mark, full corner:! Half Angle:!
Wd comma, full corner:, half corner:,
Wf semicolon, full corner:; Half Angle:;
Wenton, full Angle:,
Wm colon, full: : half: :
Ws ellipsis, full Angle:… …
Wp dashes, full corner: —- Half corner: — —-
Wb percent thousand, full Angle: % ‰ Half Angle: %
Wh unit symbol, full Angle: ¥$£° C Half Angle: $

8. Jieba

After dividing the target text into lines, dividing the lines among multiple Python processes in parallel word segmentation, and then merging the results, gives you a significant speed increase in word segmentation.

Usage:

  • jieba.enable_parallel(5) # Enable the parallel word segmentation mode. The parameter is the number of parallel processes
  • jieba.disable_parallel()# Turn off parallel word segmentation

Jieba provides a set of performance test data on its Github page:

The exact word segmentation of Jin Yong’s complete works on a 4-core 3.4GHz Linux machine has achieved a speed of 1MB/s, which is 3.3 times that of the single-process version.

9. Jieba Tokenize

Jieba library uses return Tokenize to return the starting position of the word in the original text

Take the following example:

import jieba

words= jieba.tokenize('I eat hot and sour noodles.')
for tk in words:
    print("word %s\t\t start: %d \t\t end:%d" % (tk[0], tk[1], tk[2]))
Copy the code

The output is

If you find this helpful:

1. Click “like” to support it, so that more people can see this article

2, pay attention to the public number: look at that code farmers, we study together and progress together.

This article is part of the “Gold Nuggets For Free!” Event, click to view details of the event