Akik Look at that coder
Public account: Look at that code farmer
1. Jieba profile
In natural language processing tasks, Chinese text needs to get a single word through word segmentation, so the Chinese word segmentation tool jieba is needed
Jieba is an open source project at github.com/fxsjy/jieba
It performs well in word segmentation accuracy and speed.
2. Jieba installation
- Fully automatic installation
pip install jieba / pip3 install jieba
Copy the code
- Semi-automatic installation
- To download pypi.python.org/pypi/jieba/
- Run after decompression
python setup.py install
- Manual installation:
- Place the entire jieba directory in the Python site-packages directory
3. Analysis of the principle of participle on jieba
- Initialization. Load the dictionary file to get each word and the number of words it appears in
- Syncopate phrases. Using re, the text is divided into statements, and then the statements are divided into words
- Build a DAG. Through string matching, the directed acyclic graph of all possible word segmentation cases, also known as DAG, is constructed
- Construct the maximum path probability of the node and the end position. Calculate the maximum probability in all paths from each Chinese character node to the end of the statement, and write down the end position of the word formation corresponding to the maximum probability in DAG.
- Build a shard composition. According to the node path, the result of word segmentation is obtained, which is the result of word segmentation.
- HMM New word processing: For new words, that is, words not in Jieba dictionary, we deal with them by statistical method, and HMM (hidden Markov model) is used in Jieba to deal with them.
- Return word segmentation results: Yield the words that were segmented in the previous step. Yield can save storage space relative to list.
HMM model: Hidden Markov model
4. Jieba characteristics
1. Three modes of participle on jieba
Jieba supports three modes of word segmentation
- Precise mode: The text is precisely cut apart, no redundant words.
- Full mode: Scan out all possible words in the text, with redundancy.
- Search engine pattern: On the basis of precise pattern, the long words are segmented again.
Let’s use a case to demonstrate the three modes of Jieba
import jieba
sentence = 'I love Chongqing Hot and Sour powder and old hot pot'
seg_list = jieba.cut(sentence, cut_all=True)
print('Full mode: {}'.format('/'.join(seg_list)))
seg_list = jieba.cut(sentence, cut_all=False)
print('Precision mode: {}'.format('/'.join(seg_list)))
seg_list = jieba.cut_for_search(sentence)
print('Search engine pattern: {}'.format('/'.join(seg_list)))
Copy the code
The output is
- Jieba. cut accepts three arguments. The first argument is the string to be cut. The second parameter, cut_all, controls whether full mode is used. The default is False, which means precise mode is used. The third parameter controls whether the HMM model is used to identify unknown words.
- Jieba. cut_for_search accepts two arguments, the first of which is the string to be divided; The second parameter controls whether to use the HMM model.
2. Jieba supports traditional Chinese participles
3. Jieba supports custom dictionaries
4. Jieba Supports MIT License Agreement (Open Source Software License Agreement)
5. Jieba loading custom dictionary
Jieba basic usage can only meet our basic needs of word segmentation, but the actual situation is more complicated.
In the process of Jieba word segmentation, there are many undefined new words in Jieba lexicon, which makes Jieba unable to divide words according to the developer’s will. So how can we solve this problem?
The method is to load the custom dictionary
Loading a custom dictionary using jieba.load_userdict function
- This function takes only one argument,
- The parameter is the path to a file-like object or a custom dictionary.
- It is important to note that the file must be utF-8 encoded
The definition of a custom dictionary is as follows:
- Each word occupies a line, and each line is made up of three parts.
- Separated by Spaces, they are: word, word frequency and part of speech.
- The frequency and part of speech can be omitted, but the order cannot be changed.
Then edit the following code to see it in action
import jieba
sentence = 'I love Chongqing Hot and Sour powder and Chongqing old hot pot'
seg_list = jieba.cut(sentence)
print('No custom dictionary: {}'.format('/'.join(seg_list)))
jieba.load_userdict('./userdict.txt')
seg_list = jieba.cut(sentence)
print('With custom dictionary: {}'.format('/'.join(seg_list)))
Copy the code
The output is
We can see that the participles are divided according to our wishes
6. Jieba adjust the dictionary language
Most of the time we don’t need to use a custom dictionary to make development more difficult. So Jieba provides us with functions to adjust the dictionary.
- through
jieba.add_word(word, freq=None, tag=None)
andjieba.del_word(word)
These two functions can dynamically add and delete dictionaries. - through
suggest_freq(segment, tune=True)
This function adjusts the frequency of individual words so that they can or cannot be shred.
For example, in the following case:
import jieba
sentence="I eat hot and sour noodles."
words="Hot and sour powder"
jieba.add_word(words)
jieba.suggest_freq(('Hot and sour powder'), tune=True)
seg_list = jieba.cut(sentence)
print('\'.join(seg_list))
Copy the code
The output is
7. Jieba part-of-speech tagging
Jieba not only provides word segmentation, but also part-of-speech tagging, which is very helpful for text mining.
The part of speech of each word after the sentence participle is marked with the marking method compatible with ICTCLAS
- Ictclas: Institute of Computing Technology, Chinese Lexical Analysis System
Take the following example
import jieba.posseg as pseg
words = pseg.cut('I eat hot and sour noodles.')
for word, flag in words:
print('{} {}'.format(word, flag))
Copy the code
The output is
The comparison table of stuttering participles is as follows
adjectives | The difference between words | conjunctions | adverbs | interjection | locality |
---|---|---|---|---|---|
A. an adjective | B the difference between words | C conjunctions | D adverbs | E interjection | F locality |
AD vice form words | BL | Cc conjunction | |||
The an of the word | |||||
Ag Descriptive morpheme | |||||
Al An adjective |
The prefix | The suffix | numerals | noun | Onomatopoetic words, | prepositions |
---|---|---|---|---|---|
H prefix | K suffix | M numerals | N the noun | O onomatopoetic words | P prepositions |
Mq quantifiers | Nr names | Pba preposition “ba” | |||
Nr1 Chinese surname | Pbei preposition “bei” | ||||
Nr2 Chinese name | |||||
NRJ Japanese name | |||||
NRF transliteration name | |||||
Ns place names | |||||
NSF transliteration place name | |||||
Nt Organization name | |||||
Nz other proper names | |||||
Nl noun idiom | |||||
Ng nominal morpheme | |||||
Ns place names |
quantifiers | pronouns | Place word | Word denoting time | A partical | The verb |
---|---|---|---|---|---|
Q quantifiers | R pronouns | S place word | T word denoting time | U a partical | V is a verb |
Qv momentum term | Rr personal pronoun | Tg Temporal morpheme | The uzhe | Vd vice verbs | |
Qt when quantifiers | Rz demonstrative | The ule after ather | Vshi verb “to be” | ||
RZT Time demonstrative | Uguo too | Vyou verb “to have” | |||
RZS premises demonstrative pronoun | The bottom of ude1 | Vf tendency verb | |||
RZV predicate demonstrative pronoun | Ude2 to | Vx form verb | |||
Ry interrogative pronoun | Ude3 too | Vi. Intransitive verb | |||
Ryt time interrogative pronoun | What usuo | Vl Verbal idiom | |||
Rys locative interrogative pronoun | Udeng and so on and so on | Vg Verbal morpheme | |||
Ryv is an interrogative pronoun | Uyy as general as general | ||||
Rg nominal morpheme | Udh words | ||||
Uls for the record for the record | |||||
The uzhi | |||||
Ulian even (” Even a schoolboy can do it “) |
punctuation | string | Ai ya | State of word |
---|---|---|---|
W punctuation mark | X string | Y (delete yg) | Z status word |
WKZ left parenthesis, full Angle :([{” [< half Angle: [{< | Xx non-morpheme words | Tg Temporal morpheme | |
Wky: :)]} “]]} “:) {> | Xu URL | ||
Wyz left quotation mark, full corner: “‘” | |||
Wyy close quotation mark, full corner: “‘” | |||
Wj period, full Angle:. | |||
Ww question mark, full Angle:? Half Angle:? | |||
Wt exclamation mark, full corner:! Half Angle:! | |||
Wd comma, full corner:, half corner:, | |||
Wf semicolon, full corner:; Half Angle:; | |||
Wenton, full Angle:, | |||
Wm colon, full: : half: : | |||
Ws ellipsis, full Angle:… … | |||
Wp dashes, full corner: —- Half corner: — —- | |||
Wb percent thousand, full Angle: % ‰ Half Angle: % | |||
Wh unit symbol, full Angle: ¥$£° C Half Angle: $ |
8. Jieba
After dividing the target text into lines, dividing the lines among multiple Python processes in parallel word segmentation, and then merging the results, gives you a significant speed increase in word segmentation.
Usage:
jieba.enable_parallel(5)
# Enable the parallel word segmentation mode. The parameter is the number of parallel processesjieba.disable_parallel()
# Turn off parallel word segmentation
Jieba provides a set of performance test data on its Github page:
The exact word segmentation of Jin Yong’s complete works on a 4-core 3.4GHz Linux machine has achieved a speed of 1MB/s, which is 3.3 times that of the single-process version.
9. Jieba Tokenize
Jieba library uses return Tokenize to return the starting position of the word in the original text
Take the following example:
import jieba
words= jieba.tokenize('I eat hot and sour noodles.')
for tk in words:
print("word %s\t\t start: %d \t\t end:%d" % (tk[0], tk[1], tk[2]))
Copy the code
The output is
If you find this helpful:
1. Click “like” to support it, so that more people can see this article
2, pay attention to the public number: look at that code farmers, we study together and progress together.
This article is part of the “Gold Nuggets For Free!” Event, click to view details of the event