This is the 24th day of my participation in the August Text Challenge.More challenges in August

Jieba Library jieba library is a third party library with powerful word segmentation function. It has a good application performance in Chinese word segmentation. The working mechanism is

A Chinese thesaurus is used to determine the association probability between Chinese characters and form word segmentation results. Besides the word segmentation is given systematically, three modes of adding phrase jieba are also supported:

1, accurate word segmentation mode: jieba.lcut(string) Returns a word segmentation result of list type;

2, full mode: jieba.lcut(string,cut_all=True), return a list type of word segmentation result, there is redundancy

3, search engine mode: jieba.lcut_for_search(string), returns a list of word segmentation results, there is redundancy

STR =” Examples of strings to be used in the test of the node”

eg1=jieba.lcut(str)

eg1

[‘ knot ‘, ‘Baku ‘,’ test ‘, ‘when ‘,’ use ‘, ‘of ‘,’ string ‘, ‘example ‘]

eg2=jieba.lcut(str,cut_all=True)

eg2

[‘ stutter ‘, ‘Baku’, ‘test’, ‘when’, ‘use’, ‘the’, ‘character’ and ‘string’, ‘example]

eg3=jieba.lcu_for_search(str)

eg3

[‘ and ‘, ‘Baku’, ‘test’ and ‘when’, ‘use’, ‘the’, ‘character’ and ‘string’, ‘example]

In addition, Jieba library also allows users to add user-defined phrases for word segmentation.

In the example above, we can see that. Although in our opinion, Jiebaku should be divided into one word. However, word segmentation results do not achieve this effect. The result can be obtained after running jieba.add_word(word)

Jieba.add_word (” jieba “) eg4=jieba.lcut(STR) eg4

[‘ junction library ‘, ‘test ‘,’ when ‘, ‘use ‘,’ of ‘, ‘string ‘,’ example ‘]

Wordcloud library wordcloud library is an excellent third-party wordcloud library. By operating the wordcloud as a wordcloud object, it can generate a wordcloud map composed of high frequency words in a paragraph of text.

The basic steps for using the WordCloud library are as follows

1. Obtain the contents of a. TXT text file and process it

2, create a WordCloud object of type w= WordCloud.WordCloud(< parameter >)

3. Use the functions.generate() and.to_file() provided by the WordCloud library to load text and output it as an image file(.png or.jpg format).

Word cloud object parameter width: specifies the width of the image generated by the word cloud object. The default is 400 pixels

Height: the default value is 200

Min_font_size: Specifies the minimum font size for the word cloud font. The default size is 4

Max_font_size: the largest font size, automatically adjusted according to height

Font_step: Specifies the step interval for the font size in the word cloud. The default is 1

Font_path: Specifies the path to the font file. Default is None

Max_words: Specifies the maximum number of words to display in the word cloud. The default is 200

Stop_words: Specifies the list of excluded words in the word cloud, that is, the list of words that are not displayed

Mask: specifies the shape of the word cloud, which is rectangular by default and needs to reference imread(0 function

Application example: Analysis of the frequency of character appearances and word cloud mapping in the Romance of The Three Kingdoms

import jieba import wordcloud

def wordStatistics(txt,excludeWord=None): Words = jieba.lcut(TXT) tempExclude=excludeWord counts = {} # Create an empty dictionary type for word in words: if len(word) == 1: continue else: counts[word] = counts.get(word, 0) + 1 items = list(counts.items()) items.sort(key=lambda x: X [1], reverse=True) # for I in range(15): word, count = items[i] print(“{0:<10}{1:>5}”.format(word, count))

TXT =open(” threekind.txt “,’r’,encoding=’ utF-8 ‘).read() # exclude={} wordStatistics(TXT

As you can see, there are some extraneous or overlapping terms. The code needs to be further optimized. That is to add judgment irrelevant words that overlap words merge count. The code implementation is as follows

import jieba import wordcloud

Def wordStatistics(TXT,excludeWord=None): words = jieba.lcut(TXT) counts = {} # If word in excludeWord or len(word)==1: continue elif word==’ kong Ming ‘or word==’ Zhuge Zhuge ‘: [word] = counts. Get (word, 0) + 1 elif ==’ cao ‘or ==’ Mende’ or ==’ prime minister ‘or ==’ Lord Lord ‘: Get (word, 0) + 1 elif ==’ liu ‘or ==’ xuande’ or ==’ xuande ‘or ==’ da er ‘: Count [word] = count. Get (word, 0) + 1 count == count Count == count [count] = count. Get (count, 0) + 1 count == count Count == count [count] = count. Get (count, 0) + 1 count == count Get (word, 0) + 1 else: count [word] = count. Get (word, 0) + 1

Items = list(counts. Items ()) items.sort(key=lambda x: x[1], reverse=True) # sort for I in range(15): word, count = items[i] print("{0:<10}{1:>5}".format(word, count))Copy the code

TXT = open (” threeKindoms. TXT “, ‘r’, encoding = “utf-8”), read () part # word frequency statistics exclude = {‘ said, “can’t”, ‘so,’ not ‘, ‘two’ and ‘jingzhou’} wordStatistics(txt,excludeWord=exclude)

The new results are as follows:

As you can see, after the change, there are other new unrelated terms. Alias cases are almost non-existent. Of course, in the process of word distribution, the accuracy of statistical results depends on how familiar you are with the three countries. I have read The Three Kingdoms several times, the impression is: Cao Cao is usually called the prime minister, the Lord also a little more. He died not a king. And Liu Bei’s big ear is the day that Lu Bu was killed by Cao Cao. After capturing Lu Bu, Cao Cao asked Liu Bei what he should do. Liu Bei said, “Do you still remember the fate of Ding Yuan and Dong Zhuo? Cao Cao decided to put Lu Bu to death. At a moment’s notice, Lu Asked Liu Bei,” Ear, do you still remember to shoot halberds at the shaft?” When Liu Bei was being bullied by Yuan Shao or someone else, he went to Lu Bu for help. Got a shaft halberd. Yes, I’ll shoot something 100 feet away with this sky-lance, and it’s over. Here it can be seen that Lu Bu is not a mang fu. Neither refused Liu Bei, in the eyes of the other side is impossible, do not offend). Of course, the word general must be no statistics of the word frequency, involving too many military generals. – digression

Further add irrelevant items to the exclusion list, optimize, and get the final result.

import jieba import wordcloud

Def wordStatistics(TXT,excludeWord=None): words = jieba.lcut(TXT) counts = {} # If word in excludeWord or len(word)==1: continue elif word==’ kong Ming ‘or word==’ Zhuge Zhuge ‘: [word] = counts. Get (word, 0) + 1 elif ==’ cao ‘or ==’ Mende’ or ==’ prime minister ‘or ==’ Lord Lord ‘: Count = count [count] = count. 0) + 1 elif word==’ Liu Bei ‘or word==’ Xuande’ or word==’ Xuande ‘or word==’ big er’ or word==’ master ‘: Count [word] = count. Get (word, 0) + 1 count == count Count == count [count] = count. Get (count, 0) + 1 count == count Count == count [count] = count. Get (count, 0) + 1 count == count Count [count] = count. Get (count, 0) + 1 count ==’ count ‘ Count [word] = count. Get (word, 0) + 1 else: count [word] = count

Items = list(counts. Items ()) items.sort(key=lambda x: x[1], reverse=True) # sort for I in range(15): word, count = items[i] print("{0:<10}{1:>5}".format(word, count))Copy the code

TXT =open(” threekind.txt “,’r’,encoding=’ utF-8 ‘).read() # Exclude = {‘ say, “can’t”, ‘so,’ not ‘and’ two ‘and’ jingzhou ‘, ‘general’, ‘discuss’, ‘the’, ‘led soldiers’,’ army ‘, ‘how to’, ‘soldiers’,’ about ‘ , ‘big’, ‘the’, ‘today’, ‘and’, ‘Wei Bing’, ‘not’ and ‘wu’, ‘your majesty’, ‘a’, ‘team’, ‘know’, ‘in hanzhong city’, ‘see’ , ‘surrender’, ‘shu soldiers’,’ on ‘, ‘shouted,’ satrap ‘, ‘this’,’ the man ‘, ‘Mrs’,’ descendants’, ‘behind’, ‘the city’, ‘right’, ‘a’, ‘why not’} wordStatistics (TXT, excludeWord = exclude) The next step is to draw a simple word cloud according to the word frequency statistics. In this case, you need to use the function that wordCloud library nicely provides to plot the graph cloud based on usage frequency

.generate_from_frequencies(text)

import jieba import wordcloud

Def wordStatistics(TXT,excludeWord=None): words = jieba.lcut(TXT) counts = {} # If word in excludeWord or len(word)==1: continue elif word==’ kong Ming ‘or word==’ Zhuge Zhuge ‘: [word] = counts. Get (word, 0) + 1 elif ==’ cao ‘or ==’ Mende’ or ==’ prime minister ‘or ==’ Lord Lord ‘: Count = count [count] = count. 0) + 1 elif word==’ Liu Bei ‘or word==’ Xuande’ or word==’ Xuande ‘or word==’ big er’ or word==’ master ‘: Count [word] = count. Get (word, 0) + 1 count == count Count == count [count] = count. Get (count, 0) + 1 count == count Count == count [count] = count. Get (count, 0) + 1 count == count Count [count] = count. Get (count, 0) + 1 count ==’ count ‘ Count [word] = count. Get (word, 0) + 1 else: count [word] = count

Items = list(counts. Items ()) items.sort(key=lambda x: x[1], reverse=True) # sort for I in range(15): word, count = items[i] print("{0:<10}{1:>5}".format(word, count)) return countsCopy the code

TXT =open(” threekind.txt “,’r’,encoding=’ utF-8 ‘).read() # Exclude = {‘ say, “can’t”, ‘so,’ not ‘and’ two ‘and’ jingzhou ‘, ‘general’, ‘discuss’, ‘the’, ‘led soldiers’,’ army ‘, ‘how to’, ‘soldiers’,’ about ‘ , ‘big’, ‘the’, ‘today’, ‘and’, ‘Wei Bing’, ‘not’ and ‘wu’, ‘your majesty’, ‘a’, ‘team’, ‘know’, ‘in hanzhong city’, ‘see’ , ‘surrender’, ‘shu soldiers’,’ on ‘, ‘shouted,’ satrap ‘, ‘this’,’ the man ‘, ‘Mrs’,’ descendants’, ‘behind’, ‘the city’, ‘right’, ‘a’, ‘why not’} the result = wordStatistics (TXT, excludeWord = exclude) Wordcloud (font_path=”msyh.ttf”,max_words=15,background_color=”white”,font_step=2) W.enerate_from_comb (result) w.tob_file (“threeKingdoms. PNG “) was actually holding me up for a while in the drawing cloud section, which looked like just three lines of code. I was really stupid, because I didn’t know it provided the ability to map clouds by frequency, i.e. dictionary key size. Later I looked at the sample code on the official website to figure it out. The study.

Necessary TXT file can baidu cloud must: link: pan.baidu.com/s/1SWYAXq0n… Please listen to the next decomposition ~