preface

You can’t get around wordcloud for text analysis, and you can’t get around wordcloud for python, but I mean, can you really use it? You probably already have a pretty good wordcloud from a tutorial on the web, but I think today’s post will definitely give you a sense of the principles behind wordcloud.

A profound

First you need to install the third-party library using PIP. Then let’s briefly look at the differences between Chinese and English word clouds.

from matplotlib import pyplot as plt
from wordcloud import WordCloud

text = 'my is luopan. he is zhangshan'

wc = WordCloud()
wc.generate(text)

plt.imshow(wc)
Copy the code

From matplotlib import pyplot as PLT from wordcloud import wordcloud text = 'I am Luopan, he is Zhang Shan, I'm Luo Pan 'wc = WordCloud (font_path = r'/System/Library/Fonts/Supplemental/Songti. TTC ') # set Chinese Fonts wc. The generate (text) plt.imshow(wc)Copy the code

You will be smart enough to realize that the Chinese wordcloud is not what we want, and that is because wordcloud does not successfully divide Chinese words. With wordCloud’s source code analysis below, I think you should be able to figure it out.

WordCloud source code analysis

We are mainly to see the WordCloud class, here I will not put all the source code up, but mainly analyze the whole process of making WordCloud.

Class WordCloud(object): def __init__(self,): "pass def fit_words(self, entry): return self.generate_from_frequencies(frequencies) def generate_from_frequencies(self, frequencies, Pass def process_text(self, text): Flags = (re.unicode if sys.version < '3' and type(text) is UNICODE # noqa: F821 else 0) pattern = r"\w[\w']*" if self.min_word_length <= 1 else r"\w[\w']+" regexp = self.regexp if self.regexp is not None else pattern words = re.findall(regexp, text, flags) # remove 's words = [word[:-2] if word.lower().endswith("'s") else word for word in words] # remove numbers if not self.include_numbers: words = [word for word in words if not word.isdigit()] # remove short words if self.min_word_length: words = [word for word in words if len(word) >= self.min_word_length] stopwords = set([i.lower() for i in self.stopwords]) if self.collocations: word_counts = unigrams_and_bigrams(words, stopwords, self.normalize_plurals, self.collocation_threshold) else: # remove stopwords words = [word for word in words if word.lower() not in stopwords] word_counts, _ = process_tokens(words, self.normalize_plurals) return word_counts def generate_from_text(self, text): words = self.process_text(text) self.generate_from_frequencies(words) return self def generate(self, text): return self.generate_from_text(text)Copy the code

When we use the generate method, it is called in the following order:

Generate_from_text process_text # Preprocess the text generate_from_text # standardize the word frequency, create the drawing objectCopy the code

Note: So whether you use generate or generate_from_text to create a word cloud, you will end up calling the generate_from_text method.

So, the most important things here are the process_text and generate_from_frequencies functions. Let’s go through them all.

Process_text function

The process_text function simply splits and cleans the text, preferably returning a dictionary with segmentation counts. We can try using:

text = 'my is luopan. he is zhangshan' wc = WordCloud() cut_word = wc.process_text(text) print(cut_word) # {'luopan': 1, 'zhangshan': 1} text = 'WordCloud() cut_word = wc.process_text(text) print(cut_word) # {' I am cut_word ': 2, 'His name is Zhang SAN ': 1}Copy the code

So you can see that the process_text function can’t do a good word segmentation for Chinese. Leaving aside how the process_text function cleans the word segmentation, let’s focus on how the word segmentation is done for the text.

def process_text(self, text): Flags = (re.unicode if sys.version < '3' and type(text) is UNICODE # noqa: F821 else 0) pattern = r"\w[\w']*" if self.min_word_length <= 1 else r"\w[\w']+" regexp = self.regexp if self.regexp is not None else pattern words = re.findall(regexp, text, flags)Copy the code

The key here is to use regular expressions for word segmentation (“\w[\w’]+”). If you have learned regular expressions, \w[\w]+ stands for matching two or more letters, digits, Chinese characters, and underscores (\w can stand for Chinese characters in Python regular expressions).

Therefore, Chinese can not be segmented, but only in various punctuation marks, which is not in line with the logic of Chinese word segmentation. But the English text itself is separated by space, so English words can be easily separated out.

To sum up, wordcloud itself is to make wordcloud for English text. If Chinese wordcloud needs to be made, Chinese word segmentation should be done first.

Generate_from_frequencies function

Last but not least, this function is used to normalize word frequency and create drawing objects.

Drawing this code is a lot, is not what we are going to talk about today, we just need to understand what data is needed to draw the word cloud map, the following is the word frequency normalization code, I think we should be able to understand.

from operator import itemgetter def generate_from_frequencies(frequencies): frequencies = sorted(frequencies.items(), key=itemgetter(1), reverse=True) if len(frequencies) <= 0: raise ValueError("We need at least 1 word to plot a word cloud, " "got %d." % len(frequencies)) max_frequency = float(frequencies[0][1]) frequencies = [(word, Freq/max_frequency) for VMFS, freq in frequencies] return VMFS = generate_from_frequencies({' I'm calling VMFS ': 2, 'his name is zhang SAN: test # 1})/(' my name is Luo Pan', 1.0), (0.5), 'his name is zhang SAN,]Copy the code

The correct way to make word cloud map of Chinese text

We first join the text with Spaces by jieba participle so that the process_text function returns a dictionary with the correct participle count.

From matplotlib import pyplot as PLT from wordcloud import wordcloud import ba text = 'I am Luo Pan, he is Zhang SAN, My name is Luo Pan 'cut_word = "". Join (jieba. The cut (text) wc = WordCloud (font_path = r'/System/Library/Fonts/Supplemental/Songti. TTC ') wc.generate(cut_word) plt.imshow(wc)Copy the code

Of course, if you have a dictionary that counts words directly, you don’t need to call generate, just call the generate_from_comb function.

Text = {' l ':2, 'Joe' : 1} wc = WordCloud (font_path = r '/ System/Library/Fonts/Supplemental/Songti. TTC') wc. Generate_from_frequencies (text)  plt.imshow(wc)Copy the code

conclusion

(1) according to process_text function analysis, wordcloud itself is a third-party library for wordcloud production of English texts.

(2) If a Chinese word cloud needs to be made, the Chinese text needs to be separated through jieba and other Chinese word segmentation banks.

Finally, the above Chinese word cloud is not on our final ideal word cloud, such as I, he, etc., do not need to show, there is to make the word cloud more beautiful, these content will tell you next time ~