An overview of the
Sometimes we miss the point of some articles, and Python jieba provides a good solution for us. Here is an example of a practical use to illustrate.
First of all, we need to request and parse the content of the web page. Here is the path of the government work report:
www.gov.cn/premier/201…
The get(URL) method of the Request library was used to request the data of the response, and it was found that the text content of the report was mostly in the paragraph P tag. You can refer to find_all() of BeautifulSoup to fetch all the labels in the P label and retrieve the contents. Let’s encapsulate it:
def extract_text(url):
Send the URL request and get the response file
page_source = requests.get(url).content
bs_source = BeautifulSoup(page_source, "lxml")
Parse out all p tags
report_text = bs_source.find_all('p')
text = ' '
Save all the contents of the p tag to a string
for p in report_text:
text += p.get_text()
text += '\n'
return textCopy the code
The use of word clouds
To use the word cloud, you need to prepare a background image, which uses the popular image of Peppa Pig.
To read in a picture, use the imread() method of the Pyplot module in the Matplotlib library.
import matplotlib.pyplot as plt
back_img = plt.imread('/peiqi.jpg') The path to the image passed into the methodCopy the code
Here are the basic uses of word clouds:
cloud = WordCloud(font_path= '/simhei.ttf'.# If there are Chinese characters, this code must be added, otherwise there will be a box, no Chinese characters
background_color="white".# Background color
max_words=5000.# The maximum number of words displayed in the word cloud
mask=back_img, # Set the background image
max_font_size=100.# maximum font size
random_state=42,
width=360, height=591, margin=2.# set the default size of the image, but if you use the background image, the saved image will be saved according to its size, and margin is the word edge distance
)Copy the code
Because the word cloud does not support Chinese by default, we need to search the supported fonts on the Internet, and then put the downloaded font Simhei. TTF into the project.
Jieba participle
The following is to input the resolved content according to the exact word segmentation mode of jieba and the word cloud to generate the graph.
for li in content:
comment_txt += ' '.join(jieba.cut(li, cut_all=False))
wc = cloud.generate(comment_txt)
image_colors = ImageColorGenerator(back_img)
plt.figure("wordc")
plt.imshow(wc.recolor(color_func=image_colors))
wc.to_file('Report on the Work of the Government for March 2019.)Copy the code
Here is the complete code:
import matplotlib.pyplot as plt
import jieba
import requests
from bs4 import BeautifulSoup
from wordcloud import WordCloud, ImageColorGenerator
def extract_text(url):
Send the URL request and get the response file
page_source = requests.get(url).content
bs_source = BeautifulSoup(page_source, "lxml")
Parse out all p tags
report_text = bs_source.find_all('p')
text = ' '
Save all the contents of the p tag to a string
for p in report_text:
text += p.get_text()
text += '\n'
return text
def word_cloud(content):
comment_txt = ' '
back_img = plt.imread('/peiqi.jpg')
cloud = WordCloud(font_path='/simhei.ttf'.# If there are Chinese characters, this code must be added, otherwise there will be a box, no Chinese characters
background_color="white".# Background color
max_words=5000.# The maximum number of words displayed in the word cloud
mask=back_img, # Set the background image
max_font_size=100.# maximum font size
random_state=42,
width=360, height=591, margin=2.# set the default size of the image, but if you use the background image, the saved image will be saved according to its size, and margin is the word edge distance
)
for li in content:
comment_txt += ' '.join(jieba.cut(li, cut_all=False))
wc = cloud.generate(comment_txt)
image_colors = ImageColorGenerator(back_img)
plt.figure("wordc")
plt.imshow(wc.recolor(color_func=image_colors))
wc.to_file('Report on the Work of the Government for March 2019.)
if __name__ == "__main__":
url = 'http://www.gov.cn/premier/2019-03/16/content_5374314.htm'
text = extract_text(url)
word_cloud(text)Copy the code
The word cloud display does not appear to be very intuitive, but the number of the top 10 keywords is shown in the matplotlib histogram below. We need to optimize our search for the word cloud.
from collections import Counter import numpy as np def word_frequency(text): word_list = [] count_list = [] words = [word for word in jieba.cut(text, Cut_all =True) if len(word) >= 2] # return a list of words with length greater than or equal to 2 C = Counter(words) for word_freq in c. ost_common(10): # c.ost_common (10) returns a list, Each of these elements is a meta-ancestor word, freq = word_freq word_list.append(word) count_list.append(freq) plt.rcParams['font.sans-serif'] = ['SimHei'] # RcParams ['axes. Unicode_minus '] = False colors = ['#00FFFF', '#7FFFD4', '#F08080', '#90EE90', '#AFEEEE', '#98FB98', '#B0E0E6', '#00FF7F', '#FFFF00', '#9ACD32'] index = np.arange(10) plt.bar(index, count_list, Color =colors, width=0.5, align='center') ticks(np.arange(10), word_list) y in enumerate(count_list): Plt. text(x, y + 1.2, '%s' % y, ha='center', Fontsize =10) plt.ylabel(' prov_title ') # set the title Plt.savefig ('/ govt report Top10 keywords.png') # Save images # show plt.show()Copy the code
Here, Counter in collections was used to count the top 10 words and their frequency, traverse them and put them in two lists respectively, and then set them on the coordinate axes of the bar chart to display them respectively. The following information is displayed:
Replace the code for the word cloud with the function call above.
if __name__ == "__main__":
url = 'http://www.gov.cn/premier/2019-03/16/content_5374314.htm'
text = extract_text(url)
# word_cloud(text)
word_frequency(text)Copy the code
In fact, pyecharts can be used for visualization
pip install pyechartsCopy the code
The code is modified as follows:
def word_frequency(text):
# return a list of all words with a length greater than or equal to 2
word_list = []
count_list = []
words = [word for word in jieba.cut(text, cut_all=True) if len(word) >= 2]
# Counter is a simple Counter that counts the number of occurrences of characters
c = Counter(words)
for word_freq in c.most_common(10) :# c.ost_common (10) is a list in which each element is a meta-ancestor
word, freq = word_freq
word_list.append(word)
count_list.append(freq)
print(word, freq,sep = ':')
from pyecharts import options as opts
from pyecharts.charts import Pie
def pie_base(a) -> Pie:
c = (
Pie()
.add("", [list(z) for z in zip(word_list, count_list)])
.set_global_opts(title_opts=opts.TitleOpts(title="Keywords in Government Reports"))
# .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
)
return c
pie_base().render('Temp/Government Report Keywords.html')Copy the code
Pyecharts is dynamic and displays better than Matplotlib.