An overview of the

Sometimes we miss the point of some articles, and Python jieba provides a good solution for us. Here is an example of a practical use to illustrate.

First of all, we need to request and parse the content of the web page. Here is the path of the government work report:

www.gov.cn/premier/201…

The get(URL) method of the Request library was used to request the data of the response, and it was found that the text content of the report was mostly in the paragraph P tag. You can refer to find_all() of BeautifulSoup to fetch all the labels in the P label and retrieve the contents. Let’s encapsulate it:

def extract_text(url):
    Send the URL request and get the response file
    page_source = requests.get(url).content
    bs_source = BeautifulSoup(page_source, "lxml")

    Parse out all p tags
    report_text = bs_source.find_all('p')

    text = ' '
    Save all the contents of the p tag to a string
    for p in report_text:
        text += p.get_text()
        text += '\n'

    return textCopy the code

The use of word clouds

To use the word cloud, you need to prepare a background image, which uses the popular image of Peppa Pig.



To read in a picture, use the imread() method of the Pyplot module in the Matplotlib library.

import matplotlib.pyplot as plt
back_img = plt.imread('/peiqi.jpg') The path to the image passed into the methodCopy the code

Here are the basic uses of word clouds:

    cloud = WordCloud(font_path= '/simhei.ttf'.# If there are Chinese characters, this code must be added, otherwise there will be a box, no Chinese characters
                      background_color="white".# Background color
                      max_words=5000.# The maximum number of words displayed in the word cloud
                      mask=back_img,  # Set the background image
                      max_font_size=100.# maximum font size
                      random_state=42,
                      width=360, height=591, margin=2.# set the default size of the image, but if you use the background image, the saved image will be saved according to its size, and margin is the word edge distance
                      )Copy the code

Because the word cloud does not support Chinese by default, we need to search the supported fonts on the Internet, and then put the downloaded font Simhei. TTF into the project.

Jieba participle

The following is to input the resolved content according to the exact word segmentation mode of jieba and the word cloud to generate the graph.

  for li in content:
        comment_txt += ' '.join(jieba.cut(li, cut_all=False))
    wc = cloud.generate(comment_txt)
    image_colors = ImageColorGenerator(back_img)
    plt.figure("wordc")
    plt.imshow(wc.recolor(color_func=image_colors))
    wc.to_file('Report on the Work of the Government for March 2019.)Copy the code

Here is the complete code:

import matplotlib.pyplot as plt
import jieba
import requests
from bs4 import BeautifulSoup
from wordcloud import WordCloud, ImageColorGenerator

def extract_text(url):
    Send the URL request and get the response file
    page_source = requests.get(url).content
    bs_source = BeautifulSoup(page_source, "lxml")

    Parse out all p tags
    report_text = bs_source.find_all('p')

    text = ' '
    Save all the contents of the p tag to a string
    for p in report_text:
        text += p.get_text()
        text += '\n'

    return text

def word_cloud(content):
    comment_txt = ' '
    back_img = plt.imread('/peiqi.jpg')
    cloud = WordCloud(font_path='/simhei.ttf'.# If there are Chinese characters, this code must be added, otherwise there will be a box, no Chinese characters
                      background_color="white".# Background color
                      max_words=5000.# The maximum number of words displayed in the word cloud
                      mask=back_img,  # Set the background image
                      max_font_size=100.# maximum font size
                      random_state=42,
                      width=360, height=591, margin=2.# set the default size of the image, but if you use the background image, the saved image will be saved according to its size, and margin is the word edge distance
                      )
    for li in content:
        comment_txt += ' '.join(jieba.cut(li, cut_all=False))
    wc = cloud.generate(comment_txt)
    image_colors = ImageColorGenerator(back_img)
    plt.figure("wordc")
    plt.imshow(wc.recolor(color_func=image_colors))
    wc.to_file('Report on the Work of the Government for March 2019.)

if __name__ == "__main__":
    url = 'http://www.gov.cn/premier/2019-03/16/content_5374314.htm'
    text = extract_text(url)
    word_cloud(text)Copy the code



The word cloud display does not appear to be very intuitive, but the number of the top 10 keywords is shown in the matplotlib histogram below. We need to optimize our search for the word cloud.

from collections import Counter import numpy as np def word_frequency(text): word_list = [] count_list = [] words = [word for word in jieba.cut(text, Cut_all =True) if len(word) >= 2] # return a list of words with length greater than or equal to 2 C = Counter(words) for word_freq in c. ost_common(10): # c.ost_common (10) returns a list, Each of these elements is a meta-ancestor word, freq = word_freq word_list.append(word) count_list.append(freq) plt.rcParams['font.sans-serif'] = ['SimHei'] # RcParams ['axes. Unicode_minus '] = False colors = ['#00FFFF', '#7FFFD4', '#F08080', '#90EE90', '#AFEEEE', '#98FB98', '#B0E0E6', '#00FF7F', '#FFFF00', '#9ACD32'] index = np.arange(10) plt.bar(index, count_list, Color =colors, width=0.5, align='center') ticks(np.arange(10), word_list) y in enumerate(count_list): Plt. text(x, y + 1.2, '%s' % y, ha='center', Fontsize =10) plt.ylabel(' prov_title ') # set the title Plt.savefig ('/ govt report Top10 keywords.png') # Save images # show plt.show()Copy the code

Here, Counter in collections was used to count the top 10 words and their frequency, traverse them and put them in two lists respectively, and then set them on the coordinate axes of the bar chart to display them respectively. The following information is displayed:



Replace the code for the word cloud with the function call above.

if __name__ == "__main__":
    url = 'http://www.gov.cn/premier/2019-03/16/content_5374314.htm'
    text = extract_text(url)
     # word_cloud(text)
     word_frequency(text)Copy the code


In fact, pyecharts can be used for visualization

pip install pyechartsCopy the code

The code is modified as follows:

def word_frequency(text):
    # return a list of all words with a length greater than or equal to 2
    word_list = []
    count_list = []
    words = [word for word in jieba.cut(text, cut_all=True) if len(word) >= 2]
    # Counter is a simple Counter that counts the number of occurrences of characters
    c = Counter(words)

    for word_freq in c.most_common(10) :# c.ost_common (10) is a list in which each element is a meta-ancestor
        word, freq = word_freq
        word_list.append(word)
        count_list.append(freq)
        print(word, freq,sep = ':')

    from pyecharts import options as opts
    from pyecharts.charts import Pie
    def pie_base(a) -> Pie:
        c = (
            Pie()
                .add("", [list(z) for z in zip(word_list, count_list)])
                .set_global_opts(title_opts=opts.TitleOpts(title="Keywords in Government Reports"))
            # .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
        )
        return c

    pie_base().render('Temp/Government Report Keywords.html')Copy the code

Pyecharts is dynamic and displays better than Matplotlib.