“This is the second day of my participation in the Gwen Challenge in November. See details: The Last Gwen Challenge in 2021”
Pyechart – word cloud
Related to the library | Query the address |
---|---|
Jieba participle library | github.com/fxsjy/jieba |
pyechart | pyecharts.org/#/zh-cn/ |
pandas | Pandas.pydata.org/pandas-docs… |
Make word cloud map of the contents of 19 big report. TXT file
- Specific requirements:
- Only words with more than two characters are counted
- Only words with more than two characters in the top 100 occurrences were counted
- Objective:
- Master pyecharts word cloud map support
- Master jieba library analysis
- Command pandas grouping statistics
1. Data acquisition
Data acquisition ideas:
- The content of the 19 report file is as follows, which can be replaced by other files
- The types of data needed to analyze the word cloud
For detailed analysis of specific problems, check the official website: Pyechart official Manual (Chinese yo)
import jieba
from pyecharts.charts import WordCloud
import pandas as pd
from pyecharts import options as opts
# Define a list to store data
wordlist = []
# File reading
with open("19 report. TXT") as file:
s = file.read()
Use jieba default mode to cut words
words = jieba.lcut(s)
Take out and divide each word
for word in words:
# Remove words less than 2 characters
if len(word) > 1:
# Add data,
wordlist.append({"word":word,"count":1})
Copy the code
2. Count word frequency
(You can use pandas to collect statistics.) The top 100 entries are sorted in descending order
Pandas -DataFrame Data structure
- A DataFrame is a labeled two-dimensional data structure in which different data column types can differ. You can think of it like an Excel spreadsheet or a database spreadsheet.
2.1 create DataFrame
The common DataFrame methods are as follows:
- From one-dimensional list,Series and other objects as values – dictionary to create;
- Created from 2d NDARray objects;
- Reads a two-dimensional table of data from a file or database
Pandas pandas.pydata.org/pandas-docs…
2.2 Calculate groupby and sum:
Word group, data form –> Word :count (e.g. : development: 1)
2.3 Take the top 100 data after sorting in descending order
The statistical code of word frequency is as follows:
# wordlist for list types, elements for dictionary types [{" word ": the development of" count ": 1},...,...]
df = pd.DataFrame(wordlist)
# Group the word values as keywords, then count the sum of (count) of each group
#groupby DataFrame group function
dfword = df.groupby('word') ['count'].sum(a)# sort_values sorts the column value. Ascending is false
dfword2 = dfword.sort_values(ascending=False)
Convert the first 100 data from DFword2 to DataFrame.
dfword3 = pd.DataFrame(dfword2.head(100),columns=['count'])
The "word" column is used as a column index and can be converted to a column
dfword3['word'] = dfword3.index
Copy the code
The result we want
3. Making word cloud map
Now that you have the data, it’s easier to draw the graph and go straight to the code
Turn word columns into lists
word = dfword3['word'].tolist()
Convert the count column to a list
count = dfword3['count'].tolist()
Merge data with a for loop
a = [list(z) for z in zip(word,count)]
c = (
# Instantiation of the WordCloud class
WordCloud()
# add graph name, data, random font size, image type
.add("", a, word_size_range=[20.100],shape="diamond")
The specific Settings of the image can also be set in the global Settings, there are some interesting Settings, I will not go into here
.set_global_opts(title_opts=opts.TitleOpts(title="Cloud map of 19 Report Words")))# display on Jupyter
c.render_notebook()
Copy the code
Let’s have a look at the final product
Let me go with the party and seek development for the motherland