This is the 7th day of my participation in the August More Text Challenge
preface
- WordCloud is a graphical visual display of words with high frequency in text, which is a common method of text mining. There are a variety of data analysis tools that support such graphics, such as Matlab, SPSS, SAS, R and Python, and there are many online web pages that can generate WordCloud.
1. Website analysis and source data acquisition — video barrage from Station B of “What is Page?
- Analysis of the
- First of all, it can be seen from the figure that there are 882 bullets in this video
- Next, press F12 to enter developer mode and get the source data
- Note: The barrage data of station B has a ready-made interface, so you only need to find the CID value of the corresponding video
- There are 882 bullets in total, cid=72342029
2. Data capture — Get data around the interface that requires cookies
# -! - coding: utf-8 -! -
from bs4 import BeautifulSoup Parse HTML, XML documents
import numpy as np
import requests
# Fetch data
# target url
url = 'http://comment.bilibili.com/72342029.xml'
Send a get request to the target URL
html = requests.get(url).content
html_data = str(html, 'utf-8')
soup = BeautifulSoup(html_data, 'lxml')
results = soup.find_all('d')
comments = [comment.text for comment in results]
comments_dict = {'comments': comments}
df = pd.DataFrame(comments_dict)
df.to_csv('bilibili.csv', encoding='utf-8')
Copy the code
Successful acquisition of 882 barrage
3. Data visualization
from PIL import Image
from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt
import pandas as pd
import jieba
# jieba word segmentation, generate word cloud for bullet screen data
df = pd.read_csv('bilibili.csv', header=None)
text = ' '
for line in df[1]:
text += ' '.join(jieba.cut(line, cut_all=False))
#background_Image = plt.imread('peiqi_1.jpg')
background_Image = np.array(Image.open("peiqi_1.jpg"))
wc = WordCloud(
background_color='white',
mask=background_Image,
font_path='C:\Windows\Fonts\simhei.ttf',
max_words=2000,
max_font_size=80,
random_state=30,
)
wc.generate_from_text(text)
# Look at the high frequency of words and remove useless information
process_word = WordCloud.process_text(wc, text)
sort = sorted(process_word.items(), key=lambda e:e[1], reverse=True) print(sort[:50])
mg_colors = ImageColorGenerator(background_Image) wc.recolor(color_func=img_colors)
plt.imshow(wc)
plt.axis('off')
wc.to_file("peggy.jpg")
print('Yeah Successfully! ')
Copy the code