Simple steps, through Python to station B drama ranking data for crawling, and visual analysis
Now, let’s get started!
PS: As a beginner of the Python crawler, if you have any errors, please kindly comment on them.
This project will conduct web page information crawling and data visualization analysis on the ranking data of TV series of station B
First, prepare relevant libraries
Requests, Pandas, BeautifulSoup, Matplotlib, and more
Since this is a third-party library, we need additional downloads. There are two ways to download (for example, the other libraries are installed similarly) :
(1) Enter a value on the cli
Prerequisite: Install PIP (Python package management tool, which provides the function of finding, downloading, installing, and uninstalling Python packages).
pip install requests
(2) Download by PyCharm
Step 1: Top left corner of the compiler File – >Settings…
Click the plus sign in the upper right corner to search for library names: Requests at the top of the screen. Click Install in the lower left corner. When “Successfully” is displayed, the installation is complete.
(2) Download by PyCharm
Step 1: Top left corner of the compiler File – >Settings…
Click the plus sign in the upper right corner to search for library names: Requests at the top of the screen. Click Install in the lower left corner. When “Successfully” is displayed, the installation is complete.
After the preparatory work is done, the implementation of the project begins
1. Get web content
def get_html(url): try: R = requests. Get (url) # use get to retrieve page data r.aise_for_status () # Return r.ext # return the obtained content except: return 'error'Copy the code
Let’s look at the crawl case and see if it has what we want:
def main(): Url = 'https://www.bilibili.com/v/popular/rank/bangumi' HTML = get_html # url # (url) to obtain the return value print # print if __name__ (HTML) == '__main__': # entry main()Copy the code
The crawling results are as follows:
Success!
Ii. Information Analysis Stage:
The first step is to build a BeautifulSoup instance
BeautifulSoup = BeautifulSoup(HTML, 'html.parser') # Specifies a parser for BeautifulSoupCopy the code
Second, initialize the container in which you want to store the information
TScore = [] # favorite= [] # favorite= [] # favorite= [] # FavoriteCopy the code
Step 3: Start collating information We get the names of the shows and put them in the list first
# * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * anime name storage for the tag in soup. Find_all (' div 'class_ =' info ') : # print(tag) bf = tag.a.string name.append(str(bf)) print(name)Copy the code
Here we use find_all() from Beautifulsoup to parse. In this case, the first argument to find_all() is the tag name and the second argument is the class value in the tag (note the underscore (class_= ‘info’)). When we press F12 on the web interface, we can see the code of the web page. When we find the corresponding location, we can see the relevant information clearly:
We then used almost the same method to extract overall ratings, views, comments and favorites
# * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * play quantity storage for the tag in soup. Find_all (' div 'class_ =' detail ') : # print(tag) bf = tag.find('span', class_='data-box').get_text() # print(tag) bf = tag.find('span', class_='data-box'). num = float(re.search(r'\d(.\d)? ', bf).group()) * 10000 # print(num) bf = num else: bf = re.search(r'\d*(\.) ? \d', Bf.) group () play. Append (float (bf) print (play) # * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * comments storage for the tag soup.find_all('div', class_='detail'): # pl = tag.span.next_sibling.next_sibling pl = tag.find('span', Class_ = 'data - box) next_sibling. Next_sibling. Get_text () # * * * * * * * * * unified unit if' m not in pl: pl = '%.1f' % (float(pl) / 10000) # print(123, pl) else: pl = re.search(r'\d*(\.) ? \d', Pl.) group () review. Append (float (pl)) print (review) # * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * number of collection for the tag soup.find_all('div', class_='detail'): sc = tag.find('span', class_='data-box').next_sibling.next_sibling.next_sibling.next_sibling.get_text() sc = re.search(r'\d*(\.) ? \d', Sc.) group () favorite. Append (float (sc)) print (favorite) # * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * comprehensive score for the tag Find_all ('div', class_=' PTS '): en = tag.find('div').get_text() tscore.append (int(zh)) print(' score ', TScore)Copy the code
There is a next_sibling method that is used to extract the same tag information at the same level. Without this method, when it finds the first ‘span’ tag, it does not continue to search (use this method in combination depending on the situation).
Regular expressions are also used to extract information (import library ‘re’)
Finally, we store the extracted information in an Excel spreadsheet and return the result set
Info = {' anime name ': name, 'play ': review,' favorite ': favorite,' synth score ': TScore} dm_file = pandas.DataFrame(info) dm_file.to_excel(' dongman.xlsx ', sheet_name=" animation data analysis ") play, review, favorite, TScoreCopy the code
We can open the file to see the format of the information stored (double click to open it)
Success!
3. Data visualization analysis
Let’s do some basic setup first
To prepare a file: STHeiti Medium. TTC [note the location in the project]
My_font = font_manager.fontproperties (fname='./data/STHeiti Medium. TTC ' Plt.rcparams [' font-size. Sans-serif '] = ['SimHei'] plt.rcparams ['axes. Unicode_minus '] = False dM_name = info[0] # = info[1] # dm_favorite = info[3] # dm_com_score = info[4] # print(dm_com_score)Copy the code
Then, start using Matplot to plot graphs and visually analyze the data. There are detailed notes in this article, but you should be smart enough to understand
# * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * comprehensive score and playback volume contrast # * * * * * * * comprehensive score bar chart FIG, Ax1 = bplots() plt.bar(dm_name, dm_com_score, color='red') Fontproperties =my_font) # table title ax1.tick_params(labelsize=6) plt.xlabel(' play name ') # plt.ylabel(' score ') # Rotation =90, color='green') # ******* Ax2 = ax1.twinx() # Add ax2.plot(dm_play) Color ='cyan') # Plt.plot (1, label=' score ', color="red", linewidth=5.0) # # legend() plt.savefig(r 'e :1.png', dpi=1000, Bbox_inches ='tight') # save to local plt.show()Copy the code
Let’s see what happens
Did you feel so high for a second (heh heh)
Then we use the same method to draw several more comparison diagrams:
# * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * comments contrast and collection number # * * * * * * * * comments bar chart FIG, ax3 = plt.subplots() plt.bar(dm_name, dm_review, Tick_params (labelsize=6) tick_params(Rotation =90, rotation= 0) plt.title(' tick_params ') plt.ylabel(' tick_params ') ax3.tick_params(labelsize=6) plt.xticks(rotation=90, Ax4. Plot (dm_favorite, color='yellow') # set line width, Plt.plot (1, label=' comment ', color="green", lineWidth =5.0) plt.plot(1, label=' comment ', lineWidth =5.0) plt.plot(1, label=' comment ', linewidth=5.0) Legend () plt.savefig(r 'e :2.png', dpi=1000, Bbox_inches = 'tight') # * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * comprehensive score and the number of collections contrast # * * * * * * * comprehensive score bar chart fig, ax5 = plt.subplots() plt.bar(dm_name, dm_com_score, Color ='red') plt.title(' red') plt.ylabel(' Red ') ax5.tick_params(labelsize=6) plt.xticks(rotation=90, Ax6.plot (dm_favorite, color='yellow') # set line width, Plt.plot (1, label=' score ', color="red", lineWidth =5.0) plt.plot(1, label=' score ', lineWidth =5.0) plt.plot(1, label=' score ', Legend () plt.savefig() plt.savefig(r 'e :3.png', dpi=1000, Bbox_inches = 'tight') # * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * quantity and comments contrast # * * * * * * * to play a bar chart fig, ax7 = plt.subplots() plt.bar(dm_name, dm_play, Tick_params (labelsize=6) plt.xticks(rotation=90, Ax8.plot (dm_review, color='green') # set line thickness, Plt.plot (1, label=' number of comments ', color="cyan", lineWidth =5.0) plt.plot(1, label=' number of comments ', color="green", Linewidth =1.0, linestyle="-") plt.legend() plt.savefig(r 'e :4.png', dpi=1000, bbox_inches='tight') plt.show()Copy the code
Let’s look at the end result
Nice! It’s perfect. You can do the data combination analysis in the same way as you want.
Finally, attach the entire code
import re
import pandas
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
from matplotlib import font_manager
def get_html(url):
try:
r = requests.get(url) # 使用get来获取网页数据
r.raise_for_status() # 如果返回参数不为200,抛出异常
r.encoding = r.apparent_encoding # 获取网页编码方式
return r.text # 返回获取的内容
except:
return '错误'
def save(html):
# 解析网页
soup = BeautifulSoup(html, 'html.parser') # 指定Beautiful的解析器为“html.parser”
with open('./data/B_data.txt', 'r+', encoding='UTF-8') as f:
f.write(soup.text)
# 定义好相关列表准备存储相关信息
TScore = [] # 综合评分
name = [] # 动漫名字
bfl = [] # 播放量
pls = [] # 评论数
scs = [] # 收藏数
# ******************************************** 动漫名字存储
for tag in soup.find_all('div', class_='info'):
# print(tag)
bf = tag.a.string
name.append(str(bf))
print(name)
# ******************************************** 播放量存储
for tag in soup.find_all('div', class_='detail'):
# print(tag)
bf = tag.find('span', class_='data-box').get_text()
# 统一单位为‘万’
if '亿' in bf:
num = float(re.search(r'\d(.\d)?', bf).group()) * 10000
# print(num)
bf = num
else:
bf = re.search(r'\d*(\.)?\d', bf).group()
bfl.append(float(bf))
print(bfl)
# ******************************************** 评论数存储
for tag in soup.find_all('div', class_='detail'):
# pl = tag.span.next_sibling.next_sibling
pl = tag.find('span', class_='data-box').next_sibling.next_sibling.get_text()
# *********统一单位
if '万' not in pl:
pl = '%.1f' % (float(pl) / 10000)
# print(123, pl)
else:
pl = re.search(r'\d*(\.)?\d', pl).group()
pls.append(float(pl))
print(pls)
# ******************************************** 收藏数
for tag in soup.find_all('div', class_='detail'):
sc = tag.find('span', class_='data-box').next_sibling.next_sibling.next_sibling.next_sibling.get_text()
sc = re.search(r'\d*(\.)?\d', sc).group()
scs.append(float(sc))
print(scs)
# ******************************************** 综合评分
for tag in soup.find_all('div', class_='pts'):
zh = tag.find('div').get_text()
TScore.append(int(zh))
print('综合评分', TScore)
# 存储至excel表格中
info = {'动漫名': name, '播放量(万)': bfl, '评论数(万)': pls, '收藏数(万)': scs, '综合评分': TScore}
dm_file = pandas.DataFrame(info)
dm_file.to_excel('Dongman.xlsx', sheet_name="动漫数据分析")
# 将所有列表返回
return name, bfl, pls, scs, TScore
def view(info):
my_font = font_manager.FontProperties(fname='./data/STHeiti Medium.ttc') # 设置中文字体(图标中能显示中文)
dm_name = info[0] # 番剧名
dm_play = info[1] # 番剧播放量
dm_review = info[2] # 番剧评论数
dm_favorite = info[3] # 番剧收藏数
dm_com_score = info[4] # 番剧综合评分
# print(dm_com_score)
# 为了坐标轴上能显示中文
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
# **********************************************************************综合评分和播放量对比
# *******综合评分条形图
fig, ax1 = plt.subplots()
plt.bar(dm_name, dm_com_score, color='red') #设置柱状图
plt.title('综合评分和播放量数据分析', fontproperties=my_font) # 表标题
ax1.tick_params(labelsize=6)
plt.xlabel('番剧名') # 横轴名
plt.ylabel('综合评分') # 纵轴名
plt.xticks(rotation=90, color='green') # 设置横坐标变量名旋转度数和颜色
# *******播放量折线图
ax2 = ax1.twinx() # 组合图必须加这个
ax2.plot(dm_play, color='cyan') # 设置线粗细,节点样式
plt.ylabel('播放量') # y轴
plt.plot(1, label='综合评分', color="red", linewidth=5.0) # 图例
plt.plot(1, label='播放量', color="cyan", linewidth=1.0, linestyle="-") # 图例
plt.legend()
plt.savefig(r'E:1.png', dpi=1000, bbox_inches='tight') #保存至本地
# plt.show()
# **********************************************************************评论数和收藏数对比
# ********评论数条形图
fig, ax3 = plt.subplots()
plt.bar(dm_name, dm_review, color='green')
plt.title('番剧评论数和收藏数分析')
plt.ylabel('评论数(万)')
ax3.tick_params(labelsize=6)
plt.xticks(rotation=90, color='green')
# *******收藏数折线图
ax4 = ax3.twinx() # 组合图必须加这个
ax4.plot(dm_favorite, color='yellow') # 设置线粗细,节点样式
plt.ylabel('收藏数(万)')
plt.plot(1, label='评论数', color="green", linewidth=5.0)
plt.plot(1, label='收藏数', color="yellow", linewidth=1.0, linestyle="-")
plt.legend()
plt.savefig(r'E:2.png', dpi=1000, bbox_inches='tight')
# **********************************************************************综合评分和收藏数对比
# *******综合评分条形图
fig, ax5 = plt.subplots()
plt.bar(dm_name, dm_com_score, color='red')
plt.title('综合评分和收藏数量数据分析')
plt.ylabel('综合评分')
ax5.tick_params(labelsize=6)
plt.xticks(rotation=90, color='green')
# *******收藏折线图
ax6 = ax5.twinx() # 组合图必须加这个
ax6.plot(dm_favorite, color='yellow') # 设置线粗细,节点样式
plt.ylabel('收藏数(万)')
plt.plot(1, label='综合评分', color="red", linewidth=5.0)
plt.plot(1, label='收藏数', color="yellow", linewidth=1.0, linestyle="-")
plt.legend()
plt.savefig(r'E:3.png', dpi=1000, bbox_inches='tight')
# **********************************************************************播放量和评论数对比
# *******播放量条形图
fig, ax7 = plt.subplots()
plt.bar(dm_name, dm_play, color='cyan')
plt.title('播放量和评论数 数据分析')
plt.ylabel('播放量(万)')
ax7.tick_params(labelsize=6)
plt.xticks(rotation=90, color='green')
# *******评论数折线图
ax8 = ax7.twinx() # 组合图必须加这个
ax8.plot(dm_review, color='green') # 设置线粗细,节点样式
plt.ylabel('评论数(万)')
plt.plot(1, label='播放量', color="cyan", linewidth=5.0)
plt.plot(1, label='评论数', color="green", linewidth=1.0, linestyle="-")
plt.legend()
plt.savefig(r'E:4.png', dpi=1000, bbox_inches='tight')
plt.show()
def main():
url = 'https://www.bilibili.com/v/popular/rank/bangumi' # 网址
html = get_html(url) # 获取返回值
# print(html)
info = save(html)
view(info)
if __name__ == '__main__':
main()
Copy the code
As for the analysis and conclusion of the chart, it is not described here. There are a thousand Hamlets for a thousand readers, and each of them has his own method of analysis and description. I believe you can have a more thorough insight and analysis.
The above is about crawler and data visualization analysis, I hope to help you!
You can go to Github to view the source file: github.com/Lemon-Sheep…
Please remember to like oh ~
How to obtain the source code:
① More than 3000 Python ebooks ②Python development environment installation tutorial ③Python400 set self-learning video ④ software development common vocabulary ⑤Python learning roadmap ⑤ project source code case sharing if you use it can be directly taken away in my QQ technical exchange group group number: 754370353 (pure technical exchange and resource sharing, no advertising) to take away by yourself click here to collect