This is the 15th day of my participation in the August More text Challenge. For details, see: August More Text Challenge
01 preface
The last article with “B station” for actual combat case! You can use this framework to get the data from B station. This article will refine the code, climb up more and save it to CSV.
A total of 1,907 “course learning” data were climbed to analyze which kind of learning resources were the most popular and favored by college students. And show the results in a visual way!
02 Data Acquisition
The procedure is next with “B station” for actual combat case! Hand to hand teach you to master the essential framework of “Scrapy” for improvement, so not clear can first look at this article (detailed introduction of Scrapy, and to “B station” as a case for actual programming)
1. All scrapy files
The items file
class BiliItem(scrapy.Item) :
# define the fields for your item here like:
# name = scrapy.Field()
#pass
# Video title
title = scrapy.Field()
# link
url = scrapy.Field()
# watch quantity
watchnum = scrapy.Field()
# barrage
dm = scrapy.Field()
# Upload time
uptime = scrapy.Field()
# the author
upname = scrapy.Field()
Copy the code
Added four fields (number of views, number of bullets, upload time, author)
Government documents
class LycSpider(scrapy.Spider) :
name = 'lyc'
allowed_domains = ['bilibili.com']
start_urls = ['https://search.bilibili.com/all?keyword= university courses & page = 40']
# Crawl method
def parse(self, response) :
item = BiliItem()
# matches
for jobs_primary in response.xpath('//*[@id="all-list"]/div[1]/ul/li'):
item['title'] = (jobs_primary.xpath('./a/@title').extract())[0]
item['url'] = (jobs_primary.xpath('./a/@href').extract())[0]
item['watchnum'] = (jobs_primary.xpath('./div/div[3]/span[1]/text()').extract())[0].replace("\n"."").replace(""."")
item['dm'] = (jobs_primary.xpath('./div/div[3]/span[2]/text()').extract())[0].replace("\n"."").replace(""."")
item['uptime'] = (jobs_primary.xpath('./div/div[3]/span[3]/text()').extract())[0].replace("\n"."").replace(""."")
item['upname'] = (jobs_primary.xpath('./div/div[3]/span[4]/a/text()').extract())[0]
# Can't use return
yield item
Get the link to the current page
url = response.request.url
#page +1
new_link = url[0: -1] +str(int(url[-1]) +1)
# Send the request again for the next page data
yield scrapy.Request(new_link, callback=self.parse)
Copy the code
Page tag parsing for the four new fields
Pipelines file
import csv
class BiliPipeline:
def __init__(self) :
Open the file, specify the mode as write, and use the third parameter to eliminate blank lines generated when CSV data is written
self.f = open("Lyc University course.csv"."a", newline="")
Set the name of the field in the first line of the file to be the same as the key in the dictionary passed by spider
self.fieldnames = ["title"."url"."watchnum"."dm"."uptime"."upname"]
Select * from 'CSV dictionary' where '1' specifies the file and '2' specifies the field name
self.writer = csv.DictWriter(self.f, fieldnames=self.fieldnames)
# write the name of the field in the first line. Since it is only written once, the file is in __init__
self.writer.writeheader()
def process_item(self, item, spider) :
# print("title:", item['title'][0])
# print("url:", item['url'][0])
# print("watchnum:", item['watchnum'][0].replace("\n","").replace(" ",""))
# print("dm:", item['dm'][0].replace("\n", "").replace(" ", ""))
# print("uptime:", item['uptime'][0].replace("\n", "").replace(" ", ""))
# print("upname:", item['upname'][0])
print("title:", item['title'])
print("url:", item['url'])
print("watchnum:", item['watchnum'])
print("dm:", item['dm'])
print("uptime:", item['uptime'])
print("upname:", item['upname'])
# write the value passed by the spider
self.writer.writerow(item)
# Return after writing
return item
def close(self, spider) :
self.f.close()
Copy the code
Save the crawl to a CSV file (LYC University course.csv)
2. Start the scrapy
scrapy crawl lyc
Copy the code
Start a scrapy project with the above command
3. Crawl the results
A total of 1914 data crawl, finally after a simple cleaning of the final available data 1907!
03 Data Analysis
1. Ranking of college students’ learning videos
Read the data
dataset = pd.read_csv('Bili\\lyc University course.csv ',encoding="gbk")
title = dataset['title'].tolist()
url = dataset['url'].tolist()
watchnum = dataset['watchnum'].tolist()
dm = dataset['dm'].tolist()
uptime = dataset['uptime'].tolist()
upname = dataset['upname'].tolist()
Copy the code
The data processing
# Analysis 1: & Analysis 2
def getdata1_2() :
watchnum_dict = {}
dm_dict = {}
for i in range(0.len(watchnum)):
if "万" in watchnum[i]:
watchnum[i] = int(float(watchnum[i].replace("万"."")) * 10000)
else:
watchnum[i] = int(watchnum[i])
if "万" in dm[i]:
dm[i] = int(float(dm[i].replace("万"."")) * 10000)
else:
dm[i] = int(dm[i])
watchnum_dict[title[i]] = watchnum[i]
dm_dict[title[i]] = dm[i]
### rank from smallest to largest
watchnum_dict = sorted(watchnum_dict.items(), key=lambda kv: (kv[1], kv[0]))
dm_dict = sorted(dm_dict.items(), key=lambda kv: (kv[1], kv[0]))
# # # # # # # # # # # # # # # #
analysis1(watchnum_dict,"Ranking of college Students' learning videos by Number of views")
Copy the code
Data visualization
def pie(name,value,picname,tips) :
c = (
Pie()
.add(
""[list(z) for z in zip(name, value)],
# The center of the pie chart. The first item in the array is the x-coordinate and the second is the y-coordinate
When set to %, the first item is relative to the container width and the second is relative to the container height
center=["35%"."50%"],
)
.set_colors(["blue"."green"."yellow"."red"."pink"."orange"."purple"]) # Set color
.set_global_opts(
title_opts=opts.TitleOpts(title=""+str(tips)),
legend_opts=opts.LegendOpts(type_="scroll", pos_left="70%", orient="vertical"), Adjust the legend position
)
.set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
.render(str(picname)+".html"))Copy the code
Analysis of the
-
[Patches] “Human Classroom” the most watched, watched: 2.02 million.
-
Learning from university course content at Station B is far more attractive than learning about topics that are interesting in class.
2. Ranking of college students’ learning videos
The data processing
watchnum_dict = {}
dm_dict = {}
for i in range(0.len(watchnum)):
if "万" in watchnum[i]:
watchnum[i] = int(float(watchnum[i].replace("万"."")) * 10000)
else:
watchnum[i] = int(watchnum[i])
if "万" in dm[i]:
dm[i] = int(float(dm[i].replace("万"."")) * 10000)
else:
dm[i] = int(dm[i])
watchnum_dict[title[i]] = watchnum[i]
dm_dict[title[i]] = dm[i]
### rank from smallest to largest
watchnum_dict = sorted(watchnum_dict.items(), key=lambda kv: (kv[1], kv[0]))
dm_dict = sorted(dm_dict.items(), key=lambda kv: (kv[1], kv[0]))
# Analysis 2: Ranking of students' learning videos
analysis1(dm_dict,"Ranking of College Students' Learning Videos")
Copy the code
Data visualization
# # # the pie chart
def pie(name,value,picname,tips) :
c = (
Pie()
.add(
""[list(z) for z in zip(name, value)],
# The center of the pie chart. The first item in the array is the x-coordinate and the second is the y-coordinate
When set to %, the first item is relative to the container width and the second is relative to the container height
center=["35%"."50%"],
)
.set_colors(["blue"."green"."yellow"."red"."pink"."orange"."purple"]) # Set color
.set_global_opts(
title_opts=opts.TitleOpts(title=""+str(tips)),
legend_opts=opts.LegendOpts(type_="scroll", pos_left="70%", orient="vertical"), Adjust the legend position
)
.set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
.render(str(picname)+".html"))Copy the code
Analysis of the
-
In the ranking of bullet counts, “Data Structure and Algorithm Foundation **” has the highest number of bullet counts: 33,000
-
According to the ranking of the number of bullets, we can see what kind of classroom videos everyone likes to leave comments on.
-
Compared with the amount of play, college students like to speak on the classroom content learning video!
3. Up the number of main college students’ learning videos
The data processing
# Analysis 3: Up main college students learning video number of videos
def getdata3() :
upname_dict = {}
for key in upname:
upname_dict[key] = upname_dict.get(key, 0) + 1
### rank from smallest to largest
upname_dict = sorted(upname_dict.items(), key=lambda kv: (kv[1], kv[0]))
itemNames = []
datas = []
for i in range(len(upname_dict) - 1.len(upname_dict) - 21, -1):
itemNames.append(upname_dict[i][0])
datas.append(upname_dict[i][1])
# drawing
bars(itemNames,datas)
Copy the code
Data visualization
# # # bar charts
def bars(name,dict_values) :
# Chained calls
c = (
Bar(
init_opts=opts.InitOpts( # Initial configuration item
theme=ThemeType.MACARONS,
animation_opts=opts.AnimationOpts(
animation_delay=1000, animation_easing="cubicOut" Initial animation delay and easing effects
))
)
.add_xaxis(xaxis_data=name) # x
.add_yaxis(series_name="Up", yaxis_data=dict_values) # y
.set_global_opts(
title_opts=opts.TitleOpts(title='Li Yunchen', subtitle='Up video count'.# Title configuration and adjust location
title_textstyle_opts=opts.TextStyleOpts(
font_family='SimHei', font_size=25, font_weight='bold', color='red',
), pos_left="90%", pos_top="10",
),
xaxis_opts=opts.AxisOpts(name='up main nickname ', axislabel_opts=opts.LabelOpts(rotate=45)),
The x name and Label Rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate
yaxis_opts=opts.AxisOpts(name='Student Learning Video Number of Videos'),
)
.render("Up main college student learning video video number.html"))Copy the code
Analysis of the
-
In the up main of college course videos, the number of videos related to college classes in the up main video is ranked
-
In the university course video number ranking, the video number is the most: Xiao Bai in learning
4. The cloud of university course names
The data processing
text = "".join(title)
with open("stopword.txt"."r", encoding='UTF-8') as f:
stopword = f.readlines()
for i in stopword:
print(i)
i = str(i).replace("\r\n"."").replace("\r"."").replace("\n"."")
text = text.replace(i, "")
Copy the code
Data visualization
word_list = jieba.cut(text)
result = "".join(word_list) Use the # participle to separate
Make Chinese cloud words
icon_name = 'fab fa-qq'
"" "# # # icon_name = ', flag icon_name = 'fas fa - dragon, # pterosaurs icon_name =' fas fa - dog, dog # # icon_name = 'fas fa - cat, cat # # Icon_name = 'fas fa - dove, # # pigeons icon_name =' overall fa - qq, qq # "" "
gen_stylecloud(text=result, icon_name=icon_name, font_path='simsun.ttc', output_name="University course name word cloud. PNG") # must add Chinese font, otherwise the format is wrong
Copy the code
Analysis of the
-
Peking University and Tsinghua University are the main courses, the course title contains the two universities in the majority.
-
Most of these video titles are based on the following keywords: foundation, open course, courseware, postgraduate entrance examination, university physics and so on.
04 summary
1. Use the Scrapy framework to climb 1,077 “B station” college course learning resource data.
2. Make visual presentation and concise analysis of data.