This is the 15th day of my participation in the August More text Challenge. For details, see: August More Text Challenge

01 preface

The last article with “B station” for actual combat case! You can use this framework to get the data from B station. This article will refine the code, climb up more and save it to CSV.

A total of 1,907 “course learning” data were climbed to analyze which kind of learning resources were the most popular and favored by college students. And show the results in a visual way!

02 Data Acquisition

The procedure is next with “B station” for actual combat case! Hand to hand teach you to master the essential framework of “Scrapy” for improvement, so not clear can first look at this article (detailed introduction of Scrapy, and to “B station” as a case for actual programming)

1. All scrapy files

The items file

class BiliItem(scrapy.Item) :
    # define the fields for your item here like:
    # name = scrapy.Field()
    #pass
    # Video title
    title = scrapy.Field()
    # link
    url = scrapy.Field()
    # watch quantity
    watchnum = scrapy.Field()
    # barrage
    dm = scrapy.Field()
    # Upload time
    uptime = scrapy.Field()
    # the author
    upname = scrapy.Field()
Copy the code

Added four fields (number of views, number of bullets, upload time, author)

Government documents

class LycSpider(scrapy.Spider) :
    name = 'lyc'
    allowed_domains = ['bilibili.com']
    start_urls = ['https://search.bilibili.com/all?keyword= university courses & page = 40']
 
 
    # Crawl method
    def parse(self, response) :
        item = BiliItem()
        # matches
        for jobs_primary in response.xpath('//*[@id="all-list"]/div[1]/ul/li'):
            item['title'] = (jobs_primary.xpath('./a/@title').extract())[0]
            item['url'] = (jobs_primary.xpath('./a/@href').extract())[0]
            item['watchnum'] = (jobs_primary.xpath('./div/div[3]/span[1]/text()').extract())[0].replace("\n"."").replace(""."")
            item['dm'] = (jobs_primary.xpath('./div/div[3]/span[2]/text()').extract())[0].replace("\n"."").replace(""."")
            item['uptime'] = (jobs_primary.xpath('./div/div[3]/span[3]/text()').extract())[0].replace("\n"."").replace(""."")
            item['upname'] = (jobs_primary.xpath('./div/div[3]/span[4]/a/text()').extract())[0]
 
 
 
 
            # Can't use return
            yield item
 
 
        Get the link to the current page
        url = response.request.url
        #page +1
 
 
        new_link = url[0: -1] +str(int(url[-1]) +1)
        # Send the request again for the next page data
        yield scrapy.Request(new_link, callback=self.parse)
Copy the code

Page tag parsing for the four new fields

Pipelines file

import csv
 
 
class BiliPipeline:
 
 
    def __init__(self) :
        Open the file, specify the mode as write, and use the third parameter to eliminate blank lines generated when CSV data is written
        self.f = open("Lyc University course.csv"."a", newline="")
        Set the name of the field in the first line of the file to be the same as the key in the dictionary passed by spider
        self.fieldnames = ["title"."url"."watchnum"."dm"."uptime"."upname"]
        Select * from 'CSV dictionary' where '1' specifies the file and '2' specifies the field name
        self.writer = csv.DictWriter(self.f, fieldnames=self.fieldnames)
        # write the name of the field in the first line. Since it is only written once, the file is in __init__
        self.writer.writeheader()
 
 
    def process_item(self, item, spider) :
        # print("title:", item['title'][0])
        # print("url:", item['url'][0])
        # print("watchnum:", item['watchnum'][0].replace("\n","").replace(" ",""))
        # print("dm:", item['dm'][0].replace("\n", "").replace(" ", ""))
        # print("uptime:", item['uptime'][0].replace("\n", "").replace(" ", ""))
        # print("upname:", item['upname'][0])
 
 
        print("title:", item['title'])
        print("url:", item['url'])
        print("watchnum:", item['watchnum'])
        print("dm:", item['dm'])
        print("uptime:", item['uptime'])
        print("upname:", item['upname'])
 
 
 
 
        # write the value passed by the spider
        self.writer.writerow(item)
        # Return after writing
        return item
 
 
    def close(self, spider) :
        self.f.close()
Copy the code

Save the crawl to a CSV file (LYC University course.csv)

2. Start the scrapy

scrapy crawl lyc

Copy the code

Start a scrapy project with the above command

3. Crawl the results

A total of 1914 data crawl, finally after a simple cleaning of the final available data 1907!

03 Data Analysis

1. Ranking of college students’ learning videos

Read the data

dataset  = pd.read_csv('Bili\\lyc University course.csv ',encoding="gbk")
title = dataset['title'].tolist()
url = dataset['url'].tolist()
watchnum = dataset['watchnum'].tolist()
dm = dataset['dm'].tolist()
uptime = dataset['uptime'].tolist()
upname = dataset['upname'].tolist()
Copy the code

The data processing

# Analysis 1: & Analysis 2
def getdata1_2() :
    watchnum_dict = {}
    dm_dict = {}
    for i in range(0.len(watchnum)):
        if "万" in watchnum[i]:
            watchnum[i] = int(float(watchnum[i].replace("万"."")) * 10000)
        else:
            watchnum[i] = int(watchnum[i])
 
 
        if "万" in dm[i]:
            dm[i] = int(float(dm[i].replace("万"."")) * 10000)
        else:
            dm[i] = int(dm[i])
 
 
        watchnum_dict[title[i]] = watchnum[i]
        dm_dict[title[i]] = dm[i]
 
 
    ### rank from smallest to largest
    watchnum_dict = sorted(watchnum_dict.items(), key=lambda kv: (kv[1], kv[0]))
    dm_dict = sorted(dm_dict.items(), key=lambda kv: (kv[1], kv[0]))
    # # # # # # # # # # # # # # # #
    analysis1(watchnum_dict,"Ranking of college Students' learning videos by Number of views")
Copy the code

Data visualization

def pie(name,value,picname,tips) :
    c = (
        Pie()
            .add(
            ""[list(z) for z in zip(name, value)],
            # The center of the pie chart. The first item in the array is the x-coordinate and the second is the y-coordinate
            When set to %, the first item is relative to the container width and the second is relative to the container height
            center=["35%"."50%"],
        )
            .set_colors(["blue"."green"."yellow"."red"."pink"."orange"."purple"])  # Set color
            .set_global_opts(
            title_opts=opts.TitleOpts(title=""+str(tips)),
            legend_opts=opts.LegendOpts(type_="scroll", pos_left="70%", orient="vertical"),  Adjust the legend position
        )
            .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
            .render(str(picname)+".html"))Copy the code

Analysis of the

  1. [Patches] “Human Classroom” the most watched, watched: 2.02 million.

  2. Learning from university course content at Station B is far more attractive than learning about topics that are interesting in class.

2. Ranking of college students’ learning videos

The data processing


watchnum_dict = {}
dm_dict = {}
for i in range(0.len(watchnum)):
    if "万" in watchnum[i]:
        watchnum[i] = int(float(watchnum[i].replace("万"."")) * 10000)
    else:
        watchnum[i] = int(watchnum[i])
 
 
    if "万" in dm[i]:
        dm[i] = int(float(dm[i].replace("万"."")) * 10000)
    else:
        dm[i] = int(dm[i])
 
 
    watchnum_dict[title[i]] = watchnum[i]
    dm_dict[title[i]] = dm[i]
 
 
### rank from smallest to largest
watchnum_dict = sorted(watchnum_dict.items(), key=lambda kv: (kv[1], kv[0]))
    dm_dict = sorted(dm_dict.items(), key=lambda kv: (kv[1], kv[0]))
# Analysis 2: Ranking of students' learning videos
analysis1(dm_dict,"Ranking of College Students' Learning Videos")
Copy the code

Data visualization

# # # the pie chart
def pie(name,value,picname,tips) :
    c = (
        Pie()
            .add(
            ""[list(z) for z in zip(name, value)],
            # The center of the pie chart. The first item in the array is the x-coordinate and the second is the y-coordinate
            When set to %, the first item is relative to the container width and the second is relative to the container height
            center=["35%"."50%"],
        )
            .set_colors(["blue"."green"."yellow"."red"."pink"."orange"."purple"])  # Set color
            .set_global_opts(
            title_opts=opts.TitleOpts(title=""+str(tips)),
            legend_opts=opts.LegendOpts(type_="scroll", pos_left="70%", orient="vertical"),  Adjust the legend position
        )
            .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
            .render(str(picname)+".html"))Copy the code

Analysis of the

  1. In the ranking of bullet counts, “Data Structure and Algorithm Foundation **” has the highest number of bullet counts: 33,000

  2. According to the ranking of the number of bullets, we can see what kind of classroom videos everyone likes to leave comments on.

  3. Compared with the amount of play, college students like to speak on the classroom content learning video!

3. Up the number of main college students’ learning videos

The data processing

# Analysis 3: Up main college students learning video number of videos
def getdata3() :
    upname_dict = {}
    for key in upname:
        upname_dict[key] = upname_dict.get(key, 0) + 1
        ### rank from smallest to largest
    upname_dict = sorted(upname_dict.items(), key=lambda kv: (kv[1], kv[0]))
    itemNames = []
    datas = []
    for i in range(len(upname_dict) - 1.len(upname_dict) - 21, -1):
        itemNames.append(upname_dict[i][0])
        datas.append(upname_dict[i][1])
    # drawing
    bars(itemNames,datas)
Copy the code

Data visualization

# # # bar charts
def bars(name,dict_values) :
 
 
    # Chained calls
    c = (
        Bar(
            init_opts=opts.InitOpts(  # Initial configuration item
                theme=ThemeType.MACARONS,
                animation_opts=opts.AnimationOpts(
                    animation_delay=1000, animation_easing="cubicOut"  Initial animation delay and easing effects
                ))
        )
            .add_xaxis(xaxis_data=name)  # x
            .add_yaxis(series_name="Up", yaxis_data=dict_values)  # y
            .set_global_opts(
            title_opts=opts.TitleOpts(title='Li Yunchen', subtitle='Up video count'.# Title configuration and adjust location
                                      title_textstyle_opts=opts.TextStyleOpts(
                                          font_family='SimHei', font_size=25, font_weight='bold', color='red',
                                      ), pos_left="90%", pos_top="10",
                                      ),
            xaxis_opts=opts.AxisOpts(name='up main nickname ', axislabel_opts=opts.LabelOpts(rotate=45)),
            The x name and Label Rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate rotate
            yaxis_opts=opts.AxisOpts(name='Student Learning Video Number of Videos'),
 
 
        )
            .render("Up main college student learning video video number.html"))Copy the code

Analysis of the

  1. In the up main of college course videos, the number of videos related to college classes in the up main video is ranked

  2. In the university course video number ranking, the video number is the most: Xiao Bai in learning

4. The cloud of university course names

The data processing

text = "".join(title)
with open("stopword.txt"."r", encoding='UTF-8') as f:
    stopword = f.readlines()
for i in stopword:
    print(i)
    i = str(i).replace("\r\n"."").replace("\r"."").replace("\n"."")
    text = text.replace(i, "")
Copy the code

Data visualization

word_list = jieba.cut(text)
result = "".join(word_list)  Use the # participle to separate
Make Chinese cloud words
icon_name = 'fab fa-qq'
"" "# # # icon_name = ', flag icon_name = 'fas fa - dragon, # pterosaurs icon_name =' fas fa - dog, dog # # icon_name = 'fas fa - cat, cat # # Icon_name = 'fas fa - dove, # # pigeons icon_name =' overall fa - qq, qq # "" "
gen_stylecloud(text=result, icon_name=icon_name, font_path='simsun.ttc', output_name="University course name word cloud. PNG")  # must add Chinese font, otherwise the format is wrong
Copy the code

Analysis of the

  1. Peking University and Tsinghua University are the main courses, the course title contains the two universities in the majority.

  2. Most of these video titles are based on the following keywords: foundation, open course, courseware, postgraduate entrance examination, university physics and so on.

04 summary

1. Use the Scrapy framework to climb 1,077 “B station” college course learning resource data.

2. Make visual presentation and concise analysis of data.