Author: HeoiJin, determined to see the world through data product planning, focusing on crawler, data analysis, product planning.

Blog: me.csdn.net/weixin_4067…

Due to the large length, this paper will be divided into two parts: the first part is data crawling, and the second part is data analysis. Today we bring you is the previous: climb B station data! \

This directory

I. Project background

Recently, I read an article on the interpretation of data of B station in 2019, and finally concluded that the quadratic attributes of B station have been diluted and gradually become popular.

Then after the Spring Festival in 2020, how about the dilution of the quadratic attributes? What section is the head of station B? What label videos do the mainstream users of STATION B like? What social value does the situation in each district bring? This project will take you to explore the changes of station B through data.

Project Features:

  1. Use Scrapy framework for web page crawling
  2. Pandas and NUMPY were used for data analysis
  3. Data visualization using Pyecharts
  4. Correlation analysis was performed using SCIPY

Tools and environment

  • Language: Python 3.7
  • IDE: Pycharm
  • Browser: Chrome
  • Crawler framework: Scrapy 1.8.0

Demand analysis

Website B is a video sharing website with bullet screen that we are familiar with. According to baidu Baike, its main business includes live broadcasting, games, advertising, e-commerce, comics and e-sports.

Among so many businesses, it is not difficult to find a common point. The main profit model of B station is highly dependent on users, followed by anchors and UP hosts.

Therefore, to analyze the changes of station B, it is necessary to analyze the changes of users’ preferences. The following data will be collected in this project:

  1. The partition name of the leaderboard
  2. Ranking page: video title, author, overall rating, ranking, video link
  3. Details page: video play volume, triple volume, comment volume, bullet screen volume, forwarding volume, hot label

Iv. Page analysis

4.1 Leaderboard page analysis

First of all, the ranking page has been analyzed. After Javascript is disabled, it is found that the information to be extracted is in static web pages, so you can use xpath to locate and grab information when writing code.

After analyzing the leaderboard page of a single partition, you only need to find the URL corresponding to each leaderboard to achieve multiple partitions. By checking the source code of the web page, we found that each partition has only text description, and there is no relevant URL, so we construct the URL of the request by analyzing the URL changes.

Url rules the corresponding number: www.bilibili.com/ranking/all…

The following are the corresponding numbers for each category:

We can create a numbered list and loop together the urls to complete the batch generation of urls

from pprint import pprintlabels_num=[0.1.168.3.129.4.36.188.160.119.155.5.181]
url_list=[f'https://www.bilibili.com/ranking/all/{i}/ 0/30 ' for i in labels_num]Pprint (url_list)
Copy the code

4.2 Details PAGE API Parsing

We also need to obtain the play quantity, triple quantity, comment quantity, bullet screen quantity, forwarding quantity and hot label of the video, but it is not reflected in the leaderboard page, so we need to further request the details page of the video.

After entering the video details page and also disabling Javascript, you can see that the information you are looking for is asynchronously loaded by Ajax.Here consider capturing API files to obtain information, which can greatly improve the efficiency of parsing web pages, but also not easy to be blocked IP.

After a round of analysis, it is found that the data of video playback volume, triple connection volume, comment volume, barrage volume and forwarding volume can be found in stat? Aid = In the file, the number at the end of the URL is the ID of the video. It is enough to slice the video link to obtain the ID and then splice the Request URL.

Access the Request URL, which is standard JSON data.

To parse the data page using JSON, you only need to retrieve the data under the key [‘ data ‘]

There is still a lack of hot tag data, continue to capture the package to find another API URL, also need to construct the URL through the ID of the video.



But a direct visit to this URL will show that the page you are looking for does not exist. Look at the URL and find,? After contains many parameters, try to retain only the key video ID parameter after accessing again, can obtain the required information. It’s also very neat JSON data.



After parsing json, simply retrieve all [‘ tag_name ‘] under the key [‘ data ‘].

At this point, all required URLS and relevant positioning information have been basically found, and now we can start to write crawler files.

5. Crawler analysis

Introduction to Scrapy framework crawlers

Scrapy is an application framework designed to crawl site data and extract structured data. It can be used in a range of applications including data mining, information processing or storing historical data.

Scrapy diagram (green arrow for data flow)

Introduces the components involved in this project

The Scrapy Engine is responsible for controlling the flow of data across all components of the system and firing events when corresponding actions occur.

The Scheduler receives requests from the engine and dispatches them to the engine later when the engine requests them.

The Downloader takes the page data and feeds it to the engine, which then feeds it to the spider.

Spiders Spiders are classes written by Scrapy users to analyze a response and extract an item(that is, an item that was obtained) or additional urls to follow. Each spider is responsible for a specific (or set of) web sites.

Item Pipeline The Item Pipeline handles items extracted by spiders. Typical processes include cleanup, validation, and persistence (such as accessing to a database)

5.2 Why use Scrapy frame crawlers

Scrapy uses an asynchronous network framework to handle network communication. Compared with the common Request crawler or multi-threaded crawler, Scrapy has higher efficiency in the crawl the page (detailed efficiency comparison we can see the “Python N kind of posture of the crawler” : www.cnblogs.com/jclian91/p/.)

At the same time, the perfect framework means that crawlers can be easily realized only by customized development of modules, and there is a clear logical path.

6. Compilation of crawler

If you haven’t already installed Scrapy, you can do so in CMD using the PIP statement

pip3 install Scrapy
Copy the code

6.1 Creating a crawler project

Go to the folder where you want to create a crawler file and enter CMD in the address bar to enter CMD mode.

scrapy startproject blbl
cd blbl
scrapy genspider bl "bilibili.com"
Copy the code

Command interpretation:

  • Scrapy startProject BLBL: Create a crawler project named BLBL
  • Sd BLBL: Enter the project file
  • Scrapy genspider “bilibili.com” : creates a crawler file named bl (separate from the project name and unique within the project), and limits the url range to be crawled.

Here we have completed the creation of the crawler, directory structure is as follows:



Briefly introduce the functions of the documents required for this project:

  • Scrapy. CFG: configuration file of the project
  • BLBL/BLBL: The Python module of the project, which will reference the code from here
  • Kitems. py: The object file of the project
  • “> < span style =” box-sizing: border-box
  • Settings. py: Settings file for the project
  • Spiders / : Stores a catalog of crawler codes
  • Bl.py: the crawler file we created by command

6.2 Create and write start.py

Starting a Scrapy crawler is usually done in a shell or CMD command. To facilitate crawler startup or debug test crawlers, create a start.py to control crawler startup

Goal:

  • Run the CMD command in the py file
from scrapy import cmdline

cmdline.execute('scrapy crawl bl'.split())
Copy the code

Once created, we only need to execute this file every time we run a crawler or debug test.

6.3 write Settings. Py

Goal:

  • The closure is subject to gentlemen’s agreement
  • Setup delay (good crawlers should not strain other people’s servers)
  • Construct request header
  • Open Pipeline (for storing data, uncomment)
ROBOTSTXT_OBEY = False DOWNLOAD_DELAY = 1 DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9 * / *; Q =0.8', 'accept-language ':' en',' user-agent ':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit / 537.36 (KHTML, < span style = "box-sizing: border-box! Important; word-wrap: break-word! Important;" 'bilibili.pipelines.BlblPipeline': 300, }Copy the code

6.4 write bl. Py

Bl.py is a crawler file that we created through the CMD command to parse the website content and pass the parsed data to the Items Pipeline.

Goal:

  • Get ranking, video title, author, score
  • Get the video ID and construct the API link
  • Send the request to the API link
  • Get data like triplets, barrage, comments, and popular tabs
import scrapy
from blbl.items import BlblItem
import json 


class BlSpider(scrapy.Spider) :
    name = 'bl'
    allowed_domains = ['bilibili.com']
    #start_urls defaults to 'http:// '+allowed_domains[0]
    So here we'll rewrite start_urls to assign the list of urls to the leaderboard page
    start_urls = [
        'https://www.bilibili.com/ranking/all/0/0/30'.'https://www.bilibili.com/ranking/all/1/0/30'.'https://www.bilibili.com/ranking/all/168/0/30'.'https://www.bilibili.com/ranking/all/3/0/30'.'https://www.bilibili.com/ranking/all/129/0/30'.'https://www.bilibili.com/ranking/all/4/0/30'.'https://www.bilibili.com/ranking/all/36/0/30'.'https://www.bilibili.com/ranking/all/188/0/30'.'https://www.bilibili.com/ranking/all/160/0/30'.'https://www.bilibili.com/ranking/all/119/0/30'.'https://www.bilibili.com/ranking/all/155/0/30'.'https://www.bilibili.com/ranking/all/5/0/30'.'https://www.bilibili.com/ranking/all/181/0/30'
                  ]


    def parse(self, response) :
        # get the list currently climbed
        rank_tab=response.xpath('//ul[@class="rank-tab"]/li[@class="active"]/text()').getall()[0]
        print('='*50.'Currently climbing the list is :',rank_tab,'='*50)


        # The video information is placed in the Li tag. Here we get all the Li tags first
        # Then go through rank_lists to get the information of each video
        rank_lists=response.xpath('//ul[@class="rank-list"]/li')
        for rank_list in rank_lists:
            rank_num=rank_list.xpath('div[@class="num"]/text()').get()
            title=rank_list.xpath('div/div[@class="info"]/a/text()').get()
            Capture the URL of the video, slice and get the ID of the video
            id=rank_list.xpath('div/div[@class="info"]/a/@href').get().split('/av')[-1]
            The url of the concatenation detail page API
            Detail_link=f'https://api.bilibili.com/x/web-interface/archive/stat?aid={id}'
            Labels_link=f'https://api.bilibili.com/x/tag/archive/tags?aid={id}'
            author=rank_list.xpath('div/div[@class="info"]/div[@class="detail"]/a/span/text()').get()
            score=rank_list.xpath('div/div[@class="info"]/div[@class="pts"]/div/text()').get()
            To send a request using the Requests library, write one more request header
            So we continue to use Scrapy to send requests to the API
            Create a dictionary here to store the data we have captured
            # This ensures that our detailed data and ranking data can be matched one by one without further merging
            # If the Item is Scrapy, the last page will be missing
            items={
                'rank_tab':rank_tab,
                'rank_num' : rank_num ,
                'title' :title ,
                'id' : id ,
                'author' : author ,
                'score' : score ,
                'Detail_link':Detail_link
            }
            # Send API to scheduler for detail page request, pass ranking page data via meta
            yield scrapy.Request(url=Labels_link,callback=self.Get_labels,meta={'item':items},dont_filter=True)


    def Get_labels(self,response) :
        # Get popular tag data
        items=response.meta['item']
        Detail_link=items['Detail_link']
        Parse json data
        html=json.loads(response.body)
        Tags=html['data'] # Video tag data
        Split the list with a comma, return a string
        tag_name=', '.join([i['tag_name'] for i in Tags])
        items['tag_name']=tag_name
        yield scrapy.Request(url=Detail_link,callback=self.Get_detail,meta={'item':items},dont_filter=True)


    def Get_detail(self,response) :
        # get ranking page data
        items=response.meta['item']
        rank_tab=items['rank_tab']
        rank_num=items['rank_num']
        title=items['title']
        id=items['id']
        author=items['author']
        score=items['score']
        tag_name=items['tag_name']


        Parse json data
        html=json.loads(response.body)


        Get detailed playback information
        stat=html['data']


        view=stat['view']
        danmaku =stat['danmaku']
        reply =stat['reply']
        favorite =stat['favorite']
        coin =stat['coin']
        share =stat['share']
        like =stat['like']


        Pass all crawl information to Item
        item=BlblItem(
            rank_tab=rank_tab,
            rank_num = rank_num ,
            title = title ,
            id = id ,
            author = author ,
            score = score ,
            view = view ,
            danmaku = danmaku ,
            reply = reply ,
            favorite = favorite ,
            coin = coin ,
            share = share ,
            like = like ,
            tag_name = tag_name
        )
        yield item
Copy the code

6.5 write the Items. Py

Fill in the names of the data to Scrapy

Goal:

  • Collect crawl data
import scrapyclass BlblItem(scrapy.Item): rank_tab=scrapy.Field() rank_num =scrapy.Field() id=scrapy.Field() title =scrapy.Field() author =scrapy.Field() score =scrapy.Field() view=scrapy.Field() danmaku=scrapy.Field() reply=scrapy.Field() favorite=scrapy.Field() coin=scrapy.Field() share=scrapy.Field() like=scrapy.Field() tag_name=scrapy.Field()
Copy the code

6.6 write pipeline. Py

Scrapy’s native CsvItemExporter allows you to keep writing headers and writerow statements much simpler than writing to CSV.

Goal:

  • Use CsvItemExporter to write data to CSV files
from scrapy.exporters importCsvItemExporter BlblPipeline(object): def __init__(self): self.fp=open(self)'bilibili.csv'.'ab') #include_headers_line defaults to True Self. Exportre =CsvItemExporter(self. Fp, include_headerS_line =True, encoding='utf-8-sig'Def process_item(self,item,spider): self.exportre.export_item(item) def process_item(self,item,spider): self.exportre.export_item(item)return item
    def close_spider(self,spider):        self.fp.close(a)Copy the code

Finally, open bilibili.csv and you can see that the data has been completely climbed down!

Seven, the summary of this article

Finally, review the key content of this crawler:

  • Packet capture is carried out for web pages asynchronously loaded with Ajax, and asynchronously loaded data is accessed by grabbing the Request URL
  • Use Scrapy framework crawler for data collection
  • Scrapy. Request sends requests to the API and passes crawled ranking page data via meta
  • Use CsvItemExporter, built into Scrapy, to store data into CSV

Stay tuned for the next part of this article: data analysis in action

Source code address: github.com/heoijin/Bil…

Solemnly declare: this project and all related articles, only for technical exchange experience, forbid the application of relevant technology to improper ways, because the risk of abuse of technology has nothing to do with myself

Email: [email protected]\


contribute please Click to read the original article* * * *Like articles, click Looking at the \