This article is participating in Python Theme Month. See the link to the event for more details

To prepare

  • A little bitpythonbasis
  • PyCharmProgramming tools

Results show

Let me show you our crawler work

The C/C++ hot list has 101 blog entries. Congratulations!

start

Demand analysis

Our goal is to climb the CSDN hot list of articles information.

The specific fields are

  • Content classification
  • Blog name
  • Link to blog
  • Browse the number
  • comments
  • Collect the number
  • The heat
  • The name of the blog
  • The blog link

Further, we can crawl the content details of popular blogs, the number of followers of blogs, etc., but we won’t do that this time.

Web analytics

The most important thing about web analytics is not technology, but using good tools. In fact, the page is very simple (in addition to special abnormal encryption), zero basis to become god everywhere, you know, the page is for people to see, what you see is what you get, we just use code instead of human eyes.

And now most of the website is the front and back end of the architecture is separated, the front end is responsible for rendering data, the back end is responsible for providing data, most of these data is JSON. If it is a JSON data structure, then we can write crawlers very easily. But to demonstrate the usual analysis of crawlers. Chen first with the analysis of page elements analysis, then tell you the way to access the interface directly.

tool

The tools used here are also listed by Achen.

  • ChromeThe browser

  • XPath Helper

Xpath debug artifacts, the installation address: chrome.google.com/webstore/de…

  • PostmanInterface test tool

Very convenient for interface debugging or webpage element debugging, because sometimes webpage encryption, it is difficult for us to visually analyze the encryption logic, and apply Postman to debug, twice the result with half the effort!

Ok, let Chen take you a good analysis of CSDN hot list (Chen will also be their own exploration and ideas without reservation to share out)

Web element analysis

1, select the element we want -> right -> check

The most common one is the Element analysis function of Chrome. The requirements are displayed as a paging list, and the content we want is all in each item of the list.

What we need to do in this step is to examine whether all the fields mentioned in the requirement can be described using xpath

2. On the right console, analyze the element structure

In this step, we need to parse out the xPaths

However, this xpath is generally not usable directly, so we need to further analyze it and then test it repeatedly using the xpath Helper.

//*[@id="floor-rank_460"]/div[2]/div[1]/div/div[2]/div[2]/div[1]/div/div/div[2]/div/div[1]/a
Copy the code

Chen will use a simple article title, do a demonstration

3. We notice that the title of the blog post has a unique identifier on this pageclass=hosetitem-title

Let’s try to write xpath as

//*[@id="floor-rank_460"]//*[@class="hosetitem-title"]
Copy the code

A into!

In this way, we can get the complete title of all the blog posts on this page. You can also use the xpath Helper to write the xpath for other fields.

Content classification=//*[@class="host-move"]/ul/li[@class="active"]/text()
Blog name=//*[@id="floor-rank_460"]//*[@class="hosetitem-title"]/a/text()
Link to blog=//*[@id="floor-rank_460"]//*[@class="hosetitem-title"]/a/@href
Browse the number=//*[@id="floor-rank_460"]//*[@class="hosetitem-dec"]/span[1]/text()
comments=//*[@id="floor-rank_460"]//*[@class="hosetitem-dec"]/span[2]/text()
Collect the number=//*[@id="floor-rank_460"]//*[@class="hosetitem-dec"]/span[3]/text()
The heat=//*[@id="floor-rank_460"]//*[@class="hostitem-item-right"]//span[@class="num"]/text()
The name of the blog=//*[@id="floor-rank_460"]//*[@class="hostitem-item-right"]/div[@class="right"]/a/text()
The blog link=//*[@id="floor-rank_460"]//*[@class="hostitem-item-right"]/div[@class="right"]/a/@href
Copy the code

Interface analysis

In fact, Chen noticed from the very beginning that CSDN technology is relatively advanced front and rear end separation technology, and the interface is also very well designed!

Moreover, interface analysis is not only simple and fast, but also a no-brainer!

1, open the console to access the web page

Press F12 to open the console, switch to the Network TAB, and then refresh the page

2. Analyze interfaces

In the previous step, we noticed that while refreshing the page, the page visited multiple interfaces, https://blog.csdn.net/phoenix/web/blog/hotRank?page=0&pageSize=25&child_channel=c%2Fc%2B%2B this interface, which is what we want.

3, use,PostmanFurther analysis of interfaces

This step is to confirm whether the interface has encryption measures or whether the parameters can be adjusted to simplify crawler development.

The debugging step is simple. You just keep modifying the interface parameters to see if the response is normal

4. Document the interface

A Chen will be hot list of the interface document to share.

Request way=GET
Request path=https://blog.csdn.net/phoenix/web/blog/hotRank
# Request parameters
page=Current page, starting from 0, 0 for the first page, 1 for the second page
pageSize=This parameter can greatly simplify our crawler development (for example, if you change it to 1000, you will get all the data at once). However, if you modify it carefully, it may be recognized as a crawler by the target website.
child_channel=Content classification
Copy the code

The interface response

{
    "code": 200."message": "success"."data": [{"hotRankScore": "88956"./ / heat
            "pcHotRankScore": "8.9 w".// Hot copy
            "loginUserIsFollow": false.// Whether the logged in user has followed this blog
            "nickName": "C and CPP Programming".// Name of the blog
            "avatarUrl": "https://profile.csdnimg.cn/2/3/9/3_weixin_41055260"./ / avatar
            "userName": "weixin_41055260"./ / blog id
            "articleTitle": "10W+ words C language hardcore summary (a), worth reading collection!".// Name of the blog post
            "articleDetailUrl": "https://blog.csdn.net/weixin_41055260/article/details/118947036".// Links to blog posts
            "commentCount": "35"./ / comments
            "favorCount": "1399"./ / collection
            "viewCount": "13057"./ / browsing number
            "hotComment": null // Is it a hot comment?}}]Copy the code

coding

We use the results of the interface analysis method above to write our crawler.

1, use,PyCharmNew project

2, installation,Scrapy

Open the console in the project root directory and type

$PIP install Scrapy $Scrapy version Scrapy 2.5.0Copy the code

3, newScrapyThe crawler program

Open the console in the project root directory and type

$ scrapy startproject csdnHot
New Scrapy project 'csdnHot', using template directory 'd:\devtools\python\python39\lib\site-packages\scrapy\templates\project', created in:
    D:\WorkSpace\Personal\my-scrapy\csdnHot

You can start your first spider with:
    cd csdnHot
    scrapy genspider example example.com
Copy the code

At the moment, my project catalog is

4, new crawlerSpider

Open the console in the project root directory and type

$ cd csdnHot/csdnHot/spiders
$ scrapy genspider hotList blog.csdn.net
Copy the code

1. Write the crawler data class

# items.py
from scrapy import Field, Item


class HotList(Item) :hotRankScore = Field() nickName = Field() avatarUrl = Field() userName = Field() articleTitle = Field() articleDetailUrl  = Field() commentCount = Field() favorCount = Field() viewCount = Field()Copy the code

2. Write a crawler

# hotList.py
import json
import scrapy

from csdnHot.items import HotList


class HotlistSpider(scrapy.Spider) :
    name = 'hotList'
    allowed_domains = ['blog.csdn.net']
    current_page = 0
    start_urls = [
        f'https://blog.csdn.net/phoenix/web/blog/hotRank?page={current_page}&pageSize=25&child_channel=c%2Fc%2B%2B'
    ]

    def parse(self, response) :
        items = json.loads(response.body)["data"]
        if len(items) > 0:
            for item in items:
                hot_list = HotList()
                hot_list["hotRankScore"] = item["hotRankScore"]
                hot_list["nickName"] = item["nickName"]
                hot_list["avatarUrl"] = item["avatarUrl"]
                hot_list["userName"] = item["userName"]
                hot_list["articleTitle"] = item["articleTitle"]
                hot_list["articleDetailUrl"] = item["articleDetailUrl"]
                hot_list["commentCount"] = item["commentCount"]
                hot_list["favorCount"] = item["favorCount"]
                hot_list["viewCount"] = item["viewCount"]
                yield hot_list
            If data is still available, there is a next page by default
            self.current_page = self.current_page + 1
            next_page = f'https://blog.csdn.net/phoenix/web/blog/hotRank?page={self.current_page}&pageSize=25&child_channel=c%2Fc%2B%2B'
            yield scrapy.Request(next_page, callback=self.parse)
Copy the code

3. Run the crawler and save the results to a file

Open the console in the project root directory and type

$ cd csdnHot/csdnHot/spiders
$ scrapy runspider hotList.py -o csdn_hotList.csv
...
 'scheduler/enqueued': 5,
 'scheduler/enqueued/memory': 5,
 'start_time': datetime.datetime(2021, 7, 24, 13, 17, 51, 934035)}
2021-07-24 21:17:53 [scrapy.core.engine] INFO: Spider closed (finished)
Copy the code

Seeing this proves that our crawler is running successfully, take a look at the results! All right, there are 101 posts on the C/C++ hot list! Congratulations!!

If this blog has been helpful to you, please remember to leave a comment + like + bookmark.

I am Chen, on the road of technology we forge ahead together!