This is the 13th day of my participation in Gwen Challenge

3 Crawler initialization

Those are the two red lines up there

$ cdMySpider $scrapy genspider example example.comCopy the code

The program and directory structure after initialization are as follows:

4 Write the fields we want in items.py

import scrapy


class ChinanewsItem(scrapy.Item) :
    # define the fields for your item here like:
    # name = scrapy.Field()
    # site name
    sourceName = scrapy.Field()

    # site address
    sourceUrl = scrapy.Field()

    # post address
    articleUrl = scrapy.Field()

    # titles
    title = scrapy.Field()

    # Release time
    publishTime = scrapy.Field()

    # Article classification
    articleCategory = scrapy.Field()

    # post tag
    articalLabel = scrapy.Field()

    # Article content
    articleContent = scrapy.Field()
Copy the code

5 get the page link we want to crawl

In the experiment, the page (static page) we want to climb is the crawling rules of CnN.com:

Scroll news: http://domain name /scroll-news/ year/month /news. SHTML, for example, www.chinanews.com/scroll-news…

News in a certain category: http://domain name /scroll-news/ folder of the corresponding website/year/month /news. SHTML, for example: http://www.chinanews.com/cul/2021/06-15/9500187.shtml cul for culture (culture) under the classification of the news

Find the url we want by right clicking check or F12 on the page.

Now to get the URL in the <a tag,

hrefs = response.xpath("//div[@class='content_list']//div[@class='dd_bt']/a/@href").extract()
Copy the code
  1. Div [@class=’content_list’] finds a class whose type is content_list
  2. Div [@class=’dd_bt’] find class dd_bt in content_list
  3. /a/@href” find the property tag under the a tag
  4. Extract the URL from.extract()

6 Running Crawler

$ scrapy crwal news
Copy the code

Crawler results:

As you can see from the above, we can now get links to rolling news stories on different dates.

7 Send the URL again through the obtained link to obtain the data of the news details page

Get the title by crawler.

item['title'] = response.xpath("//h1/text()").extract()[0].strip()
Copy the code

Extract ()[0] extract the first item in the list,.strip() remove Spaces