This is the 13th day of my participation in Gwen Challenge
3 Crawler initialization
Those are the two red lines up there
$ cdMySpider $scrapy genspider example example.comCopy the code
The program and directory structure after initialization are as follows:
4 Write the fields we want in items.py
import scrapy
class ChinanewsItem(scrapy.Item) :
# define the fields for your item here like:
# name = scrapy.Field()
# site name
sourceName = scrapy.Field()
# site address
sourceUrl = scrapy.Field()
# post address
articleUrl = scrapy.Field()
# titles
title = scrapy.Field()
# Release time
publishTime = scrapy.Field()
# Article classification
articleCategory = scrapy.Field()
# post tag
articalLabel = scrapy.Field()
# Article content
articleContent = scrapy.Field()
Copy the code
5 get the page link we want to crawl
In the experiment, the page (static page) we want to climb is the crawling rules of CnN.com:
Scroll news: http://domain name /scroll-news/ year/month /news. SHTML, for example, www.chinanews.com/scroll-news…
News in a certain category: http://domain name /scroll-news/ folder of the corresponding website/year/month /news. SHTML, for example: http://www.chinanews.com/cul/2021/06-15/9500187.shtml cul for culture (culture) under the classification of the news
Find the url we want by right clicking check or F12 on the page.
Now to get the URL in the <a tag,
hrefs = response.xpath("//div[@class='content_list']//div[@class='dd_bt']/a/@href").extract()
Copy the code
- Div [@class=’content_list’] finds a class whose type is content_list
- Div [@class=’dd_bt’] find class dd_bt in content_list
- /a/@href” find the property tag under the a tag
- Extract the URL from.extract()
6 Running Crawler
$ scrapy crwal news
Copy the code
Crawler results:
As you can see from the above, we can now get links to rolling news stories on different dates.
7 Send the URL again through the obtained link to obtain the data of the news details page
Get the title by crawler.
item['title'] = response.xpath("//h1/text()").extract()[0].strip()
Copy the code
Extract ()[0] extract the first item in the list,.strip() remove Spaces