Tool environment
Language: python3.6 Database: MongoDB (install and run the following command)
python3 -m pip install pymongo
brew install mongodb
mongod --config /usr/local/etc/mongod.confCopy the code
Framework: scrapy1.5.1 (install command below)
python3 -m pip install ScrapyCopy the code
Create a crawler project using a scrapy framework
Execute the following command on the terminal to create a crawler project named mySpider
scrapy startproject myspiderCopy the code
You can get a file directory with the following structure
Create a crawl style crawler
CrawlSpider CrawlSpider CrawlSpider CrawlSpider CrawlSpider CrawlSpider crawlcrawlspider The more common crawler that crawls the entire site (the following example uses this) XMLFeedSpider CSVFeedSpider SitemapSpider
Go to the spiders on the command line
cd myspider/myspider/spidersCopy the code
Then create a crawler template of type Crawl
scrapy genspider -t crawl zgmlxc www.zgmlxc.com.cnCopy the code
Parameter Description:
-t crawl Specifies the type of crawler
ZGMLXC is my name for this crawler
www.zgmlxc.com.cn is the site I want to climb
Perfect crawler ZGMLXC
Open the zgmlxc.py file and you’ll see a basic crawler template. It’s time to configure it and let the crawler follow my instructions to crawl information.
Configure tracing page rules
Rules = (/ / positioning to www.zgmlxc.com.cn/node/72.jspx this page (LinkExtractor (Rule allow = r'.72\.jspx'Rule(LinkExtractor(allow=r)), // Find the url that matches the following rules, crawl the contents, and return the information to parse_item () Rule(LinkExtractor(allow=r)'./info/\d+\.jspx'), callback='parse_item'),Copy the code
The last Rule must be followed by a comma, otherwise an error will be reported
rules = ( Rule(LinkExtractor(allow=r'./info/\d+\.jspx'), callback='parse_item', follow=True), )Copy the code
Define the fields we need to extract in items.py
import scrapy
class CrawlspiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
content = scrapy.Field()
piclist = scrapy.Field()
shortname = scrapy.Field()Copy the code
Perfect the parse_item function
Here we take what we returned from the previous step, configure the rules, and extract the information we want. The join method must be used to facilitate the smooth import of the database later.
def parse_item(self, response):
yield {
'title' : ' '.join(response.xpath("//div[@class='head']/h3/text()").get()).strip(),
'shortname' : ' '.join(response.xpath("//div[@class='body']/p/strong/text()").get()).strip(),
'piclist' : ' '.join(response.xpath("//div[@class='body']/p/img/@src").getall()).strip(),
'content' : ' '.join(response.css("div.body").extract()).strip(),
}Copy the code
PS: Here are some common rules for extracting content, summarized directly here:
//img[@class=’photo-large’]/@src
Extract () : response.css(“div.body”).extract()
Save the information to the MogoDB database
Setting database Information
Open settings.py and add the following information:
Build the connection relationship between crawler and database
ITEM_PIPELINES = {
'crawlspider.pipelines.MongoDBPipeline': 300}Set database information
MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = 'spider_world'
MONGODB_COLLECTION = 'zgmlxc'
# Set civilization crawler, which means 5 seconds between each request, friendly to the site, but also prevent being blacklisted
```py
DOWNLOAD_DELAY = 5Copy the code
On piplines. Py
import pymongo
from scrapy.conf import settings
from scrapy.exceptions import DropItem
from scrapy import log
class MongoDBPipeline(object):
def __init__(self):
connection = pymongo.MongoClient(
settings['MONGODB_SERVER'],
settings['MONGODB_PORT']
)
db = connection[settings['MONGODB_DB']]
self.collection = db[settings['MONGODB_COLLECTION']]
def process_item(self, item, spider):
valid = True
for data in item:
if not data:
valid = False
raise DropItem("Missing {0}!".format(data))
if valid:
self.collection.insert(dict(item))
log.msg("Question added to MongoDB database!",
level=log.DEBUG, spider=spider)
return itemCopy the code
Run the crawler on the terminal
scrapy crawl myspiderCopy the code
View the information in navicat
Create a new connection to the MogoDB database as shown below, fill in the above configuration information, and if everything goes well, you can see that the desired information is already in the database.
The above completed the whole process from custom crawler to data entry
More technical material video can be added to the ac group download: 1029344413