Tool environment

Language: python3.6 Database: MongoDB (install and run the following command)

python3 -m pip install pymongo
brew install mongodb
mongod --config /usr/local/etc/mongod.confCopy the code

Framework: scrapy1.5.1 (install command below)

python3 -m pip install ScrapyCopy the code

Create a crawler project using a scrapy framework

Execute the following command on the terminal to create a crawler project named mySpider

scrapy startproject myspiderCopy the code

You can get a file directory with the following structure

Create a crawl style crawler

CrawlSpider CrawlSpider CrawlSpider CrawlSpider CrawlSpider CrawlSpider crawlcrawlspider The more common crawler that crawls the entire site (the following example uses this) XMLFeedSpider CSVFeedSpider SitemapSpider

Go to the spiders on the command line

cd myspider/myspider/spidersCopy the code

Then create a crawler template of type Crawl

scrapy genspider -t crawl zgmlxc www.zgmlxc.com.cnCopy the code

Parameter Description:

-t crawl Specifies the type of crawler

ZGMLXC is my name for this crawler

www.zgmlxc.com.cn is the site I want to climb

Perfect crawler ZGMLXC

Open the zgmlxc.py file and you’ll see a basic crawler template. It’s time to configure it and let the crawler follow my instructions to crawl information.

Configure tracing page rules

Rules = (/ / positioning to www.zgmlxc.com.cn/node/72.jspx this page (LinkExtractor (Rule allow = r'.72\.jspx'Rule(LinkExtractor(allow=r)), // Find the url that matches the following rules, crawl the contents, and return the information to parse_item () Rule(LinkExtractor(allow=r)'./info/\d+\.jspx'), callback='parse_item'),Copy the code

The last Rule must be followed by a comma, otherwise an error will be reported

rules = (
Rule(LinkExtractor(allow=r'./info/\d+\.jspx'), callback='parse_item', follow=True),
)Copy the code

Define the fields we need to extract in items.py

import scrapy

class CrawlspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    content = scrapy.Field()
    piclist = scrapy.Field()
    shortname = scrapy.Field()Copy the code

Perfect the parse_item function

Here we take what we returned from the previous step, configure the rules, and extract the information we want. The join method must be used to facilitate the smooth import of the database later.

def parse_item(self, response):
    yield {
        'title' : ' '.join(response.xpath("//div[@class='head']/h3/text()").get()).strip(),
        'shortname' : ' '.join(response.xpath("//div[@class='body']/p/strong/text()").get()).strip(),
        'piclist' : ' '.join(response.xpath("//div[@class='body']/p/img/@src").getall()).strip(),
        'content' : ' '.join(response.css("div.body").extract()).strip(),
            }Copy the code

PS: Here are some common rules for extracting content, summarized directly here:

//img[@class=’photo-large’]/@src

Extract () : response.css(“div.body”).extract()

Save the information to the MogoDB database

Setting database Information

Open settings.py and add the following information:

Build the connection relationship between crawler and database
ITEM_PIPELINES = {
   'crawlspider.pipelines.MongoDBPipeline': 300}Set database information
MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = 'spider_world'
MONGODB_COLLECTION = 'zgmlxc'

# Set civilization crawler, which means 5 seconds between each request, friendly to the site, but also prevent being blacklisted
```py
DOWNLOAD_DELAY = 5Copy the code

On piplines. Py

import pymongo

from scrapy.conf import settings
from scrapy.exceptions import DropItem
from scrapy import log

class MongoDBPipeline(object):
    def __init__(self):
        connection = pymongo.MongoClient(
            settings['MONGODB_SERVER'],
            settings['MONGODB_PORT']
        )
        db = connection[settings['MONGODB_DB']]
        self.collection = db[settings['MONGODB_COLLECTION']]

    def process_item(self, item, spider):
        valid = True
        for data in item:
            if not data:
                valid = False
                raise DropItem("Missing {0}!".format(data))
        if valid:
            self.collection.insert(dict(item))
            log.msg("Question added to MongoDB database!",
                    level=log.DEBUG, spider=spider)
        return itemCopy the code

Run the crawler on the terminal

scrapy crawl myspiderCopy the code

View the information in navicat

Create a new connection to the MogoDB database as shown below, fill in the above configuration information, and if everything goes well, you can see that the desired information is already in the database.

The above completed the whole process from custom crawler to data entry

More technical material video can be added to the ac group download: 1029344413