Learning Objectives:
  1. Command the use of scrapy pipelines

We learned how to use scrapy pipes in the introduction to Scrapy section, but now we’ll dive into scrapy pipes

1. Commonly used methods in pipeline:

  1. process_item(self,item,spider):
    • A mandatory function in a pipe class
    • Implement item data processing
    • Must return the item
  2. Open_spider (self, spider): executes only once when the crawler starts
  3. Close_spider (self, spider): executed only once when the crawler is closed

2. Modification of pipeline files

Continue to improve wangyi crawler in pipelines. Py code

import json
from pymongo import MongoClient

class WangyiFilePipeline(object):
    def open_spider(self, spider):  Execute only once when the crawler is on
        if spider.name == 'itcast':
            self.f = open('json.txt'.'a', encoding='utf-8')

    def close_spider(self, spider):  Execute only once when the crawler is closed
        if spider.name == 'itcast':
            self.f.close()

    def process_item(self, item, spider):
        if spider.name == 'itcast':
            self.f.write(json.dumps(dict(item), ensure_ascii=False, indent=2) + ',\n')
        # Without return, another pipeline with a lower weight will not get item
        return item  

class WangyiMongoPipeline(object):
    def open_spider(self, spider):  Execute only once when the crawler is on
        if spider.name == 'itcast':
        We can also use the isinstanc function to distinguish between crawlers:
            con = MongoClient(host='127.0.0.1', port=27017) Mongoclient: Instantiate mongoClient
            self.collection = con.itcast.teachers Create a set operation object (itcast, teachers)

    def process_item(self, item, spider):
        if spider.name == 'itcast':
            self.collection.insert(item) 
            The item object must be a dictionary before it is inserted
            Dict (BaseItem); dict(BaseItem);
        # Without return, another pipeline with a lower weight will not get item
        return item  
Copy the code

3. Open pipes

Start pipeline in settings.py setting

. ITEM_PIPELINES = {'myspider.pipelines.ItcastFilePipeline': 400, # 400 indicates the weight
    'myspider.pipelines.ItcastMongoPipeline': 500, The smaller the weight value, the higher the priority!}...Copy the code

Sudo service mongodb start check mongo in the mongodb database

Question to consider: Why open more than one pipe in Settings?

  1. Different pipelines can process data from different crawlers, which are distinguished by the spider
  2. Different pipelines can carry out different data processing operations for one or more crawlers, such as one for data cleaning and one for data preservation
  3. The same pipe class can also process data from different crawlers, distinguished by the spider.name attribute

4. Pay attention to pipeline use

  1. To use it, you need to enable it in Settings
  2. In the setting, the key represents the position of pipeline (the position of pipeline in the project can be customized), and the value represents the distance from the engine. The closer the value is, the earlier the data will pass through: the smaller the weight value is, the better the execution
  3. When there are multiple pipelines, the process_item method must return item, otherwise the next pipeline will fetch None
  4. The process_item method must be present in pipeline, otherwise the item cannot accept and process it
  5. The process_item method accepts both the item and the spider, where the spider represents the spider that is currently passing the item
  6. Open_spider (spider) : can be executed once when the crawler is on
  7. Close_spider (spider) : can be executed once when the crawler is closed
  8. The above two methods are often used for the interaction between the crawler and the database, establishing the connection with the database when the crawler is opened, and disconnecting the connection with the database when the crawler is closed

summary

  • Pipes can be used to clean and store data, and multiple pipes can be defined to perform different functions. There are three methods
    • Process_item (self,item,spider): Implements processing of item data
    • Open_spider (self, spider): executes only once when the crawler starts
    • Close_spider (self, spider): executed only once when the crawler is closed