Use of scrapy pipes

Learning Objectives:

Command the use of scrapy pipelines

We learned how to use scrapy pipes in the introduction to Scrapy section, but now we’ll dive into scrapy pipes

1. Commonly used methods in pipeline:

process_item(self,item,spider):
- A mandatory function in a pipe class
- Implement item data processing
- Must return the item
Open_spider (self, spider): executes only once when the crawler starts
Close_spider (self, spider): executed only once when the crawler is closed

2. Modification of pipeline files

Continue to improve wangyi crawler in pipelines. Py code

import json
from pymongo import MongoClient

class WangyiFilePipeline(object):
    def open_spider(self, spider):  Execute only once when the crawler is on
        if spider.name == 'itcast':
            self.f = open('json.txt'.'a', encoding='utf-8')

    def close_spider(self, spider):  Execute only once when the crawler is closed
        if spider.name == 'itcast':
            self.f.close()

    def process_item(self, item, spider):
        if spider.name == 'itcast':
            self.f.write(json.dumps(dict(item), ensure_ascii=False, indent=2) + ',\n')
        # Without return, another pipeline with a lower weight will not get item
        return item  

class WangyiMongoPipeline(object):
    def open_spider(self, spider):  Execute only once when the crawler is on
        if spider.name == 'itcast':
        We can also use the isinstanc function to distinguish between crawlers:
            con = MongoClient(host='127.0.0.1', port=27017) Mongoclient: Instantiate mongoClient
            self.collection = con.itcast.teachers Create a set operation object (itcast, teachers)

    def process_item(self, item, spider):
        if spider.name == 'itcast':
            self.collection.insert(item) 
            The item object must be a dictionary before it is inserted
            Dict (BaseItem); dict(BaseItem);
        # Without return, another pipeline with a lower weight will not get item
        return item  
Copy the code

3. Open pipes

Start pipeline in settings.py setting

. ITEM_PIPELINES = {'myspider.pipelines.ItcastFilePipeline': 400, # 400 indicates the weight
    'myspider.pipelines.ItcastMongoPipeline': 500, The smaller the weight value, the higher the priority!}...Copy the code

Sudo service mongodb start check mongo in the mongodb database

Question to consider: Why open more than one pipe in Settings?

Different pipelines can process data from different crawlers, which are distinguished by the spider
Different pipelines can carry out different data processing operations for one or more crawlers, such as one for data cleaning and one for data preservation
The same pipe class can also process data from different crawlers, distinguished by the spider.name attribute

4. Pay attention to pipeline use

To use it, you need to enable it in Settings
In the setting, the key represents the position of pipeline (the position of pipeline in the project can be customized), and the value represents the distance from the engine. The closer the value is, the earlier the data will pass through: the smaller the weight value is, the better the execution
When there are multiple pipelines, the process_item method must return item, otherwise the next pipeline will fetch None
The process_item method must be present in pipeline, otherwise the item cannot accept and process it
The process_item method accepts both the item and the spider, where the spider represents the spider that is currently passing the item
Open_spider (spider) : can be executed once when the crawler is on
Close_spider (spider) : can be executed once when the crawler is closed
The above two methods are often used for the interaction between the crawler and the database, establishing the connection with the database when the crawler is opened, and disconnecting the connection with the database when the crawler is closed

summary

Pipes can be used to clean and store data, and multiple pipes can be defined to perform different functions. There are three methods
- Process_item (self,item,spider): Implements processing of item data
- Open_spider (self, spider): executes only once when the crawler starts
- Close_spider (self, spider): executed only once when the crawler is closed

Learning Objectives:

1. Commonly used methods in pipeline:

2. Modification of pipeline files

3. Open pipes

4. Pay attention to pipeline use

summary

Related Posts

How much does an unscrupulous hacker make in a year?

DataX: supports offline synchronization of nearly all heterogeneous data sources to MaxCompute

Why are drop-down menus a bad user experience?