Some uses of the Python crawler framework scrapy Pipeline

1. Start pipeline

ITEM_PIPELINES = {
   # 'jingxi.pipelines.JingxiPipeline': 200,
   'jingxi.pipelines.BaiduPipeline': 300,
   'jingxi.pipelines.TencentPipeline': 100,
}
Copy the code

After multiple pipelines are opened, the yield item will be circulated among all pipelines. The sequence of the flow is based on the assigned number. The smaller the number, the higher the flow order.

2. How to distinguish different pipelines

class BaiduPipeline: def open_spider(self, spider): if spider.name ! Def process_item(self, item, spider): = self, item, spider Print ('&&&&&&&&&',spider. Name,item) print(' access ') print(' access ') return item def close_spider(self, spider): Def open_spider(self, spider): def spider(self, spider): if spider. Def process_item(self, item, spider): if self, spider. = 'tencent': Return item def close_spider(self, def close_spider) return item def close_spider(self, def close_spider) spider): if spider.name ! = 'Tencent ': return print(" run ")Copy the code

In simple terms, different crawlers are distinguished by spider. When the item is transferred to the crawler, it can be cut off to continue the flow. It can be thrown directly as it flows through the target pipeline.

Introduce Settings in Pipelin


import pymongo

class MongoPipeline(object):

    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert(dict(item))
        return item

Copy the code

Simple and crude, using pipeline class method from_crawler

Some uses of the Python crawler framework scrapy Pipeline

1. Start pipeline

2. How to distinguish different pipelines

Introduce Settings in Pipelin

Related Posts

kafka

How many holes people walk through when using Python! Have you also encountered it!

Memcache – Hash table – source code analysis