Python scrapy pipeline learning, and get a hands-on crawler project

“This is my 36th day of participating in the First Challenge 2022. For details: First Challenge 2022”

Please attach great importance to the application of scrapy pipeline Pipelines.

Crawl target site analysis

The target site of this collection is: www.zaih.com/falcon/ment… , target data is expert data.

The data is saved in the MySQL database. Based on the target data, the table structure is designed as follows.Compare the table structure, you can directlyscrapyIn theitems.pyThe file is written.

class ZaihangItem(scrapy.Item) :
    # define the fields for your item here like:
    name = scrapy.Field()  # the name
    city = scrapy.Field()  # city
    industry = scrapy.Field()  # industry
    price = scrapy.Field()  Price of #
    chat_nums = scrapy.Field()  # Number of chats
    score = scrapy.Field()  # score
Copy the code

Encoding time

The project creation process can refer to the previous example, this article is written directly from the collection file development, the file is zh.py. This time the target data paging address needs to be manually concatenated, so declare an instance variable (field) in advance, which is page. After each response, determine whether the data is empty. If not, perform the +1 operation.

The request address template is as follows:

Psychological & https://www.zaih.com/falcon/mentors?first_tag_id=479&first_tag_name= page = {}Copy the code

If the number of pages exceeds the maximum number, the following page status is displayed. Therefore, the data is empty. You only need to check whether the page existsclass=empty 的 sectionCan.Parse data and data clarity directly refer to the following code.

import scrapy
from zaihang_spider.items import ZaihangItem


class ZhSpider(scrapy.Spider) :
    name = 'zh'
    allowed_domains = ['www.zaih.com']
    page = 1  # start page number
    url_format = 'https://www.zaih.com/falcon/mentors?first_tag_id=479&first_tag_name=%E5%BF%83%E7%90%86&page={}'  # template
    start_urls = [url_format.format(page)]

    def parse(self, response) :

        empty = response.css("section.empty") # check whether data is null
        if len(empty) > 0:
            return # empty tag exists, return directly
        mentors = response.css(".mentor-board a") # Hyperlinks for all masters
        for m in mentors:
            item = ZaihangItem() Instantiate an object
            name = m.css(".mentor-card__name::text").extract_first()
            city = m.css(".mentor-card__location::text").extract_first()
            industry = m.css(".mentor-card__title::text").extract_first()
            price = self.replace_space(m.css(".mentor-card__price::text").extract_first())
            chat_nums = self.replace_space(m.css(".mentor-card__number::text").extract()[0])
            score = self.replace_space(m.css(".mentor-card__number::text").extract()[1])

            # Format data
            item["name"] = name
            item["city"] = city
            item["industry"] = industry
            item["price"] = price
            item["chat_nums"] = chat_nums
            item["score"] = score

            yield item
        Make a request again
        self.page += 1
        next_url = format(self.url_format.format(self.page))

        yield scrapy.Request(url=next_url, callback=self.parse)

    def replace_space(self, in_str) :
        in_str = in_str.replace("\n"."").replace("\r"."").replace("RMB"."")
        return in_str.strip()
Copy the code

< span style = “box-sizing: border-box; color: RGB (74, 74, 74); line-height: 22px! Important; word-break: inherit! Important

ITEM_PIPELINES = {
   'zaihang_spider.pipelines.ZaihangMySQLPipeline': 300,}Copy the code

< span style = “box-sizing: border-box! Important; word-wrap: break-word! Important; word-wrap: break-word! Important;” And get the global crawler, which is then used to get the configuration items in settings.py.

In addition, there is a from_settings method, which is usually used in official plug-ins, as shown in the following example.

@classmethod
def from_settings(cls, settings) :
    host= settings.get('HOST')
    return cls(host)

@classmethod
def from_crawler(cls, crawler) :
  # FIXME: for now, stats are only supported from this constructor
  return cls.from_settings(crawler.settings)
Copy the code

Before writing the following code, you need to write the configuration items in settings.py. Settings. py file code

HOST = "127.0.0.1"
PORT = 3306
USER = "root"
PASSWORD = "123456"
DB = "zaihang"
Copy the code

> > File code

import pymysql


class ZaihangMySQLPipeline:
    def __init__(self, host, port, user, password, db) :
        self.host = host
        self.port = port
        self.user = user
        self.password = password
        self.db = db
        self.conn = None
        self.cursor = None

    @classmethod
    def from_crawler(cls, crawler) :
        return cls(
            host=crawler.settings.get('HOST'),
            port=crawler.settings.get('PORT'),
            user=crawler.settings.get('USER'),
            password=crawler.settings.get('PASSWORD'),
            db=crawler.settings.get('DB'))def open_spider(self, spider) :
        self.conn = pymysql.connect(host=self.host, port=self.port, user=self.user, password=self.password, db=self.db)

    def process_item(self, item, spider) :
        # print(item)
        # store to MySQL
        name = item["name"]
        city = item["city"]

        industry = item["industry"]
        price = item["price"]
        chat_nums = item["chat_nums"]
        score = item["score"]

        sql = "insert into users(name,city,industry,price,chat_nums,score) values ('%s','%s','%s',%.1f,%d,%.1f)" % (
            name, city, industry, float(price), int(chat_nums), float(score))
        print(sql)
        self.cursor = self.conn.cursor()  Set the cursor

        try:
            self.cursor.execute(sql)  # execute SQL
            self.conn.commit()
        except Exception as e:
            print(e)
            self.conn.rollback()
        return item

    def close_spider(self, spider) :
        self.cursor.close()
        self.conn.close()
Copy the code

The three important functions in the pipeline file are open_spider, process_item, and close_spider.

The crawler is executed once
def open_spider(self, spider) :
    The spider object dynamically adds the instance variable, and the value of the variable can be obtained in the spider module, such as in the parse(self, response) function by self
    # Some initialization actions
    pass

# Processing extracted data, data preservation code writing location
def process_item(self, item, spider) :
    pass

Close_spider is executed only once when the crawler is closed. Close_spider is not executed if an abnormal crash occurs while the crawler is running
def close_spider(self, spider) :
    Close the database to release resources
    pass
Copy the code

Crawl results show

Write in the back

Today is the 246/365 day of continuous writing. Expect attention, likes, comments and favorites.

Python scrapy pipeline learning, and get a hands-on crawler project

Crawl target site analysis

Encoding time

Write in the back

Related Posts

Union (all)

Shell Scripting Tutorial – Creating and running scripts

Digg Project Monthly list | September 2021 Top authors list announced