“This is the third day of my participation in the Gwen Challenge in November. Check out the details: The last Gwen Challenge in 2021.”

Scrapy crawler frame

What is scrapy

Github is a fast and powerful open source web crawler framework. Scrapy.org/

The installation of scrapy

CMD running on

PIP install scrapy test: scrapy-h

PIP install scrapy scrapy install scrapy

Blog.csdn.net/qq_42543250…

Scrapy-h:

Scrapy frame structure

“5 + 2” structure

Framework components:

component role
Scrapy Engine Engine, which processes the data flow of the entire framework
Scheduler The scheduler, which receives a request from the engine, queues it and returns when the engine requests it again
Downloader The downloader downloads all the requests sent by the engine, and returns the obtained source code to the engine, which then gives it to the crawler
Spiders Crawler receives and processes the source code sent by all engines, analyzes and extracts the data required by item field, submits the URL to be followed up to the engine, and enters the scheduler again
Item Pipeline The pipe is responsible for processing the Item obtained from the crawler and performing post-processing
Downloader Middlewares Download middleware, which can be understood as a component that customizes the extension download functionality
Spider Middlewares Spider middleware, a functional component for custom extensions and communication between operation engines and crawlers

Scrapy The data type of the crawler

  • The Request class
  • This class
  • The Item class

Scrapy data processing process:

  1. When a domain name needs to be opened, the crawler starts retrieving the first URL and returns it to the engine
  2. The engine passes the URL to the scheduler as a request
  3. The engine makes a request to the scheduler again and receives the request that was made to the scheduler last time
  4. The engine passes the request to the downloader
  5. When the download is complete, it is returned to the engine as a response
  6. The engine gives the response to the crawler, and the crawler starts further processing. After the processing is completed, there are two data, one is the URL to follow up, the other is the item data obtained, and then the result is returned to the engine
  7. The engine passes the URL it needs to track to the scheduler, and the retrieved item data to the pipeline
  8. Then the loop starts at step 2 until the information is retrieved. The program stops only when there are no requests in the scheduler

Basic use of Scrapy crawlers

Use of the yield keyword

  • The function that contains the yield statement is a generator
  • The generator generates one value at a time (yield statement), the function freezes, and generates another value when awakened
  • A generator is a function that continuously produces values

Scrapy common command for crawlers

The command instructions format
startproject Create a new project scrapy startproject projectName
genspider Create a crawler scrapy genspider [options]name domain
settings Get crawler configuration information scrapy settings [options]
crawl Run a crawler scrapy crawl spider
list List all crawlers in the project scrapy list
shell Start the URL debugging command line scrapy shell [url]

Use of Scrapy crawlers

  1. Scrapy startproject XXX: create a new crawler project

Create project: scrapy startProject mydemo

Directory tree:

The function of each file in the project directory

file role
scrapy.cfg The configuration file
spiders Store your Spider files, which are the py files you crawl
items.py It’s like a container, like a dictionary
middlewares.py Defines the implementation of Downloader Middlewares and Spider Middlewares
pipelines.py Define the implementation of Item Pipeline to realize data cleaning, storage and verification.
settings.py Global configuration
  1. Identify the target (write Items.py) : Identify the target you want to capture

Kitems. py File contents

  1. Spiders /xxspider.py – Spiders start to climb a web page

Spider template:

  1. > < span style = “box-sizing: border-box; word-break: inherit! Important; word-break: inherit! Important

Top250 Information -Scrapy crawler

Create project:

Genspider scrapy genspider douban_scrapy douban.com

Clear goals

We’re going to grab the serial numbers, titles, introductions, ratings, reviews, and descriptions of all the movies on movie.douban.com/top250

Open kitems.py in the douban directory. Item defines structured data fields that are used to hold data that is crawled, much like dict in Python, but with some additional protection to reduce errors. An Item can be defined by creating a scrapy.Item class and defining a class attribute of type scrapy.Field. Next, create an ItcastItem class and build the Item model.Copy the code

items.py

import scrapy  
class DoubanItem(scrapy.Item) :  
    # define the fields for your item here like:
    # name = scrapy.Field()
    # Movie number
    serial_number = scrapy.Field()
    # Movie title
    movie_name = scrapy.Field()
    # Movie Introduction
    introduce = scrapy.Field()
    # Movie stars
    star = scrapy.Field()
    # Movie reviews
    evaluate = scrapy.Field()
    # Movie description
    describe = scrapy.Field()

Copy the code

Spiders /douban_scrapy.py

Enter the command in the current directory to create a crawler named itcast in the mySpider/spider directory and specify the scope of the crawl domain:

scrapy genspider douban_scrapy movie.douban.com

Open douban_scrapy.py in the douban/spider directory and add the following code by default:

#douban_scrapy.py
#-*- coding: utf-8 -*-  
import scrapy  
from douban.items import DoubanItem  

class DoubanScrapySpider(scrapy.Spider) :  
    name = 'douban_scrapy'  
    allowed_domains = ['movie.douban.com']  
    start_urls = ['https://movie.douban.com/top250']  

    def parse(self, response) :  # method of parsing
        # movie_list type < class 'scrapy. The selector. Unified. SelectorList' >
        movie_list = response.xpath("//ol[@class ='grid_view']/li")
        # Data search
        # self.log('movie_list 'type {}
        for i_item in movie_list:
            Import the # item file
            douban_item = DoubanItem()
            # Data filtering
            #extract(): This method returns an array list containing multiple strings, or ['ABC'] if there is only one string.
            #extract_first() : This method returns a string, the first string in the list array.
            douban_item['serial_number'] = i_item.xpath(".//div[@class='pic']/em/text()").extract_first()
            douban_item['movie_name'] = i_item.xpath(".//div[@class='hd']/a/span[1]/text()").extract_first()
            douban_item['introduce'] = i_item.xpath(".")
            content = i_item.xpath(".//div[@class='bd']/p[1]/text()").extract()
            for i_content  in content:
                contents = "".join(i_content.split())
                douban_item['introduce'] = contents
            douban_item['star'] = i_item.xpath(".//div[@class='star']/span[2]/text()").extract_first()
            douban_item['evaluate'] = i_item.xpath(".//div[@class='star']/span[4]/text()").extract_first()
            douban_item['describe'] = i_item.xpath(".//p[@class= 'quote']/span/text()").extract_first()
            Return data to pipeline, using generator
            yield douban_item
        next_link = response.xpath("//span[@class ='next']/link/@href").extract()
        Parse next page, rule, take xpath from next page
        if next_link:
            next_link = next_link[0]
            #Spider generates the Request for the next page
            yield scrapy.Request('https://movie.douban.com/top250'+next_link,callback=self.parse)
Copy the code

Save data (pipeline.py)

Scrapy the simplest way to save information to the main four, -o output specified format file, command as follows:

scrapy crawl douban_scrapy -o douban.json

Json lines format, Unicode encoding by default

scrapy crawl douban_scrapy -o douban.jsonl

CSV comma expression, can be opened in Excel

scrapy crawl douban_scrapy -o douban.csv

XML format

scrapy crawl douban_scrapy -o douban.xml

2, through pipeline storage into mysql

pipeline.py
import pymysql  
from twisted.enterprise import adbapi  
from scrapy  import log   


class DoubanPipeline(object) :  
    Use teisted asynchronous storage
    def __init__(self, dbpool) :
        self.dbpool = dbpool

    @classmethod
    def from_settings(cls, settings) :
        dbparms = {
            'host': "localhost".'user': "root".'port': 3306.'passwd': "root".'db': "mystudy".'charset': 'utf8'.'cursorclass': pymysql.cursors.DictCursor,
            'use_unicode': True
        }

        dbpool = adbapi.ConnectionPool('pymysql', **dbparms)
        return cls(dbpool)

    def process_item(self, item, spider) :
        Use Twisted to insert MYSQL into asynchronous execution
        The first argument to runInteraction is a function
        query = self.dbpool.runInteraction(self.do_insert, item)
        query.addCallback(self.handle_error, item, spider)  # handle exception
        return item

    def handle_error(self, failure, item, spider) :
        Handle asynchronous insert exceptions
        print(failure)

    def do_insert(self, cursor, item) :
        Perform a specific insert
        insert_sql = ''' insert into douban (serial_number,movie_name,introduce,star, evaluate,Mdescribe) values (%s, %s, %s, %s, %s, %s); ' ' '
        cursor.execute(insert_sql,
                       (item['serial_number'], item['movie_name'], item['introduce'], item['star'], item['evaluate'],item['describe']))

Copy the code

Some configuration

#settings.py  
USER_AGENT = 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'  
ROBOTSTXT_OBEY = False  
ITEM_PIPELINES = {   
   'douban.pipelines.DoubanPipeline':10  
}
Copy the code