“This is the third day of my participation in the Gwen Challenge in November. Check out the details: The last Gwen Challenge in 2021.”

Scrapy crawler frame

What is scrapy

Github is a fast and powerful open source web crawler framework. Scrapy.org/

The installation of scrapy

CMD running on

PIP install scrapy test: scrapy-h

PIP install scrapy scrapy install scrapy

Blog.csdn.net/qq_42543250…

Scrapy-h:

Scrapy frame structure

“5 + 2” structure

Framework components:

component	role
Scrapy Engine	Engine, which processes the data flow of the entire framework
Scheduler	The scheduler, which receives a request from the engine, queues it and returns when the engine requests it again
Downloader	The downloader downloads all the requests sent by the engine, and returns the obtained source code to the engine, which then gives it to the crawler
Spiders	Crawler receives and processes the source code sent by all engines, analyzes and extracts the data required by item field, submits the URL to be followed up to the engine, and enters the scheduler again
Item Pipeline	The pipe is responsible for processing the Item obtained from the crawler and performing post-processing
Downloader Middlewares	Download middleware, which can be understood as a component that customizes the extension download functionality
Spider Middlewares	Spider middleware, a functional component for custom extensions and communication between operation engines and crawlers

Scrapy The data type of the crawler

The Request class
This class
The Item class

Scrapy data processing process:

When a domain name needs to be opened, the crawler starts retrieving the first URL and returns it to the engine
The engine passes the URL to the scheduler as a request
The engine makes a request to the scheduler again and receives the request that was made to the scheduler last time
The engine passes the request to the downloader
When the download is complete, it is returned to the engine as a response
The engine gives the response to the crawler, and the crawler starts further processing. After the processing is completed, there are two data, one is the URL to follow up, the other is the item data obtained, and then the result is returned to the engine
The engine passes the URL it needs to track to the scheduler, and the retrieved item data to the pipeline
Then the loop starts at step 2 until the information is retrieved. The program stops only when there are no requests in the scheduler

Basic use of Scrapy crawlers

Use of the yield keyword

The function that contains the yield statement is a generator

The generator generates one value at a time (yield statement), the function freezes, and generates another value when awakened

A generator is a function that continuously produces values

Scrapy common command for crawlers

The command	instructions	format
startproject	Create a new project	scrapy startproject projectName
genspider	Create a crawler	scrapy genspider [options]name domain
settings	Get crawler configuration information	scrapy settings [options]
crawl	Run a crawler	scrapy crawl spider
list	List all crawlers in the project	scrapy list
shell	Start the URL debugging command line	scrapy shell [url]

Use of Scrapy crawlers

Scrapy startproject XXX: create a new crawler project

Create project: scrapy startProject mydemo

Directory tree:

The function of each file in the project directory

file	role
scrapy.cfg	The configuration file
spiders	Store your Spider files, which are the py files you crawl
items.py	It’s like a container, like a dictionary
middlewares.py	Defines the implementation of Downloader Middlewares and Spider Middlewares
pipelines.py	Define the implementation of Item Pipeline to realize data cleaning, storage and verification.
settings.py	Global configuration

Identify the target (write Items.py) : Identify the target you want to capture

Kitems. py File contents

Spiders /xxspider.py – Spiders start to climb a web page

Spider template:

> < span style = “box-sizing: border-box; word-break: inherit! Important; word-break: inherit! Important

Top250 Information -Scrapy crawler

Create project:

Genspider scrapy genspider douban_scrapy douban.com

Clear goals

We’re going to grab the serial numbers, titles, introductions, ratings, reviews, and descriptions of all the movies on movie.douban.com/top250

Open kitems.py in the douban directory. Item defines structured data fields that are used to hold data that is crawled, much like dict in Python, but with some additional protection to reduce errors. An Item can be defined by creating a scrapy.Item class and defining a class attribute of type scrapy.Field. Next, create an ItcastItem class and build the Item model.Copy the code

items.py

import scrapy  
class DoubanItem(scrapy.Item) :  
    # define the fields for your item here like:
    # name = scrapy.Field()
    # Movie number
    serial_number = scrapy.Field()
    # Movie title
    movie_name = scrapy.Field()
    # Movie Introduction
    introduce = scrapy.Field()
    # Movie stars
    star = scrapy.Field()
    # Movie reviews
    evaluate = scrapy.Field()
    # Movie description
    describe = scrapy.Field()

Copy the code

Spiders /douban_scrapy.py

Enter the command in the current directory to create a crawler named itcast in the mySpider/spider directory and specify the scope of the crawl domain:

scrapy genspider douban_scrapy movie.douban.com

Open douban_scrapy.py in the douban/spider directory and add the following code by default:

#douban_scrapy.py
#-*- coding: utf-8 -*-  
import scrapy  
from douban.items import DoubanItem  

class DoubanScrapySpider(scrapy.Spider) :  
    name = 'douban_scrapy'  
    allowed_domains = ['movie.douban.com']  
    start_urls = ['https://movie.douban.com/top250']  

    def parse(self, response) :  # method of parsing
        # movie_list type < class 'scrapy. The selector. Unified. SelectorList' >
        movie_list = response.xpath("//ol[@class ='grid_view']/li")
        # Data search
        # self.log('movie_list 'type {}
        for i_item in movie_list:
            Import the # item file
            douban_item = DoubanItem()
            # Data filtering
            #extract(): This method returns an array list containing multiple strings, or ['ABC'] if there is only one string.
            #extract_first() : This method returns a string, the first string in the list array.
            douban_item['serial_number'] = i_item.xpath(".//div[@class='pic']/em/text()").extract_first()
            douban_item['movie_name'] = i_item.xpath(".//div[@class='hd']/a/span[1]/text()").extract_first()
            douban_item['introduce'] = i_item.xpath(".")
            content = i_item.xpath(".//div[@class='bd']/p[1]/text()").extract()
            for i_content  in content:
                contents = "".join(i_content.split())
                douban_item['introduce'] = contents
            douban_item['star'] = i_item.xpath(".//div[@class='star']/span[2]/text()").extract_first()
            douban_item['evaluate'] = i_item.xpath(".//div[@class='star']/span[4]/text()").extract_first()
            douban_item['describe'] = i_item.xpath(".//p[@class= 'quote']/span/text()").extract_first()
            Return data to pipeline, using generator
            yield douban_item
        next_link = response.xpath("//span[@class ='next']/link/@href").extract()
        Parse next page, rule, take xpath from next page
        if next_link:
            next_link = next_link[0]
            #Spider generates the Request for the next page
            yield scrapy.Request('https://movie.douban.com/top250'+next_link,callback=self.parse)
Copy the code

Save data (pipeline.py)

Scrapy the simplest way to save information to the main four, -o output specified format file, command as follows:

scrapy crawl douban_scrapy -o douban.json

Json lines format, Unicode encoding by default

scrapy crawl douban_scrapy -o douban.jsonl

CSV comma expression, can be opened in Excel

scrapy crawl douban_scrapy -o douban.csv

XML format

scrapy crawl douban_scrapy -o douban.xml

2, through pipeline storage into mysql

pipeline.py
import pymysql  
from twisted.enterprise import adbapi  
from scrapy  import log   


class DoubanPipeline(object) :  
    Use teisted asynchronous storage
    def __init__(self, dbpool) :
        self.dbpool = dbpool

    @classmethod
    def from_settings(cls, settings) :
        dbparms = {
            'host': "localhost".'user': "root".'port': 3306.'passwd': "root".'db': "mystudy".'charset': 'utf8'.'cursorclass': pymysql.cursors.DictCursor,
            'use_unicode': True
        }

        dbpool = adbapi.ConnectionPool('pymysql', **dbparms)
        return cls(dbpool)

    def process_item(self, item, spider) :
        Use Twisted to insert MYSQL into asynchronous execution
        The first argument to runInteraction is a function
        query = self.dbpool.runInteraction(self.do_insert, item)
        query.addCallback(self.handle_error, item, spider)  # handle exception
        return item

    def handle_error(self, failure, item, spider) :
        Handle asynchronous insert exceptions
        print(failure)

    def do_insert(self, cursor, item) :
        Perform a specific insert
        insert_sql = ''' insert into douban (serial_number,movie_name,introduce,star, evaluate,Mdescribe) values (%s, %s, %s, %s, %s, %s); ' ' '
        cursor.execute(insert_sql,
                       (item['serial_number'], item['movie_name'], item['introduce'], item['star'], item['evaluate'],item['describe']))

Copy the code

Some configuration

#settings.py  
USER_AGENT = 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'  
ROBOTSTXT_OBEY = False  
ITEM_PIPELINES = {   
   'douban.pipelines.DoubanPipeline':10  
}
Copy the code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Scrapy+ Scrapy+ Scrapy

Scrapy crawler frame

What is scrapy

The installation of scrapy

Scrapy frame structure

Scrapy The data type of the crawler

Basic use of Scrapy crawlers

Use of the yield keyword

Scrapy common command for crawlers

Use of Scrapy crawlers

Top250 Information -Scrapy crawler

Create project:

Clear goals

Spiders /douban_scrapy.py

Save data (pipeline.py)

Some configuration

Scrapy+ Scrapy+ Scrapy

Scrapy crawler frame

What is scrapy

The installation of scrapy

Scrapy frame structure

Scrapy The data type of the crawler

Basic use of Scrapy crawlers

Use of the yield keyword

Scrapy common command for crawlers

Use of Scrapy crawlers

Top250 Information -Scrapy crawler

Create project:

Clear goals

Spiders /douban_scrapy.py

Save data (pipeline.py)

Some configuration

Related Posts

1725. Number Of Rectangles That Can Form The Largest Square (Python) | Python Theme Month

Python multitasking – threading

Diy web server Development (1)