“This is the third day of my participation in the Gwen Challenge in November. Check out the details: The last Gwen Challenge in 2021.”
Scrapy crawler frame
What is scrapy
Github is a fast and powerful open source web crawler framework. Scrapy.org/
The installation of scrapy
CMD running on
PIP install scrapy test: scrapy-h
PIP install scrapy scrapy install scrapy
Blog.csdn.net/qq_42543250…
Scrapy-h:
Scrapy frame structure
“5 + 2” structure
Framework components:
component | role |
---|---|
Scrapy Engine | Engine, which processes the data flow of the entire framework |
Scheduler | The scheduler, which receives a request from the engine, queues it and returns when the engine requests it again |
Downloader | The downloader downloads all the requests sent by the engine, and returns the obtained source code to the engine, which then gives it to the crawler |
Spiders | Crawler receives and processes the source code sent by all engines, analyzes and extracts the data required by item field, submits the URL to be followed up to the engine, and enters the scheduler again |
Item Pipeline | The pipe is responsible for processing the Item obtained from the crawler and performing post-processing |
Downloader Middlewares | Download middleware, which can be understood as a component that customizes the extension download functionality |
Spider Middlewares | Spider middleware, a functional component for custom extensions and communication between operation engines and crawlers |
Scrapy The data type of the crawler
- The Request class
- This class
- The Item class
Scrapy data processing process:
- When a domain name needs to be opened, the crawler starts retrieving the first URL and returns it to the engine
- The engine passes the URL to the scheduler as a request
- The engine makes a request to the scheduler again and receives the request that was made to the scheduler last time
- The engine passes the request to the downloader
- When the download is complete, it is returned to the engine as a response
- The engine gives the response to the crawler, and the crawler starts further processing. After the processing is completed, there are two data, one is the URL to follow up, the other is the item data obtained, and then the result is returned to the engine
- The engine passes the URL it needs to track to the scheduler, and the retrieved item data to the pipeline
- Then the loop starts at step 2 until the information is retrieved. The program stops only when there are no requests in the scheduler
Basic use of Scrapy crawlers
Use of the yield keyword
- The function that contains the yield statement is a generator
- The generator generates one value at a time (yield statement), the function freezes, and generates another value when awakened
- A generator is a function that continuously produces values
Scrapy common command for crawlers
The command | instructions | format |
---|---|---|
startproject | Create a new project | scrapy startproject projectName |
genspider | Create a crawler | scrapy genspider [options]name domain |
settings | Get crawler configuration information | scrapy settings [options] |
crawl | Run a crawler | scrapy crawl spider |
list | List all crawlers in the project | scrapy list |
shell | Start the URL debugging command line | scrapy shell [url] |
Use of Scrapy crawlers
- Scrapy startproject XXX: create a new crawler project
Create project: scrapy startProject mydemo
Directory tree:
The function of each file in the project directory
file | role |
---|---|
scrapy.cfg | The configuration file |
spiders | Store your Spider files, which are the py files you crawl |
items.py | It’s like a container, like a dictionary |
middlewares.py | Defines the implementation of Downloader Middlewares and Spider Middlewares |
pipelines.py | Define the implementation of Item Pipeline to realize data cleaning, storage and verification. |
settings.py | Global configuration |
- Identify the target (write Items.py) : Identify the target you want to capture
Kitems. py File contents
- Spiders /xxspider.py – Spiders start to climb a web page
Spider template:
- > < span style = “box-sizing: border-box; word-break: inherit! Important; word-break: inherit! Important
Top250 Information -Scrapy crawler
Create project:
Genspider scrapy genspider douban_scrapy douban.com
Clear goals
We’re going to grab the serial numbers, titles, introductions, ratings, reviews, and descriptions of all the movies on movie.douban.com/top250
Open kitems.py in the douban directory. Item defines structured data fields that are used to hold data that is crawled, much like dict in Python, but with some additional protection to reduce errors. An Item can be defined by creating a scrapy.Item class and defining a class attribute of type scrapy.Field. Next, create an ItcastItem class and build the Item model.Copy the code
items.py
import scrapy
class DoubanItem(scrapy.Item) :
# define the fields for your item here like:
# name = scrapy.Field()
# Movie number
serial_number = scrapy.Field()
# Movie title
movie_name = scrapy.Field()
# Movie Introduction
introduce = scrapy.Field()
# Movie stars
star = scrapy.Field()
# Movie reviews
evaluate = scrapy.Field()
# Movie description
describe = scrapy.Field()
Copy the code
Spiders /douban_scrapy.py
Enter the command in the current directory to create a crawler named itcast in the mySpider/spider directory and specify the scope of the crawl domain:
scrapy genspider douban_scrapy movie.douban.com
Open douban_scrapy.py in the douban/spider directory and add the following code by default:
#douban_scrapy.py
#-*- coding: utf-8 -*-
import scrapy
from douban.items import DoubanItem
class DoubanScrapySpider(scrapy.Spider) :
name = 'douban_scrapy'
allowed_domains = ['movie.douban.com']
start_urls = ['https://movie.douban.com/top250']
def parse(self, response) : # method of parsing
# movie_list type < class 'scrapy. The selector. Unified. SelectorList' >
movie_list = response.xpath("//ol[@class ='grid_view']/li")
# Data search
# self.log('movie_list 'type {}
for i_item in movie_list:
Import the # item file
douban_item = DoubanItem()
# Data filtering
#extract(): This method returns an array list containing multiple strings, or ['ABC'] if there is only one string.
#extract_first() : This method returns a string, the first string in the list array.
douban_item['serial_number'] = i_item.xpath(".//div[@class='pic']/em/text()").extract_first()
douban_item['movie_name'] = i_item.xpath(".//div[@class='hd']/a/span[1]/text()").extract_first()
douban_item['introduce'] = i_item.xpath(".")
content = i_item.xpath(".//div[@class='bd']/p[1]/text()").extract()
for i_content in content:
contents = "".join(i_content.split())
douban_item['introduce'] = contents
douban_item['star'] = i_item.xpath(".//div[@class='star']/span[2]/text()").extract_first()
douban_item['evaluate'] = i_item.xpath(".//div[@class='star']/span[4]/text()").extract_first()
douban_item['describe'] = i_item.xpath(".//p[@class= 'quote']/span/text()").extract_first()
Return data to pipeline, using generator
yield douban_item
next_link = response.xpath("//span[@class ='next']/link/@href").extract()
Parse next page, rule, take xpath from next page
if next_link:
next_link = next_link[0]
#Spider generates the Request for the next page
yield scrapy.Request('https://movie.douban.com/top250'+next_link,callback=self.parse)
Copy the code
Save data (pipeline.py)
Scrapy the simplest way to save information to the main four, -o output specified format file, command as follows:
scrapy crawl douban_scrapy -o douban.json
Json lines format, Unicode encoding by default
scrapy crawl douban_scrapy -o douban.jsonl
CSV comma expression, can be opened in Excel
scrapy crawl douban_scrapy -o douban.csv
XML format
scrapy crawl douban_scrapy -o douban.xml
2, through pipeline storage into mysql
pipeline.py
import pymysql
from twisted.enterprise import adbapi
from scrapy import log
class DoubanPipeline(object) :
Use teisted asynchronous storage
def __init__(self, dbpool) :
self.dbpool = dbpool
@classmethod
def from_settings(cls, settings) :
dbparms = {
'host': "localhost".'user': "root".'port': 3306.'passwd': "root".'db': "mystudy".'charset': 'utf8'.'cursorclass': pymysql.cursors.DictCursor,
'use_unicode': True
}
dbpool = adbapi.ConnectionPool('pymysql', **dbparms)
return cls(dbpool)
def process_item(self, item, spider) :
Use Twisted to insert MYSQL into asynchronous execution
The first argument to runInteraction is a function
query = self.dbpool.runInteraction(self.do_insert, item)
query.addCallback(self.handle_error, item, spider) # handle exception
return item
def handle_error(self, failure, item, spider) :
Handle asynchronous insert exceptions
print(failure)
def do_insert(self, cursor, item) :
Perform a specific insert
insert_sql = ''' insert into douban (serial_number,movie_name,introduce,star, evaluate,Mdescribe) values (%s, %s, %s, %s, %s, %s); ' ' '
cursor.execute(insert_sql,
(item['serial_number'], item['movie_name'], item['introduce'], item['star'], item['evaluate'],item['describe']))
Copy the code
Some configuration
#settings.py
USER_AGENT = 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
'douban.pipelines.DoubanPipeline':10
}
Copy the code