[SPIDER] scrapy

This is the 23rd day of my participation in the August More Text Challenge

Life is too short to learn Python together

introduce

Scrapy An open source and collaborative framework originally designed for page scraping (or, more specifically, web scraping) to extract data from web sites in a fast, easy, and extensible manner. But Scrapy is now widely used in areas such as data mining, monitoring, and automated testing, as well as in extracting data from apis (such as Amazon Associates Web Services) or in general Web crawlers.

Scrapy is based on the Twisted framework, a popular event-driven Python networking framework. So Scrapy uses non-blocking (aka asynchronous) code to implement concurrency.

Scrapy execution process

Developers only need to write the same code in the same place (the most common is spiders)

The five components

Engine (EGINE) : the main steward that controls the flow of data;

SCHEDULER: The SCHEDULER determines what the next url to crawl is.

DOWLOADER: A DOWLOADER that downloads web content and returns it to EGINE, based on an efficient asynchronous model of Twisted;

SPIDERS: Developers’ custom classes used to resolve responses and extract items, or send new requests.

ITEM PIPLINES: Handles items after they have been extracted, including cleaning, validation, persistence (e.g. saving to a database), and so on

Two major middleware

Crawler middleware: Between EGINE and SPIDERS. Its main job is to handle the input and output of SPIDERS (rarely used).

Download middleware: Between engine and downloader, add proxy, add header, integrate Selenium

Scrapy installed

Under the Linux/MAC installation

pip3 install scrapy

The window, the installation

Pip3 Install scrapy may not work, but if it doesn’t, use the following solution.

1, pip3 install wheel # after installation, you can install software through the wheel file, wheel file official website: www.lfd.uci.edu/~gohlke/pyt… 2, pip3 install LXML 3, pip3 install pyopenssl 4, download and install pywin32:sourceforge.net/projects/py… 5, download twisted wheel file: www.lfd.uci.edu/~gohlke/pyt… Pip3 install scrapy. Select * from Twisted-17.9.0-cp36-cp36m-win_amd64. WHL

After the installation is complete, open CMD and type scrapy to verify that the installation was successful (the Scripts folder in the Python interpreter needs to be placed in the environment variable). After the installation is successful, create scrapy projects in the specified directory.

Scrapy creates and runs projects

Create a project

Command line CD to the specified directory to create a crawler project

Create crawler project scrapy startproject

scrapy startproject myscrapy
Copy the code

Creating a crawler file

To create a crawler on the command line, you need to first CD myscrapy into the project folder, and then execute the command to create a crawler.

Create crawler files directly using the terminal in PyCharm.

Create crawler file command: scrapy genspider crawler file name crawler address

scrapy genspider chouti dig.chouti.com
This will create a py file named chouti in the spider folder
Copy the code

Terminal runs crawler

Run logs

scrapy crawl chouti

No run logs are required

scrapy crawl chouti –nolog

Support right click crawler file execution

Create a new folder with the same path as the spiders folder main.py (name whatever you want)

If you want to execute multiple crawlers, add them one by one.

To execute the crawler file written to the file, run main.py directly by right-clicking

from scrapy.cmdline import execute


execute(['scrapy'.'crawl'.'chouti'.'--nolog'])
execute(['scrapy'.'crawl'.'baidu'])...Copy the code

Project Catalog Introduction

Document Description:

The main configuration information of the scrapy. CFG project, which is used to deploy scrapy. Crawler configuration information is in settings.py.

Items. py sets up a data store template for structured data, such as Django’s Model

Data handling behavior e.g. General structured data persistence

Settings. py configuration files, such as recursive layers, concurrency, delayed downloads, etc. Note: the options in the configuration file must be capitalized otherwise they will be invalid **

Spiders create files and write crawler rules

Note: generally, crawler files are named after the domain name of the website

Settings. py file introduction

By default, scrapy follows the crawler protocol

You can modify configuration file parameters to forcibly crawl data without following protocols
ROBOTSTXT_OBEY = False
Copy the code
Configure the USER_AGENT
USER_AGENT = 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
Copy the code
If the log level is not specified, all logs with run logs will be printed
LOG_LEVEL = 'ERROR'
Copy the code

Scrapy data parsing ⭐⭐⭐⭐⭐

Xpath selectors

- Take the text response.xpath('//a[contains(@class,"link-title")]/text()').extract() - extract attribute response.xpath('//a[contains(@class,"link-title")]/@href').extract()
Copy the code

CSS selectors

- Take text response.css('.link-title::text').extract() - Extract property response.css('.link-title::attr(href)').extract_first()
Copy the code

Extract the data

- Extract all and put the result in a list response.xpath('... ').extract() - Extract the first response.xpath('... ').extract()[0]
response.xpath('... ').extract_first()
Copy the code

Scrapy persistent storage ⭐⭐⭐⭐⭐

Scheme 1: The parse function in the crawler must return the form of a list-jacket dictionary (understood)

Can restrict the format of the file to be stored

scrapy crawl chouti -o chouti.csv

Pipeline item stored in redis/mysql/ file

Write a class in items.py
Import in a crawler file and instantiate the Item object in a parse method
Put the data into the item object usingyieldKeyword return
Configure in settings.py (lower number higher priority)

ITEM_PIPELINES = {
   		'firstscrapy.pipelines.ChoutiFilePipeline': 300. }Copy the code

< span style = “box-sizing: border-box! Important; word-break: inherit! Important;

import pymysql
class ChoutiMysqlPipeline(object) :
    This method is executed only once on entry
    def open_spider(self,spider) :
        self.conn=pymysql.connect( host='127.0.0.1', user='root', password="123",
                 database='chouti', port=3306)
   
	Execute this method only once on exit
    def close_spider(self,spider) :
        self.conn.close()
        
    def process_item(self, item, spider) :
        cursor=self.conn.cursor()
        sql='insert into article (title,url,photo_url)values(%s,%s,%s) '
        cursor.execute(sql,[item['title'],item['url'],item['photo_url']])
        self.conn.commit()
        yield item		If there are multiple pipe streams, we must return item. Otherwise, subsequent pipes will not receive the required item
Copy the code

conclusion

The article was first published in the wechat public account Program Yuan Xiaozhuang, at the same time in nuggets.

The code word is not easy, reprint please explain the source, pass by the little friends of the lovely little finger point like and then go (╹▽╹)