This is the 23rd day of my participation in the August More Text Challenge
Life is too short to learn Python together
introduce
Scrapy An open source and collaborative framework originally designed for page scraping (or, more specifically, web scraping) to extract data from web sites in a fast, easy, and extensible manner. But Scrapy is now widely used in areas such as data mining, monitoring, and automated testing, as well as in extracting data from apis (such as Amazon Associates Web Services) or in general Web crawlers.
Scrapy is based on the Twisted framework, a popular event-driven Python networking framework. So Scrapy uses non-blocking (aka asynchronous) code to implement concurrency.
Scrapy execution process
Developers only need to write the same code in the same place (the most common is spiders)
- The five components
Engine (EGINE) : the main steward that controls the flow of data;
SCHEDULER: The SCHEDULER determines what the next url to crawl is.
DOWLOADER: A DOWLOADER that downloads web content and returns it to EGINE, based on an efficient asynchronous model of Twisted;
SPIDERS: Developers’ custom classes used to resolve responses and extract items, or send new requests.
ITEM PIPLINES: Handles items after they have been extracted, including cleaning, validation, persistence (e.g. saving to a database), and so on
- Two major middleware
Crawler middleware: Between EGINE and SPIDERS. Its main job is to handle the input and output of SPIDERS (rarely used).
Download middleware: Between engine and downloader, add proxy, add header, integrate Selenium
Scrapy installed
Under the Linux/MAC installation
pip3 install scrapy
The window, the installation
Pip3 Install scrapy may not work, but if it doesn’t, use the following solution.
1, pip3 install wheel # after installation, you can install software through the wheel file, wheel file official website: www.lfd.uci.edu/~gohlke/pyt… 2, pip3 install LXML 3, pip3 install pyopenssl 4, download and install pywin32:sourceforge.net/projects/py… 5, download twisted wheel file: www.lfd.uci.edu/~gohlke/pyt… Pip3 install scrapy. Select * from Twisted-17.9.0-cp36-cp36m-win_amd64. WHL
After the installation is complete, open CMD and type scrapy to verify that the installation was successful (the Scripts folder in the Python interpreter needs to be placed in the environment variable). After the installation is successful, create scrapy projects in the specified directory.
Scrapy creates and runs projects
Create a project
Command line CD to the specified directory to create a crawler project
Create crawler project scrapy startproject
scrapy startproject myscrapy
Copy the code
Creating a crawler file
To create a crawler on the command line, you need to first CD myscrapy into the project folder, and then execute the command to create a crawler.
Create crawler files directly using the terminal in PyCharm.
Create crawler file command: scrapy genspider crawler file name crawler address
scrapy genspider chouti dig.chouti.com
This will create a py file named chouti in the spider folder
Copy the code
Terminal runs crawler
- Run logs
scrapy crawl chouti
- No run logs are required
scrapy crawl chouti –nolog
Support right click crawler file execution
Create a new folder with the same path as the spiders folder main.py (name whatever you want)
If you want to execute multiple crawlers, add them one by one.
To execute the crawler file written to the file, run main.py directly by right-clicking
from scrapy.cmdline import execute
execute(['scrapy'.'crawl'.'chouti'.'--nolog'])
execute(['scrapy'.'crawl'.'baidu'])...Copy the code
Project Catalog Introduction
Document Description:
- The main configuration information of the scrapy. CFG project, which is used to deploy scrapy. Crawler configuration information is in settings.py.
- Items. py sets up a data store template for structured data, such as Django’s Model
- Data handling behavior e.g. General structured data persistence
- Settings. py configuration files, such as recursive layers, concurrency, delayed downloads, etc. Note: the options in the configuration file must be capitalized otherwise they will be invalid **
- Spiders create files and write crawler rules
Note: generally, crawler files are named after the domain name of the website
Settings. py file introduction
- By default, scrapy follows the crawler protocol
- You can modify configuration file parameters to forcibly crawl data without following protocols
ROBOTSTXT_OBEY = False Copy the code
- Configure the USER_AGENT
USER_AGENT = 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36' Copy the code
- If the log level is not specified, all logs with run logs will be printed
LOG_LEVEL = 'ERROR' Copy the code
Scrapy data parsing ⭐⭐⭐⭐⭐
- Xpath selectors
- Take the text response.xpath('//a[contains(@class,"link-title")]/text()').extract() - extract attribute response.xpath('//a[contains(@class,"link-title")]/@href').extract()
Copy the code
- CSS selectors
- Take text response.css('.link-title::text').extract() - Extract property response.css('.link-title::attr(href)').extract_first()
Copy the code
- Extract the data
- Extract all and put the result in a list response.xpath('... ').extract() - Extract the first response.xpath('... ').extract()[0]
response.xpath('... ').extract_first()
Copy the code
Scrapy persistent storage ⭐⭐⭐⭐⭐
-
Scheme 1: The parse function in the crawler must return the form of a list-jacket dictionary (understood)
Can restrict the format of the file to be stored
scrapy crawl chouti -o chouti.csv
- Pipeline item stored in redis/mysql/ file
- Write a class in items.py
- Import in a crawler file and instantiate the Item object in a parse method
- Put the data into the item object using
yield
Keyword return- Configure in settings.py (lower number higher priority)
ITEM_PIPELINES = { 'firstscrapy.pipelines.ChoutiFilePipeline': 300. }Copy the code
- < span style = “box-sizing: border-box! Important; word-break: inherit! Important;
import pymysql class ChoutiMysqlPipeline(object) : This method is executed only once on entry def open_spider(self,spider) : self.conn=pymysql.connect( host='127.0.0.1', user='root', password="123", database='chouti', port=3306) Execute this method only once on exit def close_spider(self,spider) : self.conn.close() def process_item(self, item, spider) : cursor=self.conn.cursor() sql='insert into article (title,url,photo_url)values(%s,%s,%s) ' cursor.execute(sql,[item['title'],item['url'],item['photo_url']]) self.conn.commit() yield item If there are multiple pipe streams, we must return item. Otherwise, subsequent pipes will not receive the required item Copy the code
conclusion
The article was first published in the wechat public account Program Yuan Xiaozhuang, at the same time in nuggets.
The code word is not easy, reprint please explain the source, pass by the little friends of the lovely little finger point like and then go (╹▽╹)