This is the 12th day of my participation in Gwen Challenge
Scrapy is a Python framework for crawling web site data.
Scrapy frame principle
Scrapy Engine: An Engine is responsible for controlling the flow of data among the components of a system and triggering events when corresponding actions occur.
Scheduler: It receives requests from the engine, sorts them in a way that makes them Scheduler and returns them to the engine when needed.
Downloader: Downloads all Requests sent by the Scrapy Engine and returns the Responses to the Scrapy Engine for handling by the Spider.
Spider: It is responsible for processing all Responses, analyzing and extracting data from them, obtaining the data required by the Item field, submitting the URL to be followed up to the engine, and entering the Scheduler again.
Item Pipeline: This is the place where items retrieved from the Spider are processed for post-processing (detailed analysis, filtering, storage, etc.).
Downloader Middlewares the Downloader Middlewares are specific hooks between the engine and the Downloader that handle the response that the Downloader passes to the engine.
Spider Middlewares: Spider middleware is the specific hooks between the engine and the Spider that handle the Spider’s input (response) and output (items and requests).
The whole crawling process is shown as follows:
The experimental process
1. Install Scrapy
pip install Scrapy
Copy the code
Check whether the installation is successful:
Scrapy
Copy the code
Display Scrapy version information, the installation is successful.
2. Create Scrapy projects
== Scrapy StartProject Project name ==
As you can see from the creation information, the new project is placed at /Users/ TJM /mySpider. The project directory is as follows:
mySpider
| mySpider python modules (projects)
|spiders
|init.py
|init.py
| items. Py (deposit need to crawl field)
| middlewares. Py (middleware in the project file)
| pipelines. Py (crawl the data here, where design is saved, or the mysql database, or save as. CSV file)
| Settings. Py (project Settings file)
| scrapy. CFG (project configuration file)