Scrapy frame introduction
Scrapy is a fast, high-level screen scraping and Web scraping framework developed by Python for scraping Web sites and extracting structured data from pages.
Scrapy is versatile and can be used for data mining, monitoring, and automated testing.
The best part about scrapy is that it’s a framework that anyone can easily modify to fit their needs. It also provides several types of crawler base classes, such as BaseSpider, sitemap crawler, and the latest version of web2.0 crawler support
See scrapy.org for scrapy frames
Scrapy frame operation principle
I don’t know if you remember, when we were writing crawlers we used to divide up three functions.
# Get web information
def get_html() :
pass
# Parse web pages
def parse_html() :
pass
# Save data
def save_data() :
pass
Copy the code
These three functions basically don’t say who calls whom, and they can only be called through the main function.
Obviously, our scrapy framework does the same thing, but it keeps all three parts in separate files and implements them using our scrapy engine.
Here’s a brief description of the above:
- **Scrapy Engine ** : Controls the flow of data across all components of the system and triggers events when the corresponding action occurs.
- Scheduler: Is responsible for receiving requests from the engine and enlisting them for later use when the engine requests them.
- Dowmloader: Takes the page data, feeds it to the engine, and feeds it to the spider.
- Spiders: Spiders are URL classes that the user writes to analyze a Response and extract items or additional follow-ups. Each spider handles a specific web site.
- Item PIpeline: Handles items extracted by spiders. Typical processes are cleanup, validation, and persistence. (For example, save to a database)
- The DowmLoader Middlewares is a specific hook between the engine and the downloader that handles Response and items and requests for spider input and provides a convenient mechanism to extend Scrapy functionality through custom code.
Scrapy workflow
The following dialog occurs when we write code using scrapy and run it.
Engine: brother MOE, hot yao boring, crawler up!
Spider: Good, old brother, already want to do, today climb XXX website good?
Engine: no problem, portal URL sent over!
Spider: Well, the entry URL is: www.xxx.com
Engine: Scheduler, I’ve got requests for sorting my requests.
Scheduler: Engine guy, this is the requests I processed
Engine: Downloader, please download this Requests request for me according to the setup of downloader middleware
Downloader: Ok, this is the downloaded content. Sorry, this request download failed, and the engine tells the scheduler, this request download failed, and we’ll download it later.
Engine: crawler brother, this is a good thing to download, the download has been handled according to the download middleware, you deal with it.
Dude, my data has been processed, and here are two results, this is the URL I need to follow up, and this is the item data I got.
Engine: Pipe brother, I have an item here, you help me deal with it.
Engine: scheduler brother, this is the URL that needs to be followed up, please help me deal with it. (Then cycle from step 4 until you have all the information.)
Installation of scrapy frames
The installation of scrapy requires a version of Python3.6+. If you’re using Anaconda or Miniconda, to install scrapy, run:
conda install -c conda-forge scrapy
Copy the code
Alternatively, if you are already familiar with the Python installation package, you can use the following methods:
pip install scrapy
Copy the code
The dependence of scrapy
Scrapy is written in pure Python and relies on a few key Python packages:
- LXML: an efficient XML and HTML parser.
- Parsel: A library for extracting HTML and XML data written on top of LXML
- W3lib: versatile helper for handling URL and web page encoding
- Twisted: asynchronous network framework
- Cryptography and pyOpenSSL: Addresses various network-level security needs
After the installation is complete, enter the following command line:
scrapy version
Copy the code
Use of scrapy frames
Scrapy framework command introduction
Scrapy commands are classified into two types: global commands and project commands.
Global commands: can be used anywhere.
Project command: must be used in a crawler project.
Common commands for scrapy are:
Scrapy runspider # create a scrapy runspider file # run a crawler class * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *Copy the code
More commands we can type at the command line
scrapy -h
Copy the code
Use the format of the command
Bench test native hardware performance using scrapy <command> [options][args] Create a separate crawler file using a predefined template runspider: Scrapy runspider abc.py Settings shell into interactive terminal for crawler debugging (if you don't debug, Scrapy shell https://www.baidu.com startProject Creates a crawler project, such as: Download a web page and open the source code in the default text editor: scrapy view https://www.baidu.comCopy the code
Use of scrapy frames
- Use the scrapy startProject command to create a crawler project
- Go to the project directory and create the crawler class
- Go to Items.py and create your own Item container class
- Enter the custom spider class, parse the Response information, and encapsulate it into an Item.
- The parsed Item data is cleaned, validated, de-duplicated, and stored using the Item Pipline project pipeline
- Execute the crawl command to perform the crawl information
Scrapy
The above theory says no amount of work, so let’s use simple examples to learn how to do Scrapy crawlers.
This actual combat content, we will get I love my house housing information.
Here’s the link:
https://fang.5i5j.com/
Copy the code
First we type in the path where you want to create your crawler project:
scrapy startproject fangdemo
Copy the code
Fangdemo | - fangdemo | | - set p y | | -- __pycache__. | | - items. Py # Item definition, Define the data structure of the fetching | | -- middlewares. Py # define spiders and Dowmloader and middlewares middleware to implement | | -- pipelines. Py # defines the Item the realization of the Pipeline, That define the data pipe | | -- Settings. # py it defines project global configuration | | __ spiders # contains small spiders, Each Spider has a file | - set p y | - __pycache__ | -- scrapy. CFG # scrapy deployment configuration file, defines the configuration file path, the deployment information content.Copy the code
When we have created the project, it will prompt us, and we will continue with it.
When you do this you will see that the fang.py file appears under the spiders folder. So this is our crawler file.
Next, we can write some code in fang.py to test it.
After writing, you can enter:
scrapy crawl fang
Copy the code
In the early stages of scrapy there is a lot of request preparation, then crawling, and finally closing the files.
Of course, if your code is bug-free you can use the following command:
scrapy crawl fang --nolog
Copy the code
If you remember from the previous picture, when a spider runs it generates a list of items, so we can now select Items.py and instantiate the object.
The code looks like this:
For example, we now want to get the title, address, and price of the page, as shown below:
Then we can simply use CSS parser to extract data.
The code looks like this:
In this way, we can extract the data we need.
Next, we change the print(item) on the last line to yield item, and then we go to the pipe.
However, the Item in this case is turned off by default, so we need to go to Settings and turn it on.
Now that you’re done with scrapy, you should know a little bit about scrapy.
The last
Nothing can be accomplished overnight, so is life, so is learning!
So what’s a three-day, seven-day crash?
Only insist, can succeed!
Biting books says:
Every word of the article is my heart to knock out, only hope to live up to every attention to my people. Click “like” at the end of the article to let me know that you are also working hard for your study.
The way ahead is so long without ending, yet high and low I’ll search with my will unbending.
I am book-learning, a person who concentrates on learning. The more you know, the more you don’t know. See you next time for more exciting content!