This is the 8th day of my participation in the August More Text Challenge. For details, see:August is more challenging
I have received feedback from many fans in private chat :”scrapy framework is difficult to learn, I have learned nearly the basic crawler library and have made a lot in practical operation, but many outsourcing or boss require skilled use of scrapy framework, I don’t know how to do it!”
ˏ₍•ɞ•₎ char-indent, I have worked hard to compile ➕ and summarize for three days and three nights, summed up this article of 20,000 words, and attached a complete set of scrapy framework learning route at the end of this article. If you can read this article earnestly, have an impression on Scrapy in your heart, and then study the whole learning route at the end of this article with great effort, Scrapy frameworks are easy for you!!
Design purpose: This is a framework for crawling network data and extracting structural data. This tool uses the Twisted asynchronous network framework to speed up download times.
Official document address!!
PIP install scrapy
Scrapy project development process
Extract data: according to the website structure in the spider to achieve data collection related content to save data: Use PIPELINE for data follow-up processing and saving \
Description :(scrapy each module in the specific role)
Request: Url, method, post_data, headers, and so on The item data object consists of url, body, status, headers, and so on: essentially a dictionary
Process principle description:
- Url object of the initial URL construct in the crawler -> crawler middleware -> engine -> scheduler
- The scheduler takes the request–> engine –> middleware –> loader
- The downloader sends the request and gets the response response –> download middleware –> engine –> crawler middleware –> crawler
- The crawler extracts the URL address and assembles it into a Request object –> crawler middleware –> engine –> scheduler. Repeat Step 2
- Crawler extract data -> engine -> pipeline process and save data
Crawler middleware and download middleware only run the logic in different locations, the function is repeated: such as replacing UA, etc
1. Create a project
Example: scrapy startproject myspiderCopy the code
The resulting directories and files are as follows:
2. Create crawler files
CD myspider scrapy genspider itcast itcast.cnCopy the code
Crawler name: domain name allowed to crawl as a parameter when the crawler runs: it is the crawling range set for the crawler. After setting, it is used to filter the URL to be crawled. If the url to be crawled is different from the allowed domain name, it will be filtered out
3. Run the scrapy crawler
Example: scrapy crawl itcast
Write the itcast.py crawler file:
# -*- coding: utf-8 -*-
import scrapy
class ItcastSpider(scrapy.Spider) :
# Crawler runtime parameters
name = 'itcast'
Check the domain names allowed to crawl
allowed_domains = ['itcast.cn']
# 1. Change the url where the Settings start
start_urls = ['http://www.itcast.cn/channel/teacher.shtml#ajacaee']
# Data extraction method: receive the response from the download middleware and define the operation related to the site
def parse(self, response) :
# Get all teacher nodes
t_list = response.xpath('//div[@class="li_txt"]')
print(t_list)
# Traverse the list of teacher nodes
tea_dist = {}
for teacher in t_list:
The # xpath method returns a list of selector objects and the extract() method can extract the data corresponding to the data in the selector object.
tea_dist['name'] = teacher.xpath('./h3/text()').extract_first()
tea_dist['title'] = teacher.xpath('./h4/text()').extract_first()
tea_dist['desc'] = teacher.xpath('./p/text()').extract_first()
yield teacher
Copy the code
You will find that it is OK to run!
4. After data modeling (after defining the data crawled by crawlers, use the pipeline for data persistence operation)
# -*- coding: utf-8 -*-
import scrapy
from ..items import UbuntuItem
class ItcastSpider(scrapy.Spider) :
# Crawler runtime parameters
name = 'itcast'
Check the domain names allowed to crawl
allowed_domains = ['itcast.cn']
# 1. Change the url where the Settings start
start_urls = ['http://www.itcast.cn/channel/teacher.shtml#ajacaee']
# Data extraction method: receive the response from the download middleware and define the operation related to the site
def parse(self, response) :
# Get all teacher nodes
t_list = response.xpath('//div[@class="li_txt"]')
print(t_list)
# Traverse the list of teacher nodes
item = UbuntuItem()
for teacher in t_list:
The # xpath method returns a list of selector objects and the extract() method can extract the data corresponding to the data in the selector object.
item['name'] = teacher.xpath('./h3/text()').extract_first()
item['title'] = teacher.xpath('./h4/text()').extract_first()
item['desc'] = teacher.xpath('./p/text()').extract_first()
yield item
Copy the code
Parse (); / / Parse (); / / parse (); If the site structure is more complex, you can also customize other resolution functions. 3. The url extracted in the resolution function must belong to allowed_domains if you want to send a request, but the URL in start_urls is not subject to this restriction. 5. Parse () uses yield to return data. Note: The only objects that yield can be passed in the parse function are BaseItem, Request, dict, None\
Select * from scrapy crawler where you want to locate elements and extract data and attribute values. 1. The response. Xpath method returns a type similar to list, containing the Selector object. The operation is the same as the list, but with some additional methods 2. Extra method extract() : returns a list of strings 3. Extra method extract_first() : returns the first string in the list, if the list is empty None\ is returned
Response. url: Url of the current response Response. request.url: URL of the request corresponding to the current response Response. headers: Response headers response. Requests. Headers: the current response to the request of the head of the response. The body: the response body, which is the HTML code, byte type response. The status: the response status code \
5. Pipes save data
Define operations on data in Pipelines. Py file 1. Define a pipeline class 2. Override the pipeline class process_item method 3. The process-item method must return the item to the engine after processing it
Update:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import json
class UbuntuPipeline(object) :
def __init__(self) :
self.file = open('itcast.json'.'w', encoding='utf-8')
def process_item(self, item, spider) :
This operation can only be used in scrapy applications
item = dict(item)
# Every yield of the method used to extract data from the crawler file is run
# This method is a fixed-name function
By default, the pipe is used up and data needs to be returned to the engine
# 1. Serialize the dictionary data
Ensure_ascii =False Converts unicode type to STR type. Defaults to True.
json_data = json.dumps(item, ensure_ascii=False, indent=2) + ',\n'
# 2. Write data to a file
self.file.write(json_data)
return item
def __del__(self) :
self.file.close()
Copy the code
6.settings.py configures to enable pipes
In the Settings file, unpack the code as follows:
7. Scrapy data modeling and requests
(Data modeling is usually done in items.py during the project)
Why model? 1. Define item, that is, plan in advance which fields need to be caught to prevent manual errors, because after the definition is done, the system will automatically check during the running process, and error 2 will be reported if the values are different. Together with comments, you can clearly know which fields to crawl. Undefined fields cannot be crawled. When there are few target fields, you can use dictionary instead of 3. Some scrapy-specific components require Item support, such as scrapy's ImagesPipeline classCopy the code
Operate in the kitems.py file:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class UbuntuItem(scrapy.Item) :
# Name of lecturer
name = scrapy.Field()
# Lecturer title
title = scrapy.Field()
# Lecturer motto
desc = scrapy.Field()
Copy the code
1. From… Note the correct import path of the Item in this line of code, and ignore the Error of the PyCharm tag
8. Set the user-agent
Find the following code in the # settings.py file and unseal it and add UA:
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9 * / *; Q = 0.8 '.'Accept-Language': 'en'.'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.162 Safari
}
Copy the code
9. So far, an entry-level scrapy crawler is OK.
Now go to your project directory and type in scrapy crawl itcast to run scrapy!
Summary of development process:
-
Create project: scrapy startProject Project name
-
Explicit goals: Modeling is done in the items.py file
-
Create a crawler:
Create the crawler scrapy genspider crawler name allows the domain to complete the crawler modification start_urls check the modification allowed_domains write the resolution method
-
Store data: Define pipelines for data processing in Pipelines. Py file register the enabled pipeline in Settings. py file
In The End!
Start from now, stick to it, a little progress a day, in the near future, you will thank you for your efforts! |
---|
This blogger will continue to update the basic crawler column and actual crawler column, carefully read this article friends, you can like the collection and comment out of your reading feeling. And can follow this blogger, in the days to read more reptilian!
If there are any mistakes or inappropriate words, please point them out in the comment section. Thank you! If reprint this article, please contact me to obtain my consent, and annotate the source and the name of this blogger, thank you!