“This is the 31st day of my participation in the First Challenge 2022. For details: First Challenge 2022”
The settings.py file in scrapy is important in a project because it contains a lot of configuration. This blog post explains how to configure settings.py files based on the official manual, and adds some extension instructions.
Four levels of Settings
- Highest priority – command line, for example
scrapy crawl my_spider -s LOG_LEVEL=WARNINI
; - Priority 2 – crawler files own Settings, such as in
xxx.py
File Settingscustom_settings
; - Priority 3 – Project modules, by which I mean
settings.py
Configuration in files; - Priority four –
default_settings
Attribute configuration; - Priority five –
default_settings.py
Configuration in the file.
The Settings configuration is read using the from_crawler method in spider, which can be called in middleware, pipes, and extensions.
The Settings configuration is very simple to read and was covered in the last blog post. The command format is as follows:
Scrapy Settings --get Configures variable namesCopy the code
Settings Common configurations
The basic configuration
BOT_NAME
: crawler name;SPIDER_MODULES
: crawler module list;NEWSPIDER_MODULE
Where is the module usedgenspider
Command to create a new crawler;
The log
Scrapy logs are the same as the logging module and use five levels: LOG_LEVEL, DEBUG (default), INFO, WARNING, ERROR, and CRITICAL(highest). Other log configurations are as follows:
LOGSTATS_INTERVAL
: Set the log frequency. The default value is 60 seconds. You can change it to 5 seconds.LOG_FILE
: log file;LOG_ENABLED
: Whether to enable logging, closed crawler, nothing output;LOG_ENCODING
: coding;LOG_FORMAT
: Log format, this can be referencedlogging
Module learning;LOG_DATEFORMAT
: Same as above, responsible for formatting date/time;
statistical
STATS_DUMP
: enabled by default. After crawler collection is completed, the crawler operation information will be counted and output to the log.DOWNLOADER_STATS
: Enable download middleware statistics;DEPTH_STATS
和DEPTH_STATS_VERBOSE
: Statistical depth correlation setting;STATSMAILER_RCPTS
: After crawler collection is complete, send email list.
performance
CONCURRENT_REQUESTS
: Maximum number of concurrent requests, used when crawling different websites. The default value is 16. If a request takes 0.2 seconds, the concurrency limit is 16/0.2 = 80 requestsCONCURRENT_REQUESTS_PER_DOMAIN
和CONCURRENT_REQUESTS_PER_IP
: Maximum number of concurrent requests for a single domain or IP address.CONCURRENT_ITEMS
: Maximum number of files to be processed concurrently per request, ifCONCURRENT_REQUESTS=16
.CONCURRENT_ITEMS=100
, indicating that 1600 files are written to the database every second.DOWNLOAD_TIMEOUT
: The amount of time the downloader waits before timeout;DOWNLOAD_DELAY
: Download delay, limit crawl speed, cooperateRANDOMIZE_DOWNLOAD_DELAY
Use, will use a random value *DOWNLOAD_DELAY
;CLOSESPIDER_TIMEOUT
.CLOSESPIDER_ITEMCOUNT
.CLOSESPIDER_PAGECOUNT
.CLOSESPIDER_ERRORCOUNT
: The four configurations are similar, all of which are designed to close crawler in advance. They are respectively time, number of items captured, number of requests issued and number of errors.
Fetching the relevant
USER_AGENT
: user agent;DEPTH_LIMIT
: The maximum depth of grasping, useful in deep grasping;ROBOTSTXT_OBEY
: Compliancerobots.txt
Convention;COOKIES_ENABLED
: Whether to disable cookies, which can sometimes improve the collection speed;DEFAULT_REQUEST_HEADERS
: request header;IMAGES_STORE
Use:ImagePipeline
Is the image storage path;IMAGES_MIN_WIDTH
和IMAGES_MIN_HEIGHT
: Filter pictures;IMAGES_THUMBS
: Set thumbnail;FILES_STORE
: file storage path.FILES_URLS_FIELD
与FILES_RESULT_FIELD
Use:Files Pipeline
When some variable names are configured;URLLENGTH_LIMIT
: Specifies the maximum length of a web address that can be fetched.
Extend the functionality
ITEM_PIPELINES
: pipeline configuration;COMMANDS_MODULE
: user-defined command.DOWNLOADER_MIDDLEWARES
: Download middleware;SCHEDULER
: scheduler;EXTENSIONS
: extension;SPIDER_MIDDLEWARES
: Crawler middleware;RETRY_*
: The retry-related middleware configuration is set.REDIRECT_*
: Sets the Redirect related middleware configuration.METAREFRESH_*
: Sets the meta-refresh middleware configuration.MEMUSAGE_*
: Memory configurations are configured.
Settings configuration tips
- Common configuration is written in the project
settings.py
File; - Crawler personalization Settings are written in
custom_settings
Variable; - Different crawlers are configured to be initialized on the command line.
The crawler case for this blog
This time the crawler will collect the blue bridge training camp course, the page after testing to get the request address is as follows:
https://www.lanqiao.cn/api/v2/courses/?page_size=20&page=2&include=html_url,name,description,students_count,fee_type,pic ture_url,id,label,online_type,purchase_seconds_info,levelCopy the code
Where the parameters are notpage_size
和 page
Besides, there is anotherinclude
Parameter, which is also commonly used in interfaces, and whose value represents which fields (containing which attributes) are returned by the interface, as shown in the following figure.Next we implement this using scrapy and save the result to a JSON file.
Lanqiao.py file code
import json
import scrapy
from lq.items import LqItem
class LanqiaoSpider(scrapy.Spider) :
name = 'lanqiao'
allowed_domains = ['lanqiao.cn']
def start_requests(self) :
url_format = 'https://www.lanqiao.cn/api/v2/courses/?page_size=20&page={}&include=html_url,name,description,students_count,fee_type,p icture_url,id,label,online_type,purchase_seconds_info,level'
for page in range(1.34):
url = url_format.format(page)
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response) :
json_data = json.loads(response.text)
for ret_item in json_data["results"]:
item = LqItem(**ret_item)
yield item
Copy the code
Ret_item is directly assigned to the constructor of LqItem to achieve the assignment of the field.
The code for the kitems. py file
This class mainly restricts data fields.
import scrapy
class LqItem(scrapy.Item) :
# define the fields for your item here like:
# name = scrapy.Field()
html_url = scrapy.Field()
name = scrapy.Field()
description = scrapy.Field()
students_count = scrapy.Field()
fee_type = scrapy.Field()
picture_url = scrapy.Field()
id = scrapy.Field()
label = scrapy.Field()
online_type = scrapy.Field()
purchase_seconds_info = scrapy.Field()
level = scrapy.Field()
Copy the code
Settings. py enables some configuration
BOT_NAME = 'lq'
SPIDER_MODULES = ['lq.spiders']
NEWSPIDER_MODULE = 'lq.spiders'
USER_AGENT = 'the Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'
ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 3
Copy the code
Crawler running results:Get **600+** course information.
Write in the back
Today is the 254/365 day of continuous writing. Expect attention, likes, comments and favorites.