20 lines of Python scrapy code to collect bluebridge boot camp

“This is the 31st day of my participation in the First Challenge 2022. For details: First Challenge 2022”

The settings.py file in scrapy is important in a project because it contains a lot of configuration. This blog post explains how to configure settings.py files based on the official manual, and adds some extension instructions.

Four levels of Settings

Highest priority – command line, for examplescrapy crawl my_spider -s LOG_LEVEL=WARNINI;
Priority 2 – crawler files own Settings, such as inxxx.pyFile Settingscustom_settings ;
Priority 3 – Project modules, by which I meansettings.pyConfiguration in files;
Priority four –default_settingsAttribute configuration;
Priority five –default_settings.pyConfiguration in the file.

The Settings configuration is read using the from_crawler method in spider, which can be called in middleware, pipes, and extensions.

The Settings configuration is very simple to read and was covered in the last blog post. The command format is as follows:

Scrapy Settings --get Configures variable namesCopy the code

Settings Common configurations

The basic configuration

BOT_NAME: crawler name;
SPIDER_MODULES: crawler module list;
NEWSPIDER_MODULEWhere is the module usedgenspiderCommand to create a new crawler;

The log

Scrapy logs are the same as the logging module and use five levels: LOG_LEVEL, DEBUG (default), INFO, WARNING, ERROR, and CRITICAL(highest). Other log configurations are as follows:

LOGSTATS_INTERVAL: Set the log frequency. The default value is 60 seconds. You can change it to 5 seconds.
LOG_FILE: log file;
LOG_ENABLED: Whether to enable logging, closed crawler, nothing output;
LOG_ENCODING: coding;
LOG_FORMAT: Log format, this can be referencedloggingModule learning;
LOG_DATEFORMAT: Same as above, responsible for formatting date/time;

statistical

STATS_DUMP: enabled by default. After crawler collection is completed, the crawler operation information will be counted and output to the log.
DOWNLOADER_STATS: Enable download middleware statistics;
DEPTH_STATS 和 DEPTH_STATS_VERBOSE: Statistical depth correlation setting;
STATSMAILER_RCPTS: After crawler collection is complete, send email list.

performance

CONCURRENT_REQUESTS: Maximum number of concurrent requests, used when crawling different websites. The default value is 16. If a request takes 0.2 seconds, the concurrency limit is 16/0.2 = 80 requests
CONCURRENT_REQUESTS_PER_DOMAIN 和 CONCURRENT_REQUESTS_PER_IP: Maximum number of concurrent requests for a single domain or IP address.
CONCURRENT_ITEMS: Maximum number of files to be processed concurrently per request, ifCONCURRENT_REQUESTS=16.CONCURRENT_ITEMS=100, indicating that 1600 files are written to the database every second.
DOWNLOAD_TIMEOUT: The amount of time the downloader waits before timeout;
DOWNLOAD_DELAY: Download delay, limit crawl speed, cooperateRANDOMIZE_DOWNLOAD_DELAYUse, will use a random value *DOWNLOAD_DELAY;
CLOSESPIDER_TIMEOUT. CLOSESPIDER_ITEMCOUNT.CLOSESPIDER_PAGECOUNT.CLOSESPIDER_ERRORCOUNT: The four configurations are similar, all of which are designed to close crawler in advance. They are respectively time, number of items captured, number of requests issued and number of errors.

Fetching the relevant

USER_AGENT: user agent;
DEPTH_LIMIT: The maximum depth of grasping, useful in deep grasping;
ROBOTSTXT_OBEY: Compliancerobots.txtConvention;
COOKIES_ENABLED: Whether to disable cookies, which can sometimes improve the collection speed;
DEFAULT_REQUEST_HEADERS: request header;
IMAGES_STOREUse:ImagePipelineIs the image storage path;
IMAGES_MIN_WIDTH 和 IMAGES_MIN_HEIGHT: Filter pictures;
IMAGES_THUMBS: Set thumbnail;
FILES_STORE: file storage path.
FILES_URLS_FIELD 与 FILES_RESULT_FIELDUse:Files PipelineWhen some variable names are configured;
URLLENGTH_LIMIT: Specifies the maximum length of a web address that can be fetched.

Extend the functionality

ITEM_PIPELINES: pipeline configuration;
COMMANDS_MODULE: user-defined command.
DOWNLOADER_MIDDLEWARES: Download middleware;
SCHEDULER: scheduler;
EXTENSIONS: extension;
SPIDER_MIDDLEWARES: Crawler middleware;
RETRY_*: The retry-related middleware configuration is set.
REDIRECT_*: Sets the Redirect related middleware configuration.
METAREFRESH_*: Sets the meta-refresh middleware configuration.
MEMUSAGE_*: Memory configurations are configured.

Settings configuration tips

Common configuration is written in the projectsettings.pyFile;
Crawler personalization Settings are written incustom_settingsVariable;
Different crawlers are configured to be initialized on the command line.

The crawler case for this blog

This time the crawler will collect the blue bridge training camp course, the page after testing to get the request address is as follows:

https://www.lanqiao.cn/api/v2/courses/?page_size=20&page=2&include=html_url,name,description,students_count,fee_type,pic ture_url,id,label,online_type,purchase_seconds_info,levelCopy the code

Where the parameters are notpage_size 和 pageBesides, there is anotherincludeParameter, which is also commonly used in interfaces, and whose value represents which fields (containing which attributes) are returned by the interface, as shown in the following figure.Next we implement this using scrapy and save the result to a JSON file.

Lanqiao.py file code

import json
import scrapy

from lq.items import LqItem


class LanqiaoSpider(scrapy.Spider) :
    name = 'lanqiao'
    allowed_domains = ['lanqiao.cn']

    def start_requests(self) :
        url_format = 'https://www.lanqiao.cn/api/v2/courses/?page_size=20&page={}&include=html_url,name,description,students_count,fee_type,p icture_url,id,label,online_type,purchase_seconds_info,level'
        for page in range(1.34):
            url = url_format.format(page)
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response) :
        json_data = json.loads(response.text)
        for ret_item in json_data["results"]:
            item = LqItem(**ret_item)
            yield item
Copy the code

Ret_item is directly assigned to the constructor of LqItem to achieve the assignment of the field.

The code for the kitems. py file

This class mainly restricts data fields.

import scrapy


class LqItem(scrapy.Item) :

    # define the fields for your item here like:
    # name = scrapy.Field()
    html_url = scrapy.Field()
    name = scrapy.Field()
    description = scrapy.Field()
    students_count = scrapy.Field()
    fee_type = scrapy.Field()
    picture_url = scrapy.Field()
    id = scrapy.Field()
    label = scrapy.Field()
    online_type = scrapy.Field()
    purchase_seconds_info = scrapy.Field()
    level = scrapy.Field()
Copy the code

Settings. py enables some configuration

BOT_NAME = 'lq'

SPIDER_MODULES = ['lq.spiders']
NEWSPIDER_MODULE = 'lq.spiders'

USER_AGENT = 'the Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'

ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 3
Copy the code

Crawler running results:Get **600+** course information.

Write in the back

Today is the 254/365 day of continuous writing. Expect attention, likes, comments and favorites.