Demand analysis

Primary users:

  • With only one development host, you want to be able to deploy and run your Scrapy crawler project directly from your browser

Advanced users:

  • There is a cloud host that wants to integrate identity authentication
  • It is hoped that the crawler task can be automatically started to monitor web page information

Professional users:

  • There are N cloud hosts to build distributed crawlers by scrapy-redis
  • You can view the running status of all cloud hosts on the page
  • It is hoped that some cloud hosts can be freely selected to deploy and run projects in batches to achieve cluster management
  • Want to automatically perform log analysis, master crawler progress
  • You want to be notified when a particular type of exception log occurs, including automatically stopping the current crawler task

Installation and configuration

  1. Ensure that all hosts have been installed and startedScrapydIf remote access to Scrapyd is required, add thebind_addressModified toBind_address = 0.0.0.0And restart the Scrapyd Service.
  2. Development host or any host installationScrapydWeb:pip install scrapydweb
  3. By running commandsscrapydwebStart theScrapydWeb(The first startup will automatically generate the configuration file in the current working directory).
  4. Enabling BASIC HTTP authentication (Optional) :
ENABLE_AUTH = True
USERNAME = 'username'
PASSWORD = 'password'
Copy the code
  1. Add Scrapyd Server, support string and tuple configuration format, add authentication information and grouping/tag:
SCRAPYD_SERVERS = [
    '127.0.0.1'.# 'username:password@localhost:6801#group',
    ('username'.'password'.'localhost'.'6801'.'group'),]Copy the code
  1. Run the commandscrapydwebrestartScrapydWeb.

Access to the Web UI

Visit and log in to http://127.0.0.1:5000 through your browser.

  • The Servers page automatically displays the running status of all Scrapyd Servers.
  • Select HTTP JSON APIS from the Tabs TAB and execute them in batches.

  • By integrating LogParser, Jobs page automatically outputs pages and Items data of crawler tasks.
  • ScrapydWebBy default, the crawler task list is saved to the database by periodic snapshot creation, and the task information will not be lost even if the Scrapyd Server restarts. (issue 12)

Deployment project

  • Supports one-click deployment of projects to Scrapyd Server clusters.
  • By configuringSCRAPY_PROJECTS_DIRSpecify Scrapy project development directory,ScrapydWebAll projects in this path are automatically listed. By default, the latest edited project is selectedAutomatic packagingAnd deploy specified projects.
  • ifScrapydWebRunning on a remote server, in addition to uploading a regular egg file from the current development host, you can also add the entire project folder to a zip/tar/tar.gz zip file and upload it directly, without having to manually package it as an Egg file.

Run the crawler

  • Select project, Version, and Spider directly from the drop-down box.
  • Support for passing Scrapy Settings and spider Arguments.
  • Supports the creation of scheduled crawler tasks based on APScheduler. (Adjust the max-proc parameter of the Scrapyd configuration file if you need to start a large number of crawler tasks simultaneously.)
  • Support one-click launch of distributed crawlers on Scrapyd Server clusters.

Log analysis and visualization

  • If you run Scrapyd and Scrapyd on the same hostScrapydWeb, recommended settingSCRAPYD_LOGS_DIRENABLE_LOGPARSER, startScrapydWebWill run automaticallyLogParser.The subprocess speeds up the generation of Stats pages by periodically incrementing Scrapy log files in the specified directoryTo avoid consuming a lot of memory and network resources by requesting original log files.
  • To manage a Scrapyd Server cluster, you are advised to install and start it on other hostsLogParserThe reasons are as above.
  • If you have installed Scrapy version 1.5.1 or less,LogParserWill be able to automatically build in ScrapyTelnet ConsoleCrawler. Stats and Crawler. Engine data are read to master the internal workings of Scrapy.

Timed crawler task

  • Support to view the parameter information of crawler task and trace the historical record
  • Supports pause, resume, trigger, stop, edit and delete tasks

Email notification

By polling subprocess in the background to simulate the timing of access to Stats page, ScrapydWeb will automatically stop the crawler task and send a notification message when a specific trigger is met. The message body contains the statistics of the current crawler task.

  1. Adding an email account:
SMTP_SERVER = 'smtp.qq.com'
SMTP_PORT = 465
SMTP_OVER_SSL = True
SMTP_CONNECTION_TIMEOUT = 10

EMAIL_USERNAME = ' '  # defaults to FROM_ADDR
EMAIL_PASSWORD = 'password'
FROM_ADDR = '[email protected]'
TO_ADDRS = [FROM_ADDR]
Copy the code
  1. Set email work hours and basic triggers, as represented by the following example: every 1 hour or when a task is completed and the current time is 9, 12, and 17 o ‘clock on a workday,ScrapydWebA notification email will be sent.
EMAIL_WORKING_DAYS = [1.2.3.4.5]
EMAIL_WORKING_HOURS = [9.12.17]
ON_JOB_RUNNING_INTERVAL = 3600
ON_JOB_FINISHED = True
Copy the code
  1. In addition to basic triggers,ScrapydWeb Multiple triggers are also provided for handling different types of logs, including ‘CRITICAL’, ‘ERROR’, ‘WARNING’, ‘REDIRECT’, ‘RETRY’ and ‘IGNORE’.
LOG_CRITICAL_THRESHOLD = 3
LOG_CRITICAL_TRIGGER_STOP = True
LOG_CRITICAL_TRIGGER_FORCESTOP = False
#...
LOG_IGNORE_TRIGGER_FORCESTOP = False
Copy the code

This example shows that ScrapydWeb automatically stops the task when three or more critical logs appear in the log, and sends a notification mail if the current time is within the mail working time.

Making open source

Take a couple of official bigwigs alive. Let’s get to them. Click Star to get lost, welcome to submit feature request!

my8100 / scrapydweb