Learning goals
  1. Know how to use Scrapyd

1. Introduction to Scrapyd

Scrapyd is an application for deploying and running crawlers. It allows you to deploy crawler projects and control crawler runs through the JSON API. Scrapyd is a daemon that listens for crawler runs and requests, and then starts the process to execute them

The JSON API is essentially a WebAPI for POST requests

2. Scrapyd installation

PIP install scrapyd

PIP install scrapyd-client

3. Start the Scrapyd service

  1. Command to initiate scrapyd under the project path: sudo scrapyd or scrapyd

  2. When you start it, you can open up your native scrapyd. You can view the monitor screen of your native scrapyd by visiting port 6800 in your browser

  • Click Job to view the task monitoring page

4. Scrapy project deployment

4.1 Configuring projects to be deployed

Edit the scrapy. CFG file of the project to be deployed (which crawler needs to be deployed to scrapyd, configure this file of the project)

[deploy: deployment name (you can customize the deployment name)] url = http://localhost:6800/ project = project name (name used to create crawler project)Copy the code

4.2 Deploy items to Scrapyd

Also execute under the scrapy project path:

Scrapyd -deploy Deploy name (the name specified in the configuration file) -p Project name

Once the deployment is successful, you can see the deployed project

4.3 Manage scrapy projects

  • Launch project:curl http://localhost:6800/schedule.json -d project=project_name -d spider=spider_name

  • Close crawler:curl http://localhost:6800/cancel.json -d project=project_name -d job=jobid
Note; Curl is a command-line utility that requires additional installation if you don’t have it

4.4 Use the Requests module to control scrapy projects

import requests

Start crawler
url = 'http://localhost:6800/schedule.json'
data = {
	'project': Project name,'spider'} resp = requests. Post (url, data=data)# stop crawler
url = 'http://localhost:6800/cancel.json'
data = {
	'project': Project name,'job'} resp = requests. Post (url, data=data)Copy the code

5. Learn about other Scrapyd Webapis

  • Curl http://localhost:6800/listprojects.json (list items)
  • curl http://localhost:6800/listspiders.json? Project =myspider (listing crawler)
  • curl http://localhost:6800/listjobs.json? Project =myspider (list job)
  • Curl http://localhost:6800/cancel.json – d project job = = myspider – d tencent (termination of the crawler, the functions have a delay or termination of the crawler, not available at this time to stop) kill 9 kill process
  • Scrapyd and other Webapis. Search baidu.com to learn more

summary

  1. Execute in the scrapy project pathsudo scrapydorscrapydStart scrapyd service Or start as a background processnohup scrapyd > scrapyd.log 2>&1 &
  2. Deploy the scrapy crawler projectscrapyd-deploy -p myspider
  3. Start a crawler in the crawler projectcurl http://localhost:6800/schedule.json -d project=myspider -d spider=tencent