Learning goals
- Know how to use Scrapyd
1. Introduction to Scrapyd
Scrapyd is an application for deploying and running crawlers. It allows you to deploy crawler projects and control crawler runs through the JSON API. Scrapyd is a daemon that listens for crawler runs and requests, and then starts the process to execute them
The JSON API is essentially a WebAPI for POST requests
2. Scrapyd installation
PIP install scrapyd
PIP install scrapyd-client
3. Start the Scrapyd service
-
Command to initiate scrapyd under the project path: sudo scrapyd or scrapyd
-
When you start it, you can open up your native scrapyd. You can view the monitor screen of your native scrapyd by visiting port 6800 in your browser
- Click Job to view the task monitoring page
4. Scrapy project deployment
4.1 Configuring projects to be deployed
Edit the scrapy. CFG file of the project to be deployed (which crawler needs to be deployed to scrapyd, configure this file of the project)
[deploy: deployment name (you can customize the deployment name)] url = http://localhost:6800/ project = project name (name used to create crawler project)Copy the code
4.2 Deploy items to Scrapyd
Also execute under the scrapy project path:
Scrapyd -deploy Deploy name (the name specified in the configuration file) -p Project name
Once the deployment is successful, you can see the deployed project
4.3 Manage scrapy projects
- Launch project:
curl http://localhost:6800/schedule.json -d project=project_name -d spider=spider_name
- Close crawler:
curl http://localhost:6800/cancel.json -d project=project_name -d job=jobid
Note; Curl is a command-line utility that requires additional installation if you don’t have it
4.4 Use the Requests module to control scrapy projects
import requests
Start crawler
url = 'http://localhost:6800/schedule.json'
data = {
'project': Project name,'spider'} resp = requests. Post (url, data=data)# stop crawler
url = 'http://localhost:6800/cancel.json'
data = {
'project': Project name,'job'} resp = requests. Post (url, data=data)Copy the code
5. Learn about other Scrapyd Webapis
- Curl http://localhost:6800/listprojects.json (list items)
- curl http://localhost:6800/listspiders.json? Project =myspider (listing crawler)
- curl http://localhost:6800/listjobs.json? Project =myspider (list job)
- Curl http://localhost:6800/cancel.json – d project job = = myspider – d tencent (termination of the crawler, the functions have a delay or termination of the crawler, not available at this time to stop) kill 9 kill process
- Scrapyd and other Webapis. Search baidu.com to learn more
summary
- Execute in the scrapy project path
sudo scrapyd
orscrapyd
Start scrapyd service Or start as a background processnohup scrapyd > scrapyd.log 2>&1 &
- Deploy the scrapy crawler project
scrapyd-deploy -p myspider
- Start a crawler in the crawler project
curl http://localhost:6800/schedule.json -d project=myspider -d spider=tencent