Scrapyd Crawler deployment

pip install scrapyd

Scrapyd principle

When you install and run scrapyd on a server, the scrapyd daemon listens for crawlers and requests, and then starts the process to execute the crawler

Scrapyd also provides an interface to the Web. It is convenient for us to view and manage crawlers. By default, scrapyd listens on port 6800. Use a browser on the machine to visit http://localhost:6800 address to view the current running projects

Project deployment

Use the scraoyd-deploy tool provided by Scrapyd-Client to deploy

The principle of

Scrapyd runs on the server, and scrapyd-client runs on the client. The client deploys crawler projects using scrapyd-client by calling scrapyd’s JSON interface

pip install scrapyd-client

Configure the server information for the project

Modify the scrapy. CFG file in the project directory

If HTTP Basic authentication is configured on your server, you will need to configure the user name and password to log in to the server

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = toutiao.settings

[deploy:server]   # a name for the server, which I specified as server
#url = http://localhost:6800/ # server address to deploy the project, now deploy the project locally, if deployed to another machine will need to change the IP
project = toutiao # toutiao is the project name

username=' '

password = ' '
Copy the code

Deploy crawler

Run the command in the crawler root
scrapyd-deploy -p
Copy the code
Target is the server name configured in the configuration file in the previous step. Project is the project name. Take the current crawler as an example.
scrapyd-deploy server -p toutiao
Copy the code

Deployment will pack your current project operation, if under the current project setup. Py files, will use it in the configuration, no will automatically create a (if later project need packaging, can according to your need to modify the information inside, can temporarily no matter he) from the returned result, we can see the status of the deployment, The project name, version number and number of crawlers, and the current host name

The running results are as follows:

Use the following command to view the deployment result:
Scrapyd-deploy-l Server name eg: scrapyd-deploy-l server toutiao defaultCopy the code
Refresh the http://localhost:6800 page to also see Available Projects: Toutiao,default

Manage crawlers using apis

The Web interface of Scrapyd is relatively simple, mainly used for monitoring, all scheduling work all rely on the interface. Using curl to manage crawlers is recommended

Install curl

Enable crawler Schedule

Start the crawler by running a command in the project root directory

Cancel the crawler

List the project

List crawler, version, and job information

Delete crawler project

scrapyd.readthedocs.io

Check the load status of the server daemonStatus. json

# Strength requestCurl HTTP: // localhost:6800 / daemonstatus.json

# response example{" status ":" OK ", "running" :"0", "Pending" :"0", "Finished"0", "node_name" : "node-name"}Copy the code

addversion.json

Adds the item to the project, and automatically creates the item if it doesn’t exist

Supported request method: GET

Parameters:

Project: indicates the project name

Version: indicates the project version

Egg: Python egg for project code

# Strength request$curl http://localhost:6800 / addversion.json -F project = myproject -F version = r23 -F [email protected]

# response example{" status ":" OK ", "spider" :3 }
Copy the code

schedule.json

Schedule the spider to run, schedule the project, and return the job ID

# sample request$curl http://localhost:6800 / schedule.json -d project = myproject -d spider = somespider

# response example{" status ":" OK ", "jobid" :"6487Ec79947edab326d6db28a2d86511e8247444}"Copy the code

cancle.json

Cancel spider (cancel job). If the job is in a pending state, it is deleted. If the job is running, it will be terminated

# sample request$curl http://localhost:6800 / cancel.json -d project = myproject -d job = 6487ec79947edab326d6db28a2d86511e8247444

# response example{" status ":" OK ", "prevState" : "running"}Copy the code

listproject.json

Gets a list of items uploaded to this scrapy server

# sample request$curl http://localhost:6800 / listprojects.json

# response example{" status ":" ok ", "projects" : [" myProject ", "otherProject"]}Copy the code

listversion.json

# sample request$curl http://localhost:68001 / listversions json? project = myproject# response example{" status ":" OK ", "Versions" : [" R99 ", "R156"]}Copy the code

listspider.json

Gets the list of spiders available in the last (unless overridden) version of an item

# sample request$curl http://localhost:68001 / listspiders json? project = myproject# response example{" status ":" ok ", "spider" : [" spider1 ", "spider2", "spider3"]}Copy the code

listjobs.json

Gets a list of pending, running, and completed jobs for a project

# sample request$curl http://localhost:68001 / listjobs json? project = myproject | python -m json.tool# response example{" status ": ok," pending ": [{" project" : myproject, "spider" : spider1, "ID" :"78391Cc0fcaf11e1b0090800272a6d06}] ", "running" : [{" id ":"422e608F9f28cef127b3d5ef93fe9399 ", "project", "myproject", "spiders" : "spider2", "start_time" :"2012- 09- 12 10:14:03.594664}], "done" : [{" id ":"2F16646cfcaf11e1b0090800272a6d06 ", "project", "myproject", "spiders" : "spider3", "start_time" :"2012- 09- 12 10:14:03.594664", "end_time" :"2012- 09 - 12 10:24:03.594664"}}Copy the code

delversion.json

Deletes a project version, which is also deleted if no more versions are available for a given project

# instance request$curl http://localhost:6800 / delversion.json -d project = myproject -d version = r99

# Sample response
{ “status” ： “ok” }
Copy the code