pip install scrapyd
Scrapyd principle
When you install and run scrapyd on a server, the scrapyd daemon listens for crawlers and requests, and then starts the process to execute the crawler
Scrapyd also provides an interface to the Web. It is convenient for us to view and manage crawlers. By default, scrapyd listens on port 6800. Use a browser on the machine to visit http://localhost:6800 address to view the current running projects
Project deployment
Use the scraoyd-deploy tool provided by Scrapyd-Client to deploy
The principle of
Scrapyd runs on the server, and scrapyd-client runs on the client. The client deploys crawler projects using scrapyd-client by calling scrapyd’s JSON interface
pip install scrapyd-client
Configure the server information for the project
Modify the scrapy. CFG file in the project directory
If HTTP Basic authentication is configured on your server, you will need to configure the user name and password to log in to the server
# Automatically created by: scrapy startproject # # For more information about the [deploy] section see: # https://scrapyd.readthedocs.io/en/latest/deploy.html [settings] default = toutiao.settings [deploy:server] # a name for the server, which I specified as server #url = http://localhost:6800/ # server address to deploy the project, now deploy the project locally, if deployed to another machine will need to change the IP project = toutiao # toutiao is the project name username=' ' password = ' ' Copy the code
Deploy crawler
Run the command in the crawler root
scrapyd-deploy -p Copy the code
Target is the server name configured in the configuration file in the previous step. Project is the project name. Take the current crawler as an example.
scrapyd-deploy server -p toutiao Copy the code
Deployment will pack your current project operation, if under the current project setup. Py files, will use it in the configuration, no will automatically create a (if later project need packaging, can according to your need to modify the information inside, can temporarily no matter he) from the returned result, we can see the status of the deployment, The project name, version number and number of crawlers, and the current host name
The running results are as follows:
Use the following command to view the deployment result:
Scrapyd-deploy-l Server name eg: scrapyd-deploy-l server toutiao defaultCopy the code
Refresh the http://localhost:6800 page to also see Available Projects: Toutiao,default
Manage crawlers using apis
The Web interface of Scrapyd is relatively simple, mainly used for monitoring, all scheduling work all rely on the interface. Using curl to manage crawlers is recommended
Install curl
Enable crawler Schedule
Start the crawler by running a command in the project root directory
Cancel the crawler
List the project
List crawler, version, and job information
Delete crawler project
scrapyd.readthedocs.io
Check the load status of the server daemonStatus. json
# Strength requestCurl HTTP: // localhost:6800 / daemonstatus.json
# response example{" status ":" OK ", "running" :"0", "Pending" :"0", "Finished"0", "node_name" : "node-name"}Copy the code
addversion.json
Adds the item to the project, and automatically creates the item if it doesn’t exist
Supported request method: GET
Parameters:
Project: indicates the project name
Version: indicates the project version
Egg: Python egg for project code
# Strength request$curl http://localhost:6800 / addversion.json -F project = myproject -F version = r23 -F [email protected]
# response example{" status ":" OK ", "spider" :3 }
Copy the code
schedule.json
Schedule the spider to run, schedule the project, and return the job ID
# sample request$curl http://localhost:6800 / schedule.json -d project = myproject -d spider = somespider
# response example{" status ":" OK ", "jobid" :"6487Ec79947edab326d6db28a2d86511e8247444}"Copy the code
cancle.json
Cancel spider (cancel job). If the job is in a pending state, it is deleted. If the job is running, it will be terminated
# sample request$curl http://localhost:6800 / cancel.json -d project = myproject -d job = 6487ec79947edab326d6db28a2d86511e8247444
# response example{" status ":" OK ", "prevState" : "running"}Copy the code
listproject.json
Gets a list of items uploaded to this scrapy server
# sample request$curl http://localhost:6800 / listprojects.json
# response example{" status ":" ok ", "projects" : [" myProject ", "otherProject"]}Copy the code
listversion.json
# sample request$curl http://localhost:68001 / listversions json? project = myproject# response example{" status ":" OK ", "Versions" : [" R99 ", "R156"]}Copy the code
listspider.json
Gets the list of spiders available in the last (unless overridden) version of an item
# sample request$curl http://localhost:68001 / listspiders json? project = myproject# response example{" status ":" ok ", "spider" : [" spider1 ", "spider2", "spider3"]}Copy the code
listjobs.json
Gets a list of pending, running, and completed jobs for a project
# sample request$curl http://localhost:68001 / listjobs json? project = myproject | python -m json.tool# response example{" status ": ok," pending ": [{" project" : myproject, "spider" : spider1, "ID" :"78391Cc0fcaf11e1b0090800272a6d06}] ", "running" : [{" id ":"422e608F9f28cef127b3d5ef93fe9399 ", "project", "myproject", "spiders" : "spider2", "start_time" :"2012- 09- 12 10:14:03.594664}], "done" : [{" id ":"2F16646cfcaf11e1b0090800272a6d06 ", "project", "myproject", "spiders" : "spider3", "start_time" :"2012- 09- 12 10:14:03.594664", "end_time" :"2012- 09 - 12 10:24:03.594664"}}Copy the code
delversion.json
Deletes a project version, which is also deleted if no more versions are available for a given project
# instance request$curl http://localhost:6800 / delversion.json -d project = myproject -d version = r99
# Sample response
{ “status” : “ok” }
Copy the code