ScrapydManage
GitHub address: github.com/kanadebliss… Yards cloud: gitee.com/kanadebliss…
Scrapyd is a Windows management client, the software is just a scrapyd API integration into the EXE file, software is written by aardio, GitHub has source code, can be compiled, also can download GitHub release has compiled the EXE file.
Host Management Page
Right-click menu:
Add the host
Adding a host to a scrapyd API is like adding a scrapyd API address, such as 127.0.0.1:6800. Don’t understand how scrapyd use can refer to the official document: scrapyd. Readthedocs. IO/en/stable/I… Install scrapyd and type scrapyd on the command line, or create scrapyd.conf in the current directory, modify some configuration parameters and run it in scrapyd. [Reference Configuration] :
[scrapyd] eggs_dir = D:/scrapyd/eggs logs_dir = D:/scrapyd/logs items_dir = D:/scrapyd/items jobs_to_keep = 5 dbs_dir = D:/scrapyd/ DBS max_proc = 0 max_proc_per_CPU = 4 finished_to_keep = 100 poll_interval = 5.0 bind_address = 0.0.0.0 http_port = 6800 debug = off runner = scrapyd.runner application = scrapyd.app.application launcher = scrapyd.launcher.Launcher webroot = scrapyd.website.Root node_name = localhost [services] schedule.json = scrapyd.webservice.Schedule cancel.json = scrapyd.webservice.Cancel addversion.json = scrapyd.webservice.AddVersion listprojects.json = scrapyd.webservice.ListProjects listversions.json = scrapyd.webservice.ListVersions listspiders.json = scrapyd.webservice.ListSpiders delproject.json = scrapyd.webservice.DeleteProject delversion.json = scrapyd.webservice.DeleteVersion listjobs.json = scrapyd.webservice.ListJobs daemonstatus.json = scrapyd.webservice.DaemonStatusCopy the code
Modify the three directories eggs_dir, logs_dir, and dbs_dir. Modify the other directories as required.
Refresh list status
Just make a request to all hosts to update the status and node name columns, which should make sense from the first figure
Synchronize all projects to all hosts
As the name implies, the default version number is the current timestamp
Viewing the Task Queue
Listjobs Returns information about pending, RUNNING, and FINISHED crawls on the scrapyd server
Delete the host
As the name implies
Project Management Interface
Project management is basically reading the projects folder in the same directory as the current exe file. Right-click has three functions: refresh project list, synchronize all projects to, and synchronize to (need to right-click an item)
Creating a Task Page
Right-click has two functions: create a task and cancel a taskNote that the software steps are as follows: Select host -> Software request server to return all items under this host -> Select Item -> Software request server to return all crawlers under this item
The run time can be in the string format shown in the figure, indicating that the run time is specified. It can also be a number (in seconds) that runs after a specified number of seconds. The time interval indicates whether to run the crawler repeatedly. The time interval only supports numbers (in seconds), such as 86400 if the crawler runs once a day.
Because you need to wait for the server to return the data, that is, to use multithreading also need to wait for the return value, so there will be a lag after selecting the host or project, the lag time depends on the return delay, local words very fast.
If you need any other functions, you can develop them by yourself, the source code is already available, secondary development should not be difficult, the Aardio syntax is similar to other languages, I also understand that it will be very easy to use soon.