Through the first two articles, we discussed the pain points of background crawling in depth, and put forward the feasibility of client solutions. Today we will introduce the world’s first client crawling platform, today we will unveil its mysterious veil!
The first two articles: an article to understand the current state of crawler technology crawler technology (II)- client side crawlers
DSpider platform
DSpider is a client crawler platform. As described on the official website, DSpider is mainly composed of cloud management platform, SDK and crawler store. Let’s briefly explain their respective responsibilities:
Cloud Management Platform
DSpider crawl scripts are delivered dynamically. The cloud management platform is used to configure script parameters, update scripts, collect statistics about the crawl status of scripts, and analyze errors. If you’re a developer, the cloud management platform is also a place to publish and manage your own scripts.
SDK
The SDK requests scripts from the cloud, executes them, and finally passes the crawling results to third-party apps. (The SDK for ios and Android is officially available, but only the Android SDK is currently available).
The crawler store
Similar to the App Store, it is a crawler repository where developers can select the scripts they need and publish their own scripts to the crawler store
Integrated into the APP
We in android, for example, the official provides complete documentation and demo: android integration document: dspider.dtworkroom.com/document/an… Android demo:github.com/wendux/DSpi…
Let’s take a look at the official demo:
Explicit crawl
Crawl all article titles and links from the Jane Book homepage:
Implicit crawl (silent)
There is no progress bar for implicit climbing, and a loading window is displayed in the demo as an indication:
Crawl script
The crawler script is very simple. Let’s look at the crawler script:
/** * Created by du on 16/11/21. */
dSpider("jianshu".function(session,env,$){
session.showProgress();
var $items=$("div.title");
var count=$items.length;
session.log("Total"+count+"Article");
session.setProgressMax(count)
session.setProgressMsg("Initializing");
var i=0;
// Simulate the progress and transmit data to the end every 200ms
var timer=setInterval(function(){
session.setProgress(i+1);
var title=$items.eq(i).text();
session.setProgressMsg(title);
session.push({title:title, url:$items.eq(i).parent().attr("href")});
if(++i>=count){ clearInterval(timer); session.finish(); }},200);
})Copy the code
As you can see, the crawl script is very simple: parse the web page with jquery and then interact with native via the Session object. For detailed API documentation, see the dSpider Javascript API documentation.
Matters needing attention
- Before integration, you need to register on the official website and create an application after login
- After the application is successfully created, it will get the APPID, which is required in the SDK.
- After the application is created, the crawler needs to be manually added to the application. System default will be added to each new application sid for 1 test the crawler, the crawler information: dspider.dtworkroom.com/spider/1;
- Sid is the ID of each crawler, which you get when you create a crawler in the background. You can also go to the script store to select it.