background
WebMagic is a simple and flexible Java crawler framework. Based on WebMagic, efficient and easy to maintain crawler applications can be developed quickly.
I have implemented two projects using this framework:
- Climb the bidding information of a bidding website, and send the bidding information to the sales staff by email
- The vulnerability information of the vulnerability website is used as the basic data for security analysis.
This article shares my thoughts and considerations on the use of the crawler queue RedisScheduler.
RedisScheduler
Scheduler is a component of WebMagic for URL management. Generally speaking, Scheduler has two functions:
- Manage URL queues that are captured
- Deduplication of captured urls
WebMagic has several commonly used schedulers built-in. In project development, we all choose RedisScheduler that supports distributed deployment. It uses Redis to save captured URL information and uses the UUID of Task as the main key to carry out simultaneous cooperative fetching of multiple machines.
RedisScheduler stores three types of information to Redis, with the UUID suffix of the task, in addition to managing urls and de-weighting:
- URL queue to climb: queue_taskUuid
- Set of historical urls: set_taskUuid, used for weight determination
- Request Additional information: item_taskUuid, the set that stores additional Request information
For example, this code caches data to Redis:
protected void pushWhenNoDuplicate(Request request, Task task) {
Jedis jedis = pool.getResource();
try {
jedis.rpush(getQueueKey(task), request.getUrl());
if(checkForAdditionalInfo(request)) { String field = DigestUtils.shaHex(request.getUrl()); String value = JSON.toJSONString(request); jedis.hset((ITEM_PREFIX + task.getUUID()), field, value); }}finally{ jedis.close(); }}Copy the code
Redis cache invalidation problem
Since RedisScheduler will cache three types of data into Redis during a crawler task, the problem of cleaning invalid data in Redis needs to be considered according to the function of the project.
For example, if the number of each climb task is the same and the project requires an incremental climb, the cached data does not need to be cleaned.
If the number of a task is different each time when a website is crawled, the cached data of the previous task will become invalid data forever. In order to reclaim disk resources, the invalid data must be cleared after the task is executed.
The code to clean up the cache data for each task is simple:
/** * Clear the Redis cache of a certain crawl task data, start the new task preparation action *@param taskUUid
*/
public static void clearRedisData(String taskUUid){
Jedis jedis=null;
try{
jedis=getJedis();
jedis.del("queue_"+taskUUid);// List of urls to crawl
jedis.del("set_"+taskUUid); // It is used to judge the weight
jedis.del("item_"+taskUUid); // When requesting additional information
}catch(Exception e){
logger.error("Failed to clear crawler cache data ("+taskUUid+")!", e);
}finally{
if(jedis! =null){ jedis.close(); }}}Copy the code
Problem with incremental crawlers
Although RedisScheduler has the function of de-scheduling, its defect lies in that when climbing dynamically changing websites, the two urls are the same but the content is different, so the task may end, because its weight is judged only according to the URL.
public boolean isDuplicate(Request request, Task task) {
Jedis jedis = pool.getResource();
try {
return jedis.sadd(getSetKey(task), request.getUrl()) == 0;
} finally{ pool.returnResource(jedis); }}Copy the code
For crawlers of such websites, Sheduler needs to be customized to limit the URL that needs to be weighted. For example, only the URL of the lowest level of data information needs to be weighted. Paging URL does not need to be added into the weighted cache.
Above is my recent use of WebMagic thinking!