Java Web crawler (3)

This article continues to revolve around the use of pipeline in NetDiscovery framework, combined with another specialized crawling picture frame PicCrawler, to achieve the batch download of pictures and information storage. A brief introduction to the basic mongo operations in the vert. X framework.

1) Target task

  • Find a website with lots of pictures of beautiful women
  • Parse out the link to the image you want to download and place it in a list
  • Upload the list to the image crawler framework, a few lines of code
  • Store the required information to mongodb
// Refer to the previous article, we will add the dependency package implementation'IO. Vertx: vertx mongo - client: 3.5.0'
    implementation 'com. Cv4j. Piccrawler: crawler: 1.0.0'
Copy the code

2) Parse the web page

package com.sinkinka.parser;

import com.cv4j.netdiscovery.core.domain.Page;
import com.cv4j.netdiscovery.core.domain.ResultItems;
import com.cv4j.netdiscovery.core.parser.Parser;
import com.cv4j.netdiscovery.core.parser.selector.Selectable;

import java.util.ArrayList;
import java.util.List;

public class GirlParser implements Parser {

    @Override
    public void process(Page page) {

        String xpath = "//div[@class='contLeftA']/ul[@class='artCont cl']/li";
        List<Selectable> liList = page.getHtml().xpath(xpath).nodes();
        List<String> imgUrlList = new ArrayList<>();
        for(Selectable li : liList) {
            String imageUrl = li.xpath("//img/@src").get();
            imgUrlList.add(imageUrl);
        }

        ResultItems resultItems = page.getResultItems();
        resultItems.put("needDownloadImage", imgUrlList); }}Copy the code

3) Download pictures

package com.sinkinka.pipeline; import com.cv4j.netdiscovery.core.domain.ResultItems; import com.cv4j.netdiscovery.core.pipeline.Pipeline; import com.cv4j.piccrawler.PicCrawlerClient; import com.cv4j.piccrawler.download.strategy.FileGenType; import com.cv4j.piccrawler.download.strategy.FileStrategy; import java.util.List; public class SaveGirlImage implements Pipeline { @Override public void process(ResultItems resultItems) { // 1. List<String> urls = resultitems.get ("needDownloadImage");
        PicCrawlerClient.get()
                .timeOut(5000)
                .fileStrategy(new FileStrategy() {
                    @Override
                    public String filePath() {
                        return "temp"; } @override public StringpicFormat() {
                        return "jpg"; } @override public FileGenTypegenType() {
                        returnFileGenType.AUTO_INCREMENT; }}).build().autoreferer () // automatically set refer.downloadpics (urls); //2. Set the information to the next pipeline SaveGirlImageLog using resultitems.put ("savecount", urls.size()); }}Copy the code

4) Save information

package com.sinkinka.pipeline; import com.cv4j.netdiscovery.core.domain.ResultItems; import com.cv4j.netdiscovery.core.pipeline.Pipeline; import io.vertx.core.AsyncResult; import io.vertx.core.Handler; import io.vertx.core.json.JsonObject; import io.vertx.ext.mongo.MongoClient; import java.util.Date; public class SaveGirlImageLog implements Pipeline { private MongoClient mongoClient; // Vertx-based object private String collectionName; public SaveGirlImageLog(MongoClient mongoClient, String collectionName){ this.mongoClient = mongoClient; this.collectionName = collectionName; } @override public void process(ResultItems ResultItems) {// Set the data to be saved. jsonObject.put("savecount", Integer.parseInt(resultItems.get("savecount").toString()));
        jsonObject.put("savetime", new Date().getTime()); / / 1: Mongoclient. save(collectionName, jsonObject, new Handler<AsyncResult<String>>() { @Override public void handle(AsyncResult<String> response) {if (response.succeeded()) {
                    System.out.println("save success, new id=" + response.result());
                } else {
                    System.out.println("save failure"); response.cause().printStackTrace(); }}}); // mongoclient.save (collectionName, jsonObject, response -> {// mongoclient.save (collectionName, jsonObject, response -> {// mongoclient.save (collectionName, jsonObject, response -> {// mongoclient.save (collectionName, jsonObject, response -> {//)if (response.succeeded()) {
//                System.out.println("save success, new id=" + response.result());
//            } else {
//                System.out.println("save failure"); // response.cause().printStackTrace(); / / / /}}); }}Copy the code

5) Run the program

  • An parser class GirlParser
  • Two pipeline classes SaveGirlImage and SaveGirlImageLog
  • Vert.X MongoClient, asynchronous non-blocking method
package com.sinkinka;

import com.cv4j.netdiscovery.core.Spider;
import com.sinkinka.parser.GirlParser;
import com.sinkinka.pipeline.SaveGirlImage;
import com.sinkinka.pipeline.SaveGirlImageLog;
import io.vertx.core.Vertx;
import io.vertx.core.json.JsonObject;
import io.vertx.ext.mongo.MongoClient;

public class GirlSpider {

    public static void main(String[] args) {
        String url = "http://www.woyaogexing.com/touxiang/nv/2018/586210.html"; MongoClient = mongoclient.createshared (vertx.vertx (), getDatabaseConfig())); Spider.create() .name("getGirlImage")
                .url(url)
                .parser(new GirlParser())
                .pipeline(new SaveGirlImage())
                .pipeline(new SaveGirlImageLog(mongoClient, "SaveLog"))
                .run();
    }

    public static JsonObject getDatabaseConfig() {
        JsonObject jsonObject = new JsonObject();
        jsonObject.put("connection_string"."Mongo: / / 127.0.0.1:27017");
        jsonObject.put("db_name"."test");
//        jsonObject.put("username"."");
//        jsonObject.put("password"."");
        returnjsonObject; }}Copy the code

6) Mongo operation based on vert. X

This class mongo operation is: IO. Vertx. Ext mongo. MongoClient Vert. X MongoClient provides methods are asynchronous non-blocking, very flexible:

7) summary

Using the framework we can quickly achieve a picture crawler, local development environment, a few minutes can be done. The above examples are just a primer, so feel free to use them.

PicCrawler also has many powerful uses. If you are interested, you can go to Github for details.

Java Web crawler (5)