Java crawler: Use Jvppeteer(Puppeteer) to easily climb Taobao products

The failure rate is very high if you want to crawl an item using HttpURLConnection. Generally, if you want to ensure the success rate, you will choose a real browser to crawl.

The usual solutions used to be Selenium or PhantomJS, but both environments are cumbersome and unfriendly to programmers, and Puppeteer has rapidly gained popularity and acclaim since Google launched it. It is a NodeJS library, but instead of using it to crawl a treasure today, it is Jvppeteer written in Java, which is implemented on the same principle as Puppeteer.

Train of thought

  1. With multithreading, one thread is responsible for a page crawl (pages are used for the rest of the content)

  2. Create a queue of pages with the same number of threads as in the thread pool, place the page in the LinkedBlockingQueue, remove a page from the queue each time a crawl task is performed, and put the page back in the queue when the crawl task is complete. The reason for this is to reuse pages to reduce the frequency of page creation, but it is important to note that a page should not be used for too long or too many times to prevent crashes

  3. Intercept the loading of pictures and multimedia resources. The loading of multimedia resources and pictures will greatly affect the loading speed of the page, thus affecting the efficiency of crawler, so it is necessary to intercept (optional).

  4. We choose to get the entire page content and then parse it to get the product information

    Code implementation

    1. Start the browser
     // Specify the boot path to start the browser
            String path = new String("F: Java tutorial # 49 # vuejs # puppeteer #. Local-chromium # win64-722234 # chrome-win # chrome.exe".getBytes(), "UTF-8");
            ArrayList<String> argList = new ArrayList<>();
            LaunchOptions options = new OptionsBuilder().withArgs(argList).withHeadless(false).withExecutablePath(path).build();
            argList.add("--no-sandbox");
            argList.add("--disable-setuid-sandbox");
            Browser browser = Puppeteer.launch(options);
    Copy the code
    2. Create page queues and thread pools
    // Start a thread pool multi-threaded fetching
            int threadCount = 5;
            ThreadPoolExecutor executor = new ThreadPoolExecutor(threadCount, threadCount, 30, TimeUnit.SECONDS, new LinkedBlockingDeque<>());
            CompletionService service = new ExecutorCompletionService(executor);
            // Open 5 pages at the same time, these pages can be reused multiple times, so as to reduce the performance of the creation of web pages
            LinkedBlockingQueue<Page> pages = new LinkedBlockingQueue<>();
            for (int i = 0; i < threadCount; i++) {
                Page page = browser.newPage();
                // Intercept the request, optional, but serious thread switching exists, do not intercept the request
    // page.onRequest(request -> {
    // if ("image".equals(request.resourceType()) || "media".equals(request.resourceType())) {
    // // encountered multimedia or image resource request, rejected, load page load
    // request.abort();
    //} else {// Other resources are allowed
    // request.continueRequest();
    / /}
    / /});
    // page.setRequestInterception(true);
                pages.put(page);// Put it behind the queue, block
            }
    Copy the code
    3. Define the crawler thread static inner class
    static class CrawlerCallable implements Callable<Object> {
    
            private LinkedBlockingQueue<Page> pages;
    
            public CrawlerCallable(LinkedBlockingQueue<Page> pages) {
                this.pages = pages;
            }
    
            @Override
            public Object call(a) {
                Page page = null;
                try {
                    page = pages.take();
                    PageNavigateOptions navigateOptions = new PageNavigateOptions();
                    // If the page navigation is complete without domContentLoaded, the goTo method will timeout because the image request is blocked and the page will not reach loaded
                    navigateOptions.setWaitUntil(Arrays.asList("domcontentloaded"));
                    page.goTo("https://item.taobao.com/item.htm?id=541605195654", navigateOptions);
                    String content = page.content();
                    return parseItem(content);
                } catch (Exception e) {
                    e.printStackTrace();
                } finally {
                    if(page ! =null) {
                        try {
                            pages.put(page);// Put the fetched page back into the queue
                        } catch(InterruptedException e) { e.printStackTrace(); }}}return null; }}Copy the code
    4. Parse the product and get the results
    / / the result set
            List<Future<Object>> futures = new ArrayList<>();
            // fetch 100 times
            long start = System.currentTimeMillis();
            for (int i = 0; i < 100; i++) {
                Future<Object> future = service.submit(new CrawlerCallable(pages));
                futures.add(future);
            }
    
            // Close the thread pool
            executor.shutdown();
            // Get the result
            int i = 0;
            for (Future<Object> result : futures) {
                Object item = result.get();
                i++;
                System.out.println(i + ":" + Constant.OBJECTMAPPER.writeValueAsString(item));
            }
            long end = System.currentTimeMillis();
            System.out.println("Time:" + (end - start));
    Copy the code

After testing, crawling speed is very fast, 100 tasks in 15s to complete, but different computer configuration and bandwidth have different results, because crawler is more eat configuration and bandwidth.

summary

In addition to creating page queues, you can also create browser queues, because a Browser can always crash. You can start a certain number of Browsers at project startup, close the browser when one reaches a certain number of times (say 2000) or time (2h), and then take another Browser out of the queue and start a bowser and put it into the queue.

Full code address: demo