Java Web crawler (4)

Hi, everyone. The urls in the last few articles have returned HTML content and then parsed out the desired data from the HTML string. However, with the development of front-end programming technologies, Ajax, JSON and other technologies have been mainstream for at least a decade. Much of the data we see on web pages is asynchronously requested by ajax to the server and then returned as json data and loaded onto the web page.

The goal of this article is to GET the JSON data we want using the NetDiscovery crawler framework, using both GET and POST.

1) Get the city name

  • In the drop-down box of city selection, there are the names of major cities in each province:

  • Open your browser and find the link that provides this data source:

  • Now write code based on NetDiscovery (code just to show how to get the data)

The Main class

package com.cv4j.netdiscovery.example;

import com.cv4j.netdiscovery.core.Spider;
import com.cv4j.netdiscovery.core.domain.HttpMethod;
import com.cv4j.netdiscovery.core.domain.Request;

public class TestSpider {
    public static void main(String[] args) {
        String url = "https://www.zhipin.com/common/data/city.json"; Request request = new Request(url) .httpMethod(HttpMethod.GET); Spider.create().name();"getcitys") .request(request) .parser(new TestParser()) .run(); }}Copy the code

Parser class

package com.cv4j.netdiscovery.example;

import com.cv4j.netdiscovery.core.config.Constant;
import com.cv4j.netdiscovery.core.domain.Page;
import com.cv4j.netdiscovery.core.parser.Parser;

public class TestParser implements Parser {
    @Override
    public void process(Page page) {
        try {
            String response = page.getField(Constant.RESPONSE_JSON).toString();
            System.out.println("response = "+response);
        } catch(Exception e) {
        }
    }
}

Copy the code
  • Program execution result

2) Get the job

  • In the same way, first use the browser to analyze the target object:

  • Let’s take a look at the parameters we’re passing

To tell the difference between GET and POST passing parameters,

Have a concept for the types of POST parameters: application/json, application/ X-www-form-urlencode, etc

  • Start writing code for the Main class
package com.cv4j.netdiscovery.example;

import com.cv4j.netdiscovery.core.Spider;
import com.cv4j.netdiscovery.core.config.Constant;
import com.cv4j.netdiscovery.core.domain.HttpMethod;
import com.cv4j.netdiscovery.core.domain.HttpRequestBody;
import com.cv4j.netdiscovery.core.domain.Request;

import java.util.HashMap;
import java.util.Map;

public class TestSpider {
    public static void main(String[] args) {
        String url = "https://www.lagou.com/jobs/positionAjax.json?city=%E8%8B%8F%E5%B7%9E&needAddtionalResult=false&isSchoolJob=0";

        Map<String,Object> postParams = new HashMap<>();
        postParams.put("first".true);
        postParams.put("pn", 1); postParams.put("kd"."Data Engineer");

        Request request = new Request(url)
                .httpMethod(HttpMethod.POST)
                .httpRequestBody(HttpRequestBody.form(postParams, Constant.UTF_8));

        Spider.create()
                .name("getpositions") .request(request) .parser(new TestParser()) .run(); }}Copy the code

Parser is similar to TestParser

However, the result is:

Why is that? Do not be confused by the prompt text, clearly is the first visit, can not be caused by frequent operation. Returning this result is an anti-crawler method designed by the web server. The web server recognizes that no one is accessing it using a browser, so it returns this result. Therefore, the program to do as much as possible to simulate the operation of the browser, so that the web server that is the browser in access.

How to simulate as realistically as possible? Use the program to get as much data as possible from the request into the program

As a rule of thumb, the Referer and user-agent should be set up first (see HTTP protocol).

The new Main class

package com.cv4j.netdiscovery.example;

import com.cv4j.netdiscovery.core.Spider;
import com.cv4j.netdiscovery.core.config.Constant;
import com.cv4j.netdiscovery.core.domain.HttpMethod;
import com.cv4j.netdiscovery.core.domain.HttpRequestBody;
import com.cv4j.netdiscovery.core.domain.Request;

import java.util.HashMap;
import java.util.Map;

public class TestSpider {
    public static void main(String[] args) {
        String url = "https://www.lagou.com/jobs/positionAjax.json?city=%E8%8B%8F%E5%B7%9E&needAddtionalResult=false&isSchoolJob=0";

        Map<String,Object> postParams = new HashMap<>();
        postParams.put("first".true);
        postParams.put("pn", 1); postParams.put("kd"."Data Engineer");

        Request request = new Request(url)
                .httpMethod(HttpMethod.POST)
                .referer("https://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%B7%A5%E7%A8%8B%E5%B8%88?labelWords=sug&fromSearch=true&suginput=% E6%95%B0%E6%8D%AE%E5%B7%A5%E7%A8%8B")
                .ua("Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36")
                .httpRequestBody(HttpRequestBody.form(postParams, Constant.UTF_8));

        Spider.create()
                .name("getpositions") .request(request) .parser(new TestParser()) .run(); }}Copy the code

The server finally returns a result with data (whether the data is useful or not is subject to further analysis) :

3) summary

Understanding asynchronous Ajax execution concepts, JSON data formats, and debugging Tools such as Developer Tools for Google Chrome.

The most important thing is to understand the HTTP protocol.

If you want to do it yourself, please visit NetDiscovery on Github. Your likes are the power to improve the framework!

This article is intended to communicate programming techniques only, and frequent visits to other people’s production servers are not recommended

Java Web crawler (6)