Jsoup crawler

Data source, crawl data, Flume collection, big data analysis, SQOOP migration to mysql

1. Data collection function description

Send the request address: https://movie.douban.com/subject/1292052/reviews?start=0 task: crawl: shawshank redemption movie reviews Crawl fields: Review title, critics, Ratings, Review time, useCount, notUseCount, replyCount Crawl Content storage: Use logback browser: Jsoup Development Guide (https://jsoup.org/) - Imitate multiple clients (set request header information, imitate different clients to send requests) - long task (can always execute), Sleep intermittently crawls data (sleep) (execute continuously will occupy the number of concurrent sites, people will find your IP address banned, but wired wireless IP address is not the same)Copy the code

2. Data acquisition based on crawler

2.1 Network Access Tools

Browser -Postman - package capture tool :Fiddler - programming implementation: Apache HttpClient(simulated browser to send requests, get response data), to set the request header, to let douban think we write this program is the normal access of the browser.Copy the code

2.2 Access to network resources

HttpClient

2.3 Parsing Network Resource Data

JSoup

Jsoup parses the page data

document/selector/reg
Copy the code

2.4 Storage of extracted data

logback

2.5 Programming Implementation

Creating a Maven project

Pom.xml adds dependencies


        

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>cn.com.chinahitech.spider</groupId>
  <artifactId>film_spider</artifactId>
  <version>1.0 the SNAPSHOT</version>

  <name>film_spider</name>
  <! -- FIXME change it to the project's website -->
  <url>http://www.example.com</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>1.7</maven.compiler.source>
    <maven.compiler.target>1.7</maven.compiler.target>
  </properties>
  <! Maven repository = maven repository = maven repository = maven repository = maven repository
  <repositories>
    <repository>
      <id>ali-maven</id>
      <url>http://maven.aliyun.com/nexus/content/groups/public</url>
      <releases>
        <enabled>true</enabled>
      </releases>
      <snapshots>
        <enabled>true</enabled>
        <updatePolicy>always</updatePolicy>
        <checksumPolicy>fail</checksumPolicy>
      </snapshots>
    </repository>

    <repository>
      <id>central</id>
      <name>Maven Repository Switchboard</name>
      <layout>default</layout>
      <url>http://repo1.maven.org/maven2</url>
      <snapshots>
        <enabled>false</enabled>
      </snapshots>
    </repository>
  </repositories>
  <! Add dependencies on HTTP-client, jsoup, logback-->
  <dependencies>
    <! Httpclient -->
    <dependency>
      <groupId>org.apache.httpcomponents</groupId>
      <artifactId>httpcore</artifactId>
      <version>4.4.10</version>
    </dependency>
    <dependency>
      <groupId>org.apache.httpcomponents</groupId>
      <artifactId>httpclient</artifactId>
      <version>4.5.6</version>
    </dependency>

    <! -- Jsoup for HTML parsing -->
    <dependency>
      <groupId>org.jsoup</groupId>
      <artifactId>jsoup</artifactId>
      <version>1.12.1</version>
    </dependency>

    <! --logback to store logs -->
    <dependency>
      <groupId>ch.qos.logback</groupId>
      <artifactId>logback-classic</artifactId>
      <version>1.2.3</version>
    </dependency>

    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.13</version>
    </dependency>
  </dependencies>

 
</project>

Copy the code

Pom.xml adds a packaged plug-in

 <! -- Packaged plug-in -->
  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-assembly-plugin</artifactId>
        <version>3.2.0</version>
        <configuration>
          <archive>
            <manifest><! -- Configure the application's main class -->
              <mainClass>cn.com.chinahitech.spider.FilmSpider</mainClass>
            </manifest>
          </archive>
          <descriptorRefs>
            <descriptorRef>jar-with-dependencies</descriptorRef>
          </descriptorRefs>
          <encoding>utf-8</encoding>
        </configuration>
        <executions>
          <execution>
            <id>make-assembly</id>
            <phase>package</phase>
            <goals>
              <goal>single</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
Copy the code

Import the logback.xml file under Resources


        
<configuration>
    <property name="LOG_PATTERN"
              value="%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} -%msg%n" />
     <! -- The format of the core crawl data year month day hour minute second, delimiter is comma text data conversion, delimiter is also comma -->
    <property name="DATA_PATTERN" value="%d{yyyy-MM-dd HH:mm:ss},%msg%n" />
    <property name="LOG_LEVEL" value="INFO"/>

    <! -- Standard output log (console output) -->
    <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>
            <pattern>${LOG_PATTERN}</pattern>
        </encoder>
        <filter class="ch.qos.logback.classic.filter.ThresholdFilter">
            <level>${LOG_LEVEL}</level>
        </filter>
    </appender>

    <! - Data acquisition log record, save to file!! This is important -->
    <appender name="COLLECT_ROLLING"
              class="ch.qos.logback.core.rolling.RollingFileAppender">
        <rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
            <! -- Specify the location and filename of the generated log file -->
            <fileNamePattern>data/collect/data_file.%d{yyyy-MM-dd}_%i.log</fileNamePattern>
            <maxFileSize>10MB</maxFileSize>
            <maxHistory>10</maxHistory>
            <! Total log write volume per day is 10G-->
            <! MaxFileSize --> maxFileSize --> maxFileSize
            <totalSizeCap>10GB</totalSizeCap>
            <cleanHistoryOnStart>true</cleanHistoryOnStart>
        </rollingPolicy>
        <encoder>
            <pattern>${DATA_PATTERN}</pattern>
        </encoder>
    </appender>

    <! -- Specify log output level -->
    <! -- Consistent with the project root path -->
    <logger name="cn.com.chinahitech.spider" level="${LOG_LEVEL}" addtivity="false">
        <appender-ref ref="COLLECT_ROLLING" />
    </logger>
    <root level="${LOG_LEVEL}">
        <appender-ref ref="STDOUT" />
    </root>
</configuration>
Copy the code

The encoding implementation crawls the first page

https://movie.douban.com/subject/1292052/reviews?start=0 start = 0 (the first page start = 20 on behalf of the second page Article 20 data start = 20 per page on behalf of the second page of the first data Start =40 represents the third pageCopy the code

 /** * select * from the specified url@paramRequestUrl Specifies the url of the request */
    public void requestByUrl(String requestUrl) {
        System.out.println(requestUrl);

        // 1. Create client objects. A fake client was created using HttpClientBuilder
        CloseableHttpClient httpClient = HttpClientBuilder.create().build();

        // 2. Create an Http request object, an HttpGet object, and give the client an address
        HttpGet httpGet = new HttpGet(requestUrl);
        // Set the httpGet header to look like a real client request // change your browser's user-agent
        httpGet.setHeader("User-Agent"."Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36");

        // 3. Execute the request and obtain the response
        try {
            CloseableHttpResponse response = httpClient.execute(httpGet);

            // 4. Obtain the status line information of the response status
            StatusLine statusLine = response.getStatusLine();

            if(statusLine.getStatusCode() ! =200) {
                System.out.println("Request failed");
                return;
            }

            // 5. Get the data in the request
            HttpEntity httpEntity = response.getEntity();

            // 6. Convert httpEntity to String
            String content = EntityUtils.toString(httpEntity);
         // System.out.println(content);

            // 7. Parse current page data
             parseHtml(content);


        } catch (ClientProtocolException e) {
            e.printStackTrace();
        } catch(IOException e) { e.printStackTrace(); }}Copy the code

Own browser user-agent:

Parse the current page data

What data do we get before we parse it.

Let’s just output it directly to the console

It’s these things some HTML data

 /** * Parse HTML content *@paramContent HTML content to be parsed */
    public void parseHtml(String content) {
        // 1. Encapsulate the response data with Jsoup and convert it into a Document object
        Document document = Jsoup.parse(content);

        // 2. Use the Jsoup API to extract 20 pieces of comment information from this page
        Elements elements = document.select("div.article > div.review-list > div > div.main.review-item");

        // 3. Parse data (review title, critics, ratings, post review time, useful, useless, reply)
        for(Element item: elements) {
            // 1. header
            Element header = item.selectFirst("header");
            String username = header.selectFirst("a.name").text(); / / critic
            String rating = header.selectFirst("span").attr("title"); / / score
            String date = header.selectFirst("span.main-meta").text(); // Post comments
            // 2. body
            Element body = item.selectFirst("div.main-bd");
            String title = body.selectFirst("h2 > a").text(); / / title
            // The composite element of the comment
            Element ratings = body.selectFirst("div.action");
            String usefulCount = ratings.selectFirst("a.action-btn.up > span").text();
            String uselessCount = ratings.selectFirst("a.action-btn.down > span").text();
            String replyCount = ratings.selectFirst("a.reply").text();
            3. Use regular expressions
            Pattern pattern = Pattern.compile("\\d*");
            Matcher matcher = pattern.matcher(replyCount);
            replyCount = matcher.find() ? matcher.group() : "0";

            // 4. String content concatenation
            StringBuilder str = new StringBuilder();
            str.append(title).append(",").append(username).append(",").append(rating).append(",")
                    .append(date).append(",").append(usefulCount).append(",").append(uselessCount).append(",")
                    .append(replyCount);

            // 5. Write the contents of the crawllogger.info(str.toString()); }}Copy the code

Jsoup parse principle: jsoup.org/

Crawl N pages of data

 /** * crawl multiple pages */
    public void requestByPages(int page){
        int beginIndex = 0;
        for(int i = 0; i < page; ++i) { beginIndex = i * PAGE_SIZE; requestByUrl(BASE_URL + beginIndex); System.out.println(); }}Copy the code

Process: Main calls requestByPages() to tell you that you want to get a few pages of content, and then in requestByUrl () to pass the URL in, RequestByUrl () then gets the data (string) and calls parseHtml() to parse it and write it to the log in the target format

The crawler packages the JAR and deploys it on Linux

Maven package, double-click Package.

Upload to Linux:

[Get movie review data based on background crawler]

After the command is executed, a local data file is generated. Data /collect/ stores log files

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Jsoup crawler implementation

Jsoup crawler

1. Data collection function description

2. Data acquisition based on crawler

2.1 Network Access Tools

2.2 Access to network resources

2.3 Parsing Network Resource Data

2.4 Storage of extracted data

2.5 Programming Implementation

The crawler packages the JAR and deploys it on Linux

Jsoup crawler implementation

Jsoup crawler

1. Data collection function description

2. Data acquisition based on crawler

2.1 Network Access Tools

2.2 Access to network resources

2.3 Parsing Network Resource Data

2.4 Storage of extracted data

2.5 Programming Implementation

The crawler packages the JAR and deploys it on Linux

Related Posts

JVM Java program running principle analysis

Read/write locks for Java locks

C# introduction series 3 — data types