Jsoup crawler

Data source, crawl data, Flume collection, big data analysis, SQOOP migration to mysql

1. Data collection function description

Send the request address: https://movie.douban.com/subject/1292052/reviews?start=0 task: crawl: shawshank redemption movie reviews Crawl fields: Review title, critics, Ratings, Review time, useCount, notUseCount, replyCount Crawl Content storage: Use logback browser: Jsoup Development Guide (https://jsoup.org/) - Imitate multiple clients (set request header information, imitate different clients to send requests) - long task (can always execute), Sleep intermittently crawls data (sleep) (execute continuously will occupy the number of concurrent sites, people will find your IP address banned, but wired wireless IP address is not the same)Copy the code

2. Data acquisition based on crawler

2.1 Network Access Tools

Browser -Postman - package capture tool :Fiddler - programming implementation: Apache HttpClient(simulated browser to send requests, get response data), to set the request header, to let douban think we write this program is the normal access of the browser.Copy the code

2.2 Access to network resources

  • HttpClient

2.3 Parsing Network Resource Data

  • JSoup

Jsoup parses the page data

document/selector/reg
Copy the code

2.4 Storage of extracted data

  • logback

2.5 Programming Implementation

  • Creating a Maven project

  • Pom.xml adds dependencies

    
            
    
    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
      <modelVersion>4.0.0</modelVersion>
    
      <groupId>cn.com.chinahitech.spider</groupId>
      <artifactId>film_spider</artifactId>
      <version>1.0 the SNAPSHOT</version>
    
      <name>film_spider</name>
      <! -- FIXME change it to the project's website -->
      <url>http://www.example.com</url>
    
      <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <maven.compiler.source>1.7</maven.compiler.source>
        <maven.compiler.target>1.7</maven.compiler.target>
      </properties>
      <! Maven repository = maven repository = maven repository = maven repository = maven repository
      <repositories>
        <repository>
          <id>ali-maven</id>
          <url>http://maven.aliyun.com/nexus/content/groups/public</url>
          <releases>
            <enabled>true</enabled>
          </releases>
          <snapshots>
            <enabled>true</enabled>
            <updatePolicy>always</updatePolicy>
            <checksumPolicy>fail</checksumPolicy>
          </snapshots>
        </repository>
    
        <repository>
          <id>central</id>
          <name>Maven Repository Switchboard</name>
          <layout>default</layout>
          <url>http://repo1.maven.org/maven2</url>
          <snapshots>
            <enabled>false</enabled>
          </snapshots>
        </repository>
      </repositories>
      <! Add dependencies on HTTP-client, jsoup, logback-->
      <dependencies>
        <! Httpclient -->
        <dependency>
          <groupId>org.apache.httpcomponents</groupId>
          <artifactId>httpcore</artifactId>
          <version>4.4.10</version>
        </dependency>
        <dependency>
          <groupId>org.apache.httpcomponents</groupId>
          <artifactId>httpclient</artifactId>
          <version>4.5.6</version>
        </dependency>
    
        <! -- Jsoup for HTML parsing -->
        <dependency>
          <groupId>org.jsoup</groupId>
          <artifactId>jsoup</artifactId>
          <version>1.12.1</version>
        </dependency>
    
        <! --logback to store logs -->
        <dependency>
          <groupId>ch.qos.logback</groupId>
          <artifactId>logback-classic</artifactId>
          <version>1.2.3</version>
        </dependency>
    
        <dependency>
          <groupId>junit</groupId>
          <artifactId>junit</artifactId>
          <version>4.13</version>
        </dependency>
      </dependencies>
    
     
    </project>
    
    Copy the code
  • Pom.xml adds a packaged plug-in

     <! -- Packaged plug-in -->
      <build>
        <plugins>
          <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-assembly-plugin</artifactId>
            <version>3.2.0</version>
            <configuration>
              <archive>
                <manifest><! -- Configure the application's main class -->
                  <mainClass>cn.com.chinahitech.spider.FilmSpider</mainClass>
                </manifest>
              </archive>
              <descriptorRefs>
                <descriptorRef>jar-with-dependencies</descriptorRef>
              </descriptorRefs>
              <encoding>utf-8</encoding>
            </configuration>
            <executions>
              <execution>
                <id>make-assembly</id>
                <phase>package</phase>
                <goals>
                  <goal>single</goal>
                </goals>
              </execution>
            </executions>
          </plugin>
        </plugins>
      </build>
    Copy the code
  • Import the logback.xml file under Resources

    
            
    <configuration>
        <property name="LOG_PATTERN"
                  value="%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} -%msg%n" />
         <! -- The format of the core crawl data year month day hour minute second, delimiter is comma text data conversion, delimiter is also comma -->
        <property name="DATA_PATTERN" value="%d{yyyy-MM-dd HH:mm:ss},%msg%n" />
        <property name="LOG_LEVEL" value="INFO"/>
    
        <! -- Standard output log (console output) -->
        <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
            <encoder>
                <pattern>${LOG_PATTERN}</pattern>
            </encoder>
            <filter class="ch.qos.logback.classic.filter.ThresholdFilter">
                <level>${LOG_LEVEL}</level>
            </filter>
        </appender>
    
        <! - Data acquisition log record, save to file!! This is important -->
        <appender name="COLLECT_ROLLING"
                  class="ch.qos.logback.core.rolling.RollingFileAppender">
            <rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
                <! -- Specify the location and filename of the generated log file -->
                <fileNamePattern>data/collect/data_file.%d{yyyy-MM-dd}_%i.log</fileNamePattern>
                <maxFileSize>10MB</maxFileSize>
                <maxHistory>10</maxHistory>
                <! Total log write volume per day is 10G-->
                <! MaxFileSize --> maxFileSize --> maxFileSize
                <totalSizeCap>10GB</totalSizeCap>
                <cleanHistoryOnStart>true</cleanHistoryOnStart>
            </rollingPolicy>
            <encoder>
                <pattern>${DATA_PATTERN}</pattern>
            </encoder>
        </appender>
    
        <! -- Specify log output level -->
        <! -- Consistent with the project root path -->
        <logger name="cn.com.chinahitech.spider" level="${LOG_LEVEL}" addtivity="false">
            <appender-ref ref="COLLECT_ROLLING" />
        </logger>
        <root level="${LOG_LEVEL}">
            <appender-ref ref="STDOUT" />
        </root>
    </configuration>
    Copy the code
  • The encoding implementation crawls the first page

    https://movie.douban.com/subject/1292052/reviews?start=0 start = 0 (the first page start = 20 on behalf of the second page Article 20 data start = 20 per page on behalf of the second page of the first data Start =40 represents the third pageCopy the code
     /** * select * from the specified url@paramRequestUrl Specifies the url of the request */
        public void requestByUrl(String requestUrl) {
            System.out.println(requestUrl);
    
            // 1. Create client objects. A fake client was created using HttpClientBuilder
            CloseableHttpClient httpClient = HttpClientBuilder.create().build();
    
            // 2. Create an Http request object, an HttpGet object, and give the client an address
            HttpGet httpGet = new HttpGet(requestUrl);
            // Set the httpGet header to look like a real client request // change your browser's user-agent
            httpGet.setHeader("User-Agent"."Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36");
    
            // 3. Execute the request and obtain the response
            try {
                CloseableHttpResponse response = httpClient.execute(httpGet);
    
                // 4. Obtain the status line information of the response status
                StatusLine statusLine = response.getStatusLine();
    
                if(statusLine.getStatusCode() ! =200) {
                    System.out.println("Request failed");
                    return;
                }
    
                // 5. Get the data in the request
                HttpEntity httpEntity = response.getEntity();
    
                // 6. Convert httpEntity to String
                String content = EntityUtils.toString(httpEntity);
             // System.out.println(content);
    
                // 7. Parse current page data
                 parseHtml(content);
    
    
            } catch (ClientProtocolException e) {
                e.printStackTrace();
            } catch(IOException e) { e.printStackTrace(); }}Copy the code

    Own browser user-agent:

  • Parse the current page data

    What data do we get before we parse it.

    Let’s just output it directly to the console

    It’s these things some HTML data

     /** * Parse HTML content *@paramContent HTML content to be parsed */
        public void parseHtml(String content) {
            // 1. Encapsulate the response data with Jsoup and convert it into a Document object
            Document document = Jsoup.parse(content);
    
            // 2. Use the Jsoup API to extract 20 pieces of comment information from this page
            Elements elements = document.select("div.article > div.review-list > div > div.main.review-item");
    
            // 3. Parse data (review title, critics, ratings, post review time, useful, useless, reply)
            for(Element item: elements) {
                // 1. header
                Element header = item.selectFirst("header");
                String username = header.selectFirst("a.name").text(); / / critic
                String rating = header.selectFirst("span").attr("title"); / / score
                String date = header.selectFirst("span.main-meta").text(); // Post comments
                // 2. body
                Element body = item.selectFirst("div.main-bd");
                String title = body.selectFirst("h2 > a").text(); / / title
                // The composite element of the comment
                Element ratings = body.selectFirst("div.action");
                String usefulCount = ratings.selectFirst("a.action-btn.up > span").text();
                String uselessCount = ratings.selectFirst("a.action-btn.down > span").text();
                String replyCount = ratings.selectFirst("a.reply").text();
                3. Use regular expressions
                Pattern pattern = Pattern.compile("\\d*");
                Matcher matcher = pattern.matcher(replyCount);
                replyCount = matcher.find() ? matcher.group() : "0";
    
                // 4. String content concatenation
                StringBuilder str = new StringBuilder();
                str.append(title).append(",").append(username).append(",").append(rating).append(",")
                        .append(date).append(",").append(usefulCount).append(",").append(uselessCount).append(",")
                        .append(replyCount);
    
                // 5. Write the contents of the crawllogger.info(str.toString()); }}Copy the code

    Jsoup parse principle: jsoup.org/

  • Crawl N pages of data

     /** * crawl multiple pages */
        public void requestByPages(int page){
            int beginIndex = 0;
            for(int i = 0; i < page; ++i) { beginIndex = i * PAGE_SIZE; requestByUrl(BASE_URL + beginIndex); System.out.println(); }}Copy the code

Process: Main calls requestByPages() to tell you that you want to get a few pages of content, and then in requestByUrl () to pass the URL in, RequestByUrl () then gets the data (string) and calls parseHtml() to parse it and write it to the log in the target format

The crawler packages the JAR and deploys it on Linux

Maven package, double-click Package.

Upload to Linux:

[Get movie review data based on background crawler]

After the command is executed, a local data file is generated. Data /collect/ stores log files