1. An overview of the

What does the Java crawler family consist of?

Introduction to Java crawler framework Webmgic
Use webmgic to crawl movie resources from ady01.com (action movie list page, movie download location, etc.)
Use Webmgic to crawl geek Time course resources (article course series and video course series)

The main content of this article:

Introduces the Java crawler framework
Introduction to Java crawler framework WebMagic
Use Webgic to crawl action movie list information

2. Java easy to use crawler framework

How do you determine if a framework is good?

Easy to learn and use, online corresponding learning materials are more, and more perfect
The use of more people, the existence of the pit others have helped you fill almost, with a few more satisfactory
The framework updates quickly, the community is active, and you can quickly experience better features and interact with authors
The framework is stable and easy to expand

According to the above points, a very useful Java crawler framework webmgic is recommended

3. Webmgic is introduced

WebMagic is a simple and flexible Java crawler framework. Based on WebMagic, you can quickly develop an efficient, easy to maintain crawler.
Webmagic official website: webmagic. IO /
Webmgic 中文 learning documentation: webmagic. IO /docs/zh/

4. Use Webgic to crawl the list of action movies

Use webgic to crawlLove movieMovie list resource information

Example source code address

1. Create the Springboot project java-Pachong

2. Import maven configuration

<dependencies>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter</artifactId>
    </dependency>

    <dependency>
        <groupId>org.projectlombok</groupId>
        <artifactId>lombok</artifactId>
        <optional>true</optional>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-test</artifactId>
        <scope>test</scope>
    </dependency>

    <! -- webmagic start -->
    <dependency>
        <groupId>us.codecraft</groupId>
        <artifactId>webmagic-core</artifactId>
        <version>0.7.3</version>
        <exclusions>
            <exclusion>
                <artifactId>fastjson</artifactId>
                <groupId>com.alibaba</groupId>
            </exclusion>
            <exclusion>
                <artifactId>commons-io</artifactId>
                <groupId>commons-io</groupId>
            </exclusion>
            <exclusion>
                <artifactId>commons-io</artifactId>
                <groupId>commons-io</groupId>
            </exclusion>
            <exclusion>
                <artifactId>fastjson</artifactId>
                <groupId>com.alibaba</groupId>
            </exclusion>
            <exclusion>
                <artifactId>fastjson</artifactId>
                <groupId>com.alibaba</groupId>
            </exclusion>
            <exclusion>
                <artifactId>log4j</artifactId>
                <groupId>log4j</groupId>
            </exclusion>
            <exclusion>
                <artifactId>slf4j-log4j12</artifactId>
                <groupId>org.slf4j</groupId>
            </exclusion>
        </exclusions>
    </dependency>
    <dependency>
        <groupId>us.codecraft</groupId>
        <artifactId>webmagic-extension</artifactId>
        <version>0.7.3</version>
    </dependency>
    <dependency>
        <groupId>us.codecraft</groupId>
        <artifactId>webmagic-selenium</artifactId>
        <version>0.7.3</version>
    </dependency>
    <dependency>
        <groupId>net.minidev</groupId>
        <artifactId>json-smart</artifactId>
        <version>2.2.1</version>
    </dependency>
    <! -- webmagic end -->
    <dependency>
        <groupId>com.alibaba</groupId>
        <artifactId>fastjson</artifactId>
        <version>1.2.49</version>
    </dependency>
    <dependency>
        <groupId>commons-lang</groupId>
        <artifactId>commons-lang</artifactId>
        <version>2.6</version>
    </dependency>
    <dependency>
        <groupId>commons-io</groupId>
        <artifactId>commons-io</artifactId>
        <version>2.6</version>
    </dependency>
    <dependency>
        <groupId>commons-codec</groupId>
        <artifactId>commons-codec</artifactId>
        <version>1.11</version>
    </dependency>
    <dependency>
        <groupId>commons-collections</groupId>
        <artifactId>commons-collections</artifactId>
        <version>3.2.2</version>
    </dependency>
</dependencies>
Copy the code

3. Write code to capture movie data

Access the list of love movie action movies in Google Chrome
F12 finds that the data in the list page is obtained through an Ajax request, and we get the request address

M.ady01.com/rs/film/lis…
Writing crawl code

package com.ady01.demo1;

import lombok.extern.slf4j.Slf4j;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;

/** * description : first crawler, crawler  time : 2019/4/20 10:58 

 * author : ready [email protected] */
@Slf4j
public class Ady01comPageProcessor implements PageProcessor {
    @Override
    public void process(Page page) {
        log.info("Successful climb!");
        log.info("Crawl content:" + page.getRawText());
    }

    @Override
    public Site getSite(a) {
        return Site.me().setSleepTime(1000).setRetryTimes(3);
    }

    public static void main(String[] args) {
        String url = "http://m.ady01.com/rs/film/listJson/1/2?_=1555726508180";
        Spider.create(new Ady01comPageProcessor()).addUrl(url).thread(1).run(); }}Copy the code

4. Run crawler code

Run the main method in Ady01comPageProcessor. The result is as follows:

5. To summarize

This paper mainly uses an example to show that WebGIC is so simple that it can complete the data capture work. From the code, we can see that the complex code WebMagic has shielded us, and we only need to pay attention to the writing of business code.
There is no detailed introduction of how to use Webmagic in the article, as for why I did not explain in the document, mainly webigC has provided a very perfect learning document, you can move to webGIC Chinese document, need more in-depth understanding can study the source of WebGIC, it is very useful for you to write crawlers.
Tomorrow we will crawl each action movie detail page and collect the download location of the movie in the detail page
Example code, imported into idea to run, idea needs to install Maven and Lombok support
For more technical articles, please follow our official account: Javacode2018

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Java Crawlers lecture 1 – Getting started with crawlers

1. An overview of the

What does the Java crawler family consist of?

The main content of this article:

2. Java easy to use crawler framework

How do you determine if a framework is good?

3. Webmgic is introduced

4. Use Webgic to crawl the list of action movies

1. Create the Springboot project java-Pachong

2. Import maven configuration

3. Write code to capture movie data

4. Run crawler code

5. To summarize

Java Crawlers lecture 1 – Getting started with crawlers

1. An overview of the

What does the Java crawler family consist of?

The main content of this article:

2. Java easy to use crawler framework

How do you determine if a framework is good?

3. Webmgic is introduced

4. Use Webgic to crawl the list of action movies

1. Create the Springboot project java-Pachong

2. Import maven configuration

3. Write code to capture movie data

4. Run crawler code

5. To summarize

Related Posts

Magic, Alibaba Java code check plug-in

Dubbo and Zookeeper integration (easy)

STM32F103 Universal timer