1. An overview of the
What does the Java crawler family consist of?
- Introduction to Java crawler framework Webmgic
- Use webmgic to crawl movie resources from ady01.com (action movie list page, movie download location, etc.)
- Use Webmgic to crawl geek Time course resources (article course series and video course series)
The main content of this article:
- Introduces the Java crawler framework
- Introduction to Java crawler framework WebMagic
- Use Webgic to crawl action movie list information
2. Java easy to use crawler framework
How do you determine if a framework is good?
- Easy to learn and use, online corresponding learning materials are more, and more perfect
- The use of more people, the existence of the pit others have helped you fill almost, with a few more satisfactory
- The framework updates quickly, the community is active, and you can quickly experience better features and interact with authors
- The framework is stable and easy to expand
According to the above points, a very useful Java crawler framework webmgic is recommended
3. Webmgic is introduced
- WebMagic is a simple and flexible Java crawler framework. Based on WebMagic, you can quickly develop an efficient, easy to maintain crawler.
- Webmagic official website: webmagic. IO /
- Webmgic 中文 learning documentation: webmagic. IO /docs/zh/
4. Use Webgic to crawl the list of action movies
Use webgic to crawlLove movieMovie list resource information
Example source code address
1. Create the Springboot project java-Pachong
2. Import maven configuration
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter</artifactId>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<optional>true</optional>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
<! -- webmagic start -->
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.7.3</version>
<exclusions>
<exclusion>
<artifactId>fastjson</artifactId>
<groupId>com.alibaba</groupId>
</exclusion>
<exclusion>
<artifactId>commons-io</artifactId>
<groupId>commons-io</groupId>
</exclusion>
<exclusion>
<artifactId>commons-io</artifactId>
<groupId>commons-io</groupId>
</exclusion>
<exclusion>
<artifactId>fastjson</artifactId>
<groupId>com.alibaba</groupId>
</exclusion>
<exclusion>
<artifactId>fastjson</artifactId>
<groupId>com.alibaba</groupId>
</exclusion>
<exclusion>
<artifactId>log4j</artifactId>
<groupId>log4j</groupId>
</exclusion>
<exclusion>
<artifactId>slf4j-log4j12</artifactId>
<groupId>org.slf4j</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.7.3</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-selenium</artifactId>
<version>0.7.3</version>
</dependency>
<dependency>
<groupId>net.minidev</groupId>
<artifactId>json-smart</artifactId>
<version>2.2.1</version>
</dependency>
<! -- webmagic end -->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.49</version>
</dependency>
<dependency>
<groupId>commons-lang</groupId>
<artifactId>commons-lang</artifactId>
<version>2.6</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.6</version>
</dependency>
<dependency>
<groupId>commons-codec</groupId>
<artifactId>commons-codec</artifactId>
<version>1.11</version>
</dependency>
<dependency>
<groupId>commons-collections</groupId>
<artifactId>commons-collections</artifactId>
<version>3.2.2</version>
</dependency>
</dependencies>
Copy the code
3. Write code to capture movie data
-
Access the list of love movie action movies in Google Chrome
-
F12 finds that the data in the list page is obtained through an Ajax request, and we get the request address
M.ady01.com/rs/film/lis…
-
Writing crawl code
package com.ady01.demo1;
import lombok.extern.slf4j.Slf4j;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
/** * description : first crawler, crawler time : 2019/4/20 10:58
* author : ready [email protected] */
@Slf4j
public class Ady01comPageProcessor implements PageProcessor {
@Override
public void process(Page page) {
log.info("Successful climb!");
log.info("Crawl content:" + page.getRawText());
}
@Override
public Site getSite(a) {
return Site.me().setSleepTime(1000).setRetryTimes(3);
}
public static void main(String[] args) {
String url = "http://m.ady01.com/rs/film/listJson/1/2?_=1555726508180";
Spider.create(new Ady01comPageProcessor()).addUrl(url).thread(1).run(); }}Copy the code
4. Run crawler code
Run the main method in Ady01comPageProcessor. The result is as follows:
5. To summarize
- This paper mainly uses an example to show that WebGIC is so simple that it can complete the data capture work. From the code, we can see that the complex code WebMagic has shielded us, and we only need to pay attention to the writing of business code.
- There is no detailed introduction of how to use Webmagic in the article, as for why I did not explain in the document, mainly webigC has provided a very perfect learning document, you can move to webGIC Chinese document, need more in-depth understanding can study the source of WebGIC, it is very useful for you to write crawlers.
- Tomorrow we will crawl each action movie detail page and collect the download location of the movie in the detail page
- Example code, imported into idea to run, idea needs to install Maven and Lombok support
- For more technical articles, please follow our official account: Javacode2018