Hello, everyone. Speaking of crawler, I believe many programmers have heard of it. Simply speaking, it is a program that automatically captures information in batches on the network. Next, I use NetDiscovery, a crawler framework on Github, to demonstrate.
1) Why frame?
Frameworks can help us do basic work that is not directly related to the target task and keep us focused on the target task. Especially for beginners of crawler, they can quickly realize the effect and sense of achievement brought by operating crawler without having to worry about extra things. When you enter the door, try to write a crawler program independently from zero without relying on the framework, and then study the crawler framework built by others. After you can read the source code of the crawler framework, I believe you have some research on the web crawler.
2) Presentation environment
Java JDK8, IntelliJ IDEA, Google Chrome
Crawler Framework NetDiscovery: github.com/fengzhizi71…
3) Determine the crawler task
Obtain the specified job information from a recruitment website: company name, job name
4) Human flesh analysis webpage
Open the target page with Chrome, enter the search criteria, and find the page showing the job information:
The text in the red box is the information we plan to write programs to automatically retrieve.
The analysis of this link is very important. We should have a clear understanding of the target web page and target data. The human eye can already see this information, the next step is to write a program to teach the computer to help us grasp.
5) Create a Java project
Create a Gradle Java project:
Add two JAR packages of crawler framework NetDiscovery in the project, the current version is 0.0.9.3, the version is not high, but the version of the update iteration is fast, I believe it is a framework with growth force.
group 'com.sinkinka'
version 1.0 the SNAPSHOT ' '
apply plugin: 'java'
sourceCompatibility = 1.8 repositories {maven{url'http://maven.aliyun.com/nexus/content/groups/public/'}
mavenCentral()
jcenter();
}
dependencies {
testCompile group: 'junit', name: 'junit', version: '4.12'
implementation 'com.cv4j.net discovery: netdiscovery - core: 0.0.9.3'
implementation 'com.cv4j.net discovery: netdiscovery - extra: 0.0.9.3'
}
Copy the code
If you can’t download it, please add the address of Aliyun mirror:
Maven.aliyun.com/nexus/conte…
6) Code implementation
See the example code under the example module in the framework, as well as another example project: github.com/fengzhizi71…
- Create an entry class for the main method and start the crawler in main
- For target web page parsing, we need to create a class that implements the Parser interface
- There are many ways to extract target content from AN HTML string, such as xpath, Jsoup, regular expressions, and so on
In the Java main method, write the following code:
package com.sinkinka; import com.cv4j.netdiscovery.core.Spider; import com.sinkinka.parser.TestParser; Public class TestSpider {public static void main(String[] args) {public static void main(String[] args) {public static void main(String[] args)"http://www.szrc.cn/HrMarket/WLZP/ZP/0/%E6%95%B0%E6%8D%AE"; // Using NetDiscovery, we only need to write a parser class to implement basic crawler functions."spider-1"Url (url).parser(new TestParser()) //parser class.run (); }}Copy the code
TestParser class code:
package com.sinkinka.parser; import com.cv4j.netdiscovery.core.domain.Page; import com.cv4j.netdiscovery.core.parser.Parser; import com.cv4j.netdiscovery.core.parser.selector.Selectable; import java.util.List; Public class TestParser implements Parser {@override public void process(Page Page) {String xpathStr ="//*[@id=\"grid\"]/div/div[1]/table/tbody/tr";
List<Selectable> trList = page.getHtml().xpath(xpathStr).nodes();
for(Selectable tr : trList) {
String companyName = tr.xpath("//td[@class='td_companyName']/text()").get();
String positionName = tr.xpath("//td[@class='td_positionName']/a/text()").get();
if(null ! = companyName && null ! = positionName) { System.out.println(companyName+"-- -- -- -- -- -"+positionName); }}}}Copy the code
Running results:
7) summary
This paper relies on the crawler framework, with as simple as possible to demonstrate a method of capturing web page information. More practical content will be released later for your reference.
The basic schematic of the NetDiscovery crawler framework
Java Web crawler (2)