Java Web crawler (2)
This article mainly introduces the NetDiscovery framework in the pipeline pattern of some practical use methods.
1) What is pipeline
Pipelines are a common algorithmic pattern for recurring, time-consuming tasks that can waste time waiting for a cycle to end before processing the next task.
So break up time-consuming tasks into blocks, and as soon as one block is completed, you can start working on the next block. Don’t wait until the time-consuming task is over.
I think it’s really good for dealing with the data that crawlers get.
2) The role of pipeline in the framework
From the schematic provided by the framework, you can see what role the Pipeline object plays:
- Url accessed by a Downloader object
- The successfully accessed Page object is handed to the Parser object for page parsing
- The parses are handed over to the Pipleline objects to be processed sequentially, such as: deduplicate, encapsulate, store, send messages, and so on
3) Target tasks
Task steps:
- Visit the hook website
- Find links to images on the page
- Download the link on the image to local
- Save image information to mysql database
4) Create pipeline objects
Pipeline categories: DownloadImage
package com.sinkinka.pipeline;
import com.cv4j.netdiscovery.core.domain.ResultItems;
import com.cv4j.netdiscovery.core.pipeline.Pipeline;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URL;
import java.net.URLConnection;
import java.util.Map;
public class DownloadImage implements Pipeline {
@Override
public void process(ResultItems resultItems) {
Map<String, Object> map = resultItems.getAll();
for(String key : map.keySet()) {
String filePath = "./temp/" + key + ".png";
saveRemoteImage(map.get(key).toString(), filePath);
}
}
private boolean saveRemoteImage(String imgUrl, String filePath) {
InputStream in = null;
OutputStream out = null;
try {
URL url = new URL(imgUrl);
URLConnection connection = url.openConnection();
connection.setConnectTimeout(5000);
in = connection.getInputStream();
byte[] bs = new byte[1024];
int len;
out = new FileOutputStream(filePath);
while((len = in.read(bs)) ! = -1) { out.write(bs, 0, len); } } catch(Exception e) {return false;
} finally {
try {
out.flush();
out.close();
in.close();
} catch(IOException e) {
return false; }}return true; }}Copy the code
Pipeline categories: SaveImage
package com.sinkinka.pipeline;
import com.cv4j.netdiscovery.core.domain.ResultItems;
import com.cv4j.netdiscovery.core.pipeline.Pipeline;
import com.safframework.tony.common.utils.Preconditions;
import java.sql.*;
import java.util.Map;
public class SaveImage implements Pipeline {
@Override
public void process(ResultItems resultItems) {
Map<String, Object> map = resultItems.getAll();
for(String key : map.keySet()) {
System.out.println("2"+key);
saveCompanyInfo(key, map.get(key).toString());
}
}
private boolean saveCompanyInfo(String shortName, String logoUrl) {
int insertCount = 0;
Connection conn = getMySqlConnection();
Statement statement = null;
if(Preconditions.isNotBlank(conn)) {
try {
statement = conn.createStatement();
String insertSQL = "INSERT INTO company(shortname, logourl) VALUES('"+shortName+"', '"+logoUrl+"')";
insertCount = statement.executeUpdate(insertSQL);
statement.close();
conn.close();
} catch(SQLException e) {
return false;
} finally {
try{
if(statement! =null) statement.close(); }catch(SQLException e){ } try{if(conn! =null) conn.close(); }catch(SQLException e){ } } }returninsertCount > 0; } // Demo code, not recommended for production environment private ConnectiongetMySqlConnection() {// Use mysql connector 5 // database:testAccount/Password: root/123456 Final String JDBC_DRIVER ="com.mysql.jdbc.Driver";
final String DB_URL = "jdbc:mysql://localhost:3306/test";
final String USER = "root";
final String PASS = "123456";
Connection conn = null;
try {
Class.forName(JDBC_DRIVER);
conn = DriverManager.getConnection(DB_URL,USER,PASS);
} catch(SQLException e) {
return null;
} catch(Exception e) {
return null;
}
returnconn; }}Copy the code
5) Run the program
The Main class
package com.sinkinka;
import com.cv4j.netdiscovery.core.Spider;
import com.sinkinka.parser.LagouParser;
import com.sinkinka.pipeline.DownloadImage;
import com.sinkinka.pipeline.SaveImage;
public class PipelineSpider {
public static void main(String[] args) {
String url = "https://xiaoyuan.lagou.com/";
Spider.create()
.name("lagou") .url(url) .parser(new LagouParser()) .pipeline(new DownloadImage()) //1. Pipeline (new SaveImage()) //2. Then, the image information is stored in the database.run(); }}Copy the code
Parser class
package com.sinkinka.parser;
import com.cv4j.netdiscovery.core.domain.Page;
import com.cv4j.netdiscovery.core.domain.ResultItems;
import com.cv4j.netdiscovery.core.parser.Parser;
import com.cv4j.netdiscovery.core.parser.selector.Selectable;
import java.util.List;
public class LagouParser implements Parser {
@Override
public void process(Page page) {
ResultItems resultItems = page.getResultItems();
List<Selectable> liList = page.getHtml().xpath("//li[@class='nav-logo']").nodes();
for(Selectable li : liList) {
String logoUrl = li.xpath("//img/@src").get();
String companyShortName = li.xpath("//div[@class='company-short-name']/text()").get(); resultItems.put(companyShortName, logoUrl); }}}Copy the code
Image data saved locally via DownloadImage:
With SaveImage, data stored in the database:
6) summary
The above code briefly demonstrates the use of the Pipeline pattern. Remember that pipelines have a sequence of execution. Rediscover the benefits of the Pipeline pattern in a high-volume, high-frequency production environment.
Java Web crawler (4)