1. The background

This article describes the use of Java to write a simple crawler, through Jsoup to crawl HTML, HTML data.

2. Knowledge

Web crawler (also known as web spider, web robot) is a program or script that automatically captures information on the World Wide Web according to certain rules.

Simple to understand is to write a script, from the network crawl information, information parsing function.

Main steps:

Send a request to get the HTML text
Parsing HTML formatted text to retrieve the desired data from a specific HTML tag

Decomposition process:

Java sends network requests
Use the JSoup library to parse and locate the desired content

Jsoup is a Java library for processing HTML. It uses the best HTML5 DOM methods and CSS selectors to provide a very convenient API for getting urls and extracting and manipulating data.

Jsoup implements the WHATWG HTML5 specification and parses HTML into the DOM as it is in modern browsers.

Jsoup implements the HTML5 specification and parses HTML into the DOM as it is in modern browsers. Main abilities:

Grab and parse HTML from URLS, files, or strings
Find and extract data using DOM traversal or CSS selectors
Manipulate HTML elements, attributes, and text
Clean up user-submitted content against security whitelists to prevent XSS attacks
Output clean HTML

Official website: jsoup.org/

Example 3.

Write a hands-on example. For example, I want to obtain information about a fund from a fund website.

1) Send request to get HTML text The following code demonstrates making an HTTP request to get HTML text.

public class HttpClient { public static String readHtml(String urlStr) { HttpURLConnection conn = null; InputStream inputStream = null; try { URL url = new URL(urlStr); conn = (HttpURLConnection) url.openConnection(); conn.setRequestMethod("GET"); conn.setDoOutput(true); inputStream = conn.getInputStream(); BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8)); StringBuilder sb = new StringBuilder(); String line = null; while ((line = reader.readLine()) ! = null) { sb.append(line); } return sb.toString(); } catch (IOException ex) { ex.printStackTrace(); } finally { if (conn ! = null) conn.disconnect(); if (inputStream ! = null) { try { inputStream.close(); } catch (IOException e) { e.printStackTrace(); } } } return null; }}Copy the code

2) Parsing HTML formatted text to get the desired data from a specific HTML tag

Throw HTML text to jsoup.parse (HTML); Get a Document Document object.
Reuse doc. Select (” h1. Fund_name “). The first (). The text (); Search to locate the target location.

“H1. fund_name” means the class = fund_name element of the H1 tag.

/ * * * * expression for the latest fund information reference: https://www.cnblogs.com/zhangyinhua/p/8037599.html * website reference: https://jsoup.org/ * * @return */ public static FundInfo getInfo() { String urlStr = "http://finance.sina.com.cn/fund/quotes/001643/bc.shtml"; String html = HttpClient.readHtml(urlStr); Document doc = Jsoup.parse(html); String name = doc.select("h1.fund_name").first().text(); String funCode = doc.select("span.fund_code").first().text(); funCode = RegularExpression.findByFirst(funCode, "\\((.*?) \ \ "); Element ele3 = doc.select("#fund_info_blk2").first(); String worth = ele3.select("span.fund_data").get(0).text(); // unit net value String upAndDown = ele3.select(" sp.fund_data ").get(1).text(); String UP_3Month = ele3.select(" sp.fund_data ").get(2).text(); String up_1year = ele3. Select (" sp.fund_data ").get(3).text(); String UP_3Year = ele3.select(" sp.fund_data ").get(4).text(); String dataDate = doc.select("div.fund_data_date").first().text(); // String dataDate = doc.select("div.fund_data_date").first(). FundInfo f = new FundInfo(); f.name = name; f.fundCode = funCode; f.worth = Float.parseFloat(worth); f.upAndDown = upAndDown; f.up_3month = up_3month; f.up_1year = up_1year; f.up_3year = up_3year; f.dataDate = dataDate; return f; }Copy the code

4. The extension

Examples of my code can be found at: github.com/vir56k/java…

5. Reference:

jsoup.org/

END

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Java implementation of a simple crawler

1. The background

2. Knowledge

Example 3.

4. The extension

5. Reference:

Java implementation of a simple crawler

1. The background

2. Knowledge

Example 3.

4. The extension

5. Reference:

Related Posts

48, 2017 provincial contest Java group 真题 “maximum common substring”

The isolation level of Spring transactions

How do I generate an MD5 hash? | the Java Debug notes