1. The background
This article describes the use of Java to write a simple crawler, through Jsoup to crawl HTML, HTML data.
2. Knowledge
Web crawler (also known as web spider, web robot) is a program or script that automatically captures information on the World Wide Web according to certain rules.
Simple to understand is to write a script, from the network crawl information, information parsing function.
Main steps:
- Send a request to get the HTML text
- Parsing HTML formatted text to retrieve the desired data from a specific HTML tag
Decomposition process:
- Java sends network requests
- Use the JSoup library to parse and locate the desired content
Jsoup is a Java library for processing HTML. It uses the best HTML5 DOM methods and CSS selectors to provide a very convenient API for getting urls and extracting and manipulating data.
Jsoup implements the WHATWG HTML5 specification and parses HTML into the DOM as it is in modern browsers.
Jsoup implements the HTML5 specification and parses HTML into the DOM as it is in modern browsers. Main abilities:
- Grab and parse HTML from URLS, files, or strings
- Find and extract data using DOM traversal or CSS selectors
- Manipulate HTML elements, attributes, and text
- Clean up user-submitted content against security whitelists to prevent XSS attacks
- Output clean HTML
Official website: jsoup.org/
Example 3.
Write a hands-on example. For example, I want to obtain information about a fund from a fund website.
1) Send request to get HTML text The following code demonstrates making an HTTP request to get HTML text.
public class HttpClient { public static String readHtml(String urlStr) { HttpURLConnection conn = null; InputStream inputStream = null; try { URL url = new URL(urlStr); conn = (HttpURLConnection) url.openConnection(); conn.setRequestMethod("GET"); conn.setDoOutput(true); inputStream = conn.getInputStream(); BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8)); StringBuilder sb = new StringBuilder(); String line = null; while ((line = reader.readLine()) ! = null) { sb.append(line); } return sb.toString(); } catch (IOException ex) { ex.printStackTrace(); } finally { if (conn ! = null) conn.disconnect(); if (inputStream ! = null) { try { inputStream.close(); } catch (IOException e) { e.printStackTrace(); } } } return null; }}Copy the code
2) Parsing HTML formatted text to get the desired data from a specific HTML tag
- Throw HTML text to jsoup.parse (HTML); Get a Document Document object.
- Reuse doc. Select (” h1. Fund_name “). The first (). The text (); Search to locate the target location.
“H1. fund_name” means the class = fund_name element of the H1 tag.
/ * * * * expression for the latest fund information reference: https://www.cnblogs.com/zhangyinhua/p/8037599.html * website reference: https://jsoup.org/ * * @return */ public static FundInfo getInfo() { String urlStr = "http://finance.sina.com.cn/fund/quotes/001643/bc.shtml"; String html = HttpClient.readHtml(urlStr); Document doc = Jsoup.parse(html); String name = doc.select("h1.fund_name").first().text(); String funCode = doc.select("span.fund_code").first().text(); funCode = RegularExpression.findByFirst(funCode, "\\((.*?) \ \ "); Element ele3 = doc.select("#fund_info_blk2").first(); String worth = ele3.select("span.fund_data").get(0).text(); // unit net value String upAndDown = ele3.select(" sp.fund_data ").get(1).text(); String UP_3Month = ele3.select(" sp.fund_data ").get(2).text(); String up_1year = ele3. Select (" sp.fund_data ").get(3).text(); String UP_3Year = ele3.select(" sp.fund_data ").get(4).text(); String dataDate = doc.select("div.fund_data_date").first().text(); // String dataDate = doc.select("div.fund_data_date").first(). FundInfo f = new FundInfo(); f.name = name; f.fundCode = funCode; f.worth = Float.parseFloat(worth); f.upAndDown = upAndDown; f.up_3month = up_3month; f.up_1year = up_1year; f.up_3year = up_3year; f.dataDate = dataDate; return f; }Copy the code
4. The extension
Examples of my code can be found at: github.com/vir56k/java…
5. Reference:
jsoup.org/
END