In fact, many people think that Python can only do crawler, in fact, C++ and Java can also be, because the principle of crawler is very simple, nothing more than to analyze HTTP(s) request, and then through the code simulation browser to initiate the request, I chose Apache’s OKHttp to initiate the network request framework, after all, it is a lot of work to manually concatenate the HTTP request body. When you get the web page, you need to parse the key content of the web page. At this time, Jsoup comes into play. It is very convenient to get the data you want through the node selector + expression.
HttpClient
< the dependency > < groupId > org, apache httpcomponents < / groupId > < artifactId > httpclient < / artifactId > < version > 4.5.2 < / version > </dependency>Copy the code
Jsoup
Once we have captured the page, we need to parse it. You can use string processing tools to parse pages, or you can use regular expressions, but these methods are expensive to develop, so you need a technique for parsing HTML pages.
Jsoup is introduced
Jsoup is a Java HTML parser, which can directly resolve a URL address, HTML text content. It provides a very labor-intensive API for retrieving and manipulating data using DOM, CSS, and jquery-like manipulation methods.
Key functions of Jsoup
-
Parsing HTML from a URL, file, or string;
-
Use DOM or CSS selectors to find and retrieve data;
-
Manipulating HTML elements, attributes, and text;
Jsoup of actual combat
<dependencies> <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> Slf4j </groupId> <artifactId> slf4J-log4j12 </artifactId> </artifactId> <version>1.7.25</version> </dependency> <dependency> <groupId>org.jsoup</groupId> <artifactId> <version>1.11.3</version> </dependency> <dependency> <groupId> Commons -io</groupId> <artifactId> < version > 2.6 < / version > < / dependency > < the dependency > < groupId > org.apache.com mons < / groupId > <artifactId> Commons -lang3</artifactId> <version>3.7</version> </dependency> <dependency> <groupId>junit</groupId> <artifactId> <version>4.12</version> <scope>test</scope> </dependency> </dependencies>Copy the code
@Test
public void testUrl() throws Exception {
Document document = Jsoup.parse(new URL("http://zouchanglin.cn"), 5000);
String title = document.getElementsByTag("title").first().text();
System.out.println(title);
}
@Test
public void testString() throws Exception {
String html = FileUtils.readFileToString(new File("C:\\Users\\15291\\Desktop\\index.html"), "UTF-8");
Document document = Jsoup.parse(html);
String title = document.getElementsByTag("title").first().text();
System.out.println(title);
}
@Test
public void testFile() throws Exception {
Document document = Jsoup.parse(new File("C:\\Users\\15291\\Desktop\\index.html"), "UTF-8");
String title = document.getElementsByTag("title").first().text();
System.out.println(title);
}
Copy the code
While Jsoup can be used to parse data directly from HttpClient, it is often not used in this way because the actual development process requires the use of multithreading, connection pooling, proxy, etc. Jsoup does not support these methods very well. So we generally use Jsoup only as an Html parsing tool.
Dom traversal of the document
Select getElementByld by id; getElementsByTag by tag; getElementsByClass by class; getElementsByClass by attribute getElementsByAttribute
Attr = className; attr = attr; attributes = all attributes; text = text
The Selector Selector
Tagname: finds elements by tag, for example: span #id: finds elements by ID, for example: #city_bj. Class: finds elements by class name, for example:.class_a
[attribute] : Uses attributes to find elements, such as [ABC]
[attr=value] : use attribute values to find elements, such as [class=s_name]
Selector A combination of selectors
Class: element + class, e.g. Li.class_a el[attr] : element + attribute name, e.g. Span [ABC] any combination: e.g. Span [ABC]. S_name ancestor child .city_con li finds all li parent > child elements under city_con, for example: .city_con > ul > li Find ul of city_con and li parent > * : find all direct child elements of a parent element
package jsoup; import org.apache.commons.io.FileUtils; import org.jsoup.Jsoup; import org.jsoup.nodes.Attributes; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import org.junit.Test; import java.io.File; import java.io.IOException; import java.net.URL; import java.util.regex.Matcher; import java.util.regex.Pattern; public class JsoupFirstTest { @Test public void testUrl() throws Exception { Document document = Jsoup.parse(new URL("http://zouchanglin.cn"), 5000); String title = document.getElementsByTag("title").first().text(); System.out.println(title); } @Test public void testString() throws Exception { String html = FileUtils.readFileToString(new File("C:\\Users\\15291\\Desktop\\index.html"), "UTF-8"); Document document = Jsoup.parse(html); String title = document.getElementsByTag("title").first().text(); System.out.println(title); } @Test public void testFile() throws Exception { Document document = Jsoup.parse(new File("C:\\Users\\15291\\Desktop\\index.html"), "UTF-8"); String title = document.getElementsByTag("title").first().text(); System.out.println(title); } @test public void testDom() throws Exception {/** * 1. Query the element getElementByld * based on the ID. Obtain the element getElementsByTag * based on the tag Parse (new) getElementsByAttribute */ Document Document = jsoup.parse (new File("C:\\Users\\15291\\Desktop\\index.html"), "UTF-8"); Element element = document.getElementById("threeSpan"); System.out.println("threeSpan content: "+ element); System.out.println(element.getElementsByTag("a").first().attr("href")); Elements spans = document.getElementsByTag("span"); for(Element el: spans) System.out.println(el); System.out.println(document.getElementsByAttributeValue("type", "button").first()); System.out.println(document.getElementsByAttributeValue("type", "button").first().attr("value")); / * * * 1, obtained from the Element id * 2, 3 from the Element for the className *, obtain the value of the attribute from the Element attr * 4, obtain all attribute from the Element attributes * 5, obtained from the Element text Element text * / button = document.getElementsByAttributeValue("type", "button").first(); Attributes attributes = button.attributes(); System.out.println(attributes); } @test public void testSelector() throws Exception {/** * 'tagName' : searches for elements by their tags, such as' span '*' #id ': searches for elements by their ids, such as: Class_a '*' [attribute] ': uses attributes to find elements, such as' [ABC]' * '[attr=value]' : uses attribute values to find elements, such as: 'city_bj' * '. `[class=s_name]` */ Document document = Jsoup.parse(new File("C:\\Users\\15291\\Desktop\\index.html"), "UTF-8"); Elements span = document.select("span"); System.out.println(span); System.out.println("==============================="); System.out.println(document.select("#threeSpan").first()); System.out.println(document.select("#threeSpan").first().text()); } @test public void testSelectorTwo() throws Exception {/** * 'el#id' : element +ID, e.g. 'h3#3city_bj' * 'el.class' : element +class, e.g. Class_a '*' el[attr] ': element + attribute name, e.g.' span[ABC] '* any combination: e.g.' span[ABC]. S_name '*' ancestor child ': searches for the subelement of an element, e.g. '. City_con li 'finds all li *' parent > child 'under "city_con" : finds direct children of a parent element, for example: '. City_con > ul > li 'find ul of city_con and li *' parent > * ': Parse (new File("C:\\Users\\15291\\Desktop\\index.html"), "utF-8 "); System.out.println(document.select("span#oneSpan").first()); System.out.println(document.select("span[style]").first()); System.out.println(document.select(".my_div div")); System.out.println("======================================"); System.out.println(document.select(".my_div > div")); System.out.println("======================================"); System.out.println(document.select(".my_div *")); } @Test public void course() throws IOException { Document document = Jsoup.parse(new File("C:\\Users\\15291\\Desktop\\new 10.html"), "UTF-8"); Element tbody = document.select("table.blacktab > tbody").first(); Elements tds = tbody.select("tr td"); for(Element td: tds){ String tdContent = td.text(); if(tdContent.contains("{")){ System.out.println(tdContent); handel(tdContent); } } } private void handel(String tdContent) { String[] split = tdContent.split(" "); System.out.print(split[0] + "\t"); System.out.print(split[1].substring(0, 2) + "\t"); System. The out. Print (split [1]. The substring (split [1]. The indexOf (" ") + 1, the split [1]. The indexOf (" section ")) + "\ t"); System.out.print(split[1].substring(split[1].indexOf("{")+2, split[1].indexOf("}") - 1) + "\t"); System.out.print(split[2] + "\t"); System.out.print(split[3] + "\t"); System.out.println(""); }}Copy the code
The right way to teach reptiles
1. Introduce dependencies
<dependencies> <dependency> <groupId>com.gitee.zouchanglin</groupId> <artifactId>spider_xpu</artifactId> <version>1.2</version> </dependency> </dependencies> <repository> <id> Jitpack. IO </ ID > <url>https://jitpack.io</url> </repository> </repositories>Copy the code
2. Use examples
import cn.zouchanglin.spider_xpu.SpiderResult; import cn.zouchanglin.spider_xpu.cache.ResultCache; import cn.zouchanglin.spider_xpu.core.SpiderCore; import javax.security.auth.login.LoginException; import java.awt.*; import java.net.URI; import java.util.Scanner; public class Main { public static void main(String[] args) throws Exception { // // TODO fill Key, userId, password, etc. String Key = ""; // TODO fill Key, userId, password, etc. String userId = ""; String passsword = ""; / / 1, to get verification code URL String URL = SpiderCore. GetCheckCodeUrl (key); // Open a browser and enter the verification code Desktop Desktop = desktop.getDesktop (); if (Desktop.isDesktopSupported() && desktop.isSupported(Desktop.Action.BROWSE)) { URI uri = new URI(url); desktop.browse(uri); } Scanner scanner = new Scanner(System.in); String code = scanner.nextLine(); CurrentTimeMillis = system.currentTimemillis (); SpiderResult spiderResult = null; SpiderResult = spidercore. go(userId, password, code, key); spiderResult = spidercore. go(userId, password, code, key); }catch (LoginException e){// Failed to log in system.out.println (e.tostring ()); } system.out.println (" if a sync call returns only user information + class list: "+ spiderResult); System.out.println(" Execution time :" + (system.currentTimemillis () -millis)); // block waiting for the result object to exist in the cache pool while(! ResultCache.SPIDER_RESULT_CACHE.containsKey(key)); SpiderResult result = resultcache.spider_result_cache.get (key); SpiderResult result = resultcache.spider_result_cache.get (key); System.out.println(result); //TODO persistence system.out.println (" complete persistence....") ); Resultcache.spider_result_cache.remove (key); // Remove the resultcache.spider_result_cache.remove (key); }}Copy the code
Through the student information and all the students to climb the table, the effect is good!
- Author: Tim
- Links to this article: Zouchanglin. Cn / 2020/08/19 /…
- Copyright Notice: All articles on this blog are subject to a CC BY-SA 4.0 license unless otherwise stated. Reprint please indicate the source!