This is the 8th day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021

【 a Point of the Day 】

Edamame and soybeans are actually the same thing. Edamame is young and soybeans are old.

One, foreword

Recently, a netizen asked, how do you want to do a data survey report, have been page load 403, page content can not load, VUE page persuaded to quit. What? You can’t fix a little thing like that? Vue pages not supported? I immediately dumped this code to him!Vue page 403 pages

Second, code analysis

The code before the code change (can not get the VUE page and 403 page)

	public static void main(String[] args) {
		// Here is an example of a vUE page (vue translation) to test
		String nowHtml = "https://www.niutrans.com";
		URL url;
		try {
			url = new URL(nowHtml);
			URLConnection openConnection = url.openConnection();
			InputStream inputStream = openConnection.getInputStream();
			byte[] b = new byte[1024];
			int len;
			while((len = inputStream.read(b)) ! = -1) {
				System.out.println(new String(b, 0, len));
			}
			inputStream.close();
		} catch (Exception e) {
			// TODO Auto-generated catch blocke.printStackTrace(); }}Copy the code

Dependencies required by the changed code

		<! Get page content dependencies -->
		<dependency>
			<groupId>net.sourceforge.htmlunit</groupId>
			<artifactId>htmlunit</artifactId>
			<version>2.43.0</version>
		</dependency>
Copy the code

The modified code


	public static void main(String[] args) {
		// Here is an example to find a vUE page (calf translation, b station page) to test
		String nowHtml = "https://www.niutrans.com";
// String nowHtml = "https://www.bilibili.com";
		getWebBody(nowHtml);
	}

	public static void getWebBody(String nowHtml) {
		WebClient webClient = new WebClient(BrowserVersion.CHROME);
		webClient.getOptions().setActiveXNative(false);// ActiveX is disabled
		webClient.getOptions().setCssEnabled(false);// Whether to enable CSS, because the page does not need to display, so do not need to enable CSS
		webClient.getOptions().setUseInsecureSSL(true); // Set to true, clients will accept connections to any host regardless of whether they have valid certificates
		webClient.getOptions().setJavaScriptEnabled(true); // It is important to enable JS
		webClient.getOptions().setDownloadImages(false);// Do not download images
		webClient.getOptions().setThrowExceptionOnScriptError(false);// Whether to throw an exception when JS execution fails
		webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);// Whether to throw an exception when the HTTP status is not 200
		webClient.getOptions().setTimeout(15 * 1000); / / wait for 15 s
		webClient.getOptions().setConnectionTimeToLive(15 * 1000);
		webClient.waitForBackgroundJavaScript(10 * 1000);// Asynchronous JS execution takes time, so the thread blocks for 30 seconds waiting for the asynchronous JS execution to finish

		HtmlPage page = null;
		try {
			page = webClient.getPage(nowHtml);// Load the page
		} catch (Exception e) {
			e.printStackTrace();
		} finally {
			webClient.close();
		}
		String htmlStr = page.getBody().asXml();
		System.out.println(htmlStr);
	}
Copy the code

The results show

Third, the conclusion

In page requests, there are HTTP (S) certificates that are valid, there are redirect (403) pages, and there are pages that are rendered dynamically by JS (VUE pages). Therefore, there are a number of issues we need to consider in obtaining page content. HtmlUnit solves all of these problems and doesn’t require a browser setup like other utility classes. In general, HtmlUnit is very versatile and easy to use!

【 the 】

Thank you for reading the end, if you have a different view, you are welcome to leave a comment below this article. I am a southerner who loves computers and loves the motherland. If the content of this article is only for learning reference, if there is any infringement, I am very sorry, please contact the author immediately delete.