This is the 8th day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021
【 a Point of the Day 】
Edamame and soybeans are actually the same thing. Edamame is young and soybeans are old.
One, foreword
Recently, a netizen asked, how do you want to do a data survey report, have been page load 403, page content can not load, VUE page persuaded to quit. What? You can’t fix a little thing like that? Vue pages not supported? I immediately dumped this code to him!Vue page 403 pages
Second, code analysis
The code before the code change (can not get the VUE page and 403 page)
public static void main(String[] args) {
// Here is an example of a vUE page (vue translation) to test
String nowHtml = "https://www.niutrans.com";
URL url;
try {
url = new URL(nowHtml);
URLConnection openConnection = url.openConnection();
InputStream inputStream = openConnection.getInputStream();
byte[] b = new byte[1024];
int len;
while((len = inputStream.read(b)) ! = -1) {
System.out.println(new String(b, 0, len));
}
inputStream.close();
} catch (Exception e) {
// TODO Auto-generated catch blocke.printStackTrace(); }}Copy the code
Dependencies required by the changed code
<! Get page content dependencies -->
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.43.0</version>
</dependency>
Copy the code
The modified code
public static void main(String[] args) {
// Here is an example to find a vUE page (calf translation, b station page) to test
String nowHtml = "https://www.niutrans.com";
// String nowHtml = "https://www.bilibili.com";
getWebBody(nowHtml);
}
public static void getWebBody(String nowHtml) {
WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setActiveXNative(false);// ActiveX is disabled
webClient.getOptions().setCssEnabled(false);// Whether to enable CSS, because the page does not need to display, so do not need to enable CSS
webClient.getOptions().setUseInsecureSSL(true); // Set to true, clients will accept connections to any host regardless of whether they have valid certificates
webClient.getOptions().setJavaScriptEnabled(true); // It is important to enable JS
webClient.getOptions().setDownloadImages(false);// Do not download images
webClient.getOptions().setThrowExceptionOnScriptError(false);// Whether to throw an exception when JS execution fails
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);// Whether to throw an exception when the HTTP status is not 200
webClient.getOptions().setTimeout(15 * 1000); / / wait for 15 s
webClient.getOptions().setConnectionTimeToLive(15 * 1000);
webClient.waitForBackgroundJavaScript(10 * 1000);// Asynchronous JS execution takes time, so the thread blocks for 30 seconds waiting for the asynchronous JS execution to finish
HtmlPage page = null;
try {
page = webClient.getPage(nowHtml);// Load the page
} catch (Exception e) {
e.printStackTrace();
} finally {
webClient.close();
}
String htmlStr = page.getBody().asXml();
System.out.println(htmlStr);
}
Copy the code
The results show
Third, the conclusion
In page requests, there are HTTP (S) certificates that are valid, there are redirect (403) pages, and there are pages that are rendered dynamically by JS (VUE pages). Therefore, there are a number of issues we need to consider in obtaining page content. HtmlUnit solves all of these problems and doesn’t require a browser setup like other utility classes. In general, HtmlUnit is very versatile and easy to use!
【 the 】
Thank you for reading the end, if you have a different view, you are welcome to leave a comment below this article. I am a southerner who loves computers and loves the motherland. If the content of this article is only for learning reference, if there is any infringement, I am very sorry, please contact the author immediately delete.