preface
Recently, I found in a resource forum that I was looking for learning materials for several days, but I had no choice but to download the attachment and need gold coins. Although recharge is not expensive, it is just called Write the Code. Change the world. In the points system of this forum, you can click the invitation link to get gold coins. Many forums also have this mode, so I want to write a small demo, access the invitation link by proxy, to brush gold coins, and work forever.
collect
Since the principle is through proxy access to invite links, so certainly is the need for proxy IP, proxy IP this website has, the quality may not be particularly good, but it doesn’t matter, and not looking for a girlfriend, do not need to be more than tolerable, large amount on the line.
I originally intended to copy the proxy manually and save it locally, and then read it in code. Finally, I decided to use jsoup to parse the HTML and collect the proxy IP.
Jsoup is a Java HTML parser that can directly parse a URL address and HTML text content. It provides a very labor-intensive API for retrieving and manipulating data using DOM, CSS, and jquery-like manipulation methods.
GetElementsByTag (String Tag), getElementsByClass(String className), attR (String key), Text (), HTML () and so on to find elements, element data and so on.
Get the details page address
The href of all tags under the newslist_line is the address of the broker detail page.
Only the first few pages are parsed here, too many are a waste of time.
/** * Get the url of the agent details page */ from the home page
private static List<String> getProxyList(a) {
List<String> lists = new ArrayList<>();
// Parse the pre-max_page page
for (int i = 1; i <= MAX_PAGE; i++) {
String path = "http://www.youdaili.net/Daili/http/list_" + i + ".html";
try {
Document document = Jsoup.connect(path).timeout(3000).get();
Elements newsListElements = document.getElementsByClass("newslist_line");
// Only one
Element list = newsListElements.first();
Elements links = list.getElementsByTag("a");
for (Element link : links) {
String href = link.attr("href");
if(href ! =null && href.length() > 0) { lists.add(href); }}}catch(IOException e) { e.printStackTrace(); }}return lists;
}
Copy the code
Resolve the agent in the detail page
In the details page, all proxy IP addresses can be obtained from
under the cont_font
tag.
In many cases getElementsByClass returns Elements with a size greater than 1. We can also use other apis to get what we want, but here it is easier to get Emements. Size () is 1.
/** * Resolves the proxy address */ from the details page
private static List<String> getProxyAddr(String url) {
List<String> lists = new ArrayList<>();
try {
// class cont_font --> tag p --> tag span --> br
Document document = Jsoup.connect(url).timeout(3000).get();
Elements contentElements = document.getElementsByClass("cont_font");
// Only one
Element content = contentElements.first();
Elements pElements = content.getElementsByTag("p");
// Only one
Element p = pElements.first();
Element span = p.child(0);
String[] split = span.html().split("<br>");
for (int i = 0; i < split.length; i++) { lists.add(split[i].trim()); System.out.println(split[i].trim()); }}catch (IOException e) {
e.printStackTrace();
}
return lists;
}
Copy the code
The initiating
The next step is to create a thread pool, open multiple threads, and set up a proxy for HTTP requests.
-
Initialize the thread pool. The number of threads for optimal performance is not considered here.
private static ExecutorService cachedThreadPool = Executors.newFixedThreadPool(10); Copy the code
-
Before, we got the set of data (222.45.196.46:8118@HTTP#) of each row of the agent details page. We continued to get the address and port to generate the set of InetSocketAddress.
/** * get the proxy address */ private static List<InetSocketAddress> generateInetScoketAddr(List<String> proxyList) { List<InetSocketAddress> inetSocketAddresses = new ArrayList<>(); for (String proxyStr : proxyList) { String addr; int port; int firstKeyIndex = proxyStr.indexOf(":"); if(firstKeyIndex ! = -1) { addr = proxyStr.substring(0, firstKeyIndex); int secondKeyIndex = proxyStr.indexOf("@"); if(secondKeyIndex ! = -1) { port = Integer.valueOf(proxyStr.substring(firstKeyIndex + 1, secondKeyIndex)); System.out.printf("addr:%sport:%d%n", addr, port); inetSocketAddresses.add(newInetSocketAddress(addr, port)); }}}return inetSocketAddresses; } Copy the code
-
Finally, the request is made through the thread pool
/** * start task **@param proxyLists */ private static void excuteTask(List<InetSocketAddress> proxyLists) { for (InetSocketAddress intSocketAddrees : proxyLists) { cachedThreadPool.execute(new Runnable() { @Override public void run(a) { doGet(intSocketAddrees); }}); }}/** * Execute the request **@param intSocketAddrees */ private static void doGet(InetSocketAddress intSocketAddrees) { HttpURLConnection conn = null; try { URL url = new URL(Main.url); // Set the proxy Proxy proxy = new Proxy(Proxy.Type.HTTP, intSocketAddrees); conn = (HttpURLConnection) url.openConnection(proxy); / / set the user-agent conn.setRequestProperty("user-agent"."Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"); // Set no cache conn.setRequestProperty("Cache-Control"."no-store"); conn.setRequestProperty("Pragrma"."no-cache"); conn.setUseCaches(false); conn.setRequestMethod("GET"); conn.setConnectTimeout(6000); conn.setReadTimeout(6000); conn.connect(); if (conn.getResponseCode() == HttpURLConnection.HTTP_OK) { // It will probably fail System.out.println("may success"); }}catch (MalformedURLException e) { System.out.println("Exception"); } catch (IOException e) { System.out.println("Exception"); } finally { if(conn ! =null) { conn.disconnect(); }}}Copy the code
And finally, look at the effect. (right, the old driver friendship tips, this method can also brush 1024! 1024! 1024! Specific how to play, try it yourself!