preface

Recently, I found in a resource forum that I was looking for learning materials for several days, but I had no choice but to download the attachment and need gold coins. Although recharge is not expensive, it is just called Write the Code. Change the world. In the points system of this forum, you can click the invitation link to get gold coins. Many forums also have this mode, so I want to write a small demo, access the invitation link by proxy, to brush gold coins, and work forever.

collect

Since the principle is through proxy access to invite links, so certainly is the need for proxy IP, proxy IP this website has, the quality may not be particularly good, but it doesn’t matter, and not looking for a girlfriend, do not need to be more than tolerable, large amount on the line.

I originally intended to copy the proxy manually and save it locally, and then read it in code. Finally, I decided to use jsoup to parse the HTML and collect the proxy IP.

Jsoup is a Java HTML parser that can directly parse a URL address and HTML text content. It provides a very labor-intensive API for retrieving and manipulating data using DOM, CSS, and jquery-like manipulation methods.

GetElementsByTag (String Tag), getElementsByClass(String className), attR (String key), Text (), HTML () and so on to find elements, element data and so on.

Get the details page address

The href of all tags under the newslist_line is the address of the broker detail page.

Only the first few pages are parsed here, too many are a waste of time.

    /** * Get the url of the agent details page */ from the home page
    private static List<String> getProxyList(a) {
        List<String> lists = new ArrayList<>();
        // Parse the pre-max_page page
        for (int i = 1; i <= MAX_PAGE; i++) {
            String path = "http://www.youdaili.net/Daili/http/list_" + i + ".html";
            try {
                Document document = Jsoup.connect(path).timeout(3000).get();
                Elements newsListElements = document.getElementsByClass("newslist_line");
                // Only one
                Element list = newsListElements.first();
                Elements links = list.getElementsByTag("a");
                for (Element link : links) {
                    String href = link.attr("href");
                    if(href ! =null && href.length() > 0) { lists.add(href); }}}catch(IOException e) { e.printStackTrace(); }}return lists;
    }

Copy the code

Resolve the agent in the detail page

In the details page, all proxy IP addresses can be obtained from

under the cont_font

tag.

In many cases getElementsByClass returns Elements with a size greater than 1. We can also use other apis to get what we want, but here it is easier to get Emements. Size () is 1.

     /** * Resolves the proxy address */ from the details page
    private static List<String> getProxyAddr(String url) {
        List<String> lists = new ArrayList<>();
        try {
            // class cont_font --> tag p --> tag span --> br
            Document document = Jsoup.connect(url).timeout(3000).get();
            Elements contentElements = document.getElementsByClass("cont_font");
            // Only one
            Element content = contentElements.first();
            Elements pElements = content.getElementsByTag("p");
            // Only one
            Element p = pElements.first();
            Element span = p.child(0);
            String[] split = span.html().split("<br>");
            for (int i = 0; i < split.length; i++) { lists.add(split[i].trim()); System.out.println(split[i].trim()); }}catch (IOException e) {
            e.printStackTrace();
        }
        return lists;
    }

Copy the code

The initiating

The next step is to create a thread pool, open multiple threads, and set up a proxy for HTTP requests.

  1. Initialize the thread pool. The number of threads for optimal performance is not considered here.

        private static ExecutorService cachedThreadPool = Executors.newFixedThreadPool(10);
    Copy the code
  2. Before, we got the set of data (222.45.196.46:8118@HTTP#) of each row of the agent details page. We continued to get the address and port to generate the set of InetSocketAddress.

          /** * get the proxy address */
        private static List<InetSocketAddress> generateInetScoketAddr(List<String> proxyList) {
            List<InetSocketAddress> inetSocketAddresses = new ArrayList<>();
            for (String proxyStr : proxyList) {
                String addr;
                int port;
                int firstKeyIndex = proxyStr.indexOf(":");
                if(firstKeyIndex ! = -1) {
                    addr = proxyStr.substring(0, firstKeyIndex);
                    int secondKeyIndex = proxyStr.indexOf("@");
                    if(secondKeyIndex ! = -1) {
                        port = Integer.valueOf(proxyStr.substring(firstKeyIndex + 1, secondKeyIndex));
                        System.out.printf("addr:%sport:%d%n", addr, port);
                        inetSocketAddresses.add(newInetSocketAddress(addr, port)); }}}return inetSocketAddresses;
        }
    Copy the code
  3. Finally, the request is made through the thread pool

        /** * start task **@param proxyLists
         */
        private static void excuteTask(List<InetSocketAddress> proxyLists) {
            for (InetSocketAddress intSocketAddrees : proxyLists) {
                cachedThreadPool.execute(new Runnable() {
                    @Override
                    public void run(a) { doGet(intSocketAddrees); }}); }}/** * Execute the request **@param intSocketAddrees
         */
        private static void doGet(InetSocketAddress intSocketAddrees) {
            HttpURLConnection conn = null;
            try {
                URL url = new URL(Main.url);
                // Set the proxy
                Proxy proxy = new Proxy(Proxy.Type.HTTP, intSocketAddrees);
                conn = (HttpURLConnection) url.openConnection(proxy);
                / / set the user-agent
                conn.setRequestProperty("user-agent"."Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)");
                // Set no cache
                conn.setRequestProperty("Cache-Control"."no-store");
                conn.setRequestProperty("Pragrma"."no-cache");
                conn.setUseCaches(false);
                conn.setRequestMethod("GET");
                conn.setConnectTimeout(6000);
                conn.setReadTimeout(6000);
                conn.connect();
                if (conn.getResponseCode() == HttpURLConnection.HTTP_OK) {
                    // It will probably fail
                    System.out.println("may success"); }}catch (MalformedURLException e) {
                System.out.println("Exception");
            } catch (IOException e) {
                System.out.println("Exception");
            } finally {
                if(conn ! =null) { conn.disconnect(); }}}Copy the code

And finally, look at the effect. (right, the old driver friendship tips, this method can also brush 1024! 1024! 1024! Specific how to play, try it yourself!