Python crawls high secret proxy IP (no more worry about entering a dark room)

Why use proxy IP

Website, a lot of data for the crawler have made certain restrictions, this if written some crawlers friend should be the least, it is main or IP into the little black house, so for the sake of safety, can’t use your own actual IP to climb take somebody else’s site, this time, you need to use proxy IP to do these things…

Why use a hidden agent

We can compare the differences between different types of agents. According to the degree of anonymity of agents, agents can be classified as follows:

Highly anonymous proxy: forwards the packet as if it were an ordinary client and records the IP address of the proxy server.
Ordinary anonymous proxy: some changes are made to the packet. The server may discover that the packet is a proxy server, and there is a certain chance that the real IP address of the client can be traced.
Transparent proxy: Not only changes the packet, but also tells the server the real IP address of the client.
Spy agent: a proxy server created by an organization or an individual to record data transmitted by users for research and monitoring purposes.

Runtime environment

PIP instatll XXXXXX is used for PIP install requests. PIP instatll XXXXXX is used for PIP install requests. PIP instatll XXXXXX is used for PIP install requests

Climb west thorn proxy IP address

Here, I only about climb the west thorn high secret agent 50 pages of data, of course, climb 100 pages, climb all, is ok, not to say;

def run(self):
    """ Execution entry """
    page_list = range(1.51)
    with open("ip.json"."w") as write_file:
        for page in page_list:
            # page crawl data
            print('Start crawling to number one' + str(page) + 'Page IP data')
            ip_url = self.base_url + str(page)
            html = self.get_url_html(ip_url)
            soup = BeautifulSoup(html, 'html.parser')
            # the IP list
            ip_list = soup.select('#ip_list .odd')
            for ip_tr in ip_list:
                # Single Ip message
                td_list = ip_tr.select('td')
                ip_address = td_list[1].get_text()
                ip_port = td_list[2].get_text()
                ip_type = td_list[5].get_text()
                info = {'ip': ip_address, 'port': ip_port, 'type': ip_type}
                Verify the validity of the IP address before storing it
                check_res = self.check_ip(info)
                if check_res:
                    print('IP valid: ', info)
                    self.json_data.append(info)
                else:
                    print('INVALID IP:', info)
        json.dump(self.json_data, write_file)
Copy the code

Check whether the proxy IP address is valid

The proxy IP may be unusable. In order to facilitate the use, there are not too many abnormal errors reported, so it is necessary to check whether the IP can be used normally and whether it is a valid proxy IP. I have listed three websites here, which can easily detect whether the IP address can be used effectively

Icanhazip.com/ This site returns the proxy’s IP address directly
www.ip.cn/ Query the IP address and location of the proxy
Ip.chinaz.com/ Webmaster tools can also locate IP addresses and location information

def check_ip(self, ip_info):
    """ Tests whether the IP address is valid. ""
    ip_url = ip_info['ip'] + ':' + str(ip_info['port'])
    proxies = {'http': 'http://' + ip_url, 'https': 'https://' + ip_url}
    res = False
    try:
        request = requests.get(url=self.check_url, headers=self.header, proxies=proxies, timeout=3)
        if request.status_code == 200:
            res = True
    except Exception as error_info:
        res = False
    return res
Copy the code

Storage Proxy IP address

I don’t want to do the fancy things here, I directly store all the valid proxy IP data in JSON format in a file, of course, can also be stored in MongoDB or MySQL database, either way, when using a random IP, more convenient.

The complete code

Code I have uploaded GitHub (GitHub source address), but, as a zealous tile removal, in order to facilitate some people want to be lazy, not directly to the dating site view, I also posted the source code here, if there is any problem, it is best to go to the dating site to find me, please pick up the code……

#! /usr/bin/env python
# -*- coding: utf-8 -*-

Author: Gxcuizy Date: 2020-06-19 """

import requests
from bs4 import BeautifulSoup
import json


class GetIpData(object):
    """ climb 50 pages of domestic high secret proxy IP""
    header = {'User-Agent': 'the Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36'}
    base_url = 'https://www.xicidaili.com/nn/'
    check_url = 'https://www.ip.cn/'
    json_data = []

    def get_url_html(self, url):
        """ Request page HTML """
        request = requests.get(url=url, headers=self.header, timeout=5)
        html = False
        if request.status_code == 200:
            html = request.content
        return html

    def check_ip(self, ip_info):
        """ Tests whether the IP address is valid. ""
        ip_url = ip_info['ip'] + ':' + str(ip_info['port'])
        proxies = {'http': 'http://' + ip_url, 'https': 'https://' + ip_url}
        res = False
        try:
            request = requests.get(url=self.check_url, headers=self.header, proxies=proxies, timeout=3)
            if request.status_code == 200:
                res = True
        except Exception as error_info:
            res = False
        return res

    def run(self):
        """ Execution entry """
        page_list = range(1.51)
        with open("ip.json"."w") as write_file:
            for page in page_list:
                # page crawl data
                print('Start crawling to number one' + str(page) + 'Page IP data')
                ip_url = self.base_url + str(page)
                html = self.get_url_html(ip_url)
                soup = BeautifulSoup(html, 'html.parser')
                # the IP list
                ip_list = soup.select('#ip_list .odd')
                for ip_tr in ip_list:
                    # Single Ip message
                    td_list = ip_tr.select('td')
                    ip_address = td_list[1].get_text()
                    ip_port = td_list[2].get_text()
                    ip_type = td_list[5].get_text()
                    info = {'ip': ip_address, 'port': ip_port, 'type': ip_type}
                    Verify the validity of the IP address before storing it
                    check_res = self.check_ip(info)
                    if check_res:
                        print('IP valid: ', info)
                        self.json_data.append(info)
                    else:
                        print('INVALID IP:', info)
            json.dump(self.json_data, write_file)


# Program main entry
if __name__ == '__main__':
    # instantiation
    ip = GetIpData()
    # execute script
    ip.run()
Copy the code

The last

As always, if you have any questions, please leave a message or tell me through various channels. We can learn and communicate with each other and grow together…

Python crawls high secret proxy IP (no more worry about entering a dark room)

Why use proxy IP

Why use a hidden agent

Runtime environment

Climb west thorn proxy IP address

Check whether the proxy IP address is valid

Storage Proxy IP address

The complete code

The last

Related Posts

On RHEL 6, DHCP+TFTP+FTP+PXE+Kickstart implements unattended installation

The new lock provided by Java8 is StampedLock

The interviewer asked me to name two scenarios for the @Transactional annotation to fail. I rattled off six