I put the 01 in front

I often hear a lot of people complain that their IP is blocked by the website because of too many crawlers, and they have to frequently use various proxy IP, but because most of the open proxy online can not be used, and they have to spend money and energy to apply for VIP proxy, several twists and turns and blocked. Here is an article on how to build a proxy pool in Python to reduce the cost of time and effort and automate the acquisition of active proxy IP addresses.

02 Operation Principle

One, website agent acquisition

1. Climb the IP list of the free agent site to test whether it is available and high hidden

2. If yes, add them to the database. Otherwise, discard them.

3. Repeat Step 2

Ensure that the failed agent can be removed from the agent pool as soon as possible

1. Obtain the IP address from the crawler database

2. Test the availability and anonymity of IP addresses

3. If it is available and anonymous, retain it; otherwise, discard it.

4. Repeat Step 1

Note ① : you can set up a crawler Daemon. If you need it, you can Google it by yourself.

It doesn’t matter whether you use NodeJS or Flask/Django or PHP to write it. I won’t explain it here.

03 implementation

Suggested libraries: Requests, BeautifulSoup, RE, SQlite3.

The Requests library is used to fetch broker web pages, the BeautifulSoup and RE libraries are used to fetch broker information, and sqlite3 is used to access this information.

If necessary (such as when a proxy site has an anti-crawler policy), PhantomJS can be used instead of Requests, or a corresponding library can be used for data cleansing (such as Base64 decoding).

Here’s a quick look at the code for each part:

The first step is to choose a number of sites that can crawl proxies and have IP addresses that are not easily blocked, using proxy-list.org as an example:

BASE_URL = "https://proxy-list.org/english/index.php?p=" # the IP address and port of regular Re_Pattern_IP = re.com running Re_Pattern_PORT (" (. *) : ") = re.compile(" (.*)") # re.compile(" (.*)") HTML_ProxyPage = requests.get(BASE_URL+str(startingURL_Param)).content soup = bs(HTML_ProxyPage,"html.parser") for Raw_ProxyInfo in soup.find_all("ul",{"class":None}): # This website uses Base64 simple proxy encryption, So I'm decoding it ip_port = base64.b64decode(Raw_ProxyInfo.find("li",{"class":"proxy"}).text.replace("Proxy('","").replace("')","")) IP = re.findall(Re_Pattern_IP, ip_port)[0] PORT = Re.findall (Re_Pattern_PORT, ip_port)[0] TYPE = Raw_ProxyInfo.find("li",{"class":"https"}).textCopy the code

The following is a simple proxy pool framework class that provides proxy database addition, deletion, connectivity detection, and anonymity detection:

Class ProxyPool: def __init__(self,ProxyPoolDB): self.ProxyPoolDB = ProxyPoolDB self.conn = sqlite3.connect(self.ProxyPoolDB, isolation_level=None) self.cursor = self.conn.cursor() self.TB_ProxyPool = "TB_ProxyPool" self.cursor.execute("CREATE TABLE IF NOT EXISTS "+self.TB_ProxyPool+"(ip TEXT UNIQUE, port INTEGER, Def addProxy(self, IP, PORT, protocol)") def addProxy(self, PORT, protocol): self.cursor.execute("INSERT OR IGNORE INTO " + self.TB_ProxyPool+"(ip, port, protocol) VALUES (? ,? ,?) ", [IP,PORT,PROTOCOL]) def testConnection(self, IP,PORT,PROTOCOL): proxies = {PROTOCOL: IP+":"+PORT } try: OrigionalIP = requests.get("http://icanhazip.com", timeout=REQ_TIMEOUT).content MaskedIP = requests.get("http://icanhazip.com", timeout=REQ_TIMEOUT,proxies=proxies).content if OrigionalIP ! = MaskedIP: return True else: return False except: return False # def delRecord(self, IP) self.cursor.execute("DELETE FROM "+self.TB_ProxyPool+" WHERE ip=?" ,(IP,))Copy the code

Here is the code to remove the invalid IP from the proxy pool:

Def cleanNonWorking(self): for info in self.cursor.execute("SELECT * FROM "+self.TB_ProxyPool).fetchall(): IP = info[0] PORT = str(info[1]) PROTOCOL = info[2].lower() isAnonymous = self.testConnection(IP,PORT,PROTOCOL) if isAnonymous == False: Def testConnection(self, IP, PORT, PROTOCOL) def testConnection(self, IP, PORT, PROTOCOL): proxies = { PROTOCOL: IP+":"+PORT } try: OrigionalIP = requests.get("http://icanhazip.com", timeout=REQ_TIMEOUT).content MaskedIP = requests.get("http://icanhazip.com", timeout=REQ_TIMEOUT,proxies=proxies).content if OrigionalIP ! = MaskedIP: return True else: return False except: return FalseCopy the code

04 reflection

This project was written by my hand in Python at the beginning of the year. If I look back at the current level, the logic is not rigorous enough, and various functions are too coupled. Many paragraphs need to be rewritten.

This method of detecting proxy anonymity through icanhazip.com may work, but it ignores the HTTP header of X-Forwarded-For, so it is risky and must be improved.

Multithreading is required to verify the validity of agents in the agent pool, and the current scheme is inefficient.

05 Complete Code

In this article is the core code of the agent pool, to provide readers can achieve their own ideas and reference. The full code can be found on the author’s Github home page, and runs with Python 2.7 tests on Ubuntu 16.04 and Kali.


The original post was published on November 28, 2016

Author: Cangming

This article is from the Python Chinese Community, a cloud community partner. For more information, please follow the wechat official account of Python Chinese Community