Crawler essential artifact: IP proxy pool

Anti climb technology is more and more mature, in order to climb in the target data, must be to disguise the crawler request, deceive the target system, target system by judging the request access frequency or request parameters will be banned, suspected of the crawler IP requirements for security verification, through third-party python library faker can be randomly generated header camouflage request header, And slow crawl of the crawler, can very good evaded most an mechanism of target system, but for some higher safety level system, also have great possible IP is banned, when IP is banned, by changing the proxy IP can continue to climb, so has a valid IP agent pool is very important, there are many online dynamic IP proxy provider, But it’s also a good idea to have your own free IP proxy pool. Pay attention to wechat public number [rookie a] and reply: IP pool, can get the source code.

IP proxy pool development roadmap:

1. Use crawler technology to crawl free IP addresses on the Internet

2. After verification, save the valid IP address

Free IP agent:

IP provider	url
Fast acting	[https://www.kuaidaili.com/free/inha](https://www.kuaidaili.com/free/inha)
89 Free Agency	[https://www.89ip.cn/index_1.html](https://www.89ip.cn/index_1.html)
High availability global free proxy IP library	[https://ip.jiangxianli.com/](https://ip.jiangxianli.com/)
66 agent	[http://www.66ip.cn/2.html](http://www.66ip.cn/2.html)

Encapsulate the request method. When the request fails, it will pause for 3 seconds and initiate the request again. The faker library is used to generate the request header randomly

Def connect (url): I =0 while I <3: try: headers= {'User-Agent':str(UserAgent().random)} response=requests.get(url,headers=headers,timeout=5) if(response.status_code==200): return response except requests.exceptions.RequestException as e: time.sleep(3) i+=1Copy the code

Obtain the IP provided by the web page, and climb the free IP provided by the above five providers in total. The page data is a table, so the table crawl data is located by xpath

def getDate(): for i in range(0,len(urlNode)): Replace ('@', STR (j)) print(url) Response =GetConnect(url) Content =response.text html=etree.HTML(content) tr=html.xpath('//tr') for j in range(1,len(tr)+1): ip=html.xpath('//tr['+str(j)+']/td[1]/text()') port=html.xpath('//tr['+str(j)+']/td[2]/text()') IpType = HTML. Xpath (' / / tr [' STR + (j) + '] [4] / td/text () ') # 66 IP header first behavior if len (IP) > 1: continue if len(ipType)==0 or not ipType[0].isalpha(): ipType='HTTP' else: ipType=ipType[0] if len(ip)! =0 and len(port)! =0: checkIp(wash(ip[0])+':'+wash(port[0]),wash(ipType))Copy the code

Access the ICanhazIP address to verify the validity of the IP address through the IP proxy request

Def checkIp(IP,ipType): url='http://icanhazip.com/' try: headers= {'User-Agent':str(UserAgent().random)} proxy = { ipType.lower():ipType.lower()+'://'+ip } response=requests.get(url,headers=headers,proxies=proxy,timeout=5) if(response.status_code==200): Write (IP,ipType) except Exception as eCopy the code

Write a valid IP to a file for use by crawlers

Def write(IP,ipType): with open(" ip.txt ", "a", encoding=" utF-8 ") as f: f.write(wash(IP)+ "+wash(ipType)+'\n')Copy the code

Pay attention to wechat public number [rookie a] and reply: IP pool, get the source code.

Crawler essential artifact: IP proxy pool

Related Posts

The Rust Authority’s Guide to speed Reading in 5 Minutes (PART 1)

Pzh-speech: Your SpeechRecognition (PocketSphinx0.1.15)

Esp8266/32 Colorful LED Atmosphere Light (Http version)