Anti climb technology is more and more mature, in order to climb in the target data, must be to disguise the crawler request, deceive the target system, target system by judging the request access frequency or request parameters will be banned, suspected of the crawler IP requirements for security verification, through third-party python library faker can be randomly generated header camouflage request header, And slow crawl of the crawler, can very good evaded most an mechanism of target system, but for some higher safety level system, also have great possible IP is banned, when IP is banned, by changing the proxy IP can continue to climb, so has a valid IP agent pool is very important, there are many online dynamic IP proxy provider, But it’s also a good idea to have your own free IP proxy pool. Pay attention to wechat public number [rookie a] and reply: IP pool, can get the source code.
IP proxy pool development roadmap:
1. Use crawler technology to crawl free IP addresses on the Internet
2. After verification, save the valid IP address
Free IP agent:
IP provider | url |
Fast acting | [https://www.kuaidaili.com/free/inha](https://www.kuaidaili.com/free/inha) |
89 Free Agency | [https://www.89ip.cn/index_1.html](https://www.89ip.cn/index_1.html) |
High availability global free proxy IP library | [https://ip.jiangxianli.com/](https://ip.jiangxianli.com/) |
66 agent | [http://www.66ip.cn/2.html](http://www.66ip.cn/2.html) |
Encapsulate the request method. When the request fails, it will pause for 3 seconds and initiate the request again. The faker library is used to generate the request header randomly
Def connect (url): I =0 while I <3: try: headers= {'User-Agent':str(UserAgent().random)} response=requests.get(url,headers=headers,timeout=5) if(response.status_code==200): return response except requests.exceptions.RequestException as e: time.sleep(3) i+=1Copy the code
Obtain the IP provided by the web page, and climb the free IP provided by the above five providers in total. The page data is a table, so the table crawl data is located by xpath
def getDate(): for i in range(0,len(urlNode)): Replace ('@', STR (j)) print(url) Response =GetConnect(url) Content =response.text html=etree.HTML(content) tr=html.xpath('//tr') for j in range(1,len(tr)+1): ip=html.xpath('//tr['+str(j)+']/td[1]/text()') port=html.xpath('//tr['+str(j)+']/td[2]/text()') IpType = HTML. Xpath (' / / tr [' STR + (j) + '] [4] / td/text () ') # 66 IP header first behavior if len (IP) > 1: continue if len(ipType)==0 or not ipType[0].isalpha(): ipType='HTTP' else: ipType=ipType[0] if len(ip)! =0 and len(port)! =0: checkIp(wash(ip[0])+':'+wash(port[0]),wash(ipType))Copy the code
Access the ICanhazIP address to verify the validity of the IP address through the IP proxy request
Def checkIp(IP,ipType): url='http://icanhazip.com/' try: headers= {'User-Agent':str(UserAgent().random)} proxy = { ipType.lower():ipType.lower()+'://'+ip } response=requests.get(url,headers=headers,proxies=proxy,timeout=5) if(response.status_code==200): Write (IP,ipType) except Exception as eCopy the code
Write a valid IP to a file for use by crawlers
Def write(IP,ipType): with open(" ip.txt ", "a", encoding=" utF-8 ") as f: f.write(wash(IP)+ "+wash(ipType)+'\n')Copy the code
Pay attention to wechat public number [rookie a] and reply: IP pool, get the source code.