Why use proxy IP
Website, a lot of data for the crawler have made certain restrictions, this if written some crawlers friend should be the least, it is main or IP into the little black house, so for the sake of safety, can’t use your own actual IP to climb take somebody else’s site, this time, you need to use proxy IP to do these things…
Why use a hidden agent
We can compare the differences between different types of agents. According to the degree of anonymity of agents, agents can be classified as follows:
-
Highly anonymous proxy: forwards the packet as if it were an ordinary client and records the IP address of the proxy server.
-
Ordinary anonymous proxy: some changes are made to the packet. The server may discover that the packet is a proxy server, and there is a certain chance that the real IP address of the client can be traced.
-
Transparent proxy: Not only changes the packet, but also tells the server the real IP address of the client.
-
Spy agent: a proxy server created by an organization or an individual to record data transmitted by users for research and monitoring purposes.
Runtime environment
PIP instatll XXXXXX is used for PIP install requests. PIP instatll XXXXXX is used for PIP install requests. PIP instatll XXXXXX is used for PIP install requests
Climb west thorn proxy IP address
Here, I only about climb the west thorn high secret agent 50 pages of data, of course, climb 100 pages, climb all, is ok, not to say;
def run(self):
""" Execution entry """
page_list = range(1.51)
with open("ip.json"."w") as write_file:
for page in page_list:
# page crawl data
print('Start crawling to number one' + str(page) + 'Page IP data')
ip_url = self.base_url + str(page)
html = self.get_url_html(ip_url)
soup = BeautifulSoup(html, 'html.parser')
# the IP list
ip_list = soup.select('#ip_list .odd')
for ip_tr in ip_list:
# Single Ip message
td_list = ip_tr.select('td')
ip_address = td_list[1].get_text()
ip_port = td_list[2].get_text()
ip_type = td_list[5].get_text()
info = {'ip': ip_address, 'port': ip_port, 'type': ip_type}
Verify the validity of the IP address before storing it
check_res = self.check_ip(info)
if check_res:
print('IP valid: ', info)
self.json_data.append(info)
else:
print('INVALID IP:', info)
json.dump(self.json_data, write_file)
Copy the code
Check whether the proxy IP address is valid
The proxy IP may be unusable. In order to facilitate the use, there are not too many abnormal errors reported, so it is necessary to check whether the IP can be used normally and whether it is a valid proxy IP. I have listed three websites here, which can easily detect whether the IP address can be used effectively
- Icanhazip.com/ This site returns the proxy’s IP address directly
- www.ip.cn/ Query the IP address and location of the proxy
- Ip.chinaz.com/ Webmaster tools can also locate IP addresses and location information
def check_ip(self, ip_info):
""" Tests whether the IP address is valid. ""
ip_url = ip_info['ip'] + ':' + str(ip_info['port'])
proxies = {'http': 'http://' + ip_url, 'https': 'https://' + ip_url}
res = False
try:
request = requests.get(url=self.check_url, headers=self.header, proxies=proxies, timeout=3)
if request.status_code == 200:
res = True
except Exception as error_info:
res = False
return res
Copy the code
Storage Proxy IP address
I don’t want to do the fancy things here, I directly store all the valid proxy IP data in JSON format in a file, of course, can also be stored in MongoDB or MySQL database, either way, when using a random IP, more convenient.
The complete code
Code I have uploaded GitHub (GitHub source address), but, as a zealous tile removal, in order to facilitate some people want to be lazy, not directly to the dating site view, I also posted the source code here, if there is any problem, it is best to go to the dating site to find me, please pick up the code……
#! /usr/bin/env python
# -*- coding: utf-8 -*-
Author: Gxcuizy Date: 2020-06-19 """
import requests
from bs4 import BeautifulSoup
import json
class GetIpData(object):
""" climb 50 pages of domestic high secret proxy IP""
header = {'User-Agent': 'the Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36'}
base_url = 'https://www.xicidaili.com/nn/'
check_url = 'https://www.ip.cn/'
json_data = []
def get_url_html(self, url):
""" Request page HTML """
request = requests.get(url=url, headers=self.header, timeout=5)
html = False
if request.status_code == 200:
html = request.content
return html
def check_ip(self, ip_info):
""" Tests whether the IP address is valid. ""
ip_url = ip_info['ip'] + ':' + str(ip_info['port'])
proxies = {'http': 'http://' + ip_url, 'https': 'https://' + ip_url}
res = False
try:
request = requests.get(url=self.check_url, headers=self.header, proxies=proxies, timeout=3)
if request.status_code == 200:
res = True
except Exception as error_info:
res = False
return res
def run(self):
""" Execution entry """
page_list = range(1.51)
with open("ip.json"."w") as write_file:
for page in page_list:
# page crawl data
print('Start crawling to number one' + str(page) + 'Page IP data')
ip_url = self.base_url + str(page)
html = self.get_url_html(ip_url)
soup = BeautifulSoup(html, 'html.parser')
# the IP list
ip_list = soup.select('#ip_list .odd')
for ip_tr in ip_list:
# Single Ip message
td_list = ip_tr.select('td')
ip_address = td_list[1].get_text()
ip_port = td_list[2].get_text()
ip_type = td_list[5].get_text()
info = {'ip': ip_address, 'port': ip_port, 'type': ip_type}
Verify the validity of the IP address before storing it
check_res = self.check_ip(info)
if check_res:
print('IP valid: ', info)
self.json_data.append(info)
else:
print('INVALID IP:', info)
json.dump(self.json_data, write_file)
# Program main entry
if __name__ == '__main__':
# instantiation
ip = GetIpData()
# execute script
ip.run()
Copy the code
The last
As always, if you have any questions, please leave a message or tell me through various channels. We can learn and communicate with each other and grow together…