If you have encountered in the process of the crawler “your request is too frequently, please try again later, or code is completely correct, but in the process of the crawler suddenly can’t access, so congratulations you, you the crawler was caught, light to give friendly prompt warning, serious may ban on your IP, so the proxy IP that is particularly important. Today we will talk about proxy IP, to solve the problem of crawler blocking.
There are many proxy IP addresses available online, free and paid. Most company crawlers will buy these professional versions. For ordinary people, the free version basically meets our needs. However, free has a disadvantage that the timeliness is not strong and unstable, so we need to conduct a simple verification on the IP collected.
1. Target collection
This article is mainly aimed at the West thorn agent, this website used long ago, but at that time it also provides free API, now API temporarily does not provide, we will write a simple crawler to collect.
Open the west thorn agent, there are several pages, decisively choose gaunic agent.
Chrome browser right check network, it is not difficult to find, every IP address in the TD tag, for us a lot easier, the initial idea is to obtain all IP, and then check the availability, unavailable to remove.
- Defining matching rules
import re
ip_compile = re.compile(r'<td>(\d+\.\d+\.\d+\.\d+)</td>') # matching IP
port_compile = re.compile(r'<td>(\d+)</td>') # match port
Copy the code
2. Verification Here I use Taobao IP address database to verify availability
2.1 About Taobao IP address database
The espace ems provides the following services: 1. Based on the IP address provided by a user, the espace EMS can quickly query the geographic information of the IP address and related information, including the country, province, city, and carrier. 2. Users can update our service content according to their location and IP address. Our advantages: 1. Provide national, provincial, city, county, operators with comprehensive information, wide information dimension, standard format. 2. Provide complete statistical analysis reports with provincial accuracy over 99.8% and city accuracy over 96.8%, ensuring data quality.
2.2. Interface description
- Request interface (GET) :Ip.taobao.com/service/get…Ex. :
http://ip.taobao.com/service/getIpInfo2.php?ip=111.177.181.44
- Response information: country, province (autonomous region or municipality), city (county), and carrier (JSON format)
- Return data format:
{"code": 0."data": {"ip":"210.75.225.254"."country":"\u4e2d\u56fd"."area":"\u534e\u5317"."region":"\u5317\u4eac\u5e02"."city":"\u5317\u4eac\u5e02"."county":""."isp":"\u7535\u4fe1"."country_id":"86"."area_id":"100000"."region_id":"110000"."city_id":"110000"."county_id":"1"."isp_id":"100017"}}
Copy the code
The values of code are 0: success, 1: failure.
Note: to ensure normal service operation, the access frequency of each user must be less than 10qps. Let’s test it in the browser
-
Enter the address http://ip.taobao.com/service/getIpInfo2.php?ip=111.177.181.44
-
Enter an address http://ip.taobao.com/service/getIpInfo2.php?ip=112.85.168.98 again
-
Code operation
import requests
check_api = "http://ip.taobao.com/service/getIpInfo2.php?ip="
api = check_api + ip
try:
response = requests.get(url=api, headers=api_headers, timeout=2)
print("IP: %s available" % ip)
except Exception as e:
print("This IP %s is invalid: %s" % (ip, e))
Copy the code
3. Code
In addition to the exception processing code, in fact, their own handwritten demo do not write exception processing can be, but in order to facilitate other debugging, it is recommended to add exception processing where exceptions may occur.
import requests
import re
import random
from bs4 import BeautifulSoup
ua_list = [
"Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"."Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36"."Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36"."Mozilla / 5.0(Windows NT 6.1; WOW64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 45.0.2454.101Safari / 537.36"
]
def ip_parse_xici(page):
""":param page: number of pages collected :return:""
ip_list = []
for pg in range(1, int(page)):
url = 'http://www.xicidaili.com/nn/' + str(pg)
user_agent = random.choice(ua_list)
my_headers = {
'Accept': 'text/html, application/xhtml+xml, application/xml; '.'Accept-Encoding': 'gzip, deflate, sdch'.'Accept-Language': 'zh-CN,zh; Q = 0.8 '.'Referer': 'http: // www.xicidaili.com/nn'.'User-Agent': user_agent
}
try:
r = requests.get(url, headers=my_headers)
soup = BeautifulSoup(r.text, 'html.parser')
except requests.exceptions.ConnectionError:
print('ConnectionError')
else:
data = soup.find_all('td')
# Define IP and port Pattern rules
ip_compile = re.compile(r'<td>(\d+\.\d+\.\d+\.\d+)</td>') # matching IP
port_compile = re.compile(r'<td>(\d+)</td>') # match port
ips = re.findall(ip_compile, str(data)) Get all IP addresses
ports = re.findall(port_compile, str(data)) Get all ports
check_api = "http://ip.taobao.com/service/getIpInfo2.php?ip="
for i in range(len(ips)):
if i < len(ips):
ip = ips[i]
api = check_api + ip
api_headers = {
'User-Agent': user_agent
}
try:
response = requests.get(url=api, headers=api_headers, timeout=2)
print("IP: %s available" % ip)
except Exception as e:
print("This IP %s is invalid: %s" % (ip, e))
del ips[i]
del ports[i]
ips_usable = ips
ip_list += [':'.join(n) for n in zip(ips_usable, ports)] # list generator
print('Page {} IP collection completed'.format(pg))
print(ip_list)
if __name__ == '__main__':
xici_pg = input("Please enter the number of pages to be collected:")
ip_parse_xici(page=xici_pg)
Copy the code
Run code:
4. Add proxy IP for your crawler
It is suggested that you can store the COLLECTED IP in the database, so that each crawler can be directly called, by the way, how to add proxy IP in the code.
import requests
url = 'www.baidu.com'
headers = {
"User-Agent": "Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36",
}
proxies = {
"http": "http://111.177.181.44:9999".# "HTTPS" : "https://111.177.181.44:9999",
}
res = requests.get(url=url, headers=headers, proxies=proxies)
Copy the code
Well, mom won’t be worried about my crawly being sealed anymore