Your crawler IP was blocked again? Teach you a trick

Python crawls the proxy IP and tests whether it is available

A lot of people in order to prevent the blocked IP in the crawler, so will go to the web site to find free proxy IP, since not every IP address is effective, if want to go in one by one than efficiency is too low, I also met this kind of situation, so I tried directly on site crawl free proxy IP, and test one by one, Finally, the valid IP is returned.

Here I choose to accelerate the tomato www.fanqieip.net/ proxy IP site to crawl…

First, preparation

Import the package and set the header label

import requests

from bs4 import BeautifulSoup

header = {

‘user-agent ‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36’

}

Second, extract web source code

Extracting the source code of the web page returns the HTML of the entire site

def getHtml(url):

try:

reponse = requests.get(url, headers=header)

reponse.raise_for_status()

reponse.encoding = reponse.apparent_encoding

return reponse.text

except:

Return “Error extracting webpage source code”

Parse HTML and extract IP

The function takes HTML and a list of stored IP addresses

# Parse web pages, extract IP

def getIp(html, list):

try:

soup = BeautifulSoup(html, “html.parser”)

tr = soup.find(“tbody”).find_all_next(“tr”)

for ip in tr:

# extract IP

td = ip.find_next(“td”).string

td = str(td).replace(” “, “”).replace(“\n”, “”).replace(“\t”, “”)

Extract the port number

dk = ip.find_all_next(“td”)[1].string

dk = str(dk).replace(” “, “”).replace(“\n”, “”).replace(“\t”, “”)

# connect IP to port number

ip = td + “:” + dk

List. Append (IP) # add the IP address to the specified list

except:

Print (” failed to obtain IP address “)

4. Test whether the IP address is available

The principle of testing the IP here is to request Baidu site with Requests and pass in proxy IP. If the site returns a status code of 200, the IP address is valid. Otherwise, the IP address is invalid

Test the available IP

def ip_text(list, valid_IP):

try:

url = “www.baidu.com//”

for ip in list:

try:

Rep = requests. Get (URL, Proxies ={‘ HTTPS ‘: IP}, headers=header, timeout=0.5)

If rep.status_code == 200: # If rep.status_code == 200: #

valid_IP.append(ip)

Print (” The proxy IP is valid: “+ IP)

else:

Print (” This proxy IP is invalid: “+ IP)

except:

Print (” This proxy IP is invalid: “+ IP)

except:

Print (“IP test failed “)

5. Main function

The main function is responsible for calling the function and specifying a custom page number to generate the URL, and the program will output a valid IP address before the end

if __name__ == ‘__main__’:

Valid_IP = [] # Valid IP address

For I in range(1, 90)

Ip_list = [] #

url = “www.fanqieip.net/index\_” + str(i) + “.html”

print(url)

html = getHtml(url)

getIp(html, ip_list)

ip_text(ip_list, valid_IP)

print(“=” * 30)

Print (” test done, valid IP :”)

print(“-” * 30)

for a in valid_IP:

print(a)

print(“=” * 30)

The overall framework of the code has been completed, and all the code has finally been rendered

The complete code

# -*- coding: utf-8 -*-

import requests

from bs4 import BeautifulSoup

header = {

‘user-agent ‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36’

}

Extract the source code of the web page

def getHtml(url):

try:

reponse = requests.get(url, headers=header)

reponse.raise_for_status()

reponse.encoding = reponse.apparent_encoding

return reponse.text

except:

Return “Error extracting webpage source code”

# Parse web pages, extract IP

def getIp(html, list):

try:

soup = BeautifulSoup(html, “html.parser”)

tr = soup.find(“tbody”).find_all_next(“tr”)

for ip in tr:

# extract IP

td = ip.find_next(“td”).string

td = str(td).replace(” “, “”).replace(“\n”, “”).replace(“\t”, “”)

Extract the port number

dk = ip.find_all_next(“td”)[1].string

dk = str(dk).replace(” “, “”).replace(“\n”, “”).replace(“\t”, “”)

# connect IP to port number

ip = td + “:” + dk

List. Append (IP) # add the IP address to the specified list

except:

Print (” failed to obtain IP address “)

Test the available IP

def ip_text(list, valid_IP):

try:

url = “www.baidu.com//”

for ip in list:

try:

Rep = requests. Get (URL, Proxies ={‘ HTTPS ‘: IP}, headers=header, timeout=0.5)

Your crawler IP was blocked again? Teach you a trick

Related Posts

Use Linux Manjaro as a working system

[Plaything] Programmers work at home, my 27 inch 4K eye monitor upgrade experience – Benq PD2700U

Selenium drop down list location