Python crawler Notes: Use a proxy to prevent native IP from being sealed

This is the 13th day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021

Using proxies is a common way to combat anti-crawler mechanisms. Many web sites will detect information such as the number of visits to the server by an external IP address within a certain period of time. If the number of times or the access mode does not comply with the security policy, the system disallows the access from the external IP address to the server. Therefore, crawler designers can use some proxy servers, so that their real IP address is hidden, from being prohibited.

Urllib uses ProxyHandler to set up the use of proxy servers

There are generally two types of agents on the network: free agents and paid agents. Free proxy can be searched by Baidu /Google, or from the following websites: Xichini free proxy IP, fast proxy free proxy, Proxy360 proxy, the whole network proxy IP…

Free open proxy will generally be used by many people, and the proxy has short life, slow speed, anonymity is not high, HTTP/HTTPS support instability and other disadvantages (exactly the so-called, free is not good)
Professional crawler engineers or crawler companies use high-quality private agents, which are usually purchased from a specialized agent provider and authorized by a user name/password.

You can organize a list of agents and use them randomly under a certain period of time to prevent access from the server.

# Use proxy
Demo 1: Use ProxyHandler to access the target site by specifying a free proxy
import urllib.request

request = urllib.request.Request('http://www.baidu.com')

Example: I'm looking for a free agent. I don't know when it will expire
proxy_support = urllib.request.ProxyHandler({'http':'210.1.58.212:8080'})

opener = urllib.request.build_opener(proxy_support)
response = opener.open(request)
print(response.read().decode('utf-8'))
Copy the code

# Use proxy
# Demo 2: Use an authenticated proxy

import urllib.request

# here the user name and password and proxy are blank, use the time to change the use of the use of
# The process is the process, the way is the way. Long live understanding.
username = 'leo'
password = 'leo'
proxydict = {'http':'106.185.26.199:25'}

proxydict['http'] = username + ':' + password + The '@' + proxydict['http']
httpWithProxyHandler = urllib.request.ProxyHandler(proxydict)

opener = urllib.request.build_opener(httpWithProxyHandler)
request = urllib.request.Request('http://www.baidu.com')

resp = opener.open(request)
print(resp.read().decode('utf-8'))
Copy the code

# Use proxy
# Demo 3: Improve the above process using the urllib recommendations
username = 'leo'
password = 'leo'
proxyserver = {'106.185.26.199:25'}

# 1. Build a password management object to hold the user names and passwords that need to be processed
passwordMgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()

Add user information. The first parameter is the information about the domain associated with the remote server. Default is None and can be viewed via the Response header
The following three parameters are: proxy server, username, password
passwordMgr.add_password(None, proxyserver, username, password)

# 3. Build a proxy base username/password authentication Handler object with the password management object
proxyauth_handler = urllib.request.ProxyBasicAuthHandler(passwordMgr)

# 4. Define the opener object by build_opener()
opener = urllib.request.build_opener(proxyauth_handler)

# 5. Construct request request
request = urllib.request.Request('http://www.baidu.com')

# 6. Send requests using custom opener
response = opener.open(request)

# 7. Print the response content
print(response.read().decode('utf-8'))
Copy the code

# Use proxy
# 4: demo use from http://www.goubanjia.com/, www.kuaidaili.com/dps for agency list
You can use fast agent to test the feasibility of the agent online

import random

# On October 18, I found a free agent on the fast agent website.
proxylist = [
    {'http':'210.1.58.212:8080'},
    {'http':'106.185.26.199:25'},
    {'http':'124.206.192.210:38621'},
    {'http':'222.249.224.61:48114'},
    {'http':'115.218.217.184:9000'},
    {'http':'183.129.244.17:10010'},
    {'http':'120.26.199.103:8118'},]def randomTryProxy(retry) :
    ''' Function : choose a proxy from the proxy list RANDOMLY! retry : number of retry '''
    # Strategy 1 Select at random
    try:
        proxy = random.choice(proxylist)
        
        print('Try %s : %s' %(retry, proxy))
        
        httpProxyHandler = urllib.request.ProxyHandler(proxy)
        opener = urllib.request.build_opener(httpProxyHandler)
        request = urllib.request.Request('http://www.baidu.com')
        response = opener.open(request,timeout = 5)
        
        print('Worked ! ')
        
    except:
        print('Connect error:Please retry')
        if retry > 0:
            randomTryProxy(retry-1)
        
def inorderTryProxy(proxy) :
    ''' Function : choose a proxy from the proxy list RANDOMLY! retry : index of proxy '''
    # Strategy 2: Try to choose in turn
    try:

        print('Try %s ' %(proxy))
        
        httpProxyHandler = urllib.request.ProxyHandler(proxy)
        opener = urllib.request.build_opener(httpProxyHandler)
        request = urllib.request.Request('http://www.baidu.com')
        response = opener.open(request,timeout = 5)
        
        print('Worked ! ')
        
    except:
        print('Connect error:Please retry')
        
        
if __name__ == '__main__':
    Random filtering is suitable for most cases in which the agent list is available
    randomTryProxy(5)
    print(The '-'*20)
    Try, in turn, to fit a situation where most of the agent list is unavailable
    for p in proxylist:
        inorderTryProxy(p)
        
Copy the code

Try 5: {' HTTP ': '115.218.217.184:9000'} Connect error:Please retry Try 4: {' HTTP ': '115.218.217.184:9000'} Connect error:Please retry Try 3: {' HTTP ': '222.249.224.61:48114'} Connect error:Please retry Try 2: {' HTTP ': '210.1.58.212:8080'} Worked! -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Try {' HTTP: '210.1.58.212:8080} Worked! Try {' HTTP ': '106.185.26.199:25'} Connect error:Please retry Try {' HTTP ': '124.206.192.210:38621'} Connect error:Please retry Try {' HTTP ': '222.249.224.61:48114'} Connect error:Please retry Try {' HTTP ': '115.218.217.184:9000'} Connect error:Please retry Try {' HTTP ': '183.129.244.17:10010'} Connect error:Please retry Try {' HTTP ': '120.26.199.103:8118'} Connect error:Please retryCopy the code

Python crawler Notes: Use a proxy to prevent native IP from being sealed

Related Posts

The R language uses bootstrap and increment methods to calculate the confidence interval of generalized linear model (GLM) prediction

More understanding of Chinese designers UI design tools come

Safari debugging iPhone Web Page