Life is short. I use Python
Previous portal:
Learning Python crawlers (1) : The Beginning
Python crawler (2) : Preparation (1) basic class library installation
Learn Python crawler (3) : Pre-preparation (2) Linux basics
Docker is a Python crawler
Learn Python crawler (5) : pre-preparation (4) database foundation
Python crawler (6) : Pre-preparation (5) crawler framework installation
Python crawler (7) : HTTP basics
Little White learning Python crawler (8) : Web basics
Learning Python crawlers (9) : Crawler basics
Python crawler (10) : Session and Cookies
Python crawler (11) : Urllib
Python crawler (12) : Urllib
Urllib: A Python crawler (13)
Urllib: A Python crawler (14)
Python crawler (15) : Urllib
Python crawler (16) : Urllib crawler (16) : Urllib crawler
Python crawler (17) : Basic usage for Requests
Python crawler (18) : Requests advanced operations
Python crawler (19) : Xpath base operations
Learn Python crawler (20) : Advanced Xpath
Python crawler (21) : Parsing library Beautiful Soup
Python crawler (22) : Beautiful Soup
Python crawler (23) : Getting started parsing pyQuery
Python Crawler (24) : 2019 douban movie Rankings
Python crawler (25) : Crawls stock information
You can’t even afford to buy a second-hand house in Shanghai
Selenium, an Automated Testing Framework, goes from Getting Started to Giving up
Selenium, an Automated Testing Framework, goes from Starter to Quit
Selenium obtains commodity information on a large e-commerce site
The introduction
When we use crawler, we often encounter a situation that at the beginning of the operation, it is as smooth as silk. It may be over in a cup of tea, and there may be various restrictions, such as 403 Forbidden, 429 Too Many Request and so on.
In this case, it is very likely that our IP is limited.
Above problems generally because the site security restrictions or room security restrictions, sometimes really do testing on the server, sometimes is the gateway to do testing, once found an IP number of visits in unit time more than the current limit of a certain threshold, will direct denial of service, this kind of situation we are collectively referred to as: IP.
Do we recognize the above situation so? Of course not!
The agency is designed to solve this problem.
The proxy solution to the above problem is to add A proxy server in the middle of the request to do the forwarding, originally A request directly to server C, such as: A -> C, after adding proxy B, it becomes this: A -> B -> C.
Proxy acquisition
Before we go into action, let’s take a look at how to get agents.
First of all in Baidu input “agent” two words for inquiry, you can see that there are a lot of agent services, of course, ha, most are charged. But some of them will have some free agents.
Of course, free agents have all kinds of pitfalls, such as frequent disconnection, such as high network latency, etc.
But I don’t need a bike for free.
Of course, if you want to get a stable, low network delay agent service, it is recommended to pay to use, after all, with less money can not spend much money.
Agent site xiaobian will not list, is too much, we just open a free agent website:
It can be found that the agent seems to be divided into two kinds, one is a high proxy, and a transparent proxy, what is the difference between the two?
In fact, in addition to high proxy and transparent proxy, there is an intermediate form called anonymous proxy.
The difference between these agents is the header parameters used to forward requests.
Transparent proxy
The target server can know that we are using a proxy and also know our real IP. The transparent proxy accesses the target server with the following HTTP header:
- REMOTE_ADDR = IP address of the proxy server
- HTTP_VIA = PROXY server IP address
- HTTPXFORWARDED_FOR = our real IP
The transparent proxy still sends our real IP address to the other server, so it can’t hide our identity.
Anonymous proxy
The target server can know that we are using a proxy, but not our real IP. The HTTP header with which the anonymous proxy accesses the target server is as follows:
- REMOTE_ADDR = IP address of the proxy server
- HTTP_VIA = PROXY server IP address
- HTTPXFORWARDED_FOR = IP address of the proxy server
Anonymous proxies hide our real IP addresses, but reveal to the target server that we are accessing them using a proxy server.
High: agent
The target server does not know we are using a proxy, much less our real IP. The HTTP header with which the high-hiding proxy accesses the target server is as follows:
- REMOTE_ADDR = IP address of the proxy server
- HTTP_VIA don’t show
- HTTPXFORWARDED_FOR don’t show
A high-hiding proxy hides our real IP and the target server does not know we are using a proxy, so it is the most covert.
It can be seen that the anonymous proxy in the middle state, the work is not finished, but is not useful.
The proxy Settings
Now that we have seen some proxy services, let’s look at the ways in which various HTTP request libraries can set up proxies:
urllib
We will use urllib to do the test. The link to the test will be: https://httpbin.org/get is the test link we used before. If you visit this site, you can get some information about the request. The Origin field is the IP address of the source of the request. Agent xiaobian on the Internet casually find a free high proxy, as follows:
from urllib.error import URLError from urllib.request import ProxyHandler, Build_opener proxy_handler = ProxyHandler({' HTTP ': 'http://182.34.37.0:9999', 'HTTPS ': "Https://117.69.150.84:9999"}) opener = build_opener (proxy_handler) try: response = opener.open('https://httpbin.org/get') print(response.read().decode('utf-8')) except URLError as e: print(e.reason)Copy the code
The code is very simple, let’s look at the result:
{ "args": {}, "headers": { "Accept-Encoding": "identity", "Host": "httpbin.org", "User-Agent": "Python - urllib / 3.7"}, "origin" : "117.69.150.84, 117.69.150.84", "url" : "https://httpbin.org/get"}Copy the code
As you can see, the destination server already thinks we are accessed by a proxy, and the Origin parameter shows the IP address of our proxy server.
Note: Here we use ProxyHandler to set the proxy. The parameter type of ProxyHandler is the dictionary type, the key is the protocol we use, and the value is the proxy we use. We automatically select our HTTP proxy when we request an HTTP link, and we automatically select our HTTPS proxy when we request an HTTPS link.
Requests
For Requests, the proxy setup is much simpler and straightforward. Example code is as follows:
The import requests proxies = {' HTTP: 'http://59.52.186.117:9999', 'HTTPS:' https://222.95.241.6:3000 '} try: response = requests.get('https://httpbin.org/get', proxies = proxies) print(response.text) except requests.exceptions.ConnectionError as e: print('Error', e.args)Copy the code
The results are as follows:
{ "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "Python - requests / 2.22.0"}, "origin" : "222.95.241.6, 222.95.241.6", "url" : "https://httpbin.org/get"}Copy the code
Xiaobian here choose or high proxy, so the IP shown here or our proxy IP.
Selenium
Selenium can also set up a proxy, and it is very simple, as shown in the following example:
From Selenium import webdriver proxy = '222.95.241.6:3000' chrome_options = webdriver.chromeOptions () chrome_options.add_argument('--proxy-server=https://' + proxy) driver = webdriver.Chrome(chrome_options=chrome_options) driver.get('https://httpbin.org/get')Copy the code
The results are as follows:
Setting FireFox browser is almost the same as setting Chrome browser, the only difference is to initialize a FireFox and use FireFox Options() when this is the startup parameter. Otherwise there is no difference.
Free agent
Because the connectivity rate and stability of free agents is really not high, small make up here to find a few free agent sites, only for your reference:
http://www.ip3366.net/
https://www.kuaidaili.com/free/
https://www.xicidaili.com/
The sample code
All of the code in this series will be available on Github and Gitee.
Example code -Github
Example code -Gitee
reference
https://www.jianshu.com/p/bb00a288ee5f