Life is short. I use Python

Previous portal:

Learning Python crawlers (1) : The Beginning

Python crawler (2) : Preparation (1) basic class library installation

Learn Python crawler (3) : Pre-preparation (2) Linux basics

Docker is a Python crawler

Learn Python crawler (5) : pre-preparation (4) database foundation

Python crawler (6) : Pre-preparation (5) crawler framework installation

Python crawler (7) : HTTP basics

Little White learning Python crawler (8) : Web basics

Learning Python crawlers (9) : Crawler basics

Python crawler (10) : Session and Cookies

Python crawler (11) : Urllib

Python crawler (12) : Urllib

Urllib: A Python crawler (13)

Urllib: A Python crawler (14)

Python crawler (15) : Urllib

Python crawler (16) : Urllib crawler (16) : Urllib crawler

Python crawler (17) : Basic usage for Requests

Python crawler (18) : Requests advanced operations

Python crawler (19) : Xpath base operations

Learn Python crawler (20) : Advanced Xpath

Python crawler (21) : Parsing library Beautiful Soup

Python crawler (22) : Beautiful Soup

Python crawler (23) : Getting started parsing pyQuery

Python Crawler (24) : 2019 douban movie Rankings

Python crawler (25) : Crawls stock information

You can’t even afford to buy a second-hand house in Shanghai

Selenium, an Automated Testing Framework, goes from Getting Started to Giving up

Selenium, an Automated Testing Framework, goes from Starter to Quit

Selenium obtains commodity information on a large e-commerce site

The introduction

When we use crawler, we often encounter a situation that at the beginning of the operation, it is as smooth as silk. It may be over in a cup of tea, and there may be various restrictions, such as 403 Forbidden, 429 Too Many Request and so on.

In this case, it is very likely that our IP is limited.

Above problems generally because the site security restrictions or room security restrictions, sometimes really do testing on the server, sometimes is the gateway to do testing, once found an IP number of visits in unit time more than the current limit of a certain threshold, will direct denial of service, this kind of situation we are collectively referred to as: IP.

Do we recognize the above situation so? Of course not!

The agency is designed to solve this problem.

The proxy solution to the above problem is to add A proxy server in the middle of the request to do the forwarding, originally A request directly to server C, such as: A -> C, after adding proxy B, it becomes this: A -> B -> C.

Proxy acquisition

Before we go into action, let’s take a look at how to get agents.

First of all in Baidu input “agent” two words for inquiry, you can see that there are a lot of agent services, of course, ha, most are charged. But some of them will have some free agents.

Of course, free agents have all kinds of pitfalls, such as frequent disconnection, such as high network latency, etc.

But I don’t need a bike for free.

Of course, if you want to get a stable, low network delay agent service, it is recommended to pay to use, after all, with less money can not spend much money.

Agent site xiaobian will not list, is too much, we just open a free agent website:

It can be found that the agent seems to be divided into two kinds, one is a high proxy, and a transparent proxy, what is the difference between the two?

In fact, in addition to high proxy and transparent proxy, there is an intermediate form called anonymous proxy.

The difference between these agents is the header parameters used to forward requests.

Transparent proxy

The target server can know that we are using a proxy and also know our real IP. The transparent proxy accesses the target server with the following HTTP header:

  • REMOTE_ADDR = IP address of the proxy server
  • HTTP_VIA = PROXY server IP address
  • HTTPXFORWARDED_FOR = our real IP

The transparent proxy still sends our real IP address to the other server, so it can’t hide our identity.

Anonymous proxy

The target server can know that we are using a proxy, but not our real IP. The HTTP header with which the anonymous proxy accesses the target server is as follows:

  • REMOTE_ADDR = IP address of the proxy server
  • HTTP_VIA = PROXY server IP address
  • HTTPXFORWARDED_FOR = IP address of the proxy server

Anonymous proxies hide our real IP addresses, but reveal to the target server that we are accessing them using a proxy server.

High: agent

The target server does not know we are using a proxy, much less our real IP. The HTTP header with which the high-hiding proxy accesses the target server is as follows:

  • REMOTE_ADDR = IP address of the proxy server
  • HTTP_VIA don’t show
  • HTTPXFORWARDED_FOR don’t show

A high-hiding proxy hides our real IP and the target server does not know we are using a proxy, so it is the most covert.

It can be seen that the anonymous proxy in the middle state, the work is not finished, but is not useful.

The proxy Settings

Now that we have seen some proxy services, let’s look at the ways in which various HTTP request libraries can set up proxies:

urllib

We will use urllib to do the test. The link to the test will be: https://httpbin.org/get is the test link we used before. If you visit this site, you can get some information about the request. The Origin field is the IP address of the source of the request. Agent xiaobian on the Internet casually find a free high proxy, as follows:

from urllib.error import URLError from urllib.request import ProxyHandler, Build_opener proxy_handler = ProxyHandler({' HTTP ': 'http://182.34.37.0:9999', 'HTTPS ': "Https://117.69.150.84:9999"}) opener = build_opener (proxy_handler) try: response = opener.open('https://httpbin.org/get') print(response.read().decode('utf-8')) except URLError as e: print(e.reason)Copy the code

The code is very simple, let’s look at the result:

{ "args": {}, "headers": { "Accept-Encoding": "identity", "Host": "httpbin.org", "User-Agent": "Python - urllib / 3.7"}, "origin" : "117.69.150.84, 117.69.150.84", "url" : "https://httpbin.org/get"}Copy the code

As you can see, the destination server already thinks we are accessed by a proxy, and the Origin parameter shows the IP address of our proxy server.

Note: Here we use ProxyHandler to set the proxy. The parameter type of ProxyHandler is the dictionary type, the key is the protocol we use, and the value is the proxy we use. We automatically select our HTTP proxy when we request an HTTP link, and we automatically select our HTTPS proxy when we request an HTTPS link.

Requests

For Requests, the proxy setup is much simpler and straightforward. Example code is as follows:

The import requests proxies = {' HTTP: 'http://59.52.186.117:9999', 'HTTPS:' https://222.95.241.6:3000 '} try: response = requests.get('https://httpbin.org/get', proxies = proxies) print(response.text) except requests.exceptions.ConnectionError as e: print('Error', e.args)Copy the code

The results are as follows:

{ "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "Python - requests / 2.22.0"}, "origin" : "222.95.241.6, 222.95.241.6", "url" : "https://httpbin.org/get"}Copy the code

Xiaobian here choose or high proxy, so the IP shown here or our proxy IP.

Selenium

Selenium can also set up a proxy, and it is very simple, as shown in the following example:

From Selenium import webdriver proxy = '222.95.241.6:3000' chrome_options = webdriver.chromeOptions () chrome_options.add_argument('--proxy-server=https://' + proxy) driver = webdriver.Chrome(chrome_options=chrome_options) driver.get('https://httpbin.org/get')Copy the code

The results are as follows:

Setting FireFox browser is almost the same as setting Chrome browser, the only difference is to initialize a FireFox and use FireFox Options() when this is the startup parameter. Otherwise there is no difference.

Free agent

Because the connectivity rate and stability of free agents is really not high, small make up here to find a few free agent sites, only for your reference:

http://www.ip3366.net/

https://www.kuaidaili.com/free/

https://www.xicidaili.com/

The sample code

All of the code in this series will be available on Github and Gitee.

Example code -Github

Example code -Gitee

reference

https://www.jianshu.com/p/bb00a288ee5f