Geek Monkey is a weekly public account that shares original dry Python content. Including basic entry, advanced skills, Web crawler, data analysis, Web application development, welcome to pay attention to.
When crawling certain sites, we often set the proxy IP to avoid the crawler being blocked. We obtain the proxy IP address usually extract domestic well-known IP agents (such as Xitong agent, quick agent, 51worry agent, etc.) free agent. These agents generally provide transparent agents, anonymous agents, high – hidden agents. So what’s the difference between these agents? How do we choose? The main content of this article is to explain the principles behind various proxy IP addresses.
1 Proxy Type
There are four types of agents. In addition to the transparent proxy, anonymous proxy, high-hiding proxy, and obfuscating proxy mentioned earlier. In terms of security, the order of the four proxy types is high – hidden > obscure > Anonymous > Transparent.
2 Agent Principle
The proxy type depends on the configuration of the proxy server. Different configurations form different proxy types. REMOTE_ADDR, HTTP_VIA, and HTTP_X_FORWARDED_FOR are the decisive factors in the configuration.
REMOTE_ADDR REMOTE_ADDR indicates the client’S IP address, but its value is not provided by the client, but specified by the server based on the client’s IP address.
If you use a browser to access a web site directly, the web server for that site (Nginx, Apache, etc.) sets REMOTE_ADDR to the client’s IP address.
If we set up a proxy for our browser, our request to visit the target site will go through the proxy server, and the proxy server will translate the request to the target site. The web server of the site sets REMOTE_ADDR to the IP of the proxy server.
X-forwarded-for (XFF) X-Forwarded-For is an HTTP extension header that represents the real IP address of the HTTP server. When a client uses a proxy, the Web server does not know the real IP address of the client. To avoid this, proxy servers usually add an X-Forwarded-for header, which adds the client’s IP address to the header.
The format of the x-Forwarded-For header is as follows:
X-Forwarded-For: client, proxy1, proxy2
Copy the code
Client INDICATES the IP address of the client. Proxy1 is the device IP farthest from the server. Proxy2 is the IP of the secondary proxy device; From the format, you can see that there can be multiple layers of proxy from client to server.
If an HTTP request passes through three proxies Proxy1, Proxy2 and Proxy3 with IP addresses of IP1, IP2 and IP3 respectively, and the user’s real IP address is IP0, then according to XFF standard, the server will receive the following information:
X-Forwarded-For: IP0, IP1, IP2
Copy the code
Proxy3 directly connects to the server, and it appends IP2 to XFF to indicate that it is forwarding requests for Proxy2. There is no IP3 in the list, which is available on the server via the Remote Address field. We know that HTTP connection is based on TCP connection, and there is no concept of IP in HTTP protocol. Remote Address comes from TCP connection and represents the IP Address of the device that establishes TCP connection with the server. In this case, it is IP3.
HTTP_VIA Via is a header in the HTTP protocol, which records the proxy and gateway that an HTTP request passes through. If one proxy server passes through, one proxy server will be added, and two proxies will be added.
3 Proxy type difference
Transparent Proxy The configuration of the Proxy server is as follows:
REMOTE_ADDR = Proxy IP
HTTP_VIA = Proxy IP
HTTP_X_FORWARDED_FOR = Your IP
Copy the code
Transparent proxy can “hide” the client’s IP address directly, but it can still look up the client’s IP address from HTTP_X_FORWARDED_FOR.
2) Anonymous Proxy The configuration of the Proxy server is as follows:
REMOTE_ADDR = proxy IP
HTTP_VIA = proxy IP
HTTP_X_FORWARDED_FOR = proxy IP
Copy the code
Anonymous proxy can hide the CLIENT IP address. With anonymous proxy, the server can know that the client is using a proxy, but cannot know the real IP address of the client.
The configurations of the Distorting Proxy server are as follows:
REMOTE_ADDR = Proxy IP
HTTP_VIA = Proxy IP
HTTP_X_FORWARDED_FOR = Random IP address
Copy the code
The principle is similar to that of anonymous proxy, but camouflaged more closely. If the client uses an obfuscated proxy, the server still knows that the client is using the proxy, but gets a fake client IP address.
2) The configurations of Elite Proxy or High Anonymity Proxy servers are as follows:
REMOTE_ADDR = Proxy IP
HTTP_VIA = not determined
HTTP_X_FORWARDED_FOR = not determined
Copy the code
High-hiding proxy can not only make the server not clear whether the client is using proxy, but also ensure that the server can not get the real IP address of the client.
4. Selection of agents
A normal anonymous proxy can hide the client’s real IP, but it can change our request information and the server might think we are using a proxy. However, with this proxy, although the visited web site does not know the IP address of the client, it can still know that you are using the proxy, of course, some pages that can detect IP can still check the IP address of the client.
The highly anonymous proxy does not change the client’s request, so it looks to the server as if a real client browser is accessing it. In this case, the client’s real IP is hidden and the server does not think we are using the proxy.
Therefore, when the crawler needs to use proxy IP, it should try to choose common anonymous proxy and high anonymous proxy. In addition, HTTPS proxies are recommended to ensure that data is not known by proxy servers.
Article Reference:
X-forwarded-for proxy Specifies the type of the HTTP request header. This type of proxy is Forwarded to Forwarded-for Proxy
This article was first published on wechat. Welcome to reprint the article at any time, reprint please contact number to open the white list, respect the author’s original. I use my wechat account “Geek Monkey” to share original Python works every week. Related to web crawler, data analysis, Web development and other directions.