Misconceptions about using HTTP proxy IP crawler collection

Most people know that when crawler is used to climb the same website for many times, it will often be banned by the IP anti-crawler mechanism of the website. In order to solve the problem of blocked IP, proxy IP is usually used.

But there are also some people in the USE of HTTP proxy IP misunderstanding, they think that the use of proxy IP can solve all problems, but in fact proxy IP is not a panacea, it is just a tool, if used improperly, will be blocked IP.

There are three types of proxy IP addresses: transparent proxy, common anonymous proxy, and advanced anonymous proxy.

The main difference between high-secret, anonymous, and transparent proxies is the difference between how the other server gets REMOTE_ADDR, HTTP_X_FORWARDED_FOR, and HTTP_VIA.

REMOTE_ADDR is notoriously unforgeable.

With Transparent proxies, the opposing server knows that you are using a proxy and your real IP address. REMOTE_ADDR = ProxyIP, HTTP_VIA = ProxyIP, HTTP_X_FORWARDED_FOR = YourIP

Using Anonymous proxies, the other server knows that you are using a proxy, but does not know your real IP address. REMOTE_ADDR = ProxyIP, HTTP_VIA = ProxyIP, HTTP_X_FORWARDED_FOR = ProxyIP

With a High anonymous proxy (High), the opposing server does not know that you use a proxy, nor does it know your real IP address. REMOTE_ADDR = ProxyIP, HTTP_VIA = NULL, HTTP_X_FORWARDED_FOR = NULL





Transparent proxy and ordinary anonymous proxy will be known by the target website to use proxy IP, naturally restricted, advanced anonymous proxy does not, so when choosing proxy IP, pay attention to this.

When a proxy IP address is used to climb the target website, there are too many factors to block IP address, such as cookies, User Agent, etc. When the threshold is reached, the IP address will be blocked. When the frequency of visiting the target website is too fast, THE IP will also be blocked, because the normal human visit is far from that frequency, and will naturally be identified by the anti-crawler strategy of the target website.

Only by simulating the normal access of real users can IP addresses be blocked to the greatest extent.