Crawler IP proxy pool code records

When crawlers use Python to access websites, most of the time you need to be careful that IP access is too frequent and the site blocks

In this case, you need to use the IP proxy pool

Using a free IP site on the web:Domestic high secret agent

Code:

import requests
import time
import random
from lxml importDef get_ip_list(headers, page): ip_list = []for i inRange (int(page)): # fetch free IP url ='https://www.kuaidaili.com/free/inha/{}/'.format(i+1)
        # print("Crawl url is:"Web_data = requests. Get (url, headers=headers)if web_data.status_code == 200:
            tree0 = etree.HTML(web_data.text)
            ip_lists = tree0.xpath('//table/tbody/tr/td[@data-title="IP"]/text()');
            port_lists = tree0.xpath('//table/tbody/tr/td[@data-title="PORT"]/text()')
            type_lists = tree0.xpath('/ / table/tbody/tr/td [@ data - the title = "type"] / text ()')
            # print(ip_lists)
            # print(port_lists)
            for x,y in zip(ip_lists, port_lists):
                ip_list.append(x + ":" + y)
            time.sleep(3# print(len(ip_list))returnDef get_random_ip(ip_list): proxy_list = []for ip in ip_list:
        proxy_list.append('http://' + ip)
    proxy_ip = random.choice(proxy_list)
    proxies = {'http': proxy_ip}
    return proxies

if __name__ == '__main__':
    headers = {
        'User-Agent': 'the Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'} ip_list = get_ip_list(headers=headers, page= headers3)
    print(ip_list)
Copy the code

Related Posts

A RocketMQ process automatically exits

Java distributed lock, understand distributed lock implementation read this article on the right

Index down push, this point you certainly don’t know!