Python Scrapy crawler agent configuration and debugging

When debugging crawlers, novices will encounter errors about IP, good program suddenly reported an error, how to solve, the IP access error is actually very easy to solve, but how to know to solve it? How do you know if it is a proxy IP problem? Since I majored in Java, some of my interpretations may differ from those of the Python bigs because I see Python from a Java perspective. It is also easy for Java developers to read and understand.

Where is the logic for proxy IP

The structure of a scrapy project looks like this

scrapydownloadertest  # Project folder│ items. Py# Define the data structure where the crawl results are stored│ middlewares. Py# Middleware (filter interceptors that understand Java)│ pipelines. Py# Data pipeline, to obtain the data to do operations│ Settings. PyThe configuration file for the project│ set py# Initialize logic│ ├ ─ spidersThe Spiders folder│ │ httpProxyIp. Py# The processing class after the result is crawled│ │ set py# Spider initialization logic
scrapy.py  
Copy the code

As you can see from the figure above, the proxy IP must be set before the request is sent, so the only place that fits is middlewares. Py, so the proxy logic is written there. Add the following code directly to it:

The built-in Downloader Middleware provides basic functionality for Scrapy,
# Define a class where (object) can be left unwritten for the same effect
class SimpleProxyMiddleware(object):
    Declare an array
    proxyList = ['http://218.75.158.153:3128'.'http://188.226.141.61:8080']
    
    # Downloader Middleware's core methods. You can customize Downloader Middleware only if you implement one or more of these methods
    def process_request(self, request, spider):
        Select one at random and remove the left and right Spaces
        proxy = random.choice(self.proxyList).strip()
        Print the results and observe them
        print("this is request ip:" + proxy)
        Set request proxy IP to proxy IP
        request.meta['proxy'] = proxy

    # Downloader Middleware's core methods. You can customize Downloader Middleware only if you implement one or more of these methods
    def process_response(self, request, response, spider):
        # Failed request does not equal 200
        ifresponse.status ! = 200:# Re-select a proxy IP address
            proxy = random.choice(self.proxyList).strip()
            print("this is response ip:" + proxy)
            Set new proxy IP content
            request.mete['proxy'] = proxy
            return request
        return response

Copy the code

Each Downloader Middleware defines a class of one or more methods. The core methods are:

process_request(request, spider)
Process_response (request, response, spiders)
process_exception(request, exception, spider)

Then find this area in the setting.py file

HttpProxyIp icanhazip.com; / / create httpProxyIp icanhazip.com; / / create httpProxyIp icanhazip.com; / / create httpProxyIp icanhazip.com

# -*- coding: utf-8 -*-
import scrapy

class HttpproxyipSpider(scrapy.Spider):
    name = 'httpProxyIp'
    allowed_domains = ['icanhazip.com']
    start_urls = ['http://icanhazip.com/']

    def parse(self, response):
        pass
Copy the code

Let’s modify it, and the final code looks like this:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.cmdline import execute

class HttpproxyipSpider(scrapy.Spider):
    # spider Task name
    name = 'httpProxyIp'
    # Domain name allowed to access
    allowed_domains = ['icanhazip.com']
    # The url to start the crawl
    start_urls = ['http://icanhazip.com/']

    # Spider resolver method, about the content of the resolution is completed here; Self represents a reference to the instance, and response is the result of the crawler
    def parse(self, response):
        print('Proxy IP:', response.text)

# This is the idiomatic way of writing main and entry to the entire program
if __name__ == '__main__':
    execute(['scrapy'.'crawl'.'httpbin'])
Copy the code

Run the program scrapy Crawl httpProxyIp and see the output

ProxyList = [' http://218.75.158.153:3128 ', 'http://188.226.141.61:8080']
Free proxy IP

This completes scrapy proxy setup and validation debugging.

How do I configure a dynamic proxy IP address

Here is the use of the fee agent IP, you can use the fast agent or ABU Cloud cloud services such as services, when you register and pay, will give you a access to the URL and user name password, see the code directly here! Also create a new class at middlewares.py

Modify the DOWNLOADER_MIDDLEWARES content for setting.py

DOWNLOADER_MIDDLEWARES = {
    Use AbuyunProxyMiddleware instead
    # 'scrapydownloadertest.middlewares.SimpleProxyMiddleware': 100,
    'scrapydownloadertest.middlewares.AbuyunProxyMiddleware': 100}Copy the code

We’re not doing anything else, we’re starting up, and we’re starting up in a different way, because we’re using the PyCharm development tool, so we can start directly

http://icanhazip.com/

Pay attention, don’t get lost

Article weekly continuous update, can wechat search “ten minutes to learn programming” the first time to read and urge more, if this article is well written, feel something ~ ask to like 👍 ask to pay attention to ❤️ ask to share ❤️ your support and recognition, is the biggest motivation for my creation, we will see the next article!

Python Scrapy crawler agent configuration and debugging

Where is the logic for proxy IP

How do I configure a dynamic proxy IP address

Pay attention, don’t get lost

Related Posts

How does TP5 determine whether the user is accessing from PC or mobile?

Centos7 installation mongo

Share an algorithm practice dry goods book