When debugging crawlers, novices will encounter errors about IP, good program suddenly reported an error, how to solve, the IP access error is actually very easy to solve, but how to know to solve it? How do you know if it is a proxy IP problem? Since I majored in Java, some of my interpretations may differ from those of the Python bigs because I see Python from a Java perspective. It is also easy for Java developers to read and understand.
Where is the logic for proxy IP
The structure of a scrapy project looks like this
scrapydownloadertest # Project folder│ items. Py# Define the data structure where the crawl results are stored│ middlewares. Py# Middleware (filter interceptors that understand Java)│ pipelines. Py# Data pipeline, to obtain the data to do operations│ Settings. PyThe configuration file for the project│ set py# Initialize logic│ ├ ─ spidersThe Spiders folder│ │ httpProxyIp. Py# The processing class after the result is crawled│ │ set py# Spider initialization logic
scrapy.py
Copy the code
As you can see from the figure above, the proxy IP must be set before the request is sent, so the only place that fits is middlewares. Py, so the proxy logic is written there. Add the following code directly to it:
The built-in Downloader Middleware provides basic functionality for Scrapy,
# Define a class where (object) can be left unwritten for the same effect
class SimpleProxyMiddleware(object):
Declare an array
proxyList = ['http://218.75.158.153:3128'.'http://188.226.141.61:8080']
# Downloader Middleware's core methods. You can customize Downloader Middleware only if you implement one or more of these methods
def process_request(self, request, spider):
Select one at random and remove the left and right Spaces
proxy = random.choice(self.proxyList).strip()
Print the results and observe them
print("this is request ip:" + proxy)
Set request proxy IP to proxy IP
request.meta['proxy'] = proxy
# Downloader Middleware's core methods. You can customize Downloader Middleware only if you implement one or more of these methods
def process_response(self, request, response, spider):
# Failed request does not equal 200
ifresponse.status ! = 200:# Re-select a proxy IP address
proxy = random.choice(self.proxyList).strip()
print("this is response ip:" + proxy)
Set new proxy IP content
request.mete['proxy'] = proxy
return request
return response
Copy the code
Each Downloader Middleware defines a class of one or more methods. The core methods are:
- process_request(request, spider)
- Process_response (request, response, spiders)
- process_exception(request, exception, spider)
Then find this area in the setting.py file
HttpProxyIp icanhazip.com; / / create httpProxyIp icanhazip.com; / / create httpProxyIp icanhazip.com; / / create httpProxyIp icanhazip.com
# -*- coding: utf-8 -*-
import scrapy
class HttpproxyipSpider(scrapy.Spider):
name = 'httpProxyIp'
allowed_domains = ['icanhazip.com']
start_urls = ['http://icanhazip.com/']
def parse(self, response):
pass
Copy the code
Let’s modify it, and the final code looks like this:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.cmdline import execute
class HttpproxyipSpider(scrapy.Spider):
# spider Task name
name = 'httpProxyIp'
# Domain name allowed to access
allowed_domains = ['icanhazip.com']
# The url to start the crawl
start_urls = ['http://icanhazip.com/']
# Spider resolver method, about the content of the resolution is completed here; Self represents a reference to the instance, and response is the result of the crawler
def parse(self, response):
print('Proxy IP:', response.text)
# This is the idiomatic way of writing main and entry to the entire program
if __name__ == '__main__':
execute(['scrapy'.'crawl'.'httpbin'])
Copy the code
Run the program scrapy Crawl httpProxyIp and see the output
ProxyList = [' http://218.75.158.153:3128 ', 'http://188.226.141.61:8080']
Free proxy IP
This completes scrapy proxy setup and validation debugging.
How do I configure a dynamic proxy IP address
Here is the use of the fee agent IP, you can use the fast agent or ABU Cloud cloud services such as services, when you register and pay, will give you a access to the URL and user name password, see the code directly here! Also create a new class at middlewares.py
Modify the DOWNLOADER_MIDDLEWARES content for setting.py
DOWNLOADER_MIDDLEWARES = {
Use AbuyunProxyMiddleware instead
# 'scrapydownloadertest.middlewares.SimpleProxyMiddleware': 100,
'scrapydownloadertest.middlewares.AbuyunProxyMiddleware': 100}Copy the code
We’re not doing anything else, we’re starting up, and we’re starting up in a different way, because we’re using the PyCharm development tool, so we can start directly
http://icanhazip.com/
Pay attention, don’t get lost
Article weekly continuous update, can wechat search “ten minutes to learn programming” the first time to read and urge more, if this article is well written, feel something ~ ask to like 👍 ask to pay attention to ❤️ ask to share ❤️ your support and recognition, is the biggest motivation for my creation, we will see the next article!