This is the second day of my participation in the November Gwen Challenge. Check out the details: the last Gwen Challenge 2021
Experiment 2
2.2 train of thought
2.2.1 setting. Py
- Remove restrictions
ROBOTSTXT_OBEY = False
Copy the code
- Set the path to save the image
IMAGES_STORE = r'.\images' The path to save the file
Copy the code
- Open the pipelines
ITEM_PIPELINES = {
'weatherSpider.pipelines.WeatherspiderPipeline': 300,}Copy the code
- Set the request header
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9 * / *; Q = 0.8 '.'Accept-Language': 'en'.'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.16 Safari/537.36',}Copy the code
2.2.2 item. Py
- Sets the field to climb
class WeatherspiderItem(scrapy.Item) :
number = scrapy.Field()
pic_url = scrapy.Field()
Copy the code
2.2.3 wt_Spider. Py
- Send the request
def start_requests(self) :
yield scrapy.Request(self.start_url, callback=self.parse)
Copy the code
- Get all of the pages
A label
def parse(self, response) :
html = response.text
urlList = re.findall('<a href="(.*?)" ', html, re.S)
for url in urlList:
self.url = url
try:
yield scrapy.Request(self.url, callback=self.picParse)
except Exception as e:
print("err:", e)
pass
Copy the code
- Again request all the URLS below the A tag and then find all the pictures back
def picParse(self, response) :
imgList = re.findall(r', response.text, re.S)
for k in imgList:
if self.total > 102:
return
try:
item = WeatherspiderItem()
item['pic_url'] = k
item['number'] = self.total
self.total += 1
yield item
except Exception as e:
pass
Copy the code
- So similar to storing in a database,
The data processing
All of them should be herepipelines.py
Word-break: break-all; word-break: break-all; word-break: break-all; word-break: break-all
2.2.4 pipelines. Py
- Importing the setting information
from weatherSpider.settings import IMAGES_STORE as images_store Read configuration file information
from scrapy.pipelines.images import ImagesPipeline
settings = get_project_settings()
Copy the code
- Write save functions
def get_media_requests(self, item, info) :
image_url = item["pic_url"]
yield Request(image_url)
Copy the code
- It is better to rename the file when it should be saved.