This is the third day of my participation in the November Gwen Challenge. Check out the details: the last Gwen Challenge 2021

Scrapy describe 🎨

Scrapy is an application framework designed to crawl site data and extract structured data.

The framework of Scrapy is mainly composed of five components, namely Scheduler, Downloader, Spider and entity Pipeline, and Scrapy Engine. Scheduler receives THE URL and initiates the request. Downloader downloads the content, processes the useless data through engine and middleware, and then sends it to Spider. We extract data according to our own needs. If the request succeeds, the data will be saved to items in the Pipeline, and if it fails, the URL will be re-sent to the Scheduler

Create Scrapy project 💤

PIP install scrapy

Create scrapy projects:scrapy startproject movieScrapyMy project name is movieScrapy; Spiders create a crawler file:Scrapy genspider spiderData Domain name of the web page you want to visit This is what the created directory structure looks like

Kitems. py acts as a wrapper class to encapsulate data to be passed to the pipe; Middlewares. Py is middleware and generally doesn’t need to be written; > < span style = “box-sizing: border-box; color: RGB (74, 74, 74); line-height: 22px! Important; word-break: inherit! Important;

First let’s change the Settings configuration file

ROBOTSTXT_OBEY = False The default is True, we change it to False, because if we obey the robots protocol, we will lose a lot of things; CONCURRENT_REQUESTS = 32 Maximum concurrent requests, 16 by default, larger is not always better; DOWNLOAD_DELAY = 2 Download delay, which prevents IP addresses from being blocked due to too many access times. COOKIES_ENABLED = True Carries the last accessed cookie by default; DEFAULT_REQUEST_HEADERS = {} user-agent, Cookie, Referer; “> < span style =” box-sizing: border-box! Important; word-break: inherit! Important; word-break: inherit! Important

Climb the picture ❤

Today’s mission is to crawl pictures, go!!

items.py

Define the fields we need in kitems. py for easy piping

import scrapy
class MoviescrapyItem(scrapy.Item): # image namename Field() = scrapy.field ()Copy the code

Crawler file.py

Crawler file directory under the complete code, detailed URL ADDRESS I will use XXX instead

class SpiderdataSpider(scrapy.Spider) :name = 'spiderData'
    allowed_domains = ['x.com']
    start_urls = ['https://x.com/x/x/']

    def parse(self, response):
        articles = response.xpath("//div[@class='galleryWrapper']/section/section/section/section/article")
        for article in articles:
            url = article.xpath("./section/section/a/@href").extract()[0]
            yield scrapy.Request(url, callback=self.MainUrl)

    def MainUrl(self, response):
        data = response.text
        name = response.xpath("//div[@class='image_name']/text()").extract()[0]
        url = response.xpath("//section[@class='image-card']/section/figure/div[@class='unzoomed']/img/@src").extract()[0]
        url = "https:" + url
        item = MoviescrapyItem()
        item['name'] = name
        item['url'] = url
        yield item
Copy the code

Name = ‘spiderData’ name scrapy crawl spiderData; Allowed_domains = [‘x.com’] Links under this domain will be automatically filtered, name and allowed_domains will be automatically generated when the file is created; Parse function response has automatically downloaded data resources, which can be viewed through Response.text. Xpath can be used to obtain the required data in the directory, where extract()[0] can extract text

Yield scrapy.Request(URL, callback= self.mainURL) We constantly extract the link to the image detail page in the loop, yield the Request, and use the function MainUrl as a callback to extract the required data from the response. Yield returns data in each loop

Item = MoviescrapyItem() instantiates the retrieved data to the pipe via yield

pipelines.py

import urllib.request

class MoviescrapyPipeline:
    def __init__(self) :pass

    def process_item(self.item.spider) :name = item['name']
        urllib.request.urlretrieve(item['url']."./Loadimg/" + name + '.jpg')
        return item

    def close_spider(self, spider):
        pass
Copy the code

The __init__ function is the first to be executed, and we can use it to prepare the store file for opening and some Settings, like CSV process_item, which handles the data, which is where we save the image, Item is the close_spider value that we passed in from yield that we can use to close the CSV file like this

Thus the image is obtained 🎈🎈🎈