The original link

preface

Scrapy is a pure Python language crawler framework, simple, easy to use, high expansion makes it become the mainstream tool in Python crawler, this article is based on the latest version of the current official 1.6, from simple use to in-depth principle of discussion.

Say a tutorial to a tutorial in advance, there is no official document to tell the appropriate! If you’re interested in Scrapy by the end of this article and want to learn more about Scrapy, make sure you get in the habit of scrolling through the official documentation at any time!

Scrapy

content

This paper mainly expounds the following contents

Why Scrapy?
Hello Scrapy! (practice)
How does Scrapy work?

The first section, “Why Scrapy”, is recommended, and I’ll analyze my understanding of Scrapy’s business scenario.

For the next two sections, I wanted to introduce Scrapy before Hello Scrapy, but considering that not everyone wants to get into the theoretical stuff right away, I’m going to introduce the practical Demo first in the hope of piquing the reader’s interest. Interest can let us know something more deeply. So I’ll leave this section on how Scrapy works to the end and pick up the next chapter on Scrapy principles!

Why Scrapy?

While Scrapy has been designed to work for most crawlers, there are a few scenarios where Scrapy just doesn’t work.

When is Scrapy not preferred?

Scrapy is not the best option when you have a small number of pages to crawl and a small site size. Requests+PyQuery, for example, is a great way to pull out lists of movies and news items without Scrapy code. Requests and PyQuery are both superior to Scrapy in terms of network request efficiency and page parsing speed.
When no universal crawler is required, Scrapy is optional. The real benefit of Scrapy, in my opinion, is the ability to customize “Spider actions” for many different types of sites. The powerful “ItemLoader” defines a series of actions for data input and output. Scrapy doesn’t work best if you don’t have the need to constantly expand your sources.
Scrapy is weak when you need to incrementally crawl data. Scrapy does not have the ability to perform incremental crawls, because increments are very difficult to implement. If it is simple, it can be done with a small operation, but if it is high, it can be very difficult to implement.

Note: Scrapy is a bad idea, not a bad idea! It is hoped that the reader will understand that choosing a framework or technology is not about following the crowd, and that thinking carefully from the beginning of the design can be very beneficial to the development of the project.

When does Scrapy work?

Scrapy-redis, the unofficial component of Scrapy, works well when you need distributed design. Scrapy by itself does not implement distributed mechanics, but rmax scrapy-Redis does, as I’ll cover later.
Scrapy is a great tool when there’s a need to expand. The specific reasons have been described above, and there is no more explanation here.

Note: all of the above are from my personal Scrapy summary, for reference only!

Hello Scrapy

Demo to Douban (wanancient crawler victims) hot movie rankings and all its comments as the experimental target, one by one to tell the basic function of Scrapy, I believe that readers after the practice of this Demo, can be very good use of Scrapy.

Project making

Required installation:

Python (3.7 was used in this article)
scrapy

Installation environment

Install Scrapy

Type PIP install scrapy on the command line

Create Scrapy projects

Type scrapy startProject douban_demo on the command line

Before Scrapy tells us that we can use the genspider command to create our crawler file, let’s take a look at what happens when that command is executed.

View the file directory. We can see the following information

Heavy Metal Exercises ── Heavy metal Exercises ─ Items# Data model file│ ├ ─ ─ middlewares. PyMiddleware file, configure all middleware│ ├ ─ ─ pipelines. Py# pipeline file for processing data output│ ├ ─ ─ Settings. Py# douban_demo configuration file│ └ ─ ─ spidersSpider class folder, where all spiders are stored└ ─ ─ scrapy. CFG# Complete Scrapy configuration files, automatically generated by Scrapy
Copy the code

With a general understanding of the purpose of each file, let’s begin our crawler tour.

Describe a reptile

Use scrapy genspider Douban douban.com to create a crawler file. This crawler file will be placed under the Douban_demo /spiders.

PS:genspiderThe use of thescrapy genspider [options] <name> <domain>

The douban.py file would then appear under the spiders, with the following initial content:

# -*- coding: utf-8 -*-
import scrapy


class DoubanSpider(scrapy.Spider):
    name = 'douban'                       Crawler name
    allowed_domains = ['douban.com']      # list of domain names allowed to be crawled
    start_urls = ['http://douban.com/']   # list of resource links to start crawling

    def parse(self, response):            Parse data methods
        pass
Copy the code

All of your Spider classes must inherit scrapy.Spider, where name, start_urls, and parse member methods must be declared by each Spider class. More Spider properties and member methods can be found here

Want to get object linking our climb into the next start_urls can inside, we use https://movie.douban.com/chart as experiment object.

From the DoubanSpider start_urls replacement value for start_urls = [‘ https://movie.douban.com/chart ‘]

Use shell to test the page

Scrapy also provides shell commands to test page data extraction in the shell, which is much more efficient than requests+ PyQuery.

Scrapy shell urls

Type scrapy shell at the command line to enter shell mode.

Note: Don’t be in a hurry to add urls at this point because our test subjects are testing UA and 403 appears if you go directly to the test link. There are no specific restrictions on the directory in which to enter this command.

The output is as follows:

(venv) ➜ Douban_demo scrapy shell -- Nolog [s] Available scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) [s] crawler <scrapy.crawler.Crawler object at 0x106c5c550> [s] item {} [s] settings <scrapy.settings.Settings object at 0x108e18898> [s] Useful shortcuts: [s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed) [s] fetch(req) Fetch a scrapy.Request and update local objects [s] shelp() Shell help (print this help) [s] view(response) View response in a browserCopy the code

In order to prevent douban403, we should add the DEFAULT_REQUEST_HEADERS attribute to the Settings, which is a dictionary of request headers. Whenever Scrapy detects this option, it adds the value to the request header.

Values are as follows:

DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9 * / *; Q = 0.8 '.'Accept-Language': 'en'.'user-agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 \ (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
Copy the code

To add a default request header, type in the interface

>>> settings.DEFAULT_REQUEST_HEADERS = {
...   'Accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9 * / *; Q = 0.8 '.'Accept-Language': 'en'.'user-agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 \... (KHTML like Gecko) Chrome/74.0.3729.169 Safari/537.36'. }Copy the code

Enter Settings.DEFAULT_REQUEST_HEADERS again to check whether the headers are added successfully.

Once configured, we can use the fetch(urls) command to fetch the pages we need to test

Type in the fetch (‘ https://movie.douban.com/chart ‘) can see the content

2019- 06- 03 23:06:13 [scrapy.core.engine] INFO: Spider opened
2019- 06- 03 23:06:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.douban.com/robots.txt> (referer: None)
2019- 06- 03 23:06:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.douban.com/chart> (referer: None)
Copy the code

TXT file. This is a good crawler habit. In this case, all page acquisition by scrapy will follow the rules of the robots.txt file. If you don’t want to obey this rule you can set ROBOTSTXT_OBEY = False in the Settings.

At this point you can use Response.text to check if we have the source code for the entire page. All of scrapy’s resource parsing operations are integrated into the Response object, and more information about Response is available here

Analysis of the page

Movie Rankings page

Element inspection of the page

You can see what we need to crawl in the table. Because the page has multiple tables, you only need to iteratively fetch them.

Response.css (‘table’) can be used in shell to obtain all table elements. In this paper, CSS selector is used for element selection, and xpath can also be switched by itself.

The information for each movie is in the tr.item under the table tag.

A detailed link to the movie can be obtained using a.nbg::attr(href)

The movie image can be obtained by using a.nbg > img::attr(SRC)

The movie name processing is a little more complicated. As you can see from the above image, the movie may have multiple names wrapped under div. Pl2 > a, and the other names are wrapped under div. Pl2 > a > span, so we need to do some formatting for the names, such as removing Spaces, line breaks, and so on.

Div. Pl2 > a::text and div. Pl2 > a > span::text. We just take the first one and use the extract_first() method to take the contents of the first Selector element and convert it to STR.

The movie synopsis is available in p.pl::text

Movie review page

Concatenate comments after the corresponding movie details link? Status =P to enter the movie review page.

It can be seen that the film review data is composed of multiple comment-items, and the film review content is wrapped under div.comment, so the acquisition method of corresponding data can be found according to the above analysis method. I won’t elaborate here

Implementation approach

Create two parse methods: parse_rank and parse_comments. Parse_rank handles the movie rankings page and parse_comments handles the comments page.
Overrides the Spider start_requests method and populates the URL and callback attributes. Since you can’t get the comment address until you get the details from the movie list page, So the Request callback attribute returned in start_requests should be filled with self.parse_rank
The returned reponse is processed in parse_rank, the data is parsed as in “parse pages” and the Request for the comment page is raised using yield, with the callback property populated with self.parse_comments
The returned comment page is processed in the parse_comments method, throwing data and the next page link.

Note:Spider parseMethod: All of themparseMethods must return either Item(currently understood as data Item) or Requests(for the next request). All of thisparseDo you mean specificallySpiderClassparseMethod, but all parsed functions should return Item or Requests.

Code sample

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http.request import Request


class DoubanSpider(scrapy.Spider):
    name = 'douban'

    def start_requests(self):
        yield Request(url='https://movie.douban.com/chart', callback=self.parse_rank)

    def parse_rank(self, response):
        for item in response.css('tr.item'):
            detail_url = item.css('a.nbg::attr(href)').extract_first()
            img_url = item.css('a.nbg > img::attr(src)').extract_first()
            main_name = item.css('div.pl2 > a::text').extract_first()
            other_name = item.css('div.pl2 > a > span::text').extract_first()
            brief = item.css('p.pl::text').extract_first()
            main_name = main_name.replace('\n'.' ').replace(' '.' ')

            yield {
                'detail_url': detail_url,
                'img_url': img_url,
                'name': main_name+other_name,
                'brief': brief
            }

            yield Request(url=detail_url+'comments? status=P',
                          callback=self.parse_comments,
                          meta={'movie': main_name})

    def parse_comments(self, response):
        for comments in response.css('.comment-item'):
            username = comments.css('span.comment-info > a::text').extract_first()
            comment = comments.css('span.short::text').extract_first()

            yield {
                'movie': response.meta['movie'].'username': username,
                'comment': comment
            }
        nexturl = response.css('a.next::attr(href)').extract_first()
        if nexturl:
            yield Request(url=response.url[:response.url.find('? ')]+nexturl,
                          callback=self.parse_comments,
                          meta=response.meta)

Copy the code

Start the crawler

With everything in place, you can type scrapy crawl Douban under the douban_demo(top level) directory and see that there is a lot of log data and a lot of movie information and reviews printed out.

So far we have completed the preliminary crawl of douban movie rankings and comments. Of course, Douban limits the number of comments that non-login users can view and detect crawler behavior, etc. We will talk about these anti-crawling mechanisms in the future.

So now the question is I need to save the data and how do I do that?

Scrapy provides many Feed exports methods to save output data as JSON, JSON lines, CSV, XML

To save the file as JSON, add -o xx.json to the end of the startup command.

Scrapy crawl douban -o result.json

Scrapy defaults all data to ASCII when using JSON encoder, so we need to set the data encoding to UTF-8.

Just add FEED_EXPORT_ENCODING = ‘UTF-8’ in settings.py.

At this time, the data can be seen in Chinese normal display.

About 2000 pieces of data are generated at this point.

summary

At this point, we’ve done a preliminary crawl of douban movies and reviews, and while we’ve done it, it feels like “I just wrote the code to parse the page and type in the crawler to start it, and Scrapy does everything from page requests to data generation.” We need to explore what scrapy is doing when we start with scrapy crawl douban -o result.json.

How does Scrapy work?

For those of you who are interested in learning about Scrapy, please save the diagram below, which is particularly important for learning about Scrapy.

When we type scrapy crawl douban -o result.json, our scrapy does the following

The Crawler receives the crawl command, activates the Spider with name douban, creates the Engine, and launches our DoubanSpider.
When a DoubanSpider is created, Engine detects the request queue for the Spider, which is our orstart_requests method for the start_urls property. Both of these must be iterable, so you can understand why the start_requests method in our sample code uses yield throws. At this point, Request objects are generated, and all Request objects will first pass through the middleware of Spider Middlewares. At this stage, we only need to understand the middleware as a bridge, we don’t have to dig into what is on the bridge.
The Request object generated by the Spider is sent through the Engine to the Scheduler. The Scheduler queues all requests. Once it can be scheduled, The Request reaches the Downloader through the Downloader Middlewares, and the Downloader accesses the specified Internet resource based on the requested content asynchronously.
When a Downloader completes a Request task, it wraps the resource into a Response containing information about the original Request, a wrapped parser, etc. In this example, we can see that the Request thrown in parse_rank carries meta data. The meta continues to be stored in the response of parse_comments.
The Downloader Middlewares bridge and the Engine and Spider Middlewares return all responses to the corresponding Spider and activate the corresponding callback function. The last step is to execute the code we wrote in the Parse method. The steps (3-5) are repeated when Parse throws the Request object again.
When the Spider throws an Item, it goes through the Spider Middlewares again to the Item Pipeline, but we don’t specify any action on the Item Pipeline so it just throws the Item, and then the logger catches the output, Since we are using the -o directive, OUR exporter exports the item to the appropriate format, and we have the result.json data set that we specify.

conclusion

Now that we’ve covered how to use Scrapy to create a simple crawler and how it works, we’ll take a closer look at the other components of Scrapy and how to use them to break through crawlers.

Such as the above views wrong welcome yazheng!