Introduction to the

Scrapy is a python based crawler framework that provides a complete set of crawler solutions, including cookie simulation, referer simulation, Ajax simulation, and support for proxy, retry, and other crawler operations. It allows you to focus on the data itself without having to deal with much of the underlying stuff.

Some of the tips are here today for quick reference and quick start.

project

Since scrapy is a framework, we need to create the project folder first

scrapy startproject tutorial 
Copy the code

The third parameter is the name of the project, it is suggested to establish different projects according to the website, because a project shares a set of configuration, including data export format, which middleware to use, etc. If the crawler of different websites is placed in the same project, it may need to modify these configurations repeatedly, resulting in confusion. But a project can have multiple crawlers, such as crawlers for lists, crawlers for details, crawlers for tags…

* * * * * * * * * * * * * * * * * * * * * * * The project pipeline class settings.py # spiders/ # place various crawlers __init__.pyCopy the code

The first example

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)
Copy the code

This example from the official documentation shows the two basic parts of a crawler. Send requests and analyze data. This corresponds to start_requests and parse. So that’s what I was talking about using scrapy and just focusing on the data itself, these two pieces. The subsequent improvements also revolve around these two pieces.

Another key attribute is name, which represents the name of the crawler and is the unique identifier of the crawler. We’ll need it when we start the crawler.

scrapy crawl quotes
Copy the code

The command is to execute the crawler. Start_requests, which initiates requests, must yield a Request object. Request needs to specify that the callback is parse. Parse receives the response object and contains all of the response information. In this example, we will save the returned data as HTML without further operations. The following shows how to extract the information we need from response.

The selector

The official way to learn to extract data is in shell mode

scrapy shell 'http://quotes.toscrape.com/page/1/'
Copy the code

This allows you to code interactively in the shell and see the results of extracting data. Let’s introduce CSS selectors first, because anyone with a little front-end knowledge knows the syntax of CSS selectors.

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
>>> response.css('title::text').getall()
['Quotes to Scrape']
>>> response.css('title').getall()
['<title>Quotes to Scrape</title>']
>>> response.css('title::text').get()
'Quotes to Scrape'
>>> response.css('title::text')[0].get()
'Quotes to Scrape'
>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']
>>> tags = quote.css("div.tags a.tag::text").getall()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']
>>> response.css('li.next a::attr(href)').get()
'/page/2/'
Copy the code

There are two CSS selectors whose extended syntax is ::text and ::attr to get the tag text and attribute, respectively. Attributes can also be obtained in this way:

>>> response.css('li.next a').attrib['href']
'/page/2'
Copy the code

The GET method can also specify default values

response.css(query).get(default='').strip()
Copy the code

CSS selectors make it easy to locate the data we need, but another selector that Scrapy provides is xpath. This is also a classic, but seems to use far fewer scenarios than CSS, so anyone interested can learn the new syntax on their own.

Returns data and paging

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)
Copy the code

The parse method yields a dictionary to return data, and we can specify output to a file with the -o argument during execution. The default output format is Jsonline, which is a JSON string in a row. The output format can also be CSV, JSON, or XML. Json differs from JsOnline in that JSON outputs a JSON string containing all the results.

scrapy crawl quotes -o quotes.json
Copy the code

If you want to continue to climb the next page, you can also generate a request for the next page in the parse. However, pay attention to the relative path when requesting the next page. The URL obtained in the above example is a relative path, so response.urljoin method is used to convert it into an absolute path.

conclusion

It is now possible to crawl the data from a paged list and save it to a file. If a site does not set up anti – crawling here is completely ok.

Reference

This is the voa Special English Health Report.

Official documentation >>