Scrapy

Scrapy is an application framework designed to crawl site data and extract structured data. It can be used in a range of applications including data mining, information processing or storing historical data. Originally designed for page crawling (more specifically, Web crawling), it can also be used to fetch data returned by apis (such as Amazon Associates Web Services) or general Web crawlers.

Ii. Architecture Overview

1. Scrapy Engine

The engine is responsible for controlling the flow of data through all the components of the system and firing events when the corresponding action occurs. See the Data Flow section below for details. This component is equivalent to the “brain” of the crawler, and is the scheduling center of the whole crawler.

2. Scheduler

The scheduler accepts requests from the engine and enlists them so that they can be provided to the engine later when the engine requests them.

The initial crawled URL and subsequent urls retrieved from the page to be crawled are placed in the scheduler, waiting to be crawled. At the same time, the scheduler will automatically remove duplicate urls (if a specific URL does not need to be duplicated, it can also be set, such as the URL of post request).

3. Downloader

The downloader takes the page data and feeds it to the engine, which then feeds it to the Spider.

4. Spiders

A Spider is a class that Scrapy users write to analyze a response and extract an item(that is, a fetched item) or additional urls to follow. Each spider is responsible for a specific (or set of) web sites.

5. Item Pipeline

The Item Pipeline handles the items extracted by the spider. Typical processes are cleaning, validation, and persistence (such as accessing to a database).

When the data needed for page parsing by crawler is stored in Item, it will be sent to project Pipeline and processed in several specific sequences, and finally stored in local file or database.

6. Downloader Middlewares

The Downloader middleware is a specific hook between the engine and the Downloader that handles the response passed by the Downloader to the engine. It provides an easy mechanism to extend Scrapy functionality by inserting custom code. By setting the middleware of the downloader, the crawler can automatically replace user-agent and IP.

7. Spider Middlewares

Spider middleware is specific hooks between the engine and the Spider that handle the input (response) and output (items and requests) of the Spider. It provides an easy mechanism to extend Scrapy functionality by inserting custom code.

8. Data Flow

1) The engine opens a web site (open a domain), finds the Spider that handles the web site and requests the first URL to crawl from that Spider (s).

2) The engine gets the first URL to crawl from the Spider and dispatches it as a Request from the Scheduler.

3) The engine asks the scheduler for the next URL to crawl.

4) The scheduler returns the next URL to be crawled to the engine, which forwards the URL to the Downloader through the download middleware (in the request direction).

5) Once the page is downloaded, the downloader generates a Response of the page and sends it to the engine through the download middleware (Response direction).

6) The engine receives the Response from the downloader and sends it to the Spider middleware (input direction) for processing.

7) The Spider handles the Response and returns the retrieved Item and (following up) new Request to the engine.

8) The engine sends the crawling Item to the Item Pipeline and the Request to the scheduler.

9) Repeat (from step 2) until there are no more requests in the scheduler, and the engine shuts down the site.

Create a project

Before you can start scraping, you must create a new piecemeal project. Enter the directory where you want to store the code and run it:A tutorial directory is created that contains the following:

            

Create the first spider

Crawlers are defined classes that Scrapy uses to pull information from a site (or group of sites). They must be subclasses of Spider and define the initial request to make, with options on how to follow links in the page and how to parse the downloaded page content to extract data.

The code to create the first spider. Save this in the tutorial/spiders project naming file as quotes_spider.py:

        

Spider subclass scrapy.Spider and defines some properties and methods:

Name: identifies a spider. It must be unique within a project, that is, you cannot set the same name for different spiders.

Start_requests () : Must return an iterable of requests(either a list of requests or a generator function) from which the spider will start crawling. Subsequent requests are generated sequentially from these initial requests.

Parse () : The method that will be called to process the response downloaded for each request. The Response parameter is TextResponse, which holds the page content and has further useful methods for processing it.

This parse() method typically parses the response, extracts the retrieved data into a dictionary, and also looks for a new URL to follow and creates a new Request.

Run the created spider. This command runs the quotes spider we just added, which will send some quotes.toscrape.com fields. You should get output similar to the following:

Now, examine the files in the current directory. You should notice that two new files, quotes-1.html and quotes-2.html, have been created to indicate the contents of each URL as a parse method.

                    

Five, extract data

The best way for scrappy to extract data is to use a Scrapy shell:

Using the shell, you can try using CSS for the response object:

To extract text from the above headings, do the following:

            

There are two things to note here: one is that we have added ::text for CSS queries, that means we only want to directly select the internal text element <title> element. If we don’t specify ::text, we get the full title element, including its tag:

              

The other thing is that the result of the call.getall () is a list: the selector may return multiple results, so we extract all of them. When you only want the first result, as in this example:

              

Instead, write:

              

Using.get() directly on A SelectorList instance avoids IndexError returning None when it cannot find any elements that match the selected content.

In addition to the getall() and get() methods, we can also use the re() method to extract regular Expressions:

      

In addition to CSS, scrapy selectors also support XPath expression

  

XPath expressions are very powerful and are the foundation of fetching selectors. In effect, the CSS selector is converted to xpath under the hood. Xpath expressions provide more functionality because they allow you to view content in addition to navigating the structure. With xpath, you can select the following: * Select links containing text “next page” *, which makes xpath ideal for fetching tasks.

Extract references and authors

Write code to extract quotes from the web page to complete the spider. Each quotation mark in quotes.toscrape.com is represented by an HTML element like this:

Open Scrapy Shell and extract the required data:

We get a list of selectors for the QUOTE HTML element, including:

Each selector returned by the above query allows further queries to be run against its child elements. Assign the first selector to a variable so that we can run the CSS selector directly on specific quotes:

          

Extract text, author, and tags, where tags are lists of strings. We can use the.getall() method to getall of these parameters:

After extracting each bit, iterate over all the quotes elements and place them in the Python dictionary:

Extract the data in a spider, using yield to call back the Python keyword in response as follows:

The easiest way to store captured data is to use Feed exports with the following command:

This will generate a quotes.json file containing all captured items, serialized in JSON.

The -o command-line switch overwrites any existing files; Use -o instead to append the new content to any existing file. However, attaching to a JSON file invalidates the file’s contents. When attaching to files, consider using different serialization formats, such as JSON Lines: :

JSON Lines are like streams to which you can easily append new records. It does not have the same JSON problems when run twice. Also, because each record is a single line, you can process large files without having to put everything in memory, so there are tools like: JQ to help you do this from the command line. Crawl content from references to all pages on the site. Tracking links from a page The first thing you do is extract links to the page you want to track. Examining the page, you can see that there is a link to the next page with the following tags:

Extracted in the shell:

Gets the Anchor element, but requires the anchor attribute href. Scrapy supports CSS extensions that allow selection of the attribute content, as shown below:

         

There is also an Attrib property available

          

The spider is modified to recursively follow the links on the next page and extract data from them:

After extracting the data, the parse() method looks for the link to the next page, uses the urlJoin () method (because links can be relative), and generates a new request to the next page, registering itself as a callback to handle the data extraction for the next page, and keeping crawling across all pages.

Scrapy’s link mechanism: When a request is generated in a callback method, Scrapy plans to send the request and register a callback method to execute when the request completes.

It allows you to build complex crawlers that follow links according to defined rules and extract different types of data based on the pages they visit.

In the example, you create a loop that tracks all links to the next page until none is found — this is handy for crawling blogs, forums, and other sites with pagination.

Create a shortcut to the request

As a shortcut to create a request object, use Response.follow: :

     

Unlike scrapy. Request, Response. follow directly supports relative urls – without calling URLJOIN. Note that Response. follow only returns a request instance; You still need to generate the request.

You can also pass a selector to Response.follow instead of a string; This selector should extract the necessary attributes:

    

There is a shortcut for the element: Response. follow automatically uses its href attribute. So the code can be further shortened:

        

To create multiple requests from Iterable, use Response.follow_all instead:

      

Further shortening:

Another spider, demo callback and the following link, grabs author information:

The spider will start from the home page, trace all links to author pages, call parse_author their callbacks, and call back with parse as we’ve seen before. Here, we pass the callback to Response.follow_all as a positional argument to make the code shorter; It also applies to Request. The parse_author callback defines a helper function to extract and clean up data from A CSS query and generate a Python dict from author data. Another interesting thing this Spider shows is that we don’t need to worry about visiting the same author page multiple times, even if the same author quotes a lot. By default, scrapy filters out repeated requests to already-visited urls, avoiding the problem of too many visits to the server due to programming errors. This can be configured by setting DUPEFILTER_CLASS.

Use spider parameters

By running them with the option -a:These arguments are passed to SpiderinitMethod and default to a spider property.

In this case, the tag parameter goes through self.tag. You can use its spider to only get quotes with specific tags and build urls based on the following parameters:

If you pass tag = humor to the spiders, you will notice that it only access from humor markers, such as quotes.toscrape.com/tag/humor.