Python crawlers and their Scrapy framework

Life is short. I use Python

Previous portal:

Learning Python crawlers (1) : The Beginning

Python crawler (2) : Preparation (1) basic class library installation

Learn Python crawler (3) : Pre-preparation (2) Linux basics

Docker is a Python crawler

Learn Python crawler (5) : pre-preparation (4) database foundation

Python crawler (6) : Pre-preparation (5) crawler framework installation

Python crawler (7) : HTTP basics

Little White learning Python crawler (8) : Web basics

Learning Python crawlers (9) : Crawler basics

Python crawler (10) : Session and Cookies

Python crawler (11) : Urllib

Python crawler (12) : Urllib

Urllib: A Python crawler (13)

Urllib: A Python crawler (14)

Python crawler (15) : Urllib

Python crawler (16) : Urllib crawler (16) : Urllib crawler

Python crawler (17) : Basic usage for Requests

Python crawler (18) : Requests advanced operations

Python crawler (19) : Xpath base operations

Learn Python crawler (20) : Advanced Xpath

Python crawler (21) : Parsing library Beautiful Soup

Python crawler (22) : Beautiful Soup

Python crawler (23) : Getting started parsing pyQuery

Python Crawler (24) : 2019 douban movie Rankings

Python crawler (25) : Crawls stock information

You can’t even afford to buy a second-hand house in Shanghai

Selenium, an Automated Testing Framework, goes from Getting Started to Giving up

Selenium, an Automated Testing Framework, goes from Starter to Quit

Selenium obtains commodity information on a large e-commerce site

Python crawler (30) : Proxy basics

Python crawler (31) : Build a simple proxy pool yourself

Python crawler (32) : Introduction to asynchronous request library AIOHTTP basics

Python crawlers and their Scrapy framework

The introduction

In this article, we simply use the Spider to grab the information we need, we simply print the information I need by the way of print() on the console.

In our actual crawler process, we need to save the data, rather than directly output to the console, this article continues to show how we can save the information that spiders grab.

Item

The primary purpose of Item is to extract structured data from unstructured sources (usually web pages).

The Scrapy Spider returns extracted data as a Python dictionary. Python dictionaries, while convenient and familiar, lack structure: It’s easy to type errors in field names or return inconsistent data, especially in large projects with many spiders.

To define common output data formats, Scrapy provides the Item class. The Item object is a simple container for collecting fetch data. They provide dictionary-like apis with convenient syntax for declaring their available fields.

Next, let’s create an Item.

Create Item by inheriting the scrapy.Item class and defining a Field of type scrapy.Field.

In the previous article, the fields we wanted to get were text, author, and Tags.

So, we define the Item class as follows, where we directly modify the items.py file:

import scrapy

class QuoteItem(scrapy.Item):
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()Copy the code

We’re going to use this Item in our first_scrapy project, and we’re going to modify the QuotesSpider as follows:

import scrapy
from first_scrapy.items import QuoteItem

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        quotes = response.css('.quote')
        for quote in quotes:
            item = QuoteItem()
            item['text'] = quote.css('.text::text').extract_first()
            item['author'] = quote.css('.author::text').extract_first()
            item['tags'] = quote.css('.tags .tag::text').extract()
            yield itemCopy the code

Next, we can save the data we just retrieved as a JSON file using the simplest command line:

scrapy crawl quotes -o quotes.jsonCopy the code

After executing, you can see that a file named quotes.json is generated in the current directory as follows:

There are many output formats, such as CSV, XML, pickle, marshal, etc. Common output statements are as follows:

scrapy crawl quotes -o quotes.csv
scrapy crawl quotes -o quotes.xml
scrapy crawl quotes -o quotes.pickle
scrapy crawl quotes -o quotes.marshalCopy the code

At this point, we simply export the retrieved data into a JSON file, and that’s the end of it?

Of course not. In the previous article we simply grabbed the content of the current page. What if we wanted to grab the content of subsequent pages?

And, of course, the first step we need to watch behind a page link: http://quotes.toscrape.com/page/2.

Next, we need to construct a Request to access the next page, using scrapy.Request.

Here we simply pass in two arguments using Request(). There are actually many more arguments that can be passed in, but we’ll talk about that later.

Url: The URL of this request
Callback: This is the callback function. When the callback is completed and the response is received, the engine passes the response as an argument to the callback function.

So what we need to do is use the selector to get the next page link and generate the Request, and use scrapy.Request to access the link and do a new round of scraping.

Add the following code:

next = response.css('.pager .next a::attr("href")').extract_first()
url = response.urljoin(next)
yield scrapy.Request(url=url, callback=self.parse)Copy the code

The overall code for the Spider class is now as follows:

# -*- coding: utf-8 -*-
import scrapy
from first_scrapy.items import QuoteItem

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        quotes = response.css('.quote')
        for quote in quotes:
            item = QuoteItem()
            item['text'] = quote.css('.text::text').extract_first()
            item['author'] = quote.css('.author::text').extract_first()
            item['tags'] = quote.css('.tags .tag::text').extract()
            yield item

        next = response.css('.pager .next a::attr("href")').extract_first()
        url = response.urljoin(next)
        yield scrapy.Request(url=url, callback=self.parse)Copy the code

Execute the Spider again with the command, and the result is as follows (note that if the json file was generated before, delete it before running, otherwise it will append directly) :

As you can see, the data has increased a lot, indicating that we have successfully captured the data of the subsequent pages.

Is this the end of it? How is that possible? Here we simply save the data in JSON file, which is not convenient for us to access. Here we can save the data in the database we need.

Item Pipeline

When we want to store data in a database, we can use the Item Pipeline, which is the project Pipeline.

Typical uses of this pipeline are:

Cleaning HTML data
Verify the crawl data and check the crawl fields
Check duplicates and discard duplicates
Store the crawl results in the database

In this example, we choose to save the data to MongoDB. Next, we will save the previously queried data in MongoDB.

Emmmmmmmmmm, iF you want to ask xiaobing MongoDB how to install, simply say, directly use Docker to install, just a few simple commands can be:

Docker images docker run -p 27017:27017 -TD mongo docker psCopy the code

If there is no accident, the above several sentences can be implemented. The connection tool can use Navicat.

> < span style = “box-sizing: border-box; word-break: inherit! Important; word-break: inherit! Important;”

# -*- coding: utf-8 -*-

from scrapy.exceptions import DropItem

class TextPipeline(object):

    def process_item(self, item, spider):
        if item['text']:
            return item
        else:
            return DropItem('Missing Text')Copy the code

Here we implement the process_item() method, whose arguments are item and spider.

This simply checks whether the current text exists. If it does not, the DropItem exception is raised, and if it does, the item is returned.

Next, we put the processed item into MongoDB to define another Pipeline. Also in charge. Py, we implement another class MongoPipeline, the content is as follows:

import pymongo

class MongoPipeline(object):
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(mongo_uri=crawler.settings.get('MONGO_URI'),
                   mongo_db=crawler.settings.get('MONGO_DB')
                   )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def process_item(self, item, spider):
        name = item.__class__.__name__
        self.db[name].insert(dict(item))
        return item

    def close_spider(self, spider):
        self.client.close()Copy the code

The MongoPipeline class implements several other methods defined by the API.

fromCrawler, which is a classmethod identified by @classmethod, is a dependency injection method. The parameter of the method is crawler. Through crawler, we can get the configuration information of each global configuration. We can define MONGO in global settings.pyMONGO_DB URI and MONGO_DB to specify the MongoDB connection address and database name, return the class object after getting the configuration information. So this method is defined primarily to get the configuration in settings.py.
Open_spider, which is called when a Spider is enabled. This is where you do some initialization.
Close_spider, which is called when the Spider is closed, closes the database connection here. The primary process_item() method performs data insertion.

With TextPipeline and MongoPipeline classes defined, we need to use them in settings.py. MongoDB connection information needs to be defined.

Add the following to settings.py:

ITEM_PIPELINES = {
    'first_scrapy.pipelines.TextPipeline': 300,
    'first_scrapy.pipelines.MongoPipeline': 400,
}
MONGO_URI='localhost'
MONGO_DB='first_scrapy'Copy the code

Execute the crawl command again:

scrapy crawl quotesCopy the code

The result is as follows:

As you can see, MongoDB creates a QuoteItem table that holds the data we just fetched.

The sample code

All of the code in this series will be available on Github and Gitee.

Example code -Github

Example code -Gitee

reference

https://docs.scrapy.org/en/latest/topics/request-response.html

https://docs.scrapy.org/en/latest/topics/items.html

https://cuiqingcai.com/8337.html

Python crawlers and their Scrapy framework

The introduction

Item

Item Pipeline

The sample code

reference

Related Posts

With this skill, you can go to Github

Markup interfaces, annotations, and the past lives of annotation processors

The most comprehensive architect atlas ever