Life is short. I use Python
Previous portal:
Learning Python crawlers (1) : The Beginning
Python crawler (2) : Preparation (1) basic class library installation
Learn Python crawler (3) : Pre-preparation (2) Linux basics
Docker is a Python crawler
Learn Python crawler (5) : pre-preparation (4) database foundation
Python crawler (6) : Pre-preparation (5) crawler framework installation
Python crawler (7) : HTTP basics
Little White learning Python crawler (8) : Web basics
Learning Python crawlers (9) : Crawler basics
Python crawler (10) : Session and Cookies
Python crawler (11) : Urllib
Python crawler (12) : Urllib
Urllib: A Python crawler (13)
Urllib: A Python crawler (14)
Python crawler (15) : Urllib
Python crawler (16) : Urllib crawler (16) : Urllib crawler
Python crawler (17) : Basic usage for Requests
Python crawler (18) : Requests advanced operations
Python crawler (19) : Xpath base operations
Learn Python crawler (20) : Advanced Xpath
Python crawler (21) : Parsing library Beautiful Soup
Python crawler (22) : Beautiful Soup
Python crawler (23) : Getting started parsing pyQuery
Python Crawler (24) : 2019 douban movie Rankings
Python crawler (25) : Crawls stock information
You can’t even afford to buy a second-hand house in Shanghai
Selenium, an Automated Testing Framework, goes from Getting Started to Giving up
Selenium, an Automated Testing Framework, goes from Starter to Quit
Selenium obtains commodity information on a large e-commerce site
Python crawler (30) : Proxy basics
Python crawler (31) : Build a simple proxy pool yourself
Python crawler (32) : Introduction to asynchronous request library AIOHTTP basics
Python crawlers and their Scrapy framework
The introduction
In this article, we simply use the Spider to grab the information we need, we simply print the information I need by the way of print() on the console.
In our actual crawler process, we need to save the data, rather than directly output to the console, this article continues to show how we can save the information that spiders grab.
Item
The primary purpose of Item is to extract structured data from unstructured sources (usually web pages).
The Scrapy Spider returns extracted data as a Python dictionary. Python dictionaries, while convenient and familiar, lack structure: It’s easy to type errors in field names or return inconsistent data, especially in large projects with many spiders.
To define common output data formats, Scrapy provides the Item class. The Item object is a simple container for collecting fetch data. They provide dictionary-like apis with convenient syntax for declaring their available fields.
Next, let’s create an Item.
Create Item by inheriting the scrapy.Item class and defining a Field of type scrapy.Field.
In the previous article, the fields we wanted to get were text, author, and Tags.
So, we define the Item class as follows, where we directly modify the items.py file:
import scrapy
class QuoteItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()Copy the code
We’re going to use this Item in our first_scrapy project, and we’re going to modify the QuotesSpider as follows:
import scrapy
from first_scrapy.items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
quotes = response.css('.quote')
for quote in quotes:
item = QuoteItem()
item['text'] = quote.css('.text::text').extract_first()
item['author'] = quote.css('.author::text').extract_first()
item['tags'] = quote.css('.tags .tag::text').extract()
yield itemCopy the code
Next, we can save the data we just retrieved as a JSON file using the simplest command line:
scrapy crawl quotes -o quotes.jsonCopy the code
After executing, you can see that a file named quotes.json is generated in the current directory as follows:
There are many output formats, such as CSV, XML, pickle, marshal, etc. Common output statements are as follows:
scrapy crawl quotes -o quotes.csv
scrapy crawl quotes -o quotes.xml
scrapy crawl quotes -o quotes.pickle
scrapy crawl quotes -o quotes.marshalCopy the code
At this point, we simply export the retrieved data into a JSON file, and that’s the end of it?
Of course not. In the previous article we simply grabbed the content of the current page. What if we wanted to grab the content of subsequent pages?
And, of course, the first step we need to watch behind a page link: http://quotes.toscrape.com/page/2.
Next, we need to construct a Request to access the next page, using scrapy.Request.
Here we simply pass in two arguments using Request(). There are actually many more arguments that can be passed in, but we’ll talk about that later.
- Url: The URL of this request
- Callback: This is the callback function. When the callback is completed and the response is received, the engine passes the response as an argument to the callback function.
So what we need to do is use the selector to get the next page link and generate the Request, and use scrapy.Request to access the link and do a new round of scraping.
Add the following code:
next = response.css('.pager .next a::attr("href")').extract_first()
url = response.urljoin(next)
yield scrapy.Request(url=url, callback=self.parse)Copy the code
The overall code for the Spider class is now as follows:
# -*- coding: utf-8 -*-
import scrapy
from first_scrapy.items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
quotes = response.css('.quote')
for quote in quotes:
item = QuoteItem()
item['text'] = quote.css('.text::text').extract_first()
item['author'] = quote.css('.author::text').extract_first()
item['tags'] = quote.css('.tags .tag::text').extract()
yield item
next = response.css('.pager .next a::attr("href")').extract_first()
url = response.urljoin(next)
yield scrapy.Request(url=url, callback=self.parse)Copy the code
Execute the Spider again with the command, and the result is as follows (note that if the json file was generated before, delete it before running, otherwise it will append directly) :
As you can see, the data has increased a lot, indicating that we have successfully captured the data of the subsequent pages.
Is this the end of it? How is that possible? Here we simply save the data in JSON file, which is not convenient for us to access. Here we can save the data in the database we need.
Item Pipeline
When we want to store data in a database, we can use the Item Pipeline, which is the project Pipeline.
Typical uses of this pipeline are:
- Cleaning HTML data
- Verify the crawl data and check the crawl fields
- Check duplicates and discard duplicates
- Store the crawl results in the database
In this example, we choose to save the data to MongoDB. Next, we will save the previously queried data in MongoDB.
Emmmmmmmmmm, iF you want to ask xiaobing MongoDB how to install, simply say, directly use Docker to install, just a few simple commands can be:
Docker images docker run -p 27017:27017 -TD mongo docker psCopy the code
If there is no accident, the above several sentences can be implemented. The connection tool can use Navicat.
> < span style = “box-sizing: border-box; word-break: inherit! Important; word-break: inherit! Important;”
# -*- coding: utf-8 -*-
from scrapy.exceptions import DropItem
class TextPipeline(object):
def process_item(self, item, spider):
if item['text']:
return item
else:
return DropItem('Missing Text')Copy the code
Here we implement the process_item() method, whose arguments are item and spider.
This simply checks whether the current text exists. If it does not, the DropItem exception is raised, and if it does, the item is returned.
Next, we put the processed item into MongoDB to define another Pipeline. Also in charge. Py, we implement another class MongoPipeline, the content is as follows:
import pymongo
class MongoPipeline(object):
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DB')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def process_item(self, item, spider):
name = item.__class__.__name__
self.db[name].insert(dict(item))
return item
def close_spider(self, spider):
self.client.close()Copy the code
The MongoPipeline class implements several other methods defined by the API.
- fromCrawler, which is a classmethod identified by @classmethod, is a dependency injection method. The parameter of the method is crawler. Through crawler, we can get the configuration information of each global configuration. We can define MONGO in global settings.pyMONGO_DB URI and MONGO_DB to specify the MongoDB connection address and database name, return the class object after getting the configuration information. So this method is defined primarily to get the configuration in settings.py.
- Open_spider, which is called when a Spider is enabled. This is where you do some initialization.
- Close_spider, which is called when the Spider is closed, closes the database connection here. The primary process_item() method performs data insertion.
With TextPipeline and MongoPipeline classes defined, we need to use them in settings.py. MongoDB connection information needs to be defined.
Add the following to settings.py:
ITEM_PIPELINES = {
'first_scrapy.pipelines.TextPipeline': 300,
'first_scrapy.pipelines.MongoPipeline': 400,
}
MONGO_URI='localhost'
MONGO_DB='first_scrapy'Copy the code
Execute the crawl command again:
scrapy crawl quotesCopy the code
The result is as follows:
As you can see, MongoDB creates a QuoteItem table that holds the data we just fetched.
The sample code
All of the code in this series will be available on Github and Gitee.
Example code -Github
Example code -Gitee
reference
https://docs.scrapy.org/en/latest/topics/request-response.html
https://docs.scrapy.org/en/latest/topics/items.html
https://cuiqingcai.com/8337.html