“This is the 29th day of my participation in the First Challenge 2022. For details: First Challenge 2022”
This blog will be a systematic review of scrapy saving data, namely, exporters.
Export data using export
It’s easy to write a crawler and save the results quickly. Just type the following code on the command line when running the crawler:
Scrapy crawl file name -o Save file nameCopy the code
In Scrapy, the export middleware is called an Exporter, and it has a built-in export format of JSON, JSON LINES, CSV, XML, Pickle, and Matshal.
Using the above built-in formats, you can generally meet the requirements of a variety of scenarios. To save the file, enter the -o file name on the command line. Suffixes can distinguish storage file formats based on file name suffixes.
You can also use the -t file type to select the format. For example, run the following command:
Scrapy crawl name -t CSV -o mydata.dataCopy the code
Default_settings. py is the global configuration file for the Scrapy project. You can see the six built-in exporters in the scrapy.exporters module.
FEED_EXPORTERS = {}
FEED_EXPORTERS_BASE = {
'json': 'scrapy.exporters.JsonItemExporter'.'jsonlines': 'scrapy.exporters.JsonLinesItemExporter'.'jl': 'scrapy.exporters.JsonLinesItemExporter'.'csv': 'scrapy.exporters.CsvItemExporter'.'xml': 'scrapy.exporters.XmlItemExporter'.'marshal': 'scrapy.exporters.MarshalItemExporter'.'pickle': 'scrapy.exporters.PickleItemExporter',
}
FEED_EXPORT_INDENT = 0
Copy the code
Because the export format can be configured, we can also write custom exporters to add to scrapy.
Write a crawler to implement the export operation
This time directly into the case operation link, used to explain the export related skills.
Crawler code
import scrapy
class YySpider(scrapy.Spider) :
name = 'yy'
allowed_domains = ['pharmnet.com.cn']
start_urls = ['http://www.pharmnet.com.cn/product/1111/1/1.html']
def parse(self, response) :
all_items = response.css('a.green.fb.f13::text').getall()
for item in all_items:
yield {
"name": item
}
Copy the code
Running the crawler can use the following two commands to obtain the CSV file.
scrapy crawl yy -o data.csv
scrapy crawl yy -t csv -o data.d
Copy the code
There are also two special parameters you can use when exporting files, as follows:
scrapy crawl yy -t csv -o data/%(name)s/%(time)s.csv
Copy the code
%(name)s corresponds to the name of the crawler, and %(time)s corresponds to the creation time of the file.
If you don’t want to configure it by command when you export the file, you can write it to settings.py, using the FEEDS field.
FEEDS = {
'items.csv': {
'format': 'csv'.'encoding': 'utf8',}}Copy the code
This configuration is a dictionary type, the key name is the file name, and the key value is the specific configuration of the file. The common contents are as follows:
format
Export types, that is, built-in types;batch_item_count
: If you set an integer representing the number of data items stored in each file,scrapyWill generate multiple files, set the value to set a file name generation rule, code in the following;encoding
: coding;fields
: Display field;indent
: The number of Spaces for indentation at each level;
Batch_item_count rule. %(batch_time)s obtains the timestamp and %(batch_id)d obtains the serial number.
FEEDS = {
'%(batch_id)d.csv': {
'format': 'csv'.'encoding': 'utf8'.'batch_item_count': 2,}}Copy the code
See what FEEDS are configured in default_settings.py.
FEED_TEMPDIR = None
FEEDS = {}
FEED_URI_PARAMS = None # a function to extend uri arguments
FEED_STORE_EMPTY = False
FEED_EXPORT_ENCODING = None
FEED_EXPORT_FIELDS = None
FEED_STORAGES = {}
Copy the code
Implement an exporter yourself
If scrapy’s six exports don’t fit, you can create a custom export. The implementation is very simple, just in the exporter class, implement the BaseItemExporter interface can be.
Custom export mainly involves the following three methods.
start_exporting()
: initialization method of the export;finish_exporting()
: method called when the exporter ends;export_item()
: core method, executed when exporting each item in item.
Next we will implement a self-defined export.
Create a my_ext.py file in the directory sibling to the settings.py file, and then create a TXTItemExporter class that inherits BaseItemExporter.
The rest of the code looks like this:
from scrapy.exporters import BaseItemExporter
class TXTItemExporter(BaseItemExporter) :
def __init__(self, file, **kwargs) :
super().__init__(dont_fail=True, **kwargs)
self.file = file
def export_item(self, item) :
The # _get_serialized_fields method retrieves all fields of item and returns an iterator
print(self._get_serialized_fields(item, default_value=' '))
print(self.file)
for name, value in self._get_serialized_fields(item, default_value=' '):
self.file.write(bytes("\nname:" + value, encoding="utf-8"))
Copy the code
Write in the back
Today is day 257/365 of continuous writing. Expect attention, likes, comments and favorites.