Introduction to the
Scrapy is a python based crawler framework that provides a complete set of crawler solutions, including cookie simulation, referer simulation, Ajax simulation, and support for proxy, retry, and other crawler operations. It allows you to focus on the data itself without having to deal with much of the underlying stuff.
Some of the tips are here today for quick reference and quick start.
project
Since scrapy is a framework, we need to create the project folder first
scrapy startproject tutorial
Copy the code
The third parameter is the name of the project, it is suggested to establish different projects according to the website, because a project shares a set of configuration, including data export format, which middleware to use, etc. If the crawler of different websites is placed in the same project, it may need to modify these configurations repeatedly, resulting in confusion. But a project can have multiple crawlers, such as crawlers for lists, crawlers for details, crawlers for tags…
* * * * * * * * * * * * * * * * * * * * * * * The project pipeline class settings.py # spiders/ # place various crawlers __init__.pyCopy the code
The first example
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
Copy the code
This example from the official documentation shows the two basic parts of a crawler. Send requests and analyze data. This corresponds to start_requests and parse. So that’s what I was talking about using scrapy and just focusing on the data itself, these two pieces. The subsequent improvements also revolve around these two pieces.
Another key attribute is name, which represents the name of the crawler and is the unique identifier of the crawler. We’ll need it when we start the crawler.
scrapy crawl quotes
Copy the code
The command is to execute the crawler. Start_requests, which initiates requests, must yield a Request object. Request needs to specify that the callback is parse. Parse receives the response object and contains all of the response information. In this example, we will save the returned data as HTML without further operations. The following shows how to extract the information we need from response.
The selector
The official way to learn to extract data is in shell mode
scrapy shell 'http://quotes.toscrape.com/page/1/'
Copy the code
This allows you to code interactively in the shell and see the results of extracting data. Let’s introduce CSS selectors first, because anyone with a little front-end knowledge knows the syntax of CSS selectors.
>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
>>> response.css('title::text').getall()
['Quotes to Scrape']
>>> response.css('title').getall()
['<title>Quotes to Scrape</title>']
>>> response.css('title::text').get()
'Quotes to Scrape'
>>> response.css('title::text')[0].get()
'Quotes to Scrape'
>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']
>>> tags = quote.css("div.tags a.tag::text").getall()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']
>>> response.css('li.next a::attr(href)').get()
'/page/2/'
Copy the code
There are two CSS selectors whose extended syntax is ::text and ::attr to get the tag text and attribute, respectively. Attributes can also be obtained in this way:
>>> response.css('li.next a').attrib['href']
'/page/2'
Copy the code
The GET method can also specify default values
response.css(query).get(default='').strip()
Copy the code
CSS selectors make it easy to locate the data we need, but another selector that Scrapy provides is xpath. This is also a classic, but seems to use far fewer scenarios than CSS, so anyone interested can learn the new syntax on their own.
Returns data and paging
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
Copy the code
The parse method yields a dictionary to return data, and we can specify output to a file with the -o argument during execution. The default output format is Jsonline, which is a JSON string in a row. The output format can also be CSV, JSON, or XML. Json differs from JsOnline in that JSON outputs a JSON string containing all the results.
scrapy crawl quotes -o quotes.json
Copy the code
If you want to continue to climb the next page, you can also generate a request for the next page in the parse. However, pay attention to the relative path when requesting the next page. The URL obtained in the above example is a relative path, so response.urljoin method is used to convert it into an absolute path.
conclusion
It is now possible to crawl the data from a paged list and save it to a file. If a site does not set up anti – crawling here is completely ok.
Reference
This is the voa Special English Health Report.
Official documentation >>