This is the 18th day of my participation in the August Genwen Challenge.More challenges in August

Following up, today we’ll take a look at selectors in scrapy!

1. Scrapy selector

Scrapy provides parsing mechanisms based on LXML libraries, called selectors. Because they “select” parts of the HTML document specified by XPath or CSS expressions. The API of the Scarpy selector is very small and very simple.

(1) Construct selector selector

A Scrapy Selector is an instance constructed by passing text or a TextResponse object through the scrapy.Selector class. (It automatically selects the best parsing rules, XML and HTML, based on the input type)

html_str="""
<div class="info">
                    <div class="hd">
                        <a href="https://movie.douban.com/subject/1292052/" class="">
                            <span class="title">The Shawshank Redemption</span>
                                    <span class="title">&nbsp;/&nbsp;The Shawshank Redemption</span>
                                <span class="other">&nbsp;/&nbsp;Moon High Rise (HK)/Thrill 1995(Taiwan)</span>
                        </a>


                            <span class="playable">[Playable]</span>
                    </div>
                    <div class="bd">
                        <p class="">Directed by Frank Darabont&nbsp;&nbsp;&nbsp;Tim Robbins /...<br>
                            1994&nbsp;/&nbsp;The United States&nbsp;/&nbsp;Crime drama</p>

                        
                        <div class="star">
                                <span class="rating5-t"></span>
                                <span class="rating_num" property="v:average">9.7</span>
                                <span property="v:best" content="10.0"></span>
                                <span>1983356 evaluation</span>
                        </div>

                            <p class="quote">
                                <span class="inq">Hope can set a man free.</span>
                            </p>
                    </div>
</div>
Copy the code

Part 1: Construction of a Selector object (1) Constructs a Selector object by passing text text

From scrapy.selector import selector SSS = selector (text=html_str) ", ss.extract()) # Note: the selector will automatically complete the HTML tag and the body tag for usCopy the code

The same thing happens here, but we can match it directly with xpath on a Selector object

Xpath ('./body/div/div/a/span[1]/text()'). Extract ()[0]) print('. ",sss.xpath('./body/div/div/a/span[1]/text()').extract_first())Copy the code

(2) Build the Selector object through Response

from scrapy.http import HtmlResponse response=HtmlResponse(url="http://www.spider.com",body=html_str.encode()) Build a response object from the responseCopy the code

There are three ways that we can match using xpath on a Selector object. Same effect!

Print (" Use the response to construct the selector object using 1: ",Selector(response=response).xpath('./body/div/div/a/span[1]/text()').extract()[0]) print(" ",response.selector. Xpath ('./body/div/div/a/span[1]/text()'). Extract ()[0]) print ",response.xpath('./body/div/div/a/span[1]/text()').extract()[0])Copy the code

Part two: Using selectors!! Selectors provide two methods to extract tags: (1)xpath() XPath-based syntax rules (2) CSS () CSS selector-based syntax rules

How to use :(shortcut) response.xpath() response.css()

They return selector list extract text: Extract_first () or selector. Extract ()[0] returns the text of the first selector. The first method does not return None; But the second method does not give an error!

Print (" use CSS method: ",response.css("a")) #2. Print (" use xpath method: ",response.xpath('./body/div/div/a/span[1]/text()')) #3. Use of nested selectors: Note: You must be a selector object to mix! Extract (@class="title"]/text()').extract()[0]) print(" a",response.css("a").xpath('./span[@class="title"]/text()').extract()[0]) Nested choiceCopy the code

Extension: Selector also has a.re() method that uses regular expressions to extract data. It returns a string. It is typically used after xpath() and CSS () methods to filter text data. Re_first () is used to return the first matching string.

Print (" combined with the use of re: "and the response. The CSS (" a".) xpath ('/span [@ class = "title"] / text () '). The re (" (..) ")) # re using print (" combined with a quick method of re: "and the response. The CSS (" a".) xpath ('/span [@ class = "title"] / text () '). The re_first (" (..) ")) # retrieve the first item from the list, no error if None is returned; If the value is indexed, an error will be reported if the value is not retrieved!Copy the code

2. Scrapy. Spiders class

(1) Name of spider (name)

A string that defines the name of the spider. The spider name is how Scrapy locates (and instantiates) the spider, so it must be unique. This is the most important spider attribute, it is required.

(2) Start urls (start_urls)

The list of urls the spider will begin to climb. Therefore, the first page of the download will be the page listed here. Subsequent requests are continuously generated from the data contained in the initial URL.

(3) Customer_Settings

The project scoped Settings are overridden when running this spider. It must be defined as a class attribute because the Settings are updated before instantiation. Overrides Settings in settings.py!

(4) Logger

Python logger created using Spider. You can use it to send log messages. Source:

🔆 In The End!

Start now, stick to it, a little progress a day, in the near future, you will thank you for your efforts!

This blogger will continue to update the basic column of crawler and crawler combat column, carefully read this article friends, you can like the collection and comment on your feelings after reading. And can follow this blogger, read more crawler in the days ahead!

If there are mistakes or inappropriate words can be pointed out in the comment area, thank you! If reprint this article please contact me for my consent, and mark the source and the name of the blogger, thank you!