Selector a detailed introduction to the Selector
In the previous article, I briefly introduced the Selector method for you, so you can use it directly. This article focuses on Scrapy shells and Xpath selectors for more detailed use.
scrapy shell
We can use scrapy shells to simulate the request process, and then pass in variables that can be manipulated, such as resquest, response, etc.
PS C:\Users\admin\Desktop> scrapy shell https://www.baidu.com --nolog
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x00000142EEF861F0>
[s] item {}
[s] request <GET https://www.baidu.com>
[s] response <200 https://www.baidu.com>
[s] settings <scrapy.settings.Settings object at 0x00000142EEF864C0>
[s] spider <DefaultSpider 'default' at 0x142ef446400>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1] :Copy the code
As you can see from the above code, a number of variables are returned that we can manipulate.
Let’s see what the returned URL value is:
In [1]: response.url
Out[1] :'https://www.baidu.com'
Copy the code
Retrieve the text information of Baidu title:
In [2]: response.xpath('/html/head/title/text()').get()
Out[2] :'Just Google it and you'll know.'
Copy the code
Retrieve all the link text of the current page:
In [3]: response.xpath('//a/text()').getall()
Out[3] : ['news'.'hao123'.'map'.'video'.'贴吧'.'login'.'More products'.'About Baidu'.'About Baidu'.'Must read before using Baidu'.'Feedback']
Copy the code
Retrieve all hyperlinks from the current page:
In [4]: response.xpath('//a/@href').getall()
Out[4] : ['http://news.baidu.com'.'https://www.hao123.com'.'http://map.baidu.com'.'http://v.baidu.com'.'http://tieba.baidu.com'.'http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1'.'//www.baidu.com/more/'.'http://home.baidu.com'.'http://ir.baidu.com'.'http://www.baidu.com/duty/'.'http://jianyi.baidu.com/']
Copy the code
The get() function is the first result that returns all results; The getall() function returns all the results.
Of course, we can also use extract_first() to return the first result of all results; The extract() method returns all results.
Use Xpath selectors
The Response. selector property returns the equivalent of the body of the Response constructing a selector object.
The Selector object can call Xpath() to parse and extract information.
Now let’s get the information of taobao –> Commodity category –> featured market:
PS C:\Users\admin\Desktop> scrapy shell https://huodong.taobao.com/wow/tbhome/act/special-markets --nolog
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x00000186B0D061F0>
[s] item {}
[s] request <GET https://huodong.taobao.com/wow/tbhome/act/special-markets>
[s] response <200 https://huodong.taobao.com/wow/tbhome/act/special-markets>
[s] settings <scrapy.settings.Settings object at 0x00000186B0D06280>
[s] spider <DefaultSpider 'default' at 0x186b11b4670>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
Copy the code
After opening the page, it will look like the picture below:
Now I want to get the title information of each section.
In the figure above, you can see that this information exists in the dt tag. The specific code is as follows:
In [1]: response.selector.xpath('//dt/text()')
Out[1]:
[<Selector xpath='//dt/text()' data='The Fashion Spotter'>,
<Selector xpath='//dt/text()' data='Quality Life'>,
<Selector xpath='//dt/text()' data='Characteristic Playaholic'>,
<Selector xpath='//dt/text()' data='Affordable Professionals'>]
In [2]: response.selector.xpath('//dt/text()').extract()
Out[2] : ['The Fashion Spotter'.'Quality Life'.'Characteristic Playaholic'.'Affordable Professionals']
Copy the code
What if instead of extracting the title of each plate and the title within the plate, we could write this code?
As you can see from the image above, each of the large plates is inside the DL tag.
The specific code is as follows:
In [7] :for dl indllist: ... :print(dl.xpath('./dt/text()').extract_first()) ... :print("="*50)
...: alist = dl.xpath('.//a')
...: for a inalist: ... :print(a.xpath('./@href').extract_first(),':', end=' ')
...: print(a.xpath('.//span[@class="market-list-title"]/text()').extract_first()) ... King: fashion fact = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = https://ifTrend taobao.com/ : start from here: https://guang.taobao.com/ appearance association の shopping guide https://mei.taobao.com/ : makeup you the tone of https://g.taobao.com/ : to explore the global good life / / star.taobao.com/ : global star here https://mm.taobao.com/ : beauty sensations, https://www.taobao.com/markets/designer/stylish : the global creative designer platform Quality life home = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = https://chi.taobao.com/chi/ : food is authentic Chinese around the world https://www.jiyoujia.com/ : / / q.taobao.com to know good life: I want my life to https://www.taobao.com/markets/sph/sph/sy: pointed a luxury goods taste choice https://www.taobao.com/markets/qbb/index: enjoy parenting life new way / / car.taobao.com/ : buy a car to save money, Use the car to save worry //sport.taobao.com/ : fall in love with sports every day //zj.taobao.com: originality is worth the money // wt.Taobao.com/ : enjoy quality communication life https://www.alitrip.com/ : go further than your dream Characteristic ponder control = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = https://china.taobao.com: tunnel is just right! https://www.taobao.com/markets/3c/tbdc: open fashion a new life for you: https://acg.taobao.com/ ACGN look fun https://izhongchou.taobao.com/index.htm: seriously treat each a dream. //enjoy.taobao.com/ : Garden pet lovers concentration camp https://sf.taobao.com/ : Court disposal of assets,0Commission rule https://zc-paimai.taobao.com/ : value assets, the investment choice https://paimai.taobao.com/ : want to taobao auction / / xue.taobao.com/ : give your learning experience / / in the future2.Taobao.com: make your idle swam up https://ny.taobao.com/ : affordable category is complete Affordable profession of = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = //tejia.taobao.com/ : good quality goods special price area https://qing.taobao.com/ : brand tail goods365First stop day low https://ju.taobao.com/jusp/other/mingpin/tp.htm: luxury goods group https://ju.taobao.com/jusp/other/juliangfan/tp.htm?spm=608.5847457102202.. 5JO4uZI: redefining family lifestyle: https://qiang.taobao.com/ is to earn! https://ju.taobao.com/jusp/nv/fcdppc/tp.htm: big quality goods Floor price preferential https://ju.taobao.com/jusp/shh/life/tp.htm? : we gather around featured good goods https://ju.taobao.com/jusp/sp/global/tp.htm? spm=0.0. 0. 0.biIDGB :10Click on the new global floor price https://try.taobao.com/index.htm: There's always something new to discoverCopy the code
Use CSS selectors
Next, we use scrapy shell to extract information from taobao -> commodity categories -> theme markets
Similarly, we need to get the title information of each section, the specific code is as follows:
for dd indlist: ... :print(dd.css('a.category-name-level1::text').get()) ... Women's men's footwear, luggage, mother and baby products, skin care, cosmetics, food, jewelry, accessories, home decoration, building materials, home textiles, department markets, automobiles, accessories, mobile phones, digital appliances, office services, more services, life services, sports, outdoor flowers, birds, entertainment, agricultural suppliesCopy the code
With the same code above, I believe you can also be smart to each plate under the title text also grab down right.
Get a better idea of how scrapy extracts data.
The last
Nothing can be accomplished overnight, so is life, so is learning!
So what’s a three-day, seven-day crash?
Only insist, can succeed!
Biting books says:
Every word of the article is my heart to knock out, only hope to live up to every attention to my people. Click “like” at the end of the article to let me know that you are also working hard for your study.
The way ahead is so long without ending, yet high and low I’ll search with my will unbending.
I am book-learning, a person who concentrates on learning. The more you know, the more you don’t know. See you next time for more exciting content!