Python crawlers - Xpath and LXML - Moment For Technology

1. Basic syntax of Xpath

1.1 What is Xpath

Xpath, or XML Path Language, is a Language for finding information in XML documents. It was originally designed to search XML documents, but it is also suitable for searching HTML documents.

1.2 Common Rules

Get the text

expression describe

a/text() Gets the text under a

a//text() Gets the text of all elements under a

//a[text()=’ next ‘] Gets the text as the a element of the next page

expression	describe
a/text()	Gets the text under a
a//text()	Gets the text of all elements under a
//a[text()=’ next ‘]	Gets the text as the a element of the next page

Retrieve attributes

expression	describe
nodename	Selects all children of this node
/	Selects direct child nodes from the current node
//	Selects descendant nodes from the current node
.	Select the current node
.	Selects the parent of the current node
@	Select properties
*	Matches any element node
@ *	Matches any property node
node()	Matches any type of node

1.3 the sample

Path expression	The results of
/bookstore/book[1] (note that the subscript starts from 1)	Selects the first book element that is the child element of bookstore
/bookstore/book[last()]	Selects the last book element that is the child element of bookstore
/bookstore/book[last()-1]	Selects the penultimate book element that is the child element of bookstore
/bookstore/book[position()<3]	Selects the first two book elements that belong to the child element bookstore
//title[@lang]	Selects all the title elements that have an attribute named lang
//title[@lang=’eng’]	Select all the title elements whose lang attribute is eng
/ bookstore/book [price > 35.00]	Selects all book elements under the Bookstore element whose price element value is greater than 35.00
/ bookstore/book [price > 35.00] / title	Selects all the title elements of the book element in the Bookstore element, and the price element has a value greater than 35.00
/bookstore/*	Selects all children of Bookstore
/ / *	Selects all elements in the document
//node()/meta[]/@*	Selects all attributes of the meta node of any node below the HTML
//title[@*]	Selects all title elements with attributes
//book/title \| // book/price	Select all the title and price elements of the book element
//title \| //price	Select all title and Price elements in the document
//bookstore/book/title \| //price	Selects all title elements of the book element that belongs to the Bookstore element, and all price elements in the document

2. Use of LXML

2.1 Use caution

LXML can correct HTML code, but it may be wrong.
- useetree.tostringObserve what the modified HTML looks like, and write xpath based on the modified HTML string
Ideas for extracting page data
- Group first and get a list of group labels
- Traversal, take each group of data extraction, will not cause the corresponding data disorder
LXML can accept strings of bytes and STR

2.2 Simple Example

from lxml import etree text = ''' <div> <ul> <li class="item-1"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth  item</a> </ul> </div> ''' html = etree.HTML(text) print(html) # <Element html at 0x1f1007c9d08> Print (etree.tostring(HTML).decode() Html. xpath('//li[@class="item-1"]/a/@href') print(ret1) # Html. xpath("//li[@class='item-1']/a/text()") print(ret2) For I in ret1: Item = {} item['url'] = I item['title'] = ret2[ret1.index(I)] print(item) # ret3 = html.xpath('//li[@class="item-1"]') for i in ret3: Item = {} item [' url '] = i.x path ('/a / @ href ') [0] if len (i.x path ('/a / @ href ')) else None #. / a / @ href said under the current node item['title'] = i.xpath('./a/text()')[0] if len(i.xpath('./a/text()')) else None print(item)Copy the code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Python crawlers — Xpath and LXML

1. Basic syntax of Xpath

1.1 What is Xpath

1.2 Common Rules

1.3 the sample

2. Use of LXML

2.1 Use caution

2.2 Simple Example

Python crawlers — Xpath and LXML

1. Basic syntax of Xpath

1.1 What is Xpath

1.2 Common Rules

1.3 the sample

2. Use of LXML

2.1 Use caution

2.2 Simple Example

Related Posts

Python Tutorial 3– Basic data types

Task management, project management and objective management

Mac — Tip: iPhone gets the real-time text feature of iOS 15