1. Basic syntax of Xpath
1.1 What is Xpath
Xpath, or XML Path Language, is a Language for finding information in XML documents. It was originally designed to search XML documents, but it is also suitable for searching HTML documents.
1.2 Common Rules
-
Get the text
expression describe a/text() Gets the text under a a//text() Gets the text of all elements under a //a[text()=’ next ‘] Gets the text as the a element of the next page -
Retrieve attributes
expression describe nodename Selects all children of this node / Selects direct child nodes from the current node // Selects descendant nodes from the current node . Select the current node . Selects the parent of the current node @ Select properties * Matches any element node @ * Matches any property node node() Matches any type of node
1.3 the sample
Path expression | The results of |
---|---|
/bookstore/book[1] (note that the subscript starts from 1) | Selects the first book element that is the child element of bookstore |
/bookstore/book[last()] | Selects the last book element that is the child element of bookstore |
/bookstore/book[last()-1] | Selects the penultimate book element that is the child element of bookstore |
/bookstore/book[position()<3] | Selects the first two book elements that belong to the child element bookstore |
//title[@lang] | Selects all the title elements that have an attribute named lang |
//title[@lang=’eng’] | Select all the title elements whose lang attribute is eng |
/ bookstore/book [price > 35.00] | Selects all book elements under the Bookstore element whose price element value is greater than 35.00 |
/ bookstore/book [price > 35.00] / title | Selects all the title elements of the book element in the Bookstore element, and the price element has a value greater than 35.00 |
/bookstore/* | Selects all children of Bookstore |
/ / * | Selects all elements in the document |
//node()/meta[]/@* | Selects all attributes of the meta node of any node below the HTML |
//title[@*] | Selects all title elements with attributes |
//book/title | // book/price | Select all the title and price elements of the book element |
//title | //price | Select all title and Price elements in the document |
//bookstore/book/title | //price | Selects all title elements of the book element that belongs to the Bookstore element, and all price elements in the document |
2. Use of LXML
2.1 Use caution
-
LXML can correct HTML code, but it may be wrong.
- use
etree.tostring
Observe what the modified HTML looks like, and write xpath based on the modified HTML string
- use
-
Ideas for extracting page data
- Group first and get a list of group labels
- Traversal, take each group of data extraction, will not cause the corresponding data disorder
-
LXML can accept strings of bytes and STR
2.2 Simple Example
from lxml import etree text = ''' <div> <ul> <li class="item-1"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </ul> </div> ''' html = etree.HTML(text) print(html) # <Element html at 0x1f1007c9d08> Print (etree.tostring(HTML).decode() Html. xpath('//li[@class="item-1"]/a/@href') print(ret1) # Html. xpath("//li[@class='item-1']/a/text()") print(ret2) For I in ret1: Item = {} item['url'] = I item['title'] = ret2[ret1.index(I)] print(item) # ret3 = html.xpath('//li[@class="item-1"]') for i in ret3: Item = {} item [' url '] = i.x path ('/a / @ href ') [0] if len (i.x path ('/a / @ href ')) else None #. / a / @ href said under the current node item['title'] = i.xpath('./a/text()')[0] if len(i.xpath('./a/text()')) else None print(item)Copy the code