1. Basic syntax of Xpath

1.1 What is Xpath

Xpath, or XML Path Language, is a Language for finding information in XML documents. It was originally designed to search XML documents, but it is also suitable for searching HTML documents.

1.2 Common Rules

  • Get the text

    expression describe
    a/text() Gets the text under a
    a//text() Gets the text of all elements under a
    //a[text()=’ next ‘] Gets the text as the a element of the next page
  • Retrieve attributes

    expression describe
    nodename Selects all children of this node
    / Selects direct child nodes from the current node
    // Selects descendant nodes from the current node
    . Select the current node
    . Selects the parent of the current node
    @ Select properties
    * Matches any element node
    @ * Matches any property node
    node() Matches any type of node

1.3 the sample

Path expression The results of
/bookstore/book[1] (note that the subscript starts from 1) Selects the first book element that is the child element of bookstore
/bookstore/book[last()] Selects the last book element that is the child element of bookstore
/bookstore/book[last()-1] Selects the penultimate book element that is the child element of bookstore
/bookstore/book[position()<3] Selects the first two book elements that belong to the child element bookstore
//title[@lang] Selects all the title elements that have an attribute named lang
//title[@lang=’eng’] Select all the title elements whose lang attribute is eng
/ bookstore/book [price > 35.00] Selects all book elements under the Bookstore element whose price element value is greater than 35.00
/ bookstore/book [price > 35.00] / title Selects all the title elements of the book element in the Bookstore element, and the price element has a value greater than 35.00
/bookstore/* Selects all children of Bookstore
/ / * Selects all elements in the document
//node()/meta[]/@* Selects all attributes of the meta node of any node below the HTML
//title[@*] Selects all title elements with attributes
//book/title | // book/price Select all the title and price elements of the book element
//title | //price Select all title and Price elements in the document
//bookstore/book/title | //price Selects all title elements of the book element that belongs to the Bookstore element, and all price elements in the document

2. Use of LXML

2.1 Use caution

  • LXML can correct HTML code, but it may be wrong.

    • useetree.tostringObserve what the modified HTML looks like, and write xpath based on the modified HTML string
  • Ideas for extracting page data

    • Group first and get a list of group labels
    • Traversal, take each group of data extraction, will not cause the corresponding data disorder
  • LXML can accept strings of bytes and STR

2.2 Simple Example

from lxml import etree text = ''' <div> <ul> <li class="item-1"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth  item</a> </ul> </div> ''' html = etree.HTML(text) print(html) # <Element html at 0x1f1007c9d08> Print (etree.tostring(HTML).decode() Html. xpath('//li[@class="item-1"]/a/@href') print(ret1) # Html. xpath("//li[@class='item-1']/a/text()") print(ret2) For I in ret1: Item = {} item['url'] = I item['title'] = ret2[ret1.index(I)] print(item) # ret3 = html.xpath('//li[@class="item-1"]') for i in ret3: Item = {} item [' url '] = i.x path ('/a / @ href ') [0] if len (i.x path ('/a / @ href ')) else None #. / a / @ href said under the current node item['title'] = i.xpath('./a/text()')[0] if len(i.xpath('./a/text()')) else None print(item)Copy the code