Using XPath

XPath, XML Path Language, is a Language for finding information in THE XML Language. It was originally designed to search XML documents, but it can also be used to search HTML documents.

In the last article about the use of regular expressions, regular expressions are relatively difficult, if you do not spend enough time to do it is still relatively difficult, so today to share the content of simple than regular expressions, convenient for everyone to learn next.

Common Rules of XPath

XPath’s rules are so extensive that this article can’t cover them all at once, but only a few commonly used ones.

expression describe
nodename Selects all children of this node
/ Selects direct child nodes from the current node
// Selects descendant nodes from the current node
. Selects the current child node
. Selects the parent of the current node
@ Select properties

The preparatory work

Install the LXML library before using it. If not, please refer to the following installation method.

pip install lxml
Copy the code

Case of import

Now take an example of xpath parsing the web page

from lxml import etree


text = ''' 
       '''
html = etree.HTML(text)
result = etree.tostring(html)
print(result.decode('utf-8'))
Copy the code

Here we import the etree module through the LXML library, declare a piece of HTML text, call the HTML class for initialization, and successfully construct the xpath object.

Careful readers will notice that the ul tag in my code snippet above is not closed, but after running it you will see that it is closed and helps us add the HTML and body tags.

This is because we called toString () to help us correct the HTML text, but note that toString () returns a byte, so we call toString () to output the corrected HTML code.

Of course, the etree module can also read text files directly for parsing, as shown below:

from lxml import etree


html = etree.parse('./test.html', etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8'))
Copy the code

The contents of the file test.html are the HTML code for the example above.

Get all nodes

We usually use the Xpath rule starting with // to select all the nodes that meet the requirements. If I need to get all the nodes, the sample code looks like this:

from lxml import etree


html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('/ / *)
print(result)
Copy the code

First, a brief explanation of the code above, where * means match all, so all nodes are retrieved, and the return value is a list.

Each Element is of type Element, followed by the name of the node.

The running result is as follows:

[<Element html at 0x1a0334c39c0>, <Element body at 0x1a0334c3a80>, <Element div at 0x1a0334c3ac0>, <Element ul at 0x1a0334c3b00>, <Element li at 0x1a0334c3b40>, <Element a at 0x1a0334c3bc0>, <Element li at 0x1a0334c3c00>, <Element a at 0x1a0334c3c40>, <Element li at 0x1a0334c3c80>, <Element a at 0x1a0334c3b80>, <Element li at 0x1a0334c3cc0>, <Element a at 0x1a0334c3d00>, <Element li at 0x1a0334c3d40>, <Element a at 0x1a0334c3d80>]

Copy the code

From the above results you can see the HTML, body, div, ul, Li nodes, and so on.

Get the specified node

For example, what if I want to get all the Li nodes here? In fact, it is very simple, the specific code example is as follows:

from lxml import etree


html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li')
print(result)

Copy the code

Through the above examples, I don’t know if you understand the meaning of nodes.

The meaning of the node can be interpreted as the beginning of the current HTML document.

If you modified a section of the code above, it would look like this:

result = html.xpath('/li')
Copy the code

After you run it, you will find that the list is empty because the document does not have li, which is a descendant of the document, and the document’s child is HTML.

So, you modify the code like this:

result = html.xpath('/html')
# another way to write it
result = html.xpath('. ')
Copy the code

After you run it, you will be pleasantly surprised to find that you have successfully obtained the HTML node.

Child and descendant nodes

If you want to select all a nodes under the li node, you can do so as shown in the following code:

from lxml import etree


html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a')
print(result)

Copy the code

// get all of the li nodes. /a get all of the li nodes.

Or you could write it this way, you could get all the UL nodes first, and then get all the descendants of the UL nodes a nodes.

The specific code is as follows:

from lxml import etree


html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//ul//a')	# note / / a
print(result)


Copy the code

Run the above code and you will see the same result.

Get the parent node

From the examples above, you should know what children and descendants are. So how do you find the parent node? This can be done via.. To implement.

For example, I now want to select the a node with the href attribute link4.html, and then get its parent node and its class attribute. Look at the content of a lot, that will come one by one, don’t worry.

Specific code examples are as follows:

from lxml import etree


html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]/.. /@class')
print(result)

Copy the code

The results

['item-1']

Copy the code

Attribute matching

When selecting data, @ symbol can be used to filter attributes. For example, the node whose attribute class of li tag is item-0 can be selected as follows:

from lxml import etree


html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]')
print(result)

Copy the code

You can try running the code above, and you will find two correct matches.

The text to get

There is certainly a lot of text in an HTML document, and some of it is exactly what you need. How do you get it?

You can then try to get the text in the node using the text() method.

The specific code example is as follows:

from lxml import etree


html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]/a/text()')
print(result)

Copy the code

Try running the above code and you will see that you have obtained the text under all the Li nodes whose class attribute is item-0.

Gets the tag attribute value

In the process of compiling crawler, the data we need may be attribute value in many cases, so we need to learn how to obtain the attribute value we want.

For example, if I wanted to get all href attributes of a node under li, the code would look like this:

from lxml import etree


html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a/@href')
print(result)

Copy the code

We get the href value of the node via @href, which is returned as a list.

Attribute multi-value matching

When writing front-end code, some nodes may contain multiple values for convenience, so use the CONTAINS function, for example:

from lxml import etree

text = ''' 
  • first item
  • '''
    html = etree.HTML(text) result = html.xpath('//li[contains(@class, "li")]/a/text()') print(result) Copy the code

    If you say how do I remember these functions, well, you could write it this way.

    Specific code examples are as follows:

    from lxml import etree
    
    text = ''' 
  • first item
  • '''
    html = etree.HTML(text) result = html.xpath('//li[@class="li li-first"]/a/text()') print(result) Copy the code

    See the difference?

    Run the above two pieces of code and you will find the result is the same.

    Multi-attribute matching

    Another situation we encounter when writing crawlers is that there are multiple attributes within a tag. Then we can use the and operator to join

    Specific code examples are as follows:

    from lxml import etree
    
    text = ''' 
  • first item
  • '''
    html = etree.HTML(text) result = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()') print(result) Copy the code

    A brief introduction to xpath operators

    As you know from the example above, and is an xpath operator, and xpath has a lot of operators, so here’s a brief introduction to xpath operators.

    The operator describe
    or or
    and with
    | To calculate two node set, / / li | / / a for li and a element node set
    + add
    subtraction
    * The multiplication
    div division
    = Is equal to the
    ! = Is not equal to
    < Less than
    > Is greater than
    > = Greater than or equal to
    < = Less than or equal to
    mod Calculate the remainder

    Sequential selection

    Sometimes, when we write a crawler, we might match several identical Li nodes, but I only need the first or last one. So what should we do about that?

    In this case, you can pass in the specified index to obtain the specified node.

    Specific code examples are as follows:

    from lxml import etree
    
    text = ''' 
           '''
    html = etree.HTML(text)
    Get the first li node
    result = html.xpath('//li[1]/a/text()')
    print(result)
    Get the last li node
    result = html.xpath('//li[last()]/a/text()')
    print(result)
    Get the li node at position less than 3
    result = html.xpath('//li[position()<3]/a/text()')
    print(result)
    Get the third to last li node
    result = html.xpath('//li[last()-2]/a/text()')
    print(result)
    
    Copy the code

    The last

    Today is the last day of 2020. I hope you can earn more money. If you are not single, take off your single and find your partner to spend your life with you.

    The way ahead is so long without ending, yet high and low I’ll search with my will unbending.

    I am book-learning, a person who concentrates on learning. The more you know, the more you don’t know. See you next time for more!