A brief introduction to xpath
What exactly is xpath? Simply put, xpath is a language for finding information in AN XML document
An XML document is a tree consisting of a series of nodes. For example, here is a simple XML document:
<html>
<body>
<div>
<p>Hello world<p>
<a href="/home">Click here</a>
</div>
</body>
</html>
Copy the code
Common nodes in XML documents include:
- Root node: HTML
- Element nodes: HTML, body, div, P, a
- Attribute node: href
- Text nodes: Hello World, Click Here
Common relationships between nodes in XML documents include:
- Parent: For example,
and are children of
- Brothers: For example,
- Ancestor/descendant: For example, ,
Xpath is more convenient and concise than RE for web page parsing, so Python also provides the corresponding module, lxml.etree
This can be done using the PIP install LXML command
Second, xpath usage
Before we dive into xpath in earnest, let’s construct a simple XML document for testing
In a normal crawler, the XML document is the source code of the web page that is crawled back
>>> sc = '''
Example website
'''
Copy the code
1. Import modules
>>> from lxml import etree
Copy the code
2. Construct objects
>>> html = etree.HTML(sc) Construct the lxml.etree._element object
>>> The # lxml.etree._element object also has code completion
>>> If the XML document we get is not a canonical document, the object will automatically complete the missing closing tag
>>> We can use the toString () method to convert an object to a string of bytes
>>> Use decode(' UTF-8 ') to convert bytes into STR
>>> print(etree.tostring(html).decode('utf-8'))
Copy the code
3. Matching data
We can use the xpath() method to match
(1) xpath matching syntax
The xpath method takes a string as a parameter that satisfies the xpath matching syntax
Here’s a look at xpath matching syntax:
-
/ indicates the child node. For example, /E indicates the E element node matching the child node under the root node
>>> test = html.xpath('/html/head/title') Copy the code
-
// indicates the descendant node. For example, //E indicates the E element node in the descendant node under the root node
>>> test = html.xpath('//a') Copy the code
-
* represents all nodes, for example, E/* matches all nodes in the child nodes under the E element node
>>> test = html.xpath('/html/*') Copy the code
-
Text () represents a text node, for example, E/text() represents a text node that matches a child node under the E element node
>>> test = html.xpath('/html/head/title/text()') Copy the code
-
@attr represents the attribute node, for example, E/ @attr represents the ATTR attribute node that matches the child node of the E element node
>>> test = html.xpath('//a/@href') Copy the code
-
The predicate matches the specified label
-
Specify the second <a> tag
>>> test = html.xpath('//a[2]') Copy the code
-
Specify the first two <a> tags
>>> test = html.xpath('//a[position()<=2]') Copy the code
-
Specify the <a> tag with the href attribute
>>> test = html.xpath('//a[@href]') Copy the code
-
Specifies the <a> tag with an href attribute and a value of image1.html
>>> test = html.xpath('//a[@href="image1.html"]') Copy the code
-
Specifies a <a> tag with an href attribute and a value containing image
>>> test = html.xpath('//a[contains(@href,"image")]') Copy the code
-
(2) _Element object
The xpath method returns either a string or a list of matches, each of which is an lxml.etree._element object
Here are some common attributes and methods for the _Element object:
As an example, we’ll use the xpath method to get a list of matches called Tests, each of which is a _Element object
>>> test = html.xpath('//a[@href="image1.html"]')
>>> obj = test[0]
Copy the code
tag
Return tag name
>>> obj.tag
'a'
Copy the code
attrib
Returns a dictionary of attributes and values
>>> obj.attrib
{'href': 'image1.html'}
Copy the code
get()
Returns the value of the specified property
>>> obj.get('href')
'image1.html'
Copy the code
text
Return text value
>>> obj.text
'Image1'
Copy the code