Crawler Series (9) Basic use of xpath

A brief introduction to xpath

What exactly is xpath? Simply put, xpath is a language for finding information in AN XML document

An XML document is a tree consisting of a series of nodes. For example, here is a simple XML document:

<html>
	<body>
		<div>
			<p>Hello world<p>
			<a href="/home">Click here</a>
		</div>
	</body>
</html>
Copy the code

Common nodes in XML documents include:

Root node: HTML
Element nodes: HTML, body, div, P, a
Attribute node: href
Text nodes: Hello World, Click Here

Common relationships between nodes in XML documents include:

Parent: For example,
and are children of

, and vice versa,

is the parent of

and
Brothers: For example,
and are called brother nodes
Ancestor/descendant: For example, ,
,

, are all descendants of < HTML >, and vice versa, < HTML > is the ancestor of ,

,

,

Xpath is more convenient and concise than RE for web page parsing, so Python also provides the corresponding module, lxml.etree

This can be done using the PIP install LXML command

Second, xpath usage

Before we dive into xpath in earnest, let’s construct a simple XML document for testing

In a normal crawler, the XML document is the source code of the web page that is crawled back

>>> sc = '''   
       
       Example website   
       
       Image1 
       Image2 
       Image3 
      
   '''
Copy the code

1. Import modules

>>> from lxml import etree
Copy the code

2. Construct objects

>>> html = etree.HTML(sc) Construct the lxml.etree._element object
>>> The # lxml.etree._element object also has code completion
>>> If the XML document we get is not a canonical document, the object will automatically complete the missing closing tag
>>> We can use the toString () method to convert an object to a string of bytes
>>> Use decode(' UTF-8 ') to convert bytes into STR
>>> print(etree.tostring(html).decode('utf-8'))
Copy the code

3. Matching data

We can use the xpath() method to match

(1) xpath matching syntax

The xpath method takes a string as a parameter that satisfies the xpath matching syntax

Here’s a look at xpath matching syntax:

/ indicates the child node. For example, /E indicates the E element node matching the child node under the root node
```
>>> test = html.xpath('/html/head/title')
Copy the code
```
// indicates the descendant node. For example, //E indicates the E element node in the descendant node under the root node
```
>>> test = html.xpath('//a')
Copy the code
```
* represents all nodes, for example, E/* matches all nodes in the child nodes under the E element node
```
>>> test = html.xpath('/html/*')
Copy the code
```
Text () represents a text node, for example, E/text() represents a text node that matches a child node under the E element node
```
>>> test = html.xpath('/html/head/title/text()')
Copy the code
```
@attr represents the attribute node, for example, E/ @attr represents the ATTR attribute node that matches the child node of the E element node
```
>>> test = html.xpath('//a/@href')
Copy the code
```

The predicate matches the specified label

Specify the second <a> tag

>>> test = html.xpath('//a[2]')
Copy the code

Specify the first two <a> tags

>>> test = html.xpath('//a[position()<=2]')
Copy the code

Specify the <a> tag with the href attribute

>>> test = html.xpath('//a[@href]')
Copy the code

Specifies the <a> tag with an href attribute and a value of image1.html
```
>>> test = html.xpath('//a[@href="image1.html"]')
Copy the code
```
Specifies a <a> tag with an href attribute and a value containing image
```
>>> test = html.xpath('//a[contains(@href,"image")]')
Copy the code
```

(2) _Element object

The xpath method returns either a string or a list of matches, each of which is an lxml.etree._element object

Here are some common attributes and methods for the _Element object:

As an example, we’ll use the xpath method to get a list of matches called Tests, each of which is a _Element object

>>> test = html.xpath('//a[@href="image1.html"]')
>>> obj = test[0]
Copy the code

tagReturn tag name

>>> obj.tag
'a'
Copy the code

attribReturns a dictionary of attributes and values

>>> obj.attrib
{'href': 'image1.html'}
Copy the code

get()Returns the value of the specified property

>>> obj.get('href')
'image1.html'
Copy the code

textReturn text value

>>> obj.text
'Image1'
Copy the code

Crawler Series (9) Basic use of xpath

A brief introduction to xpath

Second, xpath usage

1. Import modules

2. Construct objects

3. Matching data

(1) xpath matching syntax

(2) _Element object

Related Posts

Docker installs MySQL and mounts external configuration and data

Six ways to do something during SpringBoot initialization!

Learn design mode — command mode