Life is short. I use Python

Previous portal:

Learning Python crawlers (1) : The Beginning

Python crawler (2) : Preparation (1) basic class library installation

Learn Python crawler (3) : Pre-preparation (2) Linux basics

Docker is a Python crawler

Learn Python crawler (5) : pre-preparation (4) database foundation

Python crawler (6) : Pre-preparation (5) crawler framework installation

Python crawler (7) : HTTP basics

Little White learning Python crawler (8) : Web basics

Learning Python crawlers (9) : Crawler basics

Python crawler (10) : Session and Cookies

Python crawler (11) : Urllib

Python crawler (12) : Urllib

Urllib: A Python crawler (13)

Urllib: A Python crawler (14)

Python crawler (15) : Urllib

Python crawler (16) : Urllib crawler (16) : Urllib crawler

Python crawler (17) : Basic usage for Requests

Python crawler (18) : Requests advanced operations

The introduction

In the previous two articles, we introduced the use of Requests. We wanted to introduce the parsing library for the first time.

Introduction to the

XPath, XML Path Language, is a Language for finding information in XML documents. It was originally designed to search XML documents, but it can also be used to search HTML documents.

First of all, or from Xpath’s official website: https://www.w3.org/TR/xpath/all/.

Secondly, I would like to offer you two good addresses:

w3school:https://www.w3school.com.cn/xpath/index.asp

Novice tutorial: https://www.runoob.com/xpath/xpath-tutorial.html

Common path expressions

The most useful path expressions are listed below:

expression

describe

nodename Selects all children of this node.

/

From the root node.

//

Nodes in the document are selected from the current node selected by the match, regardless of their location.
.

Select the current node.

.

Selects the parent of the current node.

@

Select properties.

Note that before using Xpath, you need to make sure that you have the LXML library installed. If you don’t, you can do so by referring to the previous setup.

Xpath demo

First, we need to introduce the ETREE module of the LXML library, followed by the Requests module, using our own blog as an example.

from lxml import etree
import requests

response = requests.get('https://www.geekdigging.com/')
html_str = response.content.decode('UTF-8')
html = etree.HTML(html_str)
result = etree.tostring(html, encoding = 'UTF-8').decode('UTF-8')
print(result)Copy the code

The results are as follows:

As you can see, the result is a successful crawl. Here we first use Requests to retrieve the home page’s source byte stream, then decode it using decode(). After decoding, we pass the string into etree.html () to build an lxml.etree._element object. Then we do the toString () conversion string to the object and print it.

Note: When toString () is used to convert strings, it is necessary to add encoding, otherwise Chinese will be shown as Unicode encoding.

All the nodes

We’ve built the Element object, and we’re ready for the fun of Xpath learning.

We use XPath rules starting with // to select all nodes that meet the criteria. Example (again using the HTML above) :

result_1 = html.xpath('//*')
print(result_1)Copy the code

The results are as follows:

[<Element html at 0x2a2810ea088>, <Element head at 0x2a2810e0788>, <Element meta at 0x2a2810d8048>, <Element meta at 0x2a2810d8088>, <Element meta at 0x2a280124188>,......Copy the code

The result is too long.

The use of * here means match all nodes, that is, all nodes in the entire HTML text will be fetched. As you can see, the return form is a list. Each Element is of type Element, followed by the name of the node, such as HTML, head, meta, etc. All the nodes are included in the list.

Of course, matching here can also specify node names, such as getting all meta nodes:

result_2 = html.xpath('//meta')
print(result_2)Copy the code

The results are as follows:

[<Element meta at 0x1fc9107a2c8>, <Element meta at 0x1fc9107a6c8>, <Element meta at 0x1fc91ff8188>, <Element meta at 0x1fc91ff8108>, <Element meta at 0x1fc91ff8088>, <Element meta at 0x1fc91fc2d88>, <Element meta at 0x1fc91e73988>, <Element meta at 0x1fc91ff81c8>, <Element meta at 0x1fc91f93f08>, <Element meta at 0x1fc9203d2c8>, <Element meta at 0x1fc9203d308>, <Element meta at 0x1fc9203d348>, <Element meta at 0x1fc9203d408>, <Element meta at 0x1fc9203db08>, <Element meta at 0x1fc9203d388>, <Element meta at 0x1fc9203d3c8>, <Element meta at 0x1fc92025c08>, <Element meta at 0x1fc92025b88>, <Element meta at 0x1fc92025c48>, <Element meta at 0x1fc92025cc8>]Copy the code

To select all meta nodes, use // and add the node name directly, using the xpath() method. Since it returns a list, you can index [] to a particular meta, such as’ [0] ‘.

Child nodes

To obtain child nodes, you can use/or // to obtain child or grandchild nodes.

For example, now you want to get all the content of the article block like this:

As the red box indicates, you can see that the DOM structure is


result_3 = html.xpath('//main/article')
print(result_3)Copy the code

The results are as follows:

[<Element article at 0x225ef371c08>, <Element article at 0x225ef372208>, <Element article at 0x225ef3727c8>, <Element article at 0x225ef372d88>, <Element article at 0x225ef373388>, <Element article at 0x225ef373948>, <Element article at 0x225ef373f08>, <Element article at 0x225ef374508>, <Element article at 0x225ef374ac8>, <Element article at 0x225ef3750c8>, <Element article at 0x225ef375688>, <Element article at 0x225ef375c48>]Copy the code

The/is used to get the child node. If you want to get the grandchild node, for example

, as shown below:

You can write:

result_4 = html.xpath('//main//div')
print(result_4)Copy the code

It was too long.

The parent node

We can use/and // to find the child node, so there must be a syntax to find the parent node, otherwise it would be a little silly to have to look down and not up.

The parent node is found by.. For example, we first found the picture of an article, and now we want to look up it, as shown below:

Here we go throughaltProperties forPython crawler (16) : Urllib crawler (16) : Urllib crawler, and then gets its parent node“And print hishrefProperty, code as follows:

Result_5 = html.xpath('//img[@alt=" Python crawler (16) : urllib crawler "]/.. /@href') print(result_5)Copy the code

The results are as follows:

'/ 2019/12/09/1691033431 /'Copy the code

We can also get the parent node using parent::.

//img[@alt=" Python (16) : urllib "]/parent::*/@href') print(result_6)Copy the code

Attribute filter

When selecting nodes, we can use the @ symbol for attribute filtering.

Let’s say we’re picking

Can be usedclasscontainer

. And on the front page
In the child node of,classcontainer

There is only one, and the code is as follows:

result_7 = html.xpath('//section/div[@class="container"]')
print(result_7)Copy the code

The running results are as follows:

[<Element div at 0x251501c2c88>]Copy the code

The sample code

All of the code in this series will be available on Github and Gitee.

Example code -Github

Example code -Gitee

If my article is helpful to you, please scan the author’s official account: to get the latest dry goods push:)