1. Introduction to xpath

Xpath is a language for finding information in XML documents

Advantages:

  • You can find information in XML
  • HTML lookup is supported
  • You can navigate through elements and attributes

But Xpath relies on XML libraries, so we need to install LXML libraries.

Install the LXML library

We need to install the LXML library first, just install it directly in PyCharm:

XML tree structure:

Root element – element – attribute – text

Select nodes using XPath:

  • Nodename: Selects all nodes of this node
  • / Select from the root node
  • // Select nodes in the document from the current node that matches the selection, regardless of their location
  • Select the current node
  • . Select the parent node of the current node (here are two points, the browser defaults to 3..)
  • /text() gets the text content in the current path
  • /@xxx Extract the attribute value of the tag in the current path

An example of an expression for selecting a node:

2. Climb the Starting Point fiction website

Get the title and author test in the browser

Install an xpath plugin in Google:Look for book-mid-info in HTML:

//div[@class=’book-mid-info’]/h4/a/ TXT ()

Add another author to get:

Use xpath to get data from The Startpoint fiction web site

# Author: Internet Lao Xin
# Development time: 2021/4/8/0008 8:24

import requests
from lxml import etree
url="https://www.qidian.com/rank/yuepiao"
headers={'user-agent':'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3861.400 QQBrowser/10.7.4313.400'}
# Send request
resp=requests.get(url,headers)
e=etree.HTML(resp.text) Convert STR to class 'lxml.etree._element
print(type(e))
names=e.xpath('//div[@class="book-mid-info"]/h4/a/text()')
authors=e.xpath('//p[@class="author"]/a[1]/text()')
print(names)
print(authors)
The name corresponds to the author
for name,authors in zip(names,authors):
    print(name,":",authors)
Copy the code