Public account: You and the cabin by: Peter Editor: Peter
Powerful Xpath
In the past, when crawlers parse data, they almost always use regular expressions themselves. Regular parsing data is powerful, but expressions are cumbersome and relatively slow to write. This article shows you how to get started quickly with a data parsing tool: Xpath.
Xpath is introduced
XPath (XML Path) is a language for finding information in XML documents. XPath can be used to traverse elements and attributes in AN XML document.
XPath is a major element of the W3C XSLT standard, and both XQuery and XPointer are built on XPath expressions.
- Xpath is a query language
- Find nodes in the TREE structure of XML (Extensible Markup Language) and HTML
- XPATH is a language for ‘finding people’ based on ‘address’
Quick start website: www.w3schools.com/xml/default…
Xpath installation
MacOS installation is very simple:
pip install lxml
Copy the code
In Linux, Ubuntu is used as an example:
sudo apt-get install python-lxml
Copy the code
Windows installation please baidu, certainly there will be a tutorial, is the process is relatively more troublesome.
How to check whether the installation is successful? If the import LXML command is not displayed, the installation is successful.
Xpath Parsing Principles
- Instantiate an ETREE parsing object and load the parsed page source data into the object
- Xpath parsing methods in xpath are called in conjunction with xpath expressions to locate tags and capture content
How do I instantiate an Etree object?
- Load source data from a local HTML document into an ETree object: etree.parse(filePath)
- Load the source code data from the Internet into this object: etree.html (‘page_text’), where page_text refers to the source code content we retrieved
Xpath usage
Three special symbols
- / : Indicates that the resolution starts from the root node and is performed at a single level
- // : represents multiple levels, some of which can be skipped; It also means to start at any position
- . : A dot represents the current node
Common path expressions
Here are common Xpath path expressions:
expression | describe |
---|---|
nodename | Selects all children of this node. |
/ | From the root node. |
// | Nodes in the document are selected from the current node selected by the match, regardless of their location. |
. | Select the current node. |
… | Selects the parent of the current node. |
@ | Select properties. |
For example,
expression | instructions |
---|---|
books | Selects all the children of the books element |
/books | Selects the root element bookstore |
books//title | Selects all the title elements that belong to the children of books |
//price | Select all price elements |
books/book[3] | Selects the third book element that belongs to the books child element, with the index starting at 1 |
/ bookstore/book [price > 55.0] | Select all book elements whose unit price is greater than 55 |
//@category | Select all the attributes named category |
/books/book/title/text() | Select all title values for the document |
Xpath operators
Operators are directly supported in Xpath expressions:
Operator | Description | Example | Chinese |
---|---|---|---|
| | Computes two node-sets | //book | //cd | Combine two results |
+ | Addition | 6 + 4 | add |
– | Subtraction | 6-4 | Reduction of |
* | Multiplication | 6 * 4 | take |
div | Division | 8 div 4 | In addition to |
= | Equal | Price = 9.80 | Is equal to the |
! = | Not equal | price! = 9.80 | Is not equal to |
< | Less than | Price < 9.80 | Less than |
< = | Less than or equal to | Price < = 9.80 | Less than or equal to |
> | Greater than | Price > 9.80 | Is greater than |
> = | Greater than or equal to | Price > = 9.80 | Greater than or equal to |
or | or | Price or price = 9.80 = 9.70 | or |
and | and | Price and price < > 9.00 9.90 | and |
mod | Modulus (division remainder) | 5 mod 2 | For more than |
The HTML element
HTML elements refer to all the code from the start tag to the end tag. Basic syntax:
- The HTML element toThe start tagBegin; An HTML element terminates with a closing tag
- The content of the element is between the start tag and the end tag
- Some HTML elements have empty content
- Empty elements are closed in the start tag (end at the end of the start tag)
- Most HTML elements can have attributes; Attributes in lower case are recommended
Regarding the use of empty elements: Adding a slash to the start tag, such as
, is the proper way to turn off empty elements, and is accepted by HTML, XHTML, and XML.
Common properties
attribute | value | describe |
---|---|---|
class | classname | Specify the element’s classname (classname) |
id | id | Specifies the unique ID of the element |
style | style_definition | Specify the inline style of an element |
title | text | Additional information for the specified element (can be displayed in the tooltip) |
The HTML title
There are six levels of headings in HTML.
Headings (Heading) are defined by tags such as < H1 > – < H6 >.
defines the largest title, and
defines the smallest title.
Case analysis
The original data
Before using Xpath to parse the data, we need to import it and instantiate an ETree object:
# import libraries
from lxml import etree
# instance resolve object
tree = etree.parse("test.html")
tree
Copy the code
Here is the raw data to be parsed: test.html
1 <html lang="en">
2 <head>
3 <meta charset="UTF-8" />
4 <title>Ancient poets and works</title>
5 </head>
6 <body>
7 <div>
8 <p>The poet's name</p>
9 </div>
10 <div class="name">
11 <p>Li bai</p>
12 <p>Bai juyi</p>
13 <p>Li qingzhao</p>
14 <p>Du fu</p>
15 <p>Wang anshi</p>
16 <a href="http://wwww.tang.com" title="Li Shimin" target="_self">
17 <span> this is span </span>The poems written by ancient poets are really wonderful</a>
19 <a href="" class="du">In front of the bed there was moonlight, and I thought it was frost on the ground</a>
20 <img src="http://www.baidu.com/tang.jpg" alt="" />
21 </div>
22 <div class="tang">
23 <ul>
24 <li><a href="http://www.baidu.com" title="Baidu">Bai Di chao ci clouds, thousands of miles jiangling a day also</a></li>
25 <li><a href="http://www.sougou.com" title="Sogou">Rain falls heavily during Qingming festival, and passers-by are dying on the road</a></li>
26 <li><a href="http://www.360.com" alt="360">Qin Mingyue Han guan, thousands of long March people have not yet</a></li>
27 <li><a href="http://www.sina.com" title="Bing">A gentleman gives speech to others, and a concubine gives wealth to others</a></li>
28 <li><b>Su shi</b></li>
29 <li><i>Su Xun</i></li>
30 <li><a href="http://www.google.cn" id="Google">Welcome to Chrome</a></li>
31 </ul>
32 </div>
33 </body>
34 </html>
Copy the code
Gets the content of a single label
For example, you want to get the content in the title tag: ancient poets and works
title = tree.xpath("/html/head/title")
title
Copy the code
As you can see from the above results, each Xpath parse results in a list
To retrieve the text content of the tag, use text() :
Extract the content from the list
title = tree.xpath("/html/head/title/text()") [0] Index 0 retrieves the first element value
title
Copy the code
Gets multiple contents within the tag
For example, if we want to retrieve the contents of the div tag, there are three pairs of div tags in the raw data, and the result is that the list contains three elements:
1. Use a single slash / : to indicate that the root node HTML begins positioning, indicating a hierarchy
2, use a double slash in the middle // : skip the middle level, indicating multiple levels
3, the beginning of the use of double slash // : from any position to start
Attribute to locate
[@attribute name =” attribute value “] :
name = tree.xpath('//div[@class="name"]') # Locate the class attribute with the value name
name
Copy the code
The index position
Indexing from 1 in Xpath is different from indexing from 0 in Python. For example, if you want to locate all p tags under the class attribute (value name) of the div tag: 5 pairs of P tags, the result should be 5 elements
Get all data
index = tree.xpath('//div[@class="name"]/p')
index
Copy the code
If we want to retrieve the third p tag:
Get a single specified data: the index starts at 1
index = tree.xpath('//div[@class="name"]/p[3]') # index starts at 1
index
Copy the code
Get text content
The first method: the text() method
Get the element below a specific tag:
# 1, / : Single level
class_text = tree.xpath('//div[@class="tang"]/ul/li/b/text()')
class_text
Copy the code
# 2, // : multiple levels
class_text = tree.xpath('//div[@class="tang"]//b/text()')
class_text
Copy the code
2. Multiple contents under a tag
For example, if you want to get all the contents of the p tag:
Get all data
p_text = tree.xpath('//div[@class="name"]/p/text()')
p_text
Copy the code
For example, we want to get the contents of the third p tag:
Get the third TAB content
p_text = tree.xpath('//div[@class="name"]/p[3]/text()')
p_text
Copy the code
If you want to fetch the entire contents of the p tag, the result is a list, and then use the Python index to fetch the contents. Note that the index is 2:
Non-label direct content acquisition:
Fetch of the li tag: the result is empty and there is no content in the li tag of the tag
If you want to access to the entire contents of the li tag, the following can be a, b, tags, I use a vertical bar |
Select a, B, and I tags from a, B, and I tags
abi_text = tree.xpath('//div[@class="tang"]//li/a/text() | //div[@class="tang"]//li/b/text() | //div[@class="tang"]//li/i/text()')
abi_text
Copy the code
Lineal and non-lineal understanding
Fetch attribute content
If you want to get the value of an attribute, add: @+ the attribute name to the final expression to get the value of the corresponding attribute
1. Get the value of a single attribute
2. Get multiple values for the property
Property starts and contains
Xpath supports Xpath expressions that start with certain strings or contain certain characters. Xpath does not have expressions that end in strings
- Start: starts with
- Contains: the contains
The syntax can be written as:
/ / label [starts -with(@ Attribute name,"Same part of the string"] // tag [conatians(@attribute name,"Same part of the string")]
Copy the code
1. Start with a string
Gets the text content under the HREF starting with HTTP under the a tag
2. Contains strings
The title attribute under tag A contains baidu text content:
conclusion
Here’s a summary of the use of Xpath:
- // : indicates that the label is not directly related to the content
- / : Only the immediate content of the label is obtained
- If the index is in an Xpath expression, the index starts at 1; If you get the list data from the Xpath expression, then use the Python index to fetch the number, starting at 0
In actual combat
Use XPATH to crawl the name and URL of all Gu Long’s novels on the novel website.
Xiong Yaohua, born in Jiangxi; He graduated from Tamkang English College in Taiwan. When he was young, he was interested in reading ancient and modern martial arts novels and Western literature. It is generally believed that he was influenced by Eiji Yoshikawa, Dumas, Hemingway, Jack London, Steinbeck’s novels and even Western philosophy such as Nietzsche and Saudi. “I like to steal lessons from modern Japanese and Western novels,” he said. ) so it can be new and new, come from behind to take the lead, and do not open a new realm of martial arts novels.
Web data analysis
Crawl the information on this website: www.kanunu8.com/zj/10867.ht…
When we click on a specific novel, such as “Two Pride”, we can enter the specific information of that novel:
By looking at the source code of the page, we find that the name and URL address are all in the tag below:
Below each TR tag are three TD tags representing three novels, one td containing the address and name
Get the source code of the web page
Send a web request to get the source code
import requests
from lxml import etree
import pandas as pd
url = 'https://www.kanunu8.com/zj/10867.html'
headers = {'user-agent': 'Request header'}
response = requests.get(url = url,headers = headers)
result = response.content.decode('gbk') # This page requires GBK encoding to parse the data
result
Copy the code
Access to information
1. Get the exclusive link address for each novel
tree = etree.HTML(result)
href_list = tree.xpath('//tbody/tr//a/@href') # specify information about attributes
href_list[:5]
Copy the code
2. Get the name of each novel
name_list = tree.xpath('//tbody/tr//a/text()') Specify everything below the tag
name_list[:5]
Copy the code
3. Generate DataFrame DataFrame
# Generate the address and name of gulong's novel
gulong = pd.DataFrame({
"name":name_list,
"url":href_list
})
gulong
Copy the code
4, improve the URL address
Virtually every URL address of the novel is a prefix, such as handsome siblings complete address: www.kanunu8.com/book/4573/
gulong['url'] = 'https://www.kanunu8.com/book' + gulong['url'] # with a public prefix
gulong
Export to excel file
gulong.to_excel("gulong.xlsx",index=False)
Copy the code