Crawlers: Python web page parsers, third-party LXML extension libraries and xpath?

The previous chapter talked about using the BeautifulSoup object to parse a web page downloaded during the crawler process. Today, another extension library, LXML, is used to parse a web page. Similarly, LXML library can complete the HTML, XML format of the file parsing, and can be used to parse large documents, parsing speed is relatively fast.

Knowing how to use LXML requires knowing how to use xpath, and since the LXML extension library is xPath-based, the main focus of this chapter is to explain how to use xpath syntax.

1. Import the LXML extension library and create objects

 1# -*- coding: UTF-8 -*-
 2
 3Import ETREE from LXML
 4from lxml import etree
 5
 6First get the source code of the web page downloaded by the web page loader
 7# Take the official case directly here
 8html_doc = """
 9<html><head><title>The Dormouse's story</title></head>
10<body>
11<p class="title"><b>The Dormouse's story</b></p>
12
13<p class="story">Once upon a time there were three little sisters; and their names were
14<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
15<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
16<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
17and they lived at the bottom of a well.</p>
18
19<p class="story">...</p>
20"""
21
22Initialize the html_doc string of the web loader and return an LXML object
23html = etree.HTML(html_doc)
Copy the code

2. Extract web elements using xpath syntax

Gets elements as nodes

1# xpath() fetches elements as tag nodes
2print html.xpath('/html/body/p')
3# [<Element p at 0x2ebc908>, <Element p at 0x2ebc8c8>, <Element p at 0x2eb9a48>]
4print html.xpath('/html')
5# [<Element html at 0x34bc948>]
6Find node A in the descendants of the current node
7print html.xpath('//a')
8Find the HTML node in the children of the current node
9print html.xpath('/html')
Copy the code

Get elements as filtered

 1"2 Gets element 3" based on a single attribute
 4Get a tag from class=bro
 5print html.xpath('//a[@class="bro"]')
 6
 7Get a tag from descendant node where id=link3
 8print html.xpath('//a[@id="link3"]')
 9
10"11 Gets element 12" based on multiple attributes
13Get the class attribute equal to sister and id equal to the A tag of link3
14print html.xpath('//a[contains(@class,"sister") and contains(@id,"link1")]')
15
16Get class = bro, or id = link1
17print html.xpath('//a[contains(@class,"bro") or contains(@id,"link1")]')
18
19Get the last a tag of the descendant's A tag using the last() function
20print html.xpath('//a[last()]')
21Get the first a tag of the descendant's A tag using the 1 function
22print html.xpath('//a[1]')
23Position () retrieves the first two A tags of the descendant's A tag
24print html.xpath('//a[position() < 3]')
25
26To obtain multiple elements by means of computation.
29Position () retrieves the first and third tags of the descendant's A tag
30>, <, =, >=, <=, +, -, and, or
31print html.xpath('//a[position() = 1 or position() = 3]')
Copy the code

Gets the attributes and text of the element

12 uses @ to get the attribute value and text() to get the tag text 3 ""
4Get the attribute value
5print html.xpath('//a[position() = 1]/@class')
6# ['sister']
7Get the text value of the tag
8print html.xpath('//a[position() = 1]/text()')
Copy the code

More exciting things to come to wechat public account “Python Concentration Camp”, focusing on Python technology stack, information acquisition, communication community, dry goods sharing, looking forward to your joining ~

Crawlers: Python web page parsers, third-party LXML extension libraries and xpath?

1. Import the LXML extension library and create objects

2. Extract web elements using xpath syntax

Related Posts

Linux server build Minecraft original /Mod server detailed tutorial

SpringBoot uniformly outputs logs

The output of the Java program | 14 sets (constructor)