The first recruit LXML
This section focuses on the use of XPath and the parsing library LXML.
XPath & LXML
XPath (XML Path Language) is a Language designed to find information in XML documents, and it works for HTML as well.
When we crawler, we can use XPath to do the corresponding information extraction.
⚠️ [Note] LXML must be installed.
Common XPath Rules
expression | describe |
---|---|
nodename |
Select all the children of this node |
/ |
Select direct child nodes from the current node |
// |
Select descendants from the current node |
. |
Select current node |
. |
Select the parent node of the current node |
@ |
Select properties |
We often use XPath rules beginning with // to select all the nodes that meet the requirements.
In addition, commonly used operators see XPath operators.
Import the HTML
Import HTML from strings
By importing the ETree module of the LXML library, declaring a piece of HTML text, calling the HTML class for initialization, we have successfully constructed an XPath parse object.
⚠️ [Note] the ETree module can make corrections to HTML text.
Calling the toString () method outputs the revised HTML code, which is of type bytes (which can be converted to STR by using the decode() method)
from lxml import etree
text = '''
'''
html = etree.HTML(text)
result = etree.tostring(html)
print(result.decode('utf-8'))
Copy the code
Import HTML from a file
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8'))
Copy the code
Access to the node
Get all nodes
Get all nodes in an HTML, using the rule //* :
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('/ / *)
print(result)
Copy the code
We get a list of Element types.
Gets all the specified labels
If we wanted to get all the li tags, we could change the rule in html.xpath() to ‘//li’ :
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li')
print(result)
Copy the code
If no match can be obtained, html.xpath will return []
Obtaining a child node
Select all direct a subnodes of the li node, using the rule ‘//li/a’ :
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a')
print(result)
Copy the code
To retrieve all descendant a nodes below, do this: //li//a
Gets a node for a specific attribute
Use the @ sign for attribute filtering. smt[…] Is there… Limited SMT.
//a[@href=”link4.html”]:
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]')
print(result)
Copy the code
Obtaining the parent node
If we want to get the parent of the above example and then get its class attribute:
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]/.. /@class')
# you can also use axis "node" '/ / a [@ href = "link4. HTML"] / parent: : * / @ class'
print(result)
Copy the code
See XPath Axes for the use of node Axes
Gets the text in the node
The text() method in XPath retrieves the direct text in a node (excluding the text in its children).
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]/text()')
print(result)
Copy the code
Retrieve attributes
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a/@href')
print(result)
Copy the code
Only properties with one value can be retrieved using this method, for the following:
<li class="li li-first"><a href="link.html">first item</a></li>
The class attribute of the Li node has two values. This method is invalid. We can use the contains() function:
from lxml import etree
text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = e#tree.HTML(text)
result = html.xpath('//li[contains(@class, "li")]/a/text()')
print(result)
Copy the code
We can also use the operator and to concatenate:
from lxml import etree
text = '''
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()')
print(result)
Copy the code
supplement
Click on the link for a detailed XPath tutorial, LXML library.
The second recruit BeautifulSoup
This section focuses on the use of BeautifulSoup, a parsing library.
BeautifulSoup
BeautifulSoup provides simple, Python-like functions to handle navigation, searching, modifying analysis trees, and more. It parses documents to provide users with the data they need to grab. Using it, we can improve the efficiency of parsing.
BeautifulSoup has a complete official Chinese documentation, you can view BeautifulSoup official documentation
⚠️ [Note] BeautifulSoup and LXML need to be installed.
BeautifulSoup can use several parsers, the main ones are as follows:
The parser | Method of use | advantage | disadvantage |
---|---|---|---|
The Python standard library | BeautifulSoup(markup, “html.parser”) | Python has a built-in standard library, moderate execution speed, and strong document fault tolerance | Versions earlier than Python 2.7.3 or 3.2.2 have poor fault tolerance |
LXML HTML parser | BeautifulSoup(markup, “lxml”) | High speed and document fault tolerance | The C library needs to be installed |
LXML XML parser | BeautifulSoup(markup, “xml”) | A fast, only XML-capable parser | The C library needs to be installed |
html5lib | BeautifulSoup(markup, “html5lib”) | The best fault tolerance, browser-style parsing of documents, and the generation of HTML5 documents | It is slow and does not rely on external extensions |
We usually use an LXML parser for parsing, as follows:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Hello</p>'.'lxml')
print(soup.p.string)
Copy the code
Initialization of a BeaufulSoup object
Use the following code to import THE HTML, complete the initialization of the BeautifulSoup object, and auto-correct (such as closing unclosed tags).
soup = BeautifulSoup(markup, "lxml") The # markup is the STR of HTML
Copy the code
After initialization we can also output the string to be parsed in the standard indent format:
print(soup.prettify())
Copy the code
Node selector
Select the label
When selecting an element, you can select the node element by calling the node name directly, and you can call the string attribute to get the text inside the node.
from bs4 import BeautifulSoup
html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><! -- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">... </p> """
soup = BeautifulSoup(html, 'lxml')
print(soup.title) # <title>The Dormouse's story</title>
print(type(soup.title)) # <class 'bs4.element.Tag'>
print(soup.title.string) # The Dormouse's story
print(soup.head) # <head><title>The Dormouse's story</title></head>
print(soup.p) # <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
Copy the code
Nested choice
We can also make nested choices, which are like parents. Son. Sun’s choice:
print(soup.head.title.string)
Copy the code
Link to choose
Sometimes it is difficult to select the desired node element in one step. In this case, we can select a node element first, and then select its child node, parent node, brother node and so on based on it
To obtainChildren of the node
Once a node element is selected, if you want to retrieve its immediate children, you can call the contents property, which returns a list of all the children in turn.
Nodes such as p tags may contain both text and nodes, and the return result will return them all as a list.
soup.p.contents Note that the text is cut into several parts
'''(result)
[
'Once upon a time ... were\n',
<a class="sister" href="..." id="link1"><!-- Elsie --></a>,
',\n',
<a class="sister" href="..." id="link2">Lacie</a>,
' and\n',
<a class="sister" href="..." id="link3">Tillie</a>,
';\nand ... well.'
]
'''
Copy the code
In addition, we can use the children property to query the list_iterator object, which will return a list_iterator object. After the list_iterator object is changed to list, it will be the same as contents:
>>> s.p.children
<list_iterator object at 0x109d6a8d0>
>>> a = list(soup.p.children)
>>> b = soup.p.contents
>>> a == b
True
Copy the code
We can number the child nodes one by one:
for i, child in enumerate(soup.p.children):
print(i, child)
Copy the code
To get all the descendants (all the subordinate nodes), you can call the descendants property, descendants will recursively query all the children (depth-first) and get all the descendants. Returns a < Generator object Tag. Descendants at 0x109d297c8> :
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.descendants)
for i, d in enumerate(soup.p.descendants):
print(i, d)
Copy the code
Get the parent and ancestor nodes
If we want to get the parent of a node element, we can call the parent property and return a node:
>>> soup.span.parent
...
Copy the code
If we wanted to retrieve all the ancestor nodes (and work our way up through the entire HTML), we could call the parents attribute and return a generator:
>>> soup.span.parents
<generator object PageElement.parents at 0x109d29ed0>
>>> list(soup.span.parents)
# as a result, [< p >... < / p >, < div >... < / div >, < body >... < / body >, and < HTML > < / HTML >]...
Copy the code
⚠️【 例 】 The parents are the parents
Getting a sibling node
To retrieve a sibling node, we can call four different properties that do different things:
- Next_sibling: Gets the node’s next sibling and returns the node.
- Previous_sibling: Retrieves the upward sibling node and returns the node.
- Next_siblings: Gets all the siblings below and returns a generator.
- Previous_siblings: Takes all siblings and returns a generator.
>>> from bs4 import BeautifulSoup
>>> html = "" ".<html>
. <body>
. <p class="story">
. Once upon a time there were three little sisters; and their names were
. <a href="http://example.com/elsie" class="sister" id="link1">
. <span>Elsie</span>
. </a>
. Hello
. <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
. and
. <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
. and they lived at the bottom of a well.
. </p>
."" "
>>> soup = BeautifulSoup(html, 'lxml')
>>> soup.a
<a class="sister" href="http://example.com/elsie" id="link1"> Elsie >>> soup.a.next_sibling '\n Hello\n ' >>> soup.a.previous_sibling '\n Once upon a time there were three little sisters; and their names were\n ' >>> soup.a.next_siblings
>>> soup.a.previous_siblings
>>> for i in soup.a.previous_siblings: ... print(i) ... Once upon a time there were three little sisters; and their names were >>> for i in soup.a.next_siblings: ... print(i) ... Hello
sister" href="http://example.com/lacie" id="link2">Lacie and sister" href="http://example.com/tillie" id="link3">Tillie and they lived at the bottom of a well. >>>Copy the code
Method selector
Sometimes it is difficult to directly find the desired node with node selector, we can use find_all(), find() and other methods, passing in the corresponding parameters can flexibly query, get the desired node, and then through the association selection can easily obtain the required information.
find()
Find () passes in attributes or text to get the elements that match the criteria, and returns the first matching element.
find(name , attrs , recursive , text , **kwargs)
Copy the code
The following is an example:
from bs4 import BeautifulSoup
html='''
Hello
- Foo
- Bar
- Jay
- Foo
- Bar
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.find(name='ul')
print(soup.find(attrs={'class': 'element'}))
print(soup.find(text=re.compile(". *? o.*? ', re.S))) # The result returns the text of the first node that matches the regular expression (the result is not a node)
Copy the code
findall()
Find_all, similar to find, except that find_all queries all matching elements and returns a list of all matching elements.
More and more
Other finds such as find_parents(), find_next_siblings(), and find_previous_siblings() are generally used in the same way, but with different search ranges. See the documentation for details.
CSS selectors
BeautifulSoup also provides CSS selectors. To use CSS selectors, simply call the select() method and pass in the corresponding CSS selectors. The result is a list of nodes that match the CSS selectors:
from bs4 import BeautifulSoup
html='''
Hello
- Foo
- Bar
- Jay
- Foo
- Bar
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul') [0]))
Copy the code
Extracting information
Get the full label
To get the full HTML code for a tag, just write its node selector:
soup.title
Getting the label type
Use the name attribute to get the type of the node (p, a, title, pre, etc.) :
print(soup.title.name)
Get tag content
As we said earlier, we call the string property to get the text inside the node:
soup.title.string
⚠️ [Note].string does not work if there are other tags under the tag, it will return None:
>>> from bs4 import BeautifulSoup
>>> html = '<p>Foo<a href="#None">Bar</a></p>'
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.p.string)
None
Copy the code
To get the content, you can also use the node’s get_text() method:
soup.p.get_text()
Get_text: gettext () gettext () gettext () gettext () gettext
>>> from bs4 import BeautifulSoup
>>> html = '<p>Foo<a href="#None">Bar</a></p>'
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.p.string)
None
>>> print(soup.p.get_text())
FooBar
Copy the code
Retrieve attributes
Each node may have multiple attributes, such as ID, class, and we can call attrs to get all of them. In turn, we can get specific attributes by using the dictionary’s value method (enclosing the attribute name in brackets, or calling its get() method) :
print(soup.p.attrs)
print(soup.p.attrs['name'])
'''(results)
{'class': ['title'], 'name': 'dromouse'}
dromouse
'''
Copy the code
You can also use brackets and property names directly:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul') :print(ul['id'])
print(ul.attrs['id'])
The two lines of code in the body of the loop are equivalent
Copy the code
A third PyQuery
This section focuses on the use of the parse library PyQuery.
PyQuery
pyquery: a jquery-like library for python
PyQuery is used in much the same way as jQuery. If you’re stuck with the old jQuery front end, give it a try.
⚠️ [Note] PyQuery needs to be installed.
Initialize the
PyQuery can be initialized with multiple forms of data sources, such as a string whose content is HTML, the URL of the source, a local file name, and so on.
String initialization
from pyquery import PyQuery as pq
html = ''' Header
Something
Other thing
In div
'''
doc = pq(html) # Pass the HTML string
print(doc('p')) # Pass CSS selectors
'''(results) Something
Other thing
In div
'''
Copy the code
The URL to initialize
from pyquery import PyQuery as pq
doc = pq(url='http://www.baidu.com', encoding='utf-8') # Encoding = encoding
print(doc('title'))
"" " "
Copy the code
CSS selectors
See CSS selectors table for details.
Find nodes
- To find theThe childNodes with
children('css-selector')
Method, all if the parameter is empty. - To find theThe childrenNodes with
find('css-selector')
Method,Parameter cannot be empty! - To find theThe fatherNodes with
parent('css-selector')
Method, all if the parameter is empty. - To find theThe ancestorsNodes with
parents('css-selector')
Method, all if the parameter is empty. - To find thebrotherNodes with
siblings('css-selector')
Method, all if the parameter is empty.
>>> p = doc('div')
>>> p
[<div#wrapper>, <div#head>, <div.head_wrapper>, <div.s_form>, <div.s_form_wrapper>, <div#lg>, <div#u1>, <div#ftCon>, <div#ftConw>]
>>> type(p)
<class 'pyquery.pyquery.PyQuery'> > > >p.find('#head')
[<div#head>] > > >print(p.find('#head'))
<div id="head">... </div>
Copy the code
traverse
The results selected by PyQuery can be traversed:
>>> for i in p.parent():
. print(i, type(i))
.
<Element a at 0x1055332c8> <class 'lxml.html.HtmlElement'>
<Element a at 0x105533368> <class 'lxml.html.HtmlElement'>
<Element a at 0x105533458> <class 'lxml.html.HtmlElement'>
Copy the code
Note that this is an LXML Element, and you will use the LXML method to handle it.
Access to information
attr()
Retrieve attributes
a = doc('a')
print(a.attr('href'))
Copy the code
Attr () must pass in the name of the property to be selected. If the object contains more than one node, calling attr() on the object returns only the corresponding result for the first object. A traversal is required to return each one.
text()
Get the text
a = doc('a')
a.text()
Copy the code
This will print the result of all text joins that contain the nodes.
Node operation
PyQuery can also manipulate nodes, which is not the focus of this article and will not be covered again.