The first recruit LXML

This section focuses on the use of XPath and the parsing library LXML.

XPath & LXML

XPath (XML Path Language) is a Language designed to find information in XML documents, and it works for HTML as well.

When we crawler, we can use XPath to do the corresponding information extraction.

⚠️ [Note] LXML must be installed.

Common XPath Rules

expression describe
nodename Select all the children of this node
/ Select direct child nodes from the current node
// Select descendants from the current node
. Select current node
. Select the parent node of the current node
@ Select properties

We often use XPath rules beginning with // to select all the nodes that meet the requirements.

In addition, commonly used operators see XPath operators.

Import the HTML

Import HTML from strings

By importing the ETree module of the LXML library, declaring a piece of HTML text, calling the HTML class for initialization, we have successfully constructed an XPath parse object.

⚠️ [Note] the ETree module can make corrections to HTML text.

Calling the toString () method outputs the revised HTML code, which is of type bytes (which can be converted to STR by using the decode() method)

from lxml import etree
text = ''' 
       '''
html = etree.HTML(text)
result = etree.tostring(html)
print(result.decode('utf-8'))
Copy the code
Import HTML from a file
from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8'))
Copy the code

Access to the node

Get all nodes

Get all nodes in an HTML, using the rule //* :

from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())

result = html.xpath('/ / *)

print(result)
Copy the code

We get a list of Element types.

Gets all the specified labels

If we wanted to get all the li tags, we could change the rule in html.xpath() to ‘//li’ :

from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li')
print(result)
Copy the code

If no match can be obtained, html.xpath will return []

Obtaining a child node

Select all direct a subnodes of the li node, using the rule ‘//li/a’ :

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a')
print(result)
Copy the code

To retrieve all descendant a nodes below, do this: //li//a

Gets a node for a specific attribute

Use the @ sign for attribute filtering. smt[…] Is there… Limited SMT.

//a[@href=”link4.html”]:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]')
print(result)
Copy the code
Obtaining the parent node

If we want to get the parent of the above example and then get its class attribute:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]/.. /@class')
# you can also use axis "node" '/ / a [@ href = "link4. HTML"] / parent: : * / @ class'

print(result)
Copy the code

See XPath Axes for the use of node Axes

Gets the text in the node

The text() method in XPath retrieves the direct text in a node (excluding the text in its children).

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]/text()')
print(result)
Copy the code
Retrieve attributes
from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a/@href')
print(result)
Copy the code

Only properties with one value can be retrieved using this method, for the following:

<li class="li li-first"><a href="link.html">first item</a></li>

The class attribute of the Li node has two values. This method is invalid. We can use the contains() function:

from lxml import etree
text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = e#tree.HTML(text)
result = html.xpath('//li[contains(@class, "li")]/a/text()')
print(result)
Copy the code

We can also use the operator and to concatenate:

from lxml import etree
text = '''
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()')
print(result)
Copy the code

supplement

Click on the link for a detailed XPath tutorial, LXML library.

The second recruit BeautifulSoup

This section focuses on the use of BeautifulSoup, a parsing library.

BeautifulSoup

BeautifulSoup provides simple, Python-like functions to handle navigation, searching, modifying analysis trees, and more. It parses documents to provide users with the data they need to grab. Using it, we can improve the efficiency of parsing.

BeautifulSoup has a complete official Chinese documentation, you can view BeautifulSoup official documentation

⚠️ [Note] BeautifulSoup and LXML need to be installed.

BeautifulSoup can use several parsers, the main ones are as follows:

The parser Method of use advantage disadvantage
The Python standard library BeautifulSoup(markup, “html.parser”) Python has a built-in standard library, moderate execution speed, and strong document fault tolerance Versions earlier than Python 2.7.3 or 3.2.2 have poor fault tolerance
LXML HTML parser BeautifulSoup(markup, “lxml”) High speed and document fault tolerance The C library needs to be installed
LXML XML parser BeautifulSoup(markup, “xml”) A fast, only XML-capable parser The C library needs to be installed
html5lib BeautifulSoup(markup, “html5lib”) The best fault tolerance, browser-style parsing of documents, and the generation of HTML5 documents It is slow and does not rely on external extensions

We usually use an LXML parser for parsing, as follows:

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Hello</p>'.'lxml')    
print(soup.p.string)
Copy the code

Initialization of a BeaufulSoup object

Use the following code to import THE HTML, complete the initialization of the BeautifulSoup object, and auto-correct (such as closing unclosed tags).

soup = BeautifulSoup(markup, "lxml")   The # markup is the STR of HTML
Copy the code

After initialization we can also output the string to be parsed in the standard indent format:

print(soup.prettify())
Copy the code

Node selector

Select the label

When selecting an element, you can select the node element by calling the node name directly, and you can call the string attribute to get the text inside the node.

from bs4 import BeautifulSoup

html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><! -- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">... </p> """

soup = BeautifulSoup(html, 'lxml')

print(soup.title)           # <title>The Dormouse's story</title>
print(type(soup.title))     # <class 'bs4.element.Tag'>
print(soup.title.string)    # The Dormouse's story
print(soup.head)            # <head><title>The Dormouse's story</title></head>
print(soup.p)               # <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
Copy the code
Nested choice

We can also make nested choices, which are like parents. Son. Sun’s choice:

print(soup.head.title.string)
Copy the code
Link to choose

Sometimes it is difficult to select the desired node element in one step. In this case, we can select a node element first, and then select its child node, parent node, brother node and so on based on it

To obtainChildren of the node

Once a node element is selected, if you want to retrieve its immediate children, you can call the contents property, which returns a list of all the children in turn.

Nodes such as p tags may contain both text and nodes, and the return result will return them all as a list.

soup.p.contents     Note that the text is cut into several parts

'''(result)
[
    'Once upon a time ... were\n',
    <a class="sister" href="..." id="link1"><!-- Elsie --></a>,
    ',\n',
    <a class="sister" href="..." id="link2">Lacie</a>,
    ' and\n',
    <a class="sister" href="..." id="link3">Tillie</a>,
    ';\nand ... well.'
]
'''
Copy the code

In addition, we can use the children property to query the list_iterator object, which will return a list_iterator object. After the list_iterator object is changed to list, it will be the same as contents:

>>> s.p.children
<list_iterator object at 0x109d6a8d0>
>>> a = list(soup.p.children)
>>> b = soup.p.contents
>>> a == b
True
Copy the code

We can number the child nodes one by one:

for i, child in enumerate(soup.p.children):
    print(i, child)
Copy the code

To get all the descendants (all the subordinate nodes), you can call the descendants property, descendants will recursively query all the children (depth-first) and get all the descendants. Returns a < Generator object Tag. Descendants at 0x109d297c8> :

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.descendants)
for i, d in enumerate(soup.p.descendants):
    print(i, d)
Copy the code
Get the parent and ancestor nodes

If we want to get the parent of a node element, we can call the parent property and return a node:

>>> soup.span.parent

...

Copy the code

If we wanted to retrieve all the ancestor nodes (and work our way up through the entire HTML), we could call the parents attribute and return a generator:

>>> soup.span.parents
<generator object PageElement.parents at 0x109d29ed0>
>>> list(soup.span.parents)
# as a result, [< p >... < / p >, < div >... < / div >, < body >... < / body >, and < HTML > < / HTML >]...
Copy the code

⚠️【 例 】 The parents are the parents

Getting a sibling node

To retrieve a sibling node, we can call four different properties that do different things:

  • Next_sibling: Gets the node’s next sibling and returns the node.
  • Previous_sibling: Retrieves the upward sibling node and returns the node.
  • Next_siblings: Gets all the siblings below and returns a generator.
  • Previous_siblings: Takes all siblings and returns a generator.
>>> from bs4 import BeautifulSoup
>>> html = "" ".<html>
.    <body>
.        <p class="story">
.            Once upon a time there were three little sisters; and their names were
.            <a href="http://example.com/elsie" class="sister" id="link1">
.                <span>Elsie</span>
.            </a>
.            Hello
.            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
.            and
.            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
.            and they lived at the bottom of a well.
.        </p>
."" "
>>> soup = BeautifulSoup(html, 'lxml')
>>> soup.a
<a class="sister" href="http://example.com/elsie" id="link1"> Elsie  >>> soup.a.next_sibling '\n Hello\n ' >>> soup.a.previous_sibling '\n Once upon a time there were three little sisters; and their names were\n ' >>> soup.a.next_siblings 
      
        >>> soup.a.previous_siblings 
       
         >>> for i in soup.a.previous_siblings: ... print(i) ... Once upon a time there were three little sisters; and their names were >>> for i in soup.a.next_siblings: ... print(i) ... Hello 
       sister" href="http://example.com/lacie" id="link2">Lacie and sister" href="http://example.com/tillie" id="link3">Tillie and they lived at the bottom of a well. >>>Copy the code

Method selector

Sometimes it is difficult to directly find the desired node with node selector, we can use find_all(), find() and other methods, passing in the corresponding parameters can flexibly query, get the desired node, and then through the association selection can easily obtain the required information.

find()

Find () passes in attributes or text to get the elements that match the criteria, and returns the first matching element.

find(name , attrs , recursive , text , **kwargs)
Copy the code

The following is an example:

from bs4 import BeautifulSoup

html=''' 
      

Hello

  • Foo
  • Bar
  • Jay
  • Foo
  • Bar
'''
soup = BeautifulSoup(html, 'lxml') print(soup.find(name='ul') print(soup.find(attrs={'class': 'element'})) print(soup.find(text=re.compile(". *? o.*? ', re.S))) # The result returns the text of the first node that matches the regular expression (the result is not a node) Copy the code
findall()

Find_all, similar to find, except that find_all queries all matching elements and returns a list of all matching elements.

More and more

Other finds such as find_parents(), find_next_siblings(), and find_previous_siblings() are generally used in the same way, but with different search ranges. See the documentation for details.


CSS selectors

BeautifulSoup also provides CSS selectors. To use CSS selectors, simply call the select() method and pass in the corresponding CSS selectors. The result is a list of nodes that match the CSS selectors:

from bs4 import BeautifulSoup

html=''' 
      

Hello

  • Foo
  • Bar
  • Jay
  • Foo
  • Bar
'''
soup = BeautifulSoup(html, 'lxml') print(soup.select('.panel .panel-heading')) print(soup.select('ul li')) print(soup.select('#list-2 .element')) print(type(soup.select('ul') [0])) Copy the code

Extracting information

Get the full label

To get the full HTML code for a tag, just write its node selector:

soup.title

Getting the label type

Use the name attribute to get the type of the node (p, a, title, pre, etc.) :

print(soup.title.name)

Get tag content

As we said earlier, we call the string property to get the text inside the node:

soup.title.string

⚠️ [Note].string does not work if there are other tags under the tag, it will return None:

>>> from bs4 import BeautifulSoup
>>> html = '<p>Foo<a href="#None">Bar</a></p>'
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.p.string)
None
Copy the code

To get the content, you can also use the node’s get_text() method:

soup.p.get_text()

Get_text: gettext () gettext () gettext () gettext () gettext

>>> from bs4 import BeautifulSoup
>>> html = '<p>Foo<a href="#None">Bar</a></p>'
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.p.string)
None
>>> print(soup.p.get_text())
FooBar
Copy the code
Retrieve attributes

Each node may have multiple attributes, such as ID, class, and we can call attrs to get all of them. In turn, we can get specific attributes by using the dictionary’s value method (enclosing the attribute name in brackets, or calling its get() method) :

print(soup.p.attrs)
print(soup.p.attrs['name'])

'''(results)
{'class': ['title'], 'name': 'dromouse'}
dromouse
'''
Copy the code

You can also use brackets and property names directly:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul') :print(ul['id'])
    print(ul.attrs['id'])
    The two lines of code in the body of the loop are equivalent
Copy the code

A third PyQuery

This section focuses on the use of the parse library PyQuery.

PyQuery

pyquery: a jquery-like library for python

PyQuery is used in much the same way as jQuery. If you’re stuck with the old jQuery front end, give it a try.

⚠️ [Note] PyQuery needs to be installed.

Initialize the

PyQuery can be initialized with multiple forms of data sources, such as a string whose content is HTML, the URL of the source, a local file name, and so on.

String initialization
from pyquery import PyQuery as pq

html = ''' 

Header

Something

Other thing

In div

'''
doc = pq(html) # Pass the HTML string print(doc('p')) # Pass CSS selectors '''(results)

Something

Other thing

In div

'''
Copy the code
The URL to initialize
from pyquery import PyQuery as pq
doc = pq(url='http://www.baidu.com', encoding='utf-8')      # Encoding = encoding
print(doc('title'))

 "" " "
Copy the code

CSS selectors

See CSS selectors table for details.

Find nodes

  • To find theThe childNodes withchildren('css-selector')Method, all if the parameter is empty.
  • To find theThe childrenNodes withfind('css-selector')Method,Parameter cannot be empty!
  • To find theThe fatherNodes withparent('css-selector')Method, all if the parameter is empty.
  • To find theThe ancestorsNodes withparents('css-selector')Method, all if the parameter is empty.
  • To find thebrotherNodes withsiblings('css-selector')Method, all if the parameter is empty.
>>> p = doc('div')
>>> p
[<div#wrapper>, <div#head>, <div.head_wrapper>, <div.s_form>, <div.s_form_wrapper>, <div#lg>, <div#u1>, <div#ftCon>, <div#ftConw>]
>>> type(p)
<class 'pyquery.pyquery.PyQuery'> > > >p.find('#head')
[<div#head>] > > >print(p.find('#head'))
<div id="head">... </div> 
Copy the code

traverse

The results selected by PyQuery can be traversed:

>>> for i in  p.parent():
.    print(i, type(i))
.
<Element a at 0x1055332c8> <class 'lxml.html.HtmlElement'>
<Element a at 0x105533368> <class 'lxml.html.HtmlElement'>
<Element a at 0x105533458> <class 'lxml.html.HtmlElement'>
Copy the code

Note that this is an LXML Element, and you will use the LXML method to handle it.

Access to information

attr()Retrieve attributes
a = doc('a')
print(a.attr('href'))
Copy the code

Attr () must pass in the name of the property to be selected. If the object contains more than one node, calling attr() on the object returns only the corresponding result for the first object. A traversal is required to return each one.

text()Get the text
a = doc('a')
a.text()
Copy the code

This will print the result of all text joins that contain the nodes.

Node operation

PyQuery can also manipulate nodes, which is not the focus of this article and will not be covered again.