Python crawler localization of web elements, in this blog post

Make writing a habit together! This is the 14th day of my participation in the “Gold Digging Day New Plan · April More Text Challenge”. Click here for more details.

📢 📢 📢 📢 📢 📢 hello! Hello everyone, I am [Dream Eraser], 10 years of research experience, dedicated to the spread of Python related technology stack 💗

📆 Last updated: April 14, 2022

⛳️ Actual combat scenario

If you’re new to a Python crawler, chances are you’re going to be collecting web pages, so quickly locating the content of a web page is the first hurdle you’ll face. This blog explains how to locate the most easy-to-use elements of a web page.

We’ve recently added a series of Python crawler basics to get you up to speed, play, and dance.

The core of this paper uses Beautiful Soup module, so the site we use for test collection is also its official website (at the present stage, crawler collection is becoming more and more strict, many sites can not collect any more, it is easy to be blocked, so we have to learn who to collect).

The official site

www.crummy.com/software/BeautifulSoup/
Copy the code

Beautiful Soup is a Python parsing library that converts HTML tags into a Tree of Python objects and lets us extract data from that tree.

Module installation is extremely simple:

PIP install bS4-iCopy the code

Any module installed in the future, try to use domestic sources, fast and stable.

The name of this module package is BS4. Pay special attention to the installation.

The basic usage of 🥇 is shown below

import requests
from bs4 import BeautifulSoup


def ret_html() :
    """ Get HTML element """
    res = requests.get('https://www.crummy.com/software/BeautifulSoup/', timeout=3)
    return res.text


if __name__ == '__main__':
    html_str = ret_html()
    soup = BeautifulSoup(html_str, 'lxml')
    print(soup)
Copy the code

One of the things to note is the module import code and the two parameters passed in the constructor of BeautifulSoup class when instantiating the soup object, a string to be parsed and a parser. The official recommendation is LXML because of its speed of parsing.

The output of the above code is shown below and looks like a normal HTML code file.

And we can call the soup.prettify() method of the soup object to format the HTML tag so that when you store it in an external file, your HTML code will look pretty.

⛳️ Object description of BeautifulSoup module

The BeautifulSoup class parses HTML text into a Tree of Python objects, including the four most important ones, Tag, NavigableString, BeautifulSoup, and Comment, which we’ll cover next.

🥇 BeautifulSoup object

The object itself represents the entire HTML page, and the HTML code is automatically completed when the object is instantiated.

    html_str = ret_html()
    soup = BeautifulSoup(html_str, 'lxml')
    print(type(soup))
Copy the code

🥇 Tag object

The Tag object is a page Tag, or a page element object. For example, to obtain the H1 Tag object from bS4 official website, the code is as follows:

if __name__ == '__main__':
    html_str = ret_html()
    soup = BeautifulSoup(html_str, 'lxml')
    # print(soup. Prettify ()) # print(soup

    print(soup.h1)
Copy the code

The result is also the h1 tag in the page:

<h1>Beautiful Soup</h1>
Copy the code

Use the type function in Python to check its type as follows:

    print(soup.h1)
    print(type(soup.h1))
Copy the code

Instead of a string, you get a Tag object.

<h1>Beautiful Soup</h1>
<class 'bs4.element.Tag'>
Copy the code

Since it’s a Tag object, it has some specific attribute values

Get label name

    print(soup.h1)
    print(type(soup.h1))
    print(soup.h1.name)  Get the tag name
Copy the code

Gets the attribute value of the Tag from the Tag object

    print(soup.img)  Get the first img tag for the page
    print(soup.img['src'])  Get the DOM attribute value of the web element
Copy the code

Gets all attributes of the tag through the attrs attribute

    print(soup.img)  Get the first img tag for the page

    print(soup.img.attrs)  Get all the attributes of the page element as a dictionary
Copy the code

All of the output from the above code is shown below, and you can practice with any label you choose.

<h1>Beautiful Soup</h1> <class 'bs4.element.Tag'> h1 <img align="right" SRC ="10.1.jpg" width="250"/> {'align': 'right', 'SRC ': '10.1.jpg', 'width': '250'}Copy the code

🥇 NavigableString object

The NavigableString object retrieves the text inside the tag, such as the P tag, and in the following code retrieves I am an eraser

<p>I'm an eraser</p>
Copy the code

Getting the object is also easy, using the String attribute of the Tag object.

    nav_obj = soup.h1.string
    print(type(nav_obj))
Copy the code

The following output is displayed

<class 'bs4.element.NavigableString'>
Copy the code

If the target tag is a single tag, None is retrieved

In addition to using the string method of an object, you can also use the text property and the get_text() method to get the tag content

    print(soup.h1.text)
    print(soup.p.get_text())
    print(soup.p.get_text('&'))
Copy the code

Text is a combined string that gets the contents of all the children, and get_text() has the same effect, except that get_text() can be used to add a delimiter, such as the ampersand in the code above, and strip=True arguments can be used to remove Spaces.

🥇 Comment objects

Get web page comment content, not useful, ignore it.

BeautifulSoup objects and Tag objects support Tag lookup methods, as shown below.

⛳️ find() and find_all() methods

The find() method on BeautifulSoup and Tag objects is called to find the specified object in a web page. The syntax for this method is as follows:

obj.find(name,attrs,recursive,text,**kws)
Copy the code

Method returns the first element found, or None if None is found. The parameters are described as follows:

name: Label name;
attrs: Tag attribute;
recursive: Searches for all descendant elements by default;
text: Label content.

For example, we continue to search for tag A in the webpage requested above, and the code is as follows:

html_str = ret_html()
soup = BeautifulSoup(html_str, 'lxml')
print(soup.find('a'))
Copy the code

You can also use the attrs parameter as follows:

html_str = ret_html()
soup = BeautifulSoup(html_str, 'lxml')
# print(soup.find('a'))
print(soup.find(attrs={'class': 'cta'}))
Copy the code

The find() method also provides some special parameters that make it easy to find directly, such as id= XXX to find a label containing an ID in the attribute, or class_= XXX to find a label containing a class in the attribute.

print(soup.find(class_='cta'))
Copy the code

The find_all() method returns all matching tags, as shown in the following syntax:

obj.find_all(name,attrs,recursive,text,limit)
Copy the code

The limit argument, which represents the maximum number of matches returned, is highlighted, and the find() method can be thought of as limit=1, which makes it easier to understand.

📣📣📣📣📣📣 📣 If you find any errors in this article, please correct them in the comments section

Python crawler localization of web elements, in this blog post

⛳️ Actual combat scenario

The basic usage of 🥇 is shown below

⛳️ Object description of BeautifulSoup module

🥇 BeautifulSoup object

🥇 Tag object

🥇 NavigableString object

🥇 Comment objects

⛳️ find() and find_all() methods

Related Posts

What is an inner class

Centos Installs the K8S cluster

Gateway Service Gateway filter, all in one!