Make writing a habit together! This is the 14th day of my participation in the “Gold Digging Day New Plan ยท April More Text Challenge”. Click here for more details.
๐ข ๐ข ๐ข ๐ข ๐ข ๐ข hello! Hello everyone, I am [Dream Eraser], 10 years of research experience, dedicated to the spread of Python related technology stack ๐
๐ Last updated: April 14, 2022
โณ๏ธ Actual combat scenario
If you’re new to a Python crawler, chances are you’re going to be collecting web pages, so quickly locating the content of a web page is the first hurdle you’ll face. This blog explains how to locate the most easy-to-use elements of a web page.
We’ve recently added a series of Python crawler basics to get you up to speed, play, and dance.
The core of this paper uses Beautiful Soup module, so the site we use for test collection is also its official website (at the present stage, crawler collection is becoming more and more strict, many sites can not collect any more, it is easy to be blocked, so we have to learn who to collect).
The official site
www.crummy.com/software/BeautifulSoup/
Copy the code
Beautiful Soup is a Python parsing library that converts HTML tags into a Tree of Python objects and lets us extract data from that tree.
Module installation is extremely simple:
PIP install bS4-iCopy the code
Any module installed in the future, try to use domestic sources, fast and stable.
The name of this module package is BS4. Pay special attention to the installation.
The basic usage of ๐ฅ is shown below
import requests
from bs4 import BeautifulSoup
def ret_html() :
""" Get HTML element """
res = requests.get('https://www.crummy.com/software/BeautifulSoup/', timeout=3)
return res.text
if __name__ == '__main__':
html_str = ret_html()
soup = BeautifulSoup(html_str, 'lxml')
print(soup)
Copy the code
One of the things to note is the module import code and the two parameters passed in the constructor of BeautifulSoup class when instantiating the soup object, a string to be parsed and a parser. The official recommendation is LXML because of its speed of parsing.
The output of the above code is shown below and looks like a normal HTML code file.
And we can call the soup.prettify() method of the soup object to format the HTML tag so that when you store it in an external file, your HTML code will look pretty.
โณ๏ธ Object description of BeautifulSoup module
The BeautifulSoup class parses HTML text into a Tree of Python objects, including the four most important ones, Tag, NavigableString, BeautifulSoup, and Comment, which we’ll cover next.
๐ฅ BeautifulSoup object
The object itself represents the entire HTML page, and the HTML code is automatically completed when the object is instantiated.
html_str = ret_html()
soup = BeautifulSoup(html_str, 'lxml')
print(type(soup))
Copy the code
๐ฅ Tag object
The Tag object is a page Tag, or a page element object. For example, to obtain the H1 Tag object from bS4 official website, the code is as follows:
if __name__ == '__main__':
html_str = ret_html()
soup = BeautifulSoup(html_str, 'lxml')
# print(soup. Prettify ()) # print(soup
print(soup.h1)
Copy the code
The result is also the h1 tag in the page:
<h1>Beautiful Soup</h1>
Copy the code
Use the type function in Python to check its type as follows:
print(soup.h1)
print(type(soup.h1))
Copy the code
Instead of a string, you get a Tag object.
<h1>Beautiful Soup</h1>
<class 'bs4.element.Tag'>
Copy the code
Since it’s a Tag object, it has some specific attribute values
Get label name
print(soup.h1)
print(type(soup.h1))
print(soup.h1.name) Get the tag name
Copy the code
Gets the attribute value of the Tag from the Tag object
print(soup.img) Get the first img tag for the page
print(soup.img['src']) Get the DOM attribute value of the web element
Copy the code
Gets all attributes of the tag through the attrs attribute
print(soup.img) Get the first img tag for the page
print(soup.img.attrs) Get all the attributes of the page element as a dictionary
Copy the code
All of the output from the above code is shown below, and you can practice with any label you choose.
<h1>Beautiful Soup</h1> <class 'bs4.element.Tag'> h1 <img align="right" SRC ="10.1.jpg" width="250"/> {'align': 'right', 'SRC ': '10.1.jpg', 'width': '250'}Copy the code
๐ฅ NavigableString object
The NavigableString object retrieves the text inside the tag, such as the P tag, and in the following code retrieves I am an eraser
<p>I'm an eraser</p>
Copy the code
Getting the object is also easy, using the String attribute of the Tag object.
nav_obj = soup.h1.string
print(type(nav_obj))
Copy the code
The following output is displayed
<class 'bs4.element.NavigableString'>
Copy the code
If the target tag is a single tag, None is retrieved
In addition to using the string method of an object, you can also use the text property and the get_text() method to get the tag content
print(soup.h1.text)
print(soup.p.get_text())
print(soup.p.get_text('&'))
Copy the code
Text is a combined string that gets the contents of all the children, and get_text() has the same effect, except that get_text() can be used to add a delimiter, such as the ampersand in the code above, and strip=True arguments can be used to remove Spaces.
๐ฅ Comment objects
Get web page comment content, not useful, ignore it.
BeautifulSoup objects and Tag objects support Tag lookup methods, as shown below.
โณ๏ธ find() and find_all() methods
The find() method on BeautifulSoup and Tag objects is called to find the specified object in a web page. The syntax for this method is as follows:
obj.find(name,attrs,recursive,text,**kws)
Copy the code
Method returns the first element found, or None if None is found. The parameters are described as follows:
name
: Label name;attrs
: Tag attribute;recursive
: Searches for all descendant elements by default;text
: Label content.
For example, we continue to search for tag A in the webpage requested above, and the code is as follows:
html_str = ret_html()
soup = BeautifulSoup(html_str, 'lxml')
print(soup.find('a'))
Copy the code
You can also use the attrs parameter as follows:
html_str = ret_html()
soup = BeautifulSoup(html_str, 'lxml')
# print(soup.find('a'))
print(soup.find(attrs={'class': 'cta'}))
Copy the code
The find() method also provides some special parameters that make it easy to find directly, such as id= XXX to find a label containing an ID in the attribute, or class_= XXX to find a label containing a class in the attribute.
print(soup.find(class_='cta'))
Copy the code
The find_all() method returns all matching tags, as shown in the following syntax:
obj.find_all(name,attrs,recursive,text,limit)
Copy the code
The limit argument, which represents the maximum number of matches returned, is highlighted, and the find() method can be thought of as limit=1, which makes it easier to understand.
๐ฃ๐ฃ๐ฃ๐ฃ๐ฃ๐ฃ ๐ฃ If you find any errors in this article, please correct them in the comments section