1. Introduction
What is Beautiful Soup 4?
Beautiful Soup 4 (BS4) is a Python third-party library that allows crawlers to parse HTML pages and find exactly what they need. The crawler with BS4 is pleasant and brisk.
BS4 is powerful and easy to use. Compared to the effort of just using regular expressions, BS4 is effortlessly bold and chic.
2. Install Beautiful Soup 4
BS4 is the third Python library that needs to be installed before use.
pip install beautifulsoup4
Copy the code
2.1 Working principle of BS4
To really understand and master BS4, you need to have an understanding of its underlying workings.
Before BS4 finds page data, it loads an HTML file or HTML fragment and builds a tree object in memory that maps exactly one to one to the HTML document (similar to W3C DOM parsing). This process is called parsing.
BS4 itself does not provide an implementation of parsing, but rather an interface for interfacing with third-party parsers (which is great because BS4 is very extensible and exploitable). Regardless of the parser used, BS4 hides underlying differences and provides a unified approach to operations (query, traverse, modify, add…). .
Start by constructing BeautifulSoup objects. BeautifulSoup is a reference to the entire document tree, or an entry object into the document tree.
An analysis of the BeautifulSoup constructor shows that many parameters can be passed when constructing a BeautifulSoup object. But generally you only need to consider the first two parameters. Take the default values for the other parameters, and BS4 works fine (convention is greater than the configured paradigm).
def __init__(self, markup="", features=None, builder=None,
parse_only=None, from_encoding=None, exclude_encodings=None,element_classes=None, **kwargs) :
Copy the code
- Markup: HTML document. It can be an HTML fragment in string format or a file object.
from bs4 import BeautifulSoup
# Use HTML snippets
html_code = "BeautifulSoup 4 Introduction
"
bs = BeautifulSoup(html_code, "lxml")
print(bs)
Copy the code
The following uses a file object as a parameter.
from bs4 import BeautifulSoup
file = open("d:/hello.html", encoding="utf-8")
bs = BeautifulSoup(file, "lxml")
print(bs)
Copy the code
When using file objects, use unicode encoding (UTF-8 is a Unicode implementation).
-
Features: Specifies the parser program. Parsers are the soul of BS4, otherwise BS4 is a shell without a foundation.
BS4 supports Python’s built-in HTML parser, as well as third-party parsers: LXML, html5lib…
Anyone can customize their own parser, but be sure to follow the BS4 interface specification.
So even though Google’s browser parsing engine is awesome, it doesn’t match the BS4 interface, so you have to be sympathetic.
If you want to use a third-party parser, please install it before using it:
LXML installation:
pip install lxml
Copy the code
Install html5lib:
pip install html5lib
Copy the code
Horizontal and horizontal comparisons of several parsers:
The parser | Method of use | advantage | disadvantage |
---|---|---|---|
The Python standard library | BeautifulSoup(markup, “html.parser”) | Moderate execution speed The document is fault-tolerant |
Python documentation prior to 2.7.3 or 3.2.2 has poor fault tolerance |
LXML HTML parser | BeautifulSoup(markup, “lxml”) | Speed is fast The document is fault-tolerant |
C language library support is required |
LXML XML parser | BeautifulSoup(markup, [“lxml-xml”]) BeautifulSoup(markup, “xml”) | Speed is fast The only parser that supports XML |
C language library support is required |
html5lib | BeautifulSoup(markup, “html5lib”) | The best fault tolerance Parse the document as a browser Generate documents in HTML5 format |
Slow speed Do not rely on external extensions |
Each parser has its own advantages, such as html5lib’s fault tolerance is very good, but LXML parsers are generally preferred, and speed is often more important.
2.2 Differences in parsers
The function of the parser is to load HTML (XML) code and build a hierarchical object tree (hereafter referred to as BS tree) in memory. Although BS4 unified the usage specifications of the various parsers at the application level, each has its own underlying implementation logic.
Of course, parsers do an admirable job of parsing well-formed, htML-compliant documents, except for the speed difference. Come to think of it, this is the most basic functionality they should provide.
However, when the document format is not standard, different parsers will follow their own underlying design for parsing, resulting in weak differences.
BS4, it seems, cannot control the differences in underlying logic.
2.2.1 LXML
Parsing HTML code snippets using LXML.
from bs4 import BeautifulSoup
html_code = "<a><p><p>"
bs = BeautifulSoup(html_code, "lxml")
print(bs)
"' output < HTML > < body > < a > < p > < / p > < p > < / p > < / a > < / body > < / HTML > ' ' '
Copy the code
When LXML is parsed, HTML and body tags are automatically added. And automatically complete tags that do not end syntactic structures. As shown above, a label is the parent of the next two labels. The first P label is the sibling of the second P label.
Parse the following HTML snippet using LXML.
from bs4 import BeautifulSoup
html_code = "<a></p>"
bs = BeautifulSoup(html_code, "lxml")
print(bs)
< HTML >< A >
Copy the code
LXML will consider tag structures with only closing syntax but no opening syntax illegal and will reject parsing (which is just as good). Even if it is illegal, discarding it is natural.
2.2.2 html5lib
Parsing incomplete HTML code snippets using HTML5lib.
from bs4 import BeautifulSoup
html_code = "<a><p><p>"
bs = BeautifulSoup(html_code, "html5lib")
print(bs)
"' output < HTML > < head > < / head > < body > < a > < p > < / p > < p > < / p > < / a > < / body > < / HTML > ' ' '
Copy the code
Html5lib automatically adds HTML, head, and body tags when parsing J. Other than that, the result is not much different from LXML, and there is no end tag syntax.
Use HTML5lib to parse the following HTML snippet.
from bs4 import BeautifulSoup
html_code = "<a></p>"
bs = BeautifulSoup(html_code, "html5lib")
print(bs)
"' output: < HTML > < head > < / head > < body > < a > < p > < / p > < / a > < / body > < / HTML > ' ' '
Copy the code
Html5lib adds a start syntax to a tag that doesn’t have an end syntax. Html5lib follows part of the HTML5 standard. Html5lib will complete as much as possible.
2.2.3 Pyhton built-in parser
from bs4 import BeautifulSoup
html_code = "<a><p><p>"
bs = BeautifulSoup(html_code, "html.parser")
print(bs)
"
Copy the code
Compared with the previous two classes of parsers, the end tag structure will be completed automatically if no,, or tags are added. But the final structure is different from the first two classes of parsers. The A tag is the father of the last two tags, and the first P tag is the father of the second P tag, not the sibling.
LXML, HTML5lib and html.parser are recognized for tags that have no closing syntax structure.
from bs4 import BeautifulSoup
html_code = "<a></p>"
bs = BeautifulSoup(html_code, "html.parser")
print(bs)
Output result "
Copy the code
Tags that do not have an opening syntactic structure are treated like LXML parsers and discarded.
It can be seen from the results of the above code that HTML5lib is the most fault-tolerant. It can be considered to use HTML5lib in scenarios with low requirements for documents. In application scenarios with high requirements on document format, LXML can be used.
3. BS4 tree object
The BS4 memory tree is a memory map of AN HTML document or snippet of code and consists of four types of Python objects. BeautifulSoup, Tag, NavigableString, and Comment.
-
BeautifulSoup object is a mapping of the entire HTML document structure, providing global methods and properties for operations on the entire BS4 tree. Also an entry object.
class BeautifulSoup(Tag) : pass Copy the code
-
Tag objects are maps of tags in HTML documents, or nodes (object names are the same as Tag names) objects that provide methods and attributes for manipulating page tags. Essentially a BeautifulSoup object is also a Tag object.
The key to parsing page data is to find the Tag object that contains the content. BS4 provides many flexible and concise methods.
Using BS4 is the process of starting with BeautifulSoup objects and gradually finding the target label objects.
-
A NavigableString object is a mapping of the body of content contained in an HTML tag, providing methods and properties that operate on text information.
Analyzing a page is ultimately about capturing data, so it’s important to understand the methods and properties of this object.
This is obtained using the string property of the label object.
-
Comment is a mapping object to the content of a document’s comments. This object is not used much.
To recap: The key to using BS4 is to reference one Tag object (node object) and find other Tag objects associated with it. BeautifulSoup begins with a BeautifulSoup object.
In order to better find other nodes with a node, it is necessary to understand the relationship between nodes: mainly father-son relationship, brother relationship.
Now take a case to understand the role of each object one by one.
Case Description: Crawl the latest movie information on douban movie list. (movie.douban.com/chart), and CS… The document format saves movie information.
3.1 Searching for the target Tag
The key to getting the data you need is to find the target Tag. BS4 provides a rich variety of methods to help developers find desired Tag objects quickly and flexibly. Through the following case, let us feel its rich and varied magic.
Obtain the entry page path of Douban movie list at movie.douban.com/chart.
Browse the page using Google Chrome and use the developer tools provided by the browser to analyze the HTML snippets of movie information on the page. Start with the information about downloading the first movie.
The list changes at any time, and the first movie you see may not look like this.
Using a tabular layout. The table layout is very regular, which is great for analyzing structure.
Download the pictures and title of the first movie. The image, of course, uses the IMG Tag. After parsing with BS4, the BS4 tree will have a corresponding IMG Tag object.
There are many img Tag objects in the tree, how to find the first movie image Tag?
from bs4 import BeautifulSoup
import requests
# server address
url = "https://movie.douban.com/chart"
# Disguise as a browser
headers = {
'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36'}
# send request
resp = requests.get(url, headers=headers)
html_code = resp.text
# Get BeautifulSoup object. The first step in a long journey.
bs = BeautifulSoup(html_code, "lxml")
The easiest way to get Tag objects in the BS4 tree is to use Tag names. Simple no no no.
img_tag = bs.img
# return the first IMG Tag object in the BS4 tree
print(type(img_tag))
print(img_tag)
'''
Copy the code
There is an element of luck here, bs.img returns the image tag of the first movie (which also means that the image tag of the first movie is the first image tag of the entire page).
The image path is stored in the SRC attribute of the IMG tag. Now you just need to get the SRC attribute value of the IMG tag object.
Tag objects provide the attrs attribute, which makes it easy to get any attribute value of a Tag object.
Use grammar:
Tag[" attribute name "] or use Tag. Attrs to retrieve all attributes of the Tag object.Copy the code
The following uses ATTS to get all the attribute information for the tag object, which returns a Python dictionary object.
# omit the above code snippet
img_tag_attrs = img_tag.attrs
print(img_tag_attrs)
Output: returns all attributes of the IMG Tag object in dictionary format {' SRC ': 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2670448229.jpg' and 'width', '75', 'Alt' : 'metamorphosis of youth', 'class' : []} ' ' '
Copy the code
A single-valued property returns a single value, since the class property (a multi-valued property) can set multiple class styles and returns an array. Now you just want to get the path to the image.
img_tag_attrs = img_tag.attrs
# The first option
img_tag_src=img_tag_attrs["src"]
# Second option
img_tag_src = img_tag["src"]
print(img_tag_src)
"' output https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2670448229.jpg" '
Copy the code
The two scenarios presented in the above code are essentially the same. With the image path, the rest is easy.
Complete code:
from bs4 import BeautifulSoup
import requests
# server address
url = "https://movie.douban.com/chart"
# Disguise as a browser
headers = {
'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36'}
# send request
resp = requests.get(url, headers=headers)
html_code = resp.text
bs = BeautifulSoup(html_code, "lxml")
img_tag = bs.img
# img_tag_attrs = img_tag.attrs
# img_tag_src=img_tag_attrs["src"]
img_tag_src = img_tag["src"]
Download the image from the image path and save it locally
img_resp = requests.get(img_tag_src, headers=headers)
with open("D:/movie/movie01.jpg"."wb") as f:
f.write(img_resp.content)
Copy the code
3.2 Filtering method
After getting the picture, how to get the name of the movie, and its introduction. The following is a snippet of the movie name.
<a href="https://movie.douban.com/subject/35284253/" class="">Metamorphosis of Youth /<span style="font-size:13px;">Bear Embrace youth (Hong Kong)/Nurture Youth (Taiwan)</span></a>
Copy the code
The movie name is included in an A tag. As mentioned above, when the bs. Tag name is used, it returns the first tag object of the same name in the entire page snippet.
Obviously, the A tag with the title of the first movie cannot be the first one on the page (otherwise, it would be too lucky), and the A tag with the title of the movie cannot be directly obtained by using BS. A, and this A tag has no obvious features that can distinguish it from other A tags.
Here’s another way to do it. This a tag goes up to the parent tag div.
<div class="pl2">
<a href="https://movie.douban.com/subject/35284253/" class="">Metamorphosis of Youth /<span style="font-size:13px;">Bear Embrace youth (Hong Kong)/Nurture Youth (Taiwan)</span>
</a>
<p class="pl">2022-03-11(US Network)/Jin An Kang/Sandra Oh/Ava Moses/Metriy Ramakrishnan/Park Hyein/Oren Lee/Ho Wai-ching/Tristan Eric Chan/Wu Han-chang/Phineas O 'Connell/Jordan Fisher/Tofei-Engo / Grayson Villanueva/Josh Levy/Lori Tan Zien...</p>
<div class="star clearfix">
<span class="allstar40"></span>
<span class="rating_nums">8.2</span>
<span class="pl">(45,853 respondents)</span>
</div>
</div>
Copy the code
Similarly, there are many div tags in the whole page code, and how to get the movie name of the div tag, analysis found that this div has a different attribute characteristics from other div. Class = “pl2.” You can filter div tags with this attribute attribute.
What is a filtering method?
The filter method is a BS4 Tag object method used to filter its children.
BS4 provides filtering methods such as find() and find_all(). The effect of this method is, as the name suggests, to filter individuals based on their characteristics in a population (all child nodes).
Tip: If you call a method like this with a BeautifulSoup object, you filter the nodes in the entire BS4 tree.
If this method is called on a specific Tag object, it filters the children of the Tag.
The find () and find_all() methods take the same arguments. The difference between the two: the former search to meet the first condition on the return, the latter will search all the objects that meet the conditions.
find_all( name , attrs , recursive , string , **kwargs )
find( name , attrs , recursive , string , **kwargs )
Copy the code
Parameters that
- Name: Can be a tag name, regular expression, list, Boolean, or a custom method. It can change a lot.
Find the first div tag object in the page
div_tag = bs.find("div")
# regular expressions: Search for all tags that start with d
div_tag = bs.find_all(re.compile("^d"))
# list: Query for div or A tags
div_tag = bs.find_all(["div"."a"])
# Boolean: find all child nodes
bs.find_all(True)
# custom method: search for tag objects with class attributes but no id attributes.
def has_class_but_no_id(tag) :
return tag.has_attr('class') and not tag.has_attr('id')
bs.find_all(has_class_but_no_id)
Copy the code
- Attrs: Can accept a dictionary type. Describes the attributes of the tag object to be searched in the form of key and value pairs.
Select pl2, pl2, pl2, pl2
div_tag = bs.find(attrs={"class": "pl2"})
Copy the code
Tip: When you use this property, you can narrow the range with the name parameter.
div_tag = bs.find("div",attrs={"class": "pl2"}) Copy the code
Look for a div tag object whose class attribute value is PL2.
- String argument: This argument can be a string, regular expression, or list Boolean. Search by matching label content.
# Search tag content is a span tag object starting with the word 'youth'
div_tag = bs.find_all("span", string=re.compile(* "r" youth.))
Copy the code
-
Limit argument: You can use the limit argument to limit the number of results returned.
-
Recursive: Whether to recursively query children below a node. Default is True. When set to False, only direct children are queried.
After a brief introduction to the filtering method, go back to the problem and query the name and introduction of the first movie. Flexible use of filtering methods, it is easy to search for the required label objects.
from bs4 import BeautifulSoup
import requests
# server address
url = "https://movie.douban.com/chart"
# Disguise as a browser
headers = {
'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36'}
# send request
resp = requests.get(url, headers=headers)
html_code = resp.text
# cause the parser to build BeautifulSoup objects
bs = BeautifulSoup(html_code, "lxml")
# use the filter method to find div objects with class attribute pl2 in the entire tree structure. Actually, there are several. Let's look for the first one
div_tag = bs.find("div", class_="pl2")
Query the first a tag under the div tag object
div_a = div_tag.find("a")
# get all the children under the a tag
name = div_a.contents
# get text
print(name[0].replace("/".' ').strip())
Output result: Metamorphosis of Youth
Copy the code
Code analysis:
- Use the bs.find(“div”, class_=”pl2″) method to search for the div tag that contains the first movie.
- The movie name is contained in the child tag A of the div tag. Continue to find the TAG a using div_tag.find(“a”).
<a href="https://movie.douban.com/subject/35284253/" class="">Metamorphosis of Youth /<span style="font-size:13px;">Bear Embrace youth (Hong Kong)/Nurture Youth (Taiwan)</span>
</a>
Copy the code
- The content in the A tag is the movie name. BS4 provides a String attribute for the tag object, which can be retrieved and returns a NavigableString object. However, if the tag has both text and child tags, the string attribute cannot be used. The string of the a tag above returns None.
- The text is also a node in the BS4 tree structure and can be obtained as a child node. Tag objects have contents and children properties to get children. The former returns a list, the latter an iterator. Descendants get direct child and grandchild nodes.
- Use the contents property to get the first child node, the text node, from the returned list. Text nodes do not have string attributes.
It’s relatively easy to get the movie synops. the content is contained in the P subtag of the div tag.
# Get a synopsis of the movie
div_p = div_tag.find("p")
movie_desc = div_p.string.strip()
print(movie_desc)
Copy the code
Below you can save the movie name and movie introduction in a CSV file. Complete code:
from bs4 import BeautifulSoup
import requests
import csv
# server address
url = "https://movie.douban.com/chart"
# Disguise as a browser
headers = {
'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36'}
# send request
resp = requests.get(url, headers=headers)
html_code = resp.text
bs = BeautifulSoup(html_code, "lxml")
div_tag = bs.find("div", class_="pl2")
div_a = div_tag.find("a")
div_a_name = div_a.contents
# the movie name
movie_name = div_a_name[0].replace("/".' ').strip()
# Get a synopsis of the movie
div_p = div_tag.find("p")
movie_desc = div_p.string.strip()
with open("d:/movie/movies.csv"."w", newline=' ') as f:
csv_writer = csv.writer(f)
csv_writer.writerow(["Movie title"."Movie Introduction"])
csv_writer.writerow([movie_name, movie_desc])
Copy the code
It’s time to summarize the basic flow using BS4:
- Get the BS4 object by specifying the parser.
- Specify a tag name to get the tag object. If the desired label object cannot be obtained directly, the filter method is used to filter down layer by layer.
- Once the target tag object is found, you can use the String attribute to get the text within it, or use atRTS to get the attribute value.
- Use the data obtained.
3.3 Walk through all the goals
The above just found the first movie information. If you want to find all the movie information, you just need to add an iteration to the code above.
from bs4 import BeautifulSoup
import requests
import csv
all_movies = []
# server address
url = "https://movie.douban.com/chart"
# Disguise as a browser
headers = {
'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36'}
# send request
resp = requests.get(url, headers=headers)
html_code = resp.text
bs = BeautifulSoup(html_code, "lxml")
div_tag = bs.find_all("div", class_="pl2")
for div in div_tag:
div_a = div.find("a")
div_a_name = div_a.contents
# the movie name
movie_name = div_a_name[0].replace("/".' ').strip()
# Get a synopsis of the movie
div_p = div.find("p")
movie_desc = div_p.string.strip()
all_movies.append([movie_name, movie_desc])
with open("d:/movie/movies.csv"."w", newline=' ') as f:
csv_writer = csv.writer(f)
csv_writer.writerow(["Movie title"."Movie Introduction"])
for movie in all_movies:
csv_writer.writerow(movie)
Copy the code
This article focuses on the use of BS4 and only crawls the first page of the movie rankings. Once the data is in hand, how it is used depends on the application scenario.
4. To summarize
BS4 also provides many methods to find the parent node, child node, and sibling node based on the current node. But the principle is the same. As long as you find the label (node) object where the content is, you’re good.