{{ title }}
1, the preface
In front of me through r.ext to print the web page, output, the result is very good, we are successful
And of course utF-8 encoding
Although some text can be seen in UTF-8 encoding, adding to the reading experience, I wonder if there is a better way to add to the reading experience
The answer is yes, let’s take a look.
2. Installation of BeautifulSoup library
2.1. Introduction to the official website
BeautifulSoup: We Called him Tortoise Because he taught us. (Crummy.com)
It’s all in English, but I can’t read it
Known as “Delicious Soup”, of course this is a third party library. It can parse HTML format and extract relevant information. The original use of it is to turn the text you give into a pot of soup, for cooking, no wonder it is called delicious soup
2.2, the installation
This is easy, right? Just go to the code
pip install BeautifulSoup4
Copy the code
Then you can install it.
2.3, the test
2.3.1, analyzing web pages
Was the installation successful? We’ll find out if we test it.
The web page we tested was
Let’s look at his source code again
This is probably the front end HTML
Is a pair of information encapsulated as <>.
2.3.2, practice
Good! When everything is ready, open Idle
-
Get the source code
There are two ways to get source code
-
Manually: Ctrl+U on the current page in the Edge browser, and you’ll find the image above
-
.get () method
We should put it to use. Let’s try it
>>> import requests >>> r = requests.get("https://www.python123.io/ws/demo.html") >>> r.status_code 200 >>> r.text Copy the code
Isn’t that easy
-
-
BeautifulSoup library guide
Define a demo variable
>>> demo = r.text Copy the code
Import libraries
>>> from bs4 import BeautifulSoup Copy the code
Because the library name is too long, we call it BS4, so from BS4 imports a class called BeautifulSoup
-
soup
We need to make the demo into a soup that BeautifulSoup can understand
>>> soup = BeautifulSoup(demo, "html.parser") Copy the code
-
A printout
>>> print(soup.prettify()) Copy the code
Isn’t that much better than the last one
At this point, our BeautifulSoup4 library is also installed successfully.
2.3.3 summary
How do I use the BeautifulSoup library with two lines of code
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<p>data</p>"."html.parser")
Copy the code
Bs4 imports the BeautifulSoup class
When we make soup, we use one variable. There are two variables in there
-
“
data
” : information to be parsed
-
“Html. parser” : the parser used to parse HTML
3. Basic elements of BeautifulSoup library
This is actually a little example of how BeautifulSoup works, “Make soup.”
Let’s take a closer look at this library
3.1. Understanding of BeautifulSoup Library
-
To understand this, let’s first understand what an HTML file is, right
When you open an HTML file, you will find that it is organized in pairs of <> tags, a group of <> tags. The number of labels is formed by the upstream and downstream relationship between labels.
BeautifulSoup is a library that parses, traverses, and maintains a “label tree”.
BeautifulSoup library can parse as long as our file is of the label tree type.
-
The tag format
Now that you understand what an HTML file is, how much do you know about the format of each tag
<p class="title">.</p> Copy the code
Take the P tag as an example
-
The first thing you can see is a pair of <>.
-
..
: tag tag; P is the name of the tag; Appear at the beginning and end to indicate the scope of a tag.
-
Class =”title” : Property field, containing zero or more properties. Attributes are used to define the characteristics of the tag,
class
Attribute name:title
: The contents of the property
Any property has its properties and its values, so you can say that properties are made up of key-value pairs if you want to make a little bit of sense
-
3.2. Import of BeautifulSoup library
-
BeautifulSoup library, also known as BeautifulSoup4 or BS4
Use should be by reference
-
Common way
from bs4 import BeautifulSoup Copy the code
We used this one before
A type called BeautifulSoup has been introduced from the BS4 library
-
The traditional way
import bs4 Copy the code
This is a lot of use.
-
3.3, BeautifulSoup class
First, the HTML document corresponds to the tag tree, which is processed by BeautifulSoup and converted to eautifulSoup. Therefore, it can be understood that the BeautifulSoup type can represent a label tree.
To sum up: THE HTML, tag tree, and BeautifulSoup class are equivalent.
On this basis, the BeautifulSoup class makes the label tree into a variable, and the processing of the variable is the processing of the label tree.
BeautifulSoup corresponds to the entire content of an HTML/XML document
BeautifulSoup library parser. 3.4
The purpose of parsers, as mentioned earlier, is to parse HTML/XML documents. Of course,
The parser | Method of use | conditions |
---|---|---|
Bs4 HTML parser | BeautifulSoup(mk, "html.parser") |
Install bs4 library |
HTML parser for LXML | BeautifulSoup(mk, "lxml") |
pip install lxml |
XML parser for LXML | BeautifulSoup(mk, "xml") |
pip install lxml |
Html5lib parser | BeautifulSoup(mk, "html5lib") |
pip install html5lib |
It doesn’t matter if you parse HTML/XML, it makes a difference if you have a higher level of experience, so
3.5. Basic elements of the BeautifulSoup class
The basic elements | instructions |
---|---|
Tag | Tags, the most basic unit of information organization, are used separately<> and</> Mark the beginning and end. |
Name | The name of the tag,<p>... </p> The name is “p” and the format is:<tag>.name |
Attributes | Tag attributes, dictionary organization, format:<tag>.attrs |
NavigableString | Tag non-attribute string,< >... </> Is a character string in the following format:<tag>.string |
Comment | The Comment part of the string inside the tag, a special Comment type |
Is that dazzling? Okay, let’s hit it,
To put the source code out here, let’s type:
-
The tag label
>>> soup.title<title>This is a python demo page</title> Copy the code
Look at the label just like that.
-
Gets the content of the link tag
Define a tag that gets HTML… The contents of the tag, that is, the contents of the link tag.
>>> tag = soup.a>>> tag<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> Copy the code
It can be seen that … The tag content is output
How did I find out there was … < / a > tag
-
Label name
>>> soup.a.name'a' Copy the code
We know that the tag is a, so soup. A. Name what if we don’t know the tag name?
Here we introduce a few names, child tags, parent tags, ye
Look at the < a >… The parent tag of , the previous tag
>>> soup.a.parent.name'p'>>> soup.a.parent.parent.name'body' Copy the code
..
is the parent tag, .. Is the < / body > < p >.. The parent tag of
, printed as a string
-
Attribute information of the label
With the < a >.. < / a >, for example
>>> tag = soup.a>>> tag.attrs{'href': 'http://www.icourse163.org/course/BIT-268001'.'class': ['py1'].'id': 'link1'} Copy the code
In dictionary form
Of course, we can also extract the information inside
>>> tag.attrs['class'] ['py1']>>> tag.attrs['href']'http://www.icourse163.org/course/BIT-268001'>>> tag.attrs['id']'link1' Copy the code
-
NavigableString element of the tag
This is a string type that represents a string between Angle bracket tags
>>> soup.a<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>>>> soup.a.string'Basic Python'>>> soup.p<p class="title"><b>The demo python introduces several python courses.</b></p>>>> soup.p<p class="title"><b>The demo python introduces several python courses.</b></p>>>> soupp.p.string Copy the code
-
The Comment type
The table says, this is what a comment means, what to do if a comment appears in an HTML page.
Let’s make a new soup
>>> newsoup = BeautifulSoup("<b><! --This is a comment--></b><p>This is not a comment</p>"."html.parser")>>> newsoup.b.string'This is a comment Copy the code
Note:
<! -... -->
It’s a comment, so<p>... </p>
It’s not a comment. Good! So I’m going to end this little bit here. Let’s do a review. Look at the pictures
4. HTML content traversal method based on BS4 library
We learned how to organize HTML text into a tree structure. You can see that there are many tags and the relationship between tags is indicated. But it’s still a little bit inconvenient to see, so let’s put it in a different form
Note: the < p >… Under the < / p > < h >… should be …
… Under the < / p > < h >… should be … ‘ ‘, the following is not marked ** ** note:
.
Under the
.It should be
.‘ ‘is not marked below
This is after another cleaning. Isn’t it
We call tags nodes. So, goose, we can iterate in three ways
- Traversal down: Traversal from the root node to the leaf node
- Traversal up: As opposed to traversal down, traversal from the leaf node to the root node.
- Parallel traversal: Mutual traversal, horizontal
So it’s easy to understand.
And they have different traversal methods, so let’s look at them separately
4.1. Down-line traversal of the label tree
attribute | instructions |
---|---|
.contents | The list of child nodes will<tag> All son nodes are stored in the list. |
.childrea | The iteration type of the child node, and.contets Similarly, used to loop through the son node |
.descendans | Iteration type of descendant node, containing all descendant nodes for loop traversal |
Let’s take the above three examples respectively:
-
.contents
We read the < head >… the child node of the tag
>>> soup.head.contents[<title>This is a python demo page</title>] Copy the code
Is the < title >…
So let me just make this clear, why it’s a little slow down here, but.contents returns its children, not its children.
Since the list information is returned, we can retrieve the information
Let’s take a look at … .contents information
>>> soup.body.contents['\n', <p class="title"> <b>The demo python introduces several python courses. < /b></p>, '\'n'."p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n'] Copy the code
You can see there’s a lot of information in there, corresponding to the different body elements,
Note that for a label son node, not only the label node is included, but also the string node, \n will have.
We can view the number of son nodes
>>> len(soup.body.contents)5 Copy the code
five
We can also search,
>>> soup.body.contents[1]<p class="title"> <b>The demo python introduces several python courses. < /b></p> Copy the code
Hey hey < p >… < / p > tag.
-
.childrea
Of course, we could use this
>>> for child in soup.body.children: print(child) <p class="title"> <b>The demo python introduces several python courses. < /b></p><p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p> Copy the code
This should be for … A traversal of the son node under the tag
There was always a question, why is there an A tag, now I understand, p tag contains an A tag
-
.descendans
>>> for child2 in soup.body.descendants: print(child2) <p class="title"> <b>The demo python introduces several python courses. < /b></p><b>The demo python introduces several python courses. < /b>The demo python introduces several python courses. <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>Basic Python and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>Advanced Python. Copy the code
At first, it was a little hard to understand how the label could appear here. Slowly, comparing the structure after finishing, some labels were nested inside. Very simple.
4.2. Traversal of the label tree
There are only two attributes
attribute | instructions |
---|---|
.parent | The parent label of the node |
.parents | The iteration type of the node ancestor tag, used to loop through the ancestor node |
Do you see the meaning of parent? Take a look at each
-
.parent
Let’s look at
… The father of the tag>>> soup.title.parent<head><title>This is a python demo page</title></head> Copy the code
Let’s take a look at < HTML >… The father of the
tag>>> soup.html.parent<html><head><title>This is a python demo page</title></head><body><p class="title"> <b>The demo python introduces several python courses. < /b></p><p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p></body></html> Copy the code
Because the < HTML >…
tags are the highest level tags in HTML text, so < HTML >… The father of the
tag is himself.The Soup label is a special label, so we can also print it
>>> soup.parent>>> Copy the code
You’ll find it empty, so no
-
.parents
>>> for parent in soup.a.parents: if parent is None: print(parent) else: print(parent.name) pbodyhtml[document] Copy the code
What does this code say?
The < a >… labels all ancestor names for printing. But why is there an if–else statement? When traversing all ancestor labels of a label, the soup itself will be traversed, and there is no information in the soup, so there is a distinction that ancestor Name cannot be printed.
So, these two methods are easy to understand
4.3. Parallel traversal of label tree
There are four, four properties
attribute | instructions |
---|---|
.next_sibling | Returns the next parallel node label in HTML text order |
.previous_sibling | Returns the last parallel node tag in HTML text order |
.next_siblings | Iteration type that returns all subsequent parallel node labels in HTML text order |
.previous_siblings | Iteration type that returns all parallel node tags that follow in HTML text order |
They actually come in pairs, up and down, front and back
Parallel traversal is conditional: parallel traversal occurs between nodes of the same parent node. The same father.
-
.next_sibling
>>> soup.a.next_sibling' and ' Copy the code
The parallel tag of the A tag is a string,
In the tag tree, although the tree structure is organized in the form of tags, the NavigableString types between tags also constitute the nodes of the tag tree. So any node that has its parallel tag, its father tag, and its son tag can be of NavigableString type, so don’t just assume that the next node that we get parallel traversal is the tag type
So, it’s very graphic
And then we’re going to look at the next parallel tag for the A tag
>>> soup.a.next_sibling.next_sibling<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a> Copy the code
-
.previous_sibling
Let’s look at the parallel tag that precedes the A tag
>>> soup.a.previous_sibling'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n' Copy the code
It’s a text
Similarly, we apply the a tag to the previous parallel tag to the previous parallel tag
>>> soup.a.previous_sibling.previous_sibling>>> Copy the code
Returns an empty,
It’s not hard to see that a is empty in front of it, why not the one on the right
-
.next_siblings
>>> for sibing in soup.a.next_siblings: print(sibing) and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a> Copy the code
Sounds familiar.
-
.previous_siblings
>>> for sibing in soup.a.previous_siblings: print(sibing) Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: Copy the code
So I
5. Output in HTML format based on BS4 library
When I look at formatting, I think of string formatting. I don’t mean empty. How can I make HTML text more “friendly”?
The word “friendly” is not meant to make it easy for people to read, but it is also meant to make it easier for programs to read and analyze
5.1, bs4 libraryprettify()
methods
Let’s look at a piece of code:
>>> from bs4 import BeautifulSoup>>> soup = BeautifulSoup(demo, "html.parser")>>> soup.prettify()
Copy the code
We’re going to find a lot of newlines that we’re printing out.
>>> print(soup.prettify())
Copy the code
And that makes it a little bit clearer. Of course, this is a familiar picture.
so
Prettify () : This method adds line breaks to HTML text so it becomes more intuitive, and treats each tag individually.
For example, we print in soup… < / a > tag
>>> print(soup.a.prettify())<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1"> Basic Python</a>
Copy the code
One of the most important issues in the BS4 library is encoding, in which any HTML file or string read is converted to UTF-8 encoding. So there’s no obstacle when it comes to parsing.
6,
Remember the requests library from the previous chapter, how to get source code, add, delete, change and search urls, we focus on getting source code, you will find that the source code is not easy to read, so I think this chapter is more about how to better read source code, parsers. There are also some operations.
So that’s it, my notes.
Thank you, if there are mistakes in the article, welcome your correction; My pleasure if I can be of any help to you.