{{ title }}

1, the preface

In front of me through r.ext to print the web page, output, the result is very good, we are successful

And of course utF-8 encoding

Although some text can be seen in UTF-8 encoding, adding to the reading experience, I wonder if there is a better way to add to the reading experience

The answer is yes, let’s take a look.

2. Installation of BeautifulSoup library

2.1. Introduction to the official website

BeautifulSoup: We Called him Tortoise Because he taught us. (Crummy.com)

It’s all in English, but I can’t read it

Known as “Delicious Soup”, of course this is a third party library. It can parse HTML format and extract relevant information. The original use of it is to turn the text you give into a pot of soup, for cooking, no wonder it is called delicious soup

2.2, the installation

This is easy, right? Just go to the code

pip install BeautifulSoup4
Copy the code

Then you can install it.

2.3, the test

2.3.1, analyzing web pages

Was the installation successful? We’ll find out if we test it.

The web page we tested was

Let’s look at his source code again

This is probably the front end HTML

Is a pair of information encapsulated as <>.

2.3.2, practice

Good! When everything is ready, open Idle

Get the source code

There are two ways to get source code
- Manually: Ctrl+U on the current page in the Edge browser, and you’ll find the image above
- .get () method
  
  We should put it to use. Let’s try it
```
>>> import requests
>>> r =  requests.get("https://www.python123.io/ws/demo.html")
>>> r.status_code
200
>>> r.text
Copy the code
```
  Isn’t that easy
BeautifulSoup library guide

Define a demo variable
```
>>> demo = r.text
Copy the code
```
Import libraries
```
>>> from bs4 import BeautifulSoup
Copy the code
```
Because the library name is too long, we call it BS4, so from BS4 imports a class called BeautifulSoup
soup

We need to make the demo into a soup that BeautifulSoup can understand
```
>>> soup = BeautifulSoup(demo, "html.parser")
Copy the code
```
A printout
```
>>> print(soup.prettify())
Copy the code
```
Isn’t that much better than the last one

At this point, our BeautifulSoup4 library is also installed successfully.

2.3.3 summary

How do I use the BeautifulSoup library with two lines of code

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<p>data</p>"."html.parser")
Copy the code

Bs4 imports the BeautifulSoup class

When we make soup, we use one variable. There are two variables in there

“

data

” : information to be parsed
“Html. parser” : the parser used to parse HTML

3. Basic elements of BeautifulSoup library

This is actually a little example of how BeautifulSoup works, “Make soup.”

Let’s take a closer look at this library

3.1. Understanding of BeautifulSoup Library

To understand this, let’s first understand what an HTML file is, right

When you open an HTML file, you will find that it is organized in pairs of <> tags, a group of <> tags. The number of labels is formed by the upstream and downstream relationship between labels.

BeautifulSoup is a library that parses, traverses, and maintains a “label tree”.

BeautifulSoup library can parse as long as our file is of the label tree type.
The tag format

Now that you understand what an HTML file is, how much do you know about the format of each tag
```
.
Copy the code
```
Take the P tag as an example
- The first thing you can see is a pair of <>.
- ..
 
 : tag tag; P is the name of the tag; Appear at the beginning and end to indicate the scope of a tag.
- Class =”title” : Property field, containing zero or more properties. Attributes are used to define the characteristics of the tag,
 - classAttribute name:
 - title: The contents of the property
 Any property has its properties and its values, so you can say that properties are made up of key-value pairs if you want to make a little bit of sense

3.2. Import of BeautifulSoup library

BeautifulSoup library, also known as BeautifulSoup4 or BS4

Use should be by reference
1. Common way
```
from bs4 import BeautifulSoup
Copy the code
```
  We used this one before
  
  A type called BeautifulSoup has been introduced from the BS4 library
2. The traditional way
```
import bs4
Copy the code
```
  This is a lot of use.

3.3, BeautifulSoup class

First, the HTML document corresponds to the tag tree, which is processed by BeautifulSoup and converted to eautifulSoup. Therefore, it can be understood that the BeautifulSoup type can represent a label tree.

To sum up: THE HTML, tag tree, and BeautifulSoup class are equivalent.

On this basis, the BeautifulSoup class makes the label tree into a variable, and the processing of the variable is the processing of the label tree.

BeautifulSoup corresponds to the entire content of an HTML/XML document

BeautifulSoup library parser. 3.4

The purpose of parsers, as mentioned earlier, is to parse HTML/XML documents. Of course,

The parser	Method of use	conditions
Bs4 HTML parser	`BeautifulSoup(mk, "html.parser")`	Install bs4 library
HTML parser for LXML	`BeautifulSoup(mk, "lxml")`	`pip install lxml`
XML parser for LXML	`BeautifulSoup(mk, "xml")`	`pip install lxml`
Html5lib parser	`BeautifulSoup(mk, "html5lib")`	`pip install html5lib`

It doesn’t matter if you parse HTML/XML, it makes a difference if you have a higher level of experience, so

3.5. Basic elements of the BeautifulSoup class

The basic elements	instructions
Tag	Tags, the most basic unit of information organization, are used separately`<>`and`</>`Mark the beginning and end.
Name	The name of the tag,`<p>... </p>`The name is “p” and the format is:`<tag>.name`
Attributes	Tag attributes, dictionary organization, format:`<tag>.attrs`
NavigableString	Tag non-attribute string,`< >... </>`Is a character string in the following format:`<tag>.string`
Comment	The Comment part of the string inside the tag, a special Comment type

Is that dazzling? Okay, let’s hit it,

To put the source code out here, let’s type:

The tag label

>>> soup.title<title>This is a python demo page</title>
Copy the code

Look at the label just like that.

Gets the content of the link tag

Define a tag that gets HTML… The contents of the tag, that is, the contents of the link tag.
```
>>> tag = soup.a>>> tag<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
Copy the code
```
It can be seen that … The tag content is output

How did I find out there was … < / a > tag
Label name
```
>>> soup.a.name'a'
Copy the code
```
We know that the tag is a, so soup. A. Name what if we don’t know the tag name?

Here we introduce a few names, child tags, parent tags, ye

Look at the < a >… The parent tag of , the previous tag
```
>>> soup.a.parent.name'p'>>> soup.a.parent.parent.name'body'
Copy the code
```
..

is the parent tag, .. Is the < / body > .. The parent tag of

, printed as a string

Attribute information of the label

With the < a >.. < / a >, for example

>>> tag = soup.a>>> tag.attrs{'href': 'http://www.icourse163.org/course/BIT-268001'.'class': ['py1'].'id': 'link1'}
Copy the code

In dictionary form

Of course, we can also extract the information inside

>>> tag.attrs['class'] ['py1']>>> tag.attrs['href']'http://www.icourse163.org/course/BIT-268001'>>> tag.attrs['id']'link1'
Copy the code

NavigableString element of the tag

This is a string type that represents a string between Angle bracket tags

>>> soup.a<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>>>> soup.a.string'Basic Python'>>> soup.p<p class="title"><b>The demo python introduces several python courses.</b></p>>>> soup.p<p class="title"><b>The demo python introduces several python courses.</b></p>>>> soupp.p.string
Copy the code

The Comment type

The table says, this is what a comment means, what to do if a comment appears in an HTML page.

Let’s make a new soup
```
>>> newsoup = BeautifulSoup("<! --This is a comment-->This is not a comment"."html.parser")>>> newsoup.b.string'This is a comment
Copy the code
```
Note:<! -... -->It’s a comment, so... It’s not a comment. Good! So I’m going to end this little bit here. Let’s do a review. Look at the pictures

4. HTML content traversal method based on BS4 library

We learned how to organize HTML text into a tree structure. You can see that there are many tags and the relationship between tags is indicated. But it’s still a little bit inconvenient to see, so let’s put it in a different form

Note: the … Under the < h >… should be …

… Under the < h >… should be … ‘ ‘, the following is not marked ** ** note:

Under the.It should be.‘ ‘is not marked below

This is after another cleaning. Isn’t it

We call tags nodes. So, goose, we can iterate in three ways

Traversal down: Traversal from the root node to the leaf node
Traversal up: As opposed to traversal down, traversal from the leaf node to the root node.
Parallel traversal: Mutual traversal, horizontal

So it’s easy to understand.

And they have different traversal methods, so let’s look at them separately

4.1. Down-line traversal of the label tree

attribute	instructions
.contents	The list of child nodes will`<tag>`All son nodes are stored in the list.
.childrea	The iteration type of the child node, and`.contets`Similarly, used to loop through the son node
.descendans	Iteration type of descendant node, containing all descendant nodes for loop traversal

Let’s take the above three examples respectively:

.contents

We read the < head >… the child node of the tag

>>> soup.head.contents[<title>This is a python demo page</title>]
Copy the code

Is the < title >…

So let me just make this clear, why it’s a little slow down here, but.contents returns its children, not its children.

Since the list information is returned, we can retrieve the information

Let’s take a look at … .contents information

>>> soup.body.contents['\n', <p class="title"> <b>The demo python introduces several python courses. < /b></p>, '\'n'."p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
Copy the code

You can see there’s a lot of information in there, corresponding to the different body elements,

Note that for a label son node, not only the label node is included, but also the string node, \n will have.

We can view the number of son nodes

>>> len(soup.body.contents)5
Copy the code

five

We can also search,

>>> soup.body.contents[1]<p class="title"> <b>The demo python introduces several python courses. < /b></p>
Copy the code

Hey hey … tag.

.childrea

Of course, we could use this

>>> for child in soup.body.children:	print(child)	<p class="title"> <b>The demo python introduces several python courses. < /b></p><p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
Copy the code

This should be for … A traversal of the son node under the tag

There was always a question, why is there an A tag, now I understand, p tag contains an A tag

.descendans

>>> for child2 in soup.body.descendants:	print(child2)	<p class="title"> <b>The demo python introduces several python courses. < /b></p><b>The demo python introduces several python courses. < /b>The demo python introduces several python courses. <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>Basic Python and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>Advanced Python.
Copy the code

At first, it was a little hard to understand how the label could appear here. Slowly, comparing the structure after finishing, some labels were nested inside. Very simple.

4.2. Traversal of the label tree

There are only two attributes

attribute	instructions
.parent	The parent label of the node
.parents	The iteration type of the node ancestor tag, used to loop through the ancestor node

Do you see the meaning of parent? Take a look at each

.parent

Let’s look at

… The father of the tag

>>> soup.title.parent<head><title>This is a python demo page</title></head>
Copy the code

Let’s take a look at < HTML >… The father of the
tag

>>> soup.html.parent<html><head><title>This is a python demo page</title></head><body><p class="title"> <b>The demo python introduces several python courses. < /b></p><p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p></body></html>
Copy the code

Because the < HTML >…
tags are the highest level tags in HTML text, so < HTML >… The father of the
tag is himself.

The Soup label is a special label, so we can also print it

>>> soup.parent>>> 
Copy the code

You’ll find it empty, so no

.parents
```
>>> for parent in soup.a.parents: if parent is None: print(parent) else: print(parent.name) pbodyhtml[document]
Copy the code
```
What does this code say?

The < a >… labels all ancestor names for printing. But why is there an if–else statement? When traversing all ancestor labels of a label, the soup itself will be traversed, and there is no information in the soup, so there is a distinction that ancestor Name cannot be printed.

So, these two methods are easy to understand

4.3. Parallel traversal of label tree

There are four, four properties

attribute	instructions
.next_sibling	Returns the next parallel node label in HTML text order
.previous_sibling	Returns the last parallel node tag in HTML text order
.next_siblings	Iteration type that returns all subsequent parallel node labels in HTML text order
.previous_siblings	Iteration type that returns all parallel node tags that follow in HTML text order

They actually come in pairs, up and down, front and back

Parallel traversal is conditional: parallel traversal occurs between nodes of the same parent node. The same father.

.next_sibling
```
>>> soup.a.next_sibling' and '
Copy the code
```
The parallel tag of the A tag is a string,

In the tag tree, although the tree structure is organized in the form of tags, the NavigableString types between tags also constitute the nodes of the tag tree. So any node that has its parallel tag, its father tag, and its son tag can be of NavigableString type, so don’t just assume that the next node that we get parallel traversal is the tag type

So, it’s very graphic

And then we’re going to look at the next parallel tag for the A tag
```
>>> soup.a.next_sibling.next_sibling<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
Copy the code
```
.previous_sibling

Let’s look at the parallel tag that precedes the A tag
```
>>> soup.a.previous_sibling'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'
Copy the code
```
It’s a text

Similarly, we apply the a tag to the previous parallel tag to the previous parallel tag
```
>>> soup.a.previous_sibling.previous_sibling>>> 
Copy the code
```
Returns an empty,

It’s not hard to see that a is empty in front of it, why not the one on the right

.next_siblings

>>> for sibing in soup.a.next_siblings:	print(sibing)	 and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
Copy the code

Sounds familiar.

.previous_siblings

>>> for sibing in soup.a.previous_siblings:	print(sibing)	Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
Copy the code

So I

5. Output in HTML format based on BS4 library

When I look at formatting, I think of string formatting. I don’t mean empty. How can I make HTML text more “friendly”?

The word “friendly” is not meant to make it easy for people to read, but it is also meant to make it easier for programs to read and analyze

5.1, bs4 library`prettify()`methods

Let’s look at a piece of code:

>>> from bs4 import BeautifulSoup>>> soup = BeautifulSoup(demo, "html.parser")>>> soup.prettify()
Copy the code

We’re going to find a lot of newlines that we’re printing out.

>>> print(soup.prettify())
Copy the code

And that makes it a little bit clearer. Of course, this is a familiar picture.

Prettify () : This method adds line breaks to HTML text so it becomes more intuitive, and treats each tag individually.

For example, we print in soup… < / a > tag

>>> print(soup.a.prettify())<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1"> Basic Python</a>
Copy the code

One of the most important issues in the BS4 library is encoding, in which any HTML file or string read is converted to UTF-8 encoding. So there’s no obstacle when it comes to parsing.

6,

Remember the requests library from the previous chapter, how to get source code, add, delete, change and search urls, we focus on getting source code, you will find that the source code is not easy to read, so I think this chapter is more about how to better read source code, parsers. There are also some operations.

So that’s it, my notes.

Thank you, if there are mistakes in the article, welcome your correction; My pleasure if I can be of any help to you.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Study notes on Python’s BeautifulSoup library

1, the preface

2. Installation of BeautifulSoup library

2.1. Introduction to the official website

2.2, the installation

2.3, the test

2.3.1, analyzing web pages

2.3.2, practice

2.3.3 summary

3. Basic elements of BeautifulSoup library

3.1. Understanding of BeautifulSoup Library

3.2. Import of BeautifulSoup library

3.3, BeautifulSoup class

BeautifulSoup library parser. 3.4

3.5. Basic elements of the BeautifulSoup class

4. HTML content traversal method based on BS4 library

4.1. Down-line traversal of the label tree

4.2. Traversal of the label tree

4.3. Parallel traversal of label tree

5. Output in HTML format based on BS4 library

5.1, bs4 library`prettify()`methods

6,

Study notes on Python’s BeautifulSoup library

1, the preface

2. Installation of BeautifulSoup library

2.1. Introduction to the official website

2.2, the installation

2.3, the test

2.3.1, analyzing web pages

2.3.2, practice

2.3.3 summary

3. Basic elements of BeautifulSoup library

3.1. Understanding of BeautifulSoup Library

3.2. Import of BeautifulSoup library

3.3, BeautifulSoup class

BeautifulSoup library parser. 3.4

3.5. Basic elements of the BeautifulSoup class

4. HTML content traversal method based on BS4 library

4.1. Down-line traversal of the label tree

4.2. Traversal of the label tree

4.3. Parallel traversal of label tree

5. Output in HTML format based on BS4 library

5.1, bs4 libraryprettify()methods

6,

Related Posts

Application practice and evolution of OCR technology in IQiyi

Double 11 same style! Ali Cloud released global transaction service GTS: 100,000 transactions per second

ShardingSphere Pit Trip 01- JDBC

5.1, bs4 library`prettify()`methods