{{ title }}

1, the preface

In front of me through r.ext to print the web page, output, the result is very good, we are successful

And of course utF-8 encoding

Although some text can be seen in UTF-8 encoding, adding to the reading experience, I wonder if there is a better way to add to the reading experience

The answer is yes, let’s take a look.

2. Installation of BeautifulSoup library

2.1. Introduction to the official website

BeautifulSoup: We Called him Tortoise Because he taught us. (Crummy.com)

It’s all in English, but I can’t read it

Known as “Delicious Soup”, of course this is a third party library. It can parse HTML format and extract relevant information. The original use of it is to turn the text you give into a pot of soup, for cooking, no wonder it is called delicious soup

2.2, the installation

This is easy, right? Just go to the code

pip install BeautifulSoup4
Copy the code

Then you can install it.

2.3, the test

2.3.1, analyzing web pages

Was the installation successful? We’ll find out if we test it.

The web page we tested was

Let’s look at his source code again

This is probably the front end HTML

Is a pair of information encapsulated as <>.

2.3.2, practice

Good! When everything is ready, open Idle

  1. Get the source code

    There are two ways to get source code

    • Manually: Ctrl+U on the current page in the Edge browser, and you’ll find the image above

    • .get () method

      We should put it to use. Let’s try it

      >>> import requests
      >>> r =  requests.get("https://www.python123.io/ws/demo.html")
      >>> r.status_code
      200
      >>> r.text
      Copy the code

      Isn’t that easy

  2. BeautifulSoup library guide

    Define a demo variable

    >>> demo = r.text
    Copy the code

    Import libraries

    >>> from bs4 import BeautifulSoup
    Copy the code

    Because the library name is too long, we call it BS4, so from BS4 imports a class called BeautifulSoup

  3. soup

    We need to make the demo into a soup that BeautifulSoup can understand

    >>> soup = BeautifulSoup(demo, "html.parser")
    Copy the code
  4. A printout

    >>> print(soup.prettify())
    Copy the code

    Isn’t that much better than the last one

    At this point, our BeautifulSoup4 library is also installed successfully.

2.3.3 summary

How do I use the BeautifulSoup library with two lines of code

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<p>data</p>"."html.parser")
Copy the code

Bs4 imports the BeautifulSoup class

When we make soup, we use one variable. There are two variables in there

  • data

    ” : information to be parsed

  • “Html. parser” : the parser used to parse HTML

3. Basic elements of BeautifulSoup library

This is actually a little example of how BeautifulSoup works, “Make soup.”

Let’s take a closer look at this library

3.1. Understanding of BeautifulSoup Library

  1. To understand this, let’s first understand what an HTML file is, right

    When you open an HTML file, you will find that it is organized in pairs of <> tags, a group of <> tags. The number of labels is formed by the upstream and downstream relationship between labels.

    BeautifulSoup is a library that parses, traverses, and maintains a “label tree”.

    BeautifulSoup library can parse as long as our file is of the label tree type.

  2. The tag format

    Now that you understand what an HTML file is, how much do you know about the format of each tag

    <p class="title">.</p>
    Copy the code

    Take the P tag as an example

    • The first thing you can see is a pair of <>.

    • ..

      : tag tag; P is the name of the tag; Appear at the beginning and end to indicate the scope of a tag.

    • Class =”title” : Property field, containing zero or more properties. Attributes are used to define the characteristics of the tag,

      • classAttribute name:
      • title: The contents of the property

      Any property has its properties and its values, so you can say that properties are made up of key-value pairs if you want to make a little bit of sense

3.2. Import of BeautifulSoup library

  • BeautifulSoup library, also known as BeautifulSoup4 or BS4

    Use should be by reference

    1. Common way

      from bs4 import BeautifulSoup
      Copy the code

      We used this one before

      A type called BeautifulSoup has been introduced from the BS4 library

    2. The traditional way

      import bs4
      Copy the code

      This is a lot of use.

3.3, BeautifulSoup class

First, the HTML document corresponds to the tag tree, which is processed by BeautifulSoup and converted to eautifulSoup. Therefore, it can be understood that the BeautifulSoup type can represent a label tree.

To sum up: THE HTML, tag tree, and BeautifulSoup class are equivalent.

On this basis, the BeautifulSoup class makes the label tree into a variable, and the processing of the variable is the processing of the label tree.

BeautifulSoup corresponds to the entire content of an HTML/XML document

BeautifulSoup library parser. 3.4

The purpose of parsers, as mentioned earlier, is to parse HTML/XML documents. Of course,

The parser Method of use conditions
Bs4 HTML parser BeautifulSoup(mk, "html.parser") Install bs4 library
HTML parser for LXML BeautifulSoup(mk, "lxml") pip install lxml
XML parser for LXML BeautifulSoup(mk, "xml") pip install lxml
Html5lib parser BeautifulSoup(mk, "html5lib") pip install html5lib

It doesn’t matter if you parse HTML/XML, it makes a difference if you have a higher level of experience, so

3.5. Basic elements of the BeautifulSoup class

The basic elements instructions
Tag Tags, the most basic unit of information organization, are used separately<>and</>Mark the beginning and end.
Name The name of the tag,<p>... </p>The name is “p” and the format is:<tag>.name
Attributes Tag attributes, dictionary organization, format:<tag>.attrs
NavigableString Tag non-attribute string,< >... </>Is a character string in the following format:<tag>.string
Comment The Comment part of the string inside the tag, a special Comment type

Is that dazzling? Okay, let’s hit it,

To put the source code out here, let’s type:

  1. The tag label

    >>> soup.title<title>This is a python demo page</title>
    Copy the code

    Look at the label just like that.

  2. Gets the content of the link tag

    Define a tag that gets HTML… The contents of the tag, that is, the contents of the link tag.

    >>> tag = soup.a>>> tag<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
    Copy the code

    It can be seen that The tag content is output

    How did I find out there was … < / a > tag

  3. Label name

    >>> soup.a.name'a'
    Copy the code

    We know that the tag is a, so soup. A. Name what if we don’t know the tag name?

    Here we introduce a few names, child tags, parent tags, ye

    Look at the < a >… The parent tag of , the previous tag

    >>> soup.a.parent.name'p'>>> soup.a.parent.parent.name'body'
    Copy the code

    ..

    is the parent tag, .. Is the < / body > < p >.. The parent tag of

    , printed as a string

  4. Attribute information of the label

    With the < a >.. < / a >, for example

    >>> tag = soup.a>>> tag.attrs{'href': 'http://www.icourse163.org/course/BIT-268001'.'class': ['py1'].'id': 'link1'}
    Copy the code

    In dictionary form

    Of course, we can also extract the information inside

    >>> tag.attrs['class'] ['py1']>>> tag.attrs['href']'http://www.icourse163.org/course/BIT-268001'>>> tag.attrs['id']'link1'
    Copy the code
  5. NavigableString element of the tag

    This is a string type that represents a string between Angle bracket tags

    >>> soup.a<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>>>> soup.a.string'Basic Python'>>> soup.p<p class="title"><b>The demo python introduces several python courses.</b></p>>>> soup.p<p class="title"><b>The demo python introduces several python courses.</b></p>>>> soupp.p.string
    Copy the code
  6. The Comment type

    The table says, this is what a comment means, what to do if a comment appears in an HTML page.

    Let’s make a new soup

    >>> newsoup = BeautifulSoup("<b><! --This is a comment--></b><p>This is not a comment</p>"."html.parser")>>> newsoup.b.string'This is a comment
    Copy the code

    Note:<! -... -->It’s a comment, so<p>... </p>It’s not a comment. Good! So I’m going to end this little bit here. Let’s do a review. Look at the pictures

4. HTML content traversal method based on BS4 library

We learned how to organize HTML text into a tree structure. You can see that there are many tags and the relationship between tags is indicated. But it’s still a little bit inconvenient to see, so let’s put it in a different form

Note: the < p >… Under the < / p > < h >… should be

… Under the < / p > < h >… should be ‘ ‘, the following is not marked ** ** note:

.

Under the.It should be.‘ ‘is not marked below

This is after another cleaning. Isn’t it

We call tags nodes. So, goose, we can iterate in three ways

  • Traversal down: Traversal from the root node to the leaf node
  • Traversal up: As opposed to traversal down, traversal from the leaf node to the root node.
  • Parallel traversal: Mutual traversal, horizontal

So it’s easy to understand.

And they have different traversal methods, so let’s look at them separately

4.1. Down-line traversal of the label tree

attribute instructions
.contents The list of child nodes will<tag>All son nodes are stored in the list.
.childrea The iteration type of the child node, and.contetsSimilarly, used to loop through the son node
.descendans Iteration type of descendant node, containing all descendant nodes for loop traversal

Let’s take the above three examples respectively:

  1. .contents

    We read the < head >… the child node of the tag

    >>> soup.head.contents[<title>This is a python demo page</title>]
    Copy the code

    Is the < title >…

    So let me just make this clear, why it’s a little slow down here, but.contents returns its children, not its children.

    Since the list information is returned, we can retrieve the information

    Let’s take a look at … .contents information

    >>> soup.body.contents['\n', <p class="title"> <b>The demo python introduces several python courses. < /b></p>, '\'n'."p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
    Copy the code

    You can see there’s a lot of information in there, corresponding to the different body elements,

    Note that for a label son node, not only the label node is included, but also the string node, \n will have.

    We can view the number of son nodes

    >>> len(soup.body.contents)5
    Copy the code

    five

    We can also search,

    >>> soup.body.contents[1]<p class="title"> <b>The demo python introduces several python courses. < /b></p>
    Copy the code

    Hey hey < p >… < / p > tag.

  2. .childrea

    Of course, we could use this

    >>> for child in soup.body.children:	print(child)	<p class="title"> <b>The demo python introduces several python courses. < /b></p><p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
    Copy the code

    This should be for … A traversal of the son node under the tag

    There was always a question, why is there an A tag, now I understand, p tag contains an A tag

  3. .descendans

    >>> for child2 in soup.body.descendants:	print(child2)	<p class="title"> <b>The demo python introduces several python courses. < /b></p><b>The demo python introduces several python courses. < /b>The demo python introduces several python courses. <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>Basic Python and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>Advanced Python.
    Copy the code

    At first, it was a little hard to understand how the label could appear here. Slowly, comparing the structure after finishing, some labels were nested inside. Very simple.

4.2. Traversal of the label tree

There are only two attributes

attribute instructions
.parent The parent label of the node
.parents The iteration type of the node ancestor tag, used to loop through the ancestor node

Do you see the meaning of parent? Take a look at each

  1. .parent

    Let’s look at

    … The father of the tag

    >>> soup.title.parent<head><title>This is a python demo page</title></head>
    Copy the code

    Let’s take a look at < HTML >… The father of the
    tag

    >>> soup.html.parent<html><head><title>This is a python demo page</title></head><body><p class="title"> <b>The demo python introduces several python courses. < /b></p><p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p></body></html>
    Copy the code

    Because the < HTML >…
    tags are the highest level tags in HTML text, so < HTML >… The father of the
    tag is himself.

    The Soup label is a special label, so we can also print it

    >>> soup.parent>>> 
    Copy the code

    You’ll find it empty, so no

  2. .parents

    >>> for parent in soup.a.parents:    if parent is None:        print(parent)    else:        print(parent.name) pbodyhtml[document]
    Copy the code

    What does this code say?

    The < a >… labels all ancestor names for printing. But why is there an if–else statement? When traversing all ancestor labels of a label, the soup itself will be traversed, and there is no information in the soup, so there is a distinction that ancestor Name cannot be printed.

    So, these two methods are easy to understand

4.3. Parallel traversal of label tree

There are four, four properties

attribute instructions
.next_sibling Returns the next parallel node label in HTML text order
.previous_sibling Returns the last parallel node tag in HTML text order
.next_siblings Iteration type that returns all subsequent parallel node labels in HTML text order
.previous_siblings Iteration type that returns all parallel node tags that follow in HTML text order

They actually come in pairs, up and down, front and back

Parallel traversal is conditional: parallel traversal occurs between nodes of the same parent node. The same father.

  1. .next_sibling

    >>> soup.a.next_sibling' and '
    Copy the code

    The parallel tag of the A tag is a string,

    In the tag tree, although the tree structure is organized in the form of tags, the NavigableString types between tags also constitute the nodes of the tag tree. So any node that has its parallel tag, its father tag, and its son tag can be of NavigableString type, so don’t just assume that the next node that we get parallel traversal is the tag type

    So, it’s very graphic

    And then we’re going to look at the next parallel tag for the A tag

    >>> soup.a.next_sibling.next_sibling<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
    Copy the code
  2. .previous_sibling

    Let’s look at the parallel tag that precedes the A tag

    >>> soup.a.previous_sibling'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'
    Copy the code

    It’s a text

    Similarly, we apply the a tag to the previous parallel tag to the previous parallel tag

    >>> soup.a.previous_sibling.previous_sibling>>> 
    Copy the code

    Returns an empty,

    It’s not hard to see that a is empty in front of it, why not the one on the right

  3. .next_siblings

    >>> for sibing in soup.a.next_siblings:	print(sibing)	 and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
    Copy the code

    Sounds familiar.

  4. .previous_siblings

    >>> for sibing in soup.a.previous_siblings:	print(sibing)	Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    Copy the code

    So I

5. Output in HTML format based on BS4 library

When I look at formatting, I think of string formatting. I don’t mean empty. How can I make HTML text more “friendly”?

The word “friendly” is not meant to make it easy for people to read, but it is also meant to make it easier for programs to read and analyze

5.1, bs4 libraryprettify()methods

Let’s look at a piece of code:

>>> from bs4 import BeautifulSoup>>> soup = BeautifulSoup(demo, "html.parser")>>> soup.prettify()
Copy the code

We’re going to find a lot of newlines that we’re printing out.

>>> print(soup.prettify())
Copy the code

And that makes it a little bit clearer. Of course, this is a familiar picture.

so

Prettify () : This method adds line breaks to HTML text so it becomes more intuitive, and treats each tag individually.

For example, we print in soup… < / a > tag

>>> print(soup.a.prettify())<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1"> Basic Python</a>
Copy the code

One of the most important issues in the BS4 library is encoding, in which any HTML file or string read is converted to UTF-8 encoding. So there’s no obstacle when it comes to parsing.

6,

Remember the requests library from the previous chapter, how to get source code, add, delete, change and search urls, we focus on getting source code, you will find that the source code is not easy to read, so I think this chapter is more about how to better read source code, parsers. There are also some operations.

So that’s it, my notes.

Thank you, if there are mistakes in the article, welcome your correction; My pleasure if I can be of any help to you.