This is the 26th day of my participation in Gwen Challenge
Everyone’s life is a journey to oneself, an attempt on a road, a quiet call of a path. People never exist as absolute selves, everyone is trying to become absolute selves, some dull, some more perceptive, but all in their own way. Each carries with him the remnants of his birth, the slime and eggshells of the primal world, to the end of his life. – When DeMian Wanders off as a Teenager
BeautifulSoup4 BeautifulSoup4 BeautifulSoup4 BeautifulSoup4 BeautifulSoup4 BeautifulSoup4 BeautifulSoup4 BeautifulSoup4 BeautifulSoup4 BeautifulSoup4 BeautifulSoup4 BeautifulSoup4 BeautifulSoup4 BeautifulSoup4 BeautifulSoup4 BeautifulSoup4 BeautifulSoup4 BeautifulSoup4
Why BeautifulSoup is BeautifulSoup is a bit of a mystery, although I know it is based on a beautiful fairy tale. Visit the official website to find out ~ (below)
www.crummy.com/software/Be…
1. Features of BeautifulSoup4 library
BeautifulSoup4 has a brief description on its website:
Beautiful Soup provides some simple methods for navigating, searching, and modifying parse trees and Pythonic idioms: a toolkit for dissecting documents and extracting what you need. You don’t need a lot of code to write an application.
Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don’t have to worry about encoding unless the document doesn’t specify it and Beautiful Soup can’t detect it. Then, you just need to specify the raw encoding.
Beautiful Soup sits on top of popular Python parsers such as LXML and HTML5lib, allowing you to try different parsing strategies or increase flexibility.
BeautifulSoup3 is no longer being developed and BeautifulSoup4 is included in BS4, so you need to refer to the library using: From BS4 import BeautifulSoup
2. Installation of Beautiful Soup4 library
Open the CMD command-line interface and enter PIP install beautifulsoup4
Write a small project to check that the BeautifulSoup library is installed successfully:
import requests
from bs4 import BeautifulSoup
r=requests.get("https://python123.io/ws/demo.html")
print(r.text)
demo=r.text
soup=BeautifulSoup(demo,"html.parser")
print(soup.prettify())
Copy the code
The output is as follows (an intercept) :
If the output is correct, the installation is successful.
3. Simple use of the BeautifulSoup library
If you are interested, you can try running the following code (it is recommended to run the output statement one by one) :
import requests from bs4 import BeautifulSoup r=requests.get("https://python123.io/ws/demo.html") #print(r.text) demo=r.text soup=BeautifulSoup(demo,"html.parser") print(soup.prettify()) print(soup.title) tag=soup.a print(tag) print(soup.a.name) print(soup.a.parent.name) print(soup.a.parent.parent.name) print(tag.attrs) print(tag.attrs['class']) print(tag.attrs['href']) print(type(tag.attrs)) print(type(tag)) print(soup.a.string) print(soup.p.string) print(type(soup.p.string))Copy the code
You can figure out what it does based on the output, which is actually pretty easy.
BeautifulSoup library: also called beautifulsoup4 or bs4. From bs4 import BeautifulSoup# remember that B and S are capitalized. You can also use import bs4
Quote the following statement: soup=BeautifulSoup(demo,”html.parser”)
Html. parser is an HTML interpreter (parsed from previous demos).
So what is an interpreter?
Baidu: The interpreter, or translator, is a computer program that translates high-level programming languages line by line. Instead of translating the entire program at once, the interpreter acts like a middleman. Each time a program is run, it must first be converted to another language, so the interpreter’s programs run slowly. It runs every time it translates a line of programming narrative, and then translates the next line, and runs again, and so on.
About the same meaning as compiler, relevant knowledge please baidu.
About the interpreter for the BeautifulSoup library:
BeautifulSoup(mk,'html.parser') from bS4 library LXML is BeautifulSoup(mk,' LXML '). PIP install LXML XML interpreter BeautifulSoup(mk,' XML ') PIP install LXML html5lib the interpreter BeautifulSoup(mk,'html5lib')Copy the code
Other statements need to know the basic elements of the BeautifulSoup class:
Tag: A basic unit of information organization, starting and ending with <> and </> respectively. Name: tag Name, <p>.. The name of </p> is 'p' and the format is <tag>.name. Attributes: dictionary organization format :<tag> :<tag>.attrs Navigable String: Tag non-attribute String, <>... String in </> format: <tag>.string. Comment: The Comment part of the string inside the tag, a special Comment type.Copy the code
4. Tag tree traversal:
Label tree traversal related attributes and their description (same below) :
.content List of children, storing all of the <tag> children in the list. Children Iteration type of children, similar to.content, used to loop the children. Descendants iteration type, containing all descendants, used to loop throughCopy the code
Example:
import requests
from bs4 import BeautifulSoup
r=requests.get("https://python123.io/ws/demo.html")
#print(r.text)
demo=r.text
soup=BeautifulSoup(demo,"html.parser")
print(soup.head)
print(soup.head.contents)
print(soup.body.contents)
print(len(soup.body.contents))
print(soup.body.contents[1])
Copy the code
Tag tree traversal down:
for child in soup.body.children:
print(child)
Copy the code
Label tree traversal:
Parent: the iteration type of the parent tag of the node, used to loop through the parent nodeCopy the code
Example:
import requests
from bs4 import BeautifulSoup
r=requests.get("https://python123.io/ws/demo.html")
demo=r.text
soup=BeautifulSoup(demo,"html.parser")
for parent in soup.a.parents:
if parent is None:
print(parent)
else:
print(parent.name)
Copy the code
Parallel traversal of tag tree:
.next-sibling returns the next parallel node in the HTML text sequence tag. Previous_sibling returns the last parallel node in the HTML text sequence. Next_siblings iteration type, Returns the.previous_siblings iteration type that returns all subsequent parallel node labels in HTML text orderCopy the code
Example:
import requests from bs4 import BeautifulSoup r=requests.get("https://python123.io/ws/demo.html") #print(r.text) BeautifulSoup(Demo,"html.parser") print(soup. A.ext_sibling) # next tag print(soup.a.next_sibling.next_sibling) print(soup.a.previous_sibling) print(soup.a.previous_sibling.previous_sibling) Print (soup. A. parent) # siblings in soup. A. sibling: Print (sibling) # siblings in soup. A. sibling: print(sibling)Copy the code
To: BS4 converts any HTML file or string that is read into UTF-8 encoding.
Python crawler series, to be continued…