“Offer comes, ask friends to take it! I am participating in the 2022 Spring Recruit Punch card campaign. Click here for more details.”
Today is the new day of the 120 crawlers column series, and the next three articles will focus on BeautifulSoup4.
BeautifulSoup4 basic knowledge supplement
BeautifulSoup4 is a Python parser library that is used to parse HTML and XML. It is used to parse HTML more often in crawlers.
pip install beautifulsoup4
Copy the code
BeautifulSoup relies on third-party parsers to parse data. The advantages of common parsers are as follows:
The Python standard library html.parser
: Python built-in standard library, strong fault tolerance;LXML parser
: Fast speed, strong fault tolerance;html5lib
: has the strongest fault tolerance and the same parsing mode as the browser.
The basic use of the Beautifulsoup4 library is demonstrated with a custom HTML code that looks like this:
<html>
<head>
<title>Test the BS4 module script</title>
</head>
<body>
<h1>Eraser crawler class</h1>
<p>Demonstrate this with a custom HTML code</p>
</body>
</html>
Copy the code
BeautifulSoup is used for simple operations, including instantiating BS objects, printing page labels, and so on.
from bs4 import BeautifulSoup
text_str = ""< HTML > Test BS4 module script < H1 > Eraser crawler lesson Demonstrate with 1 section of custom HTML code
with 2 sections of custom HTML
""
Instantiate Beautiful Soup
soup = BeautifulSoup(text_str, "html.parser")
The above is to format the string as a Beautiful Soup object, which you can format from a file
# soup = BeautifulSoup(open('test.html'))
print(soup)
Enter the title tag of the webpage
print(soup.title)
Enter the page head TAB
print(soup.head)
Test the input paragraph tag p
print(soup.p) Get the first by default
Copy the code
We can call a web label directly from BeautifulSoup object, but there is a problem here. Calling a label from a BS object only gets the first label, as in the code above, we only get a P label. If you want to get more information, please read on.
To get there, you need to know about the four built-in objects in BeautifulSoup.
BeautifulSoup
: basic object, the entire HTML object, generally regarded as a Tag object can be viewed;Tag
: tag object. Tags are nodes in a web page, such as title, head, p.NavigableString
: String inside the label;Comment
: comment object, there are not many scenes used in crawler.
The following code shows you how these objects appear. Note the comments in the code.
from bs4 import BeautifulSoup
text_str = ""< HTML > Test BS4 module script < H1 > Eraser crawler lesson Demonstrate with 1 section of custom HTML code
with 2 sections of custom HTML
""
Instantiate Beautiful Soup
soup = BeautifulSoup(text_str, "html.parser")
The above is to format the string as a Beautiful Soup object, which you can format from a file
# soup = BeautifulSoup(open('test.html'))
print(soup)
print(type(soup)) # <class 'bs4.BeautifulSoup'>
Enter the title tag of the webpage
print(soup.title)
print(type(soup.title)) # <class 'bs4.element.Tag'>
print(type(soup.title.string)) # <class 'bs4.element.NavigableString'>
Enter the page head TAB
print(soup.head)
Copy the code
forTag objectThere are two important properties, yesname
和 attrs
from bs4 import BeautifulSoup
text_str = ""< HTML > Test BS4 module script < H1 > Eraser crawler lesson Demonstrate with 1 section of custom HTML code
with 2 sections of custom HTML Code to demonstrate < / p > < a href = "http://www.csdn.net" > CSDN website < / a > < / body > < / HTML > "" "
Instantiate Beautiful Soup
soup = BeautifulSoup(text_str, "html.parser")
print(soup.name) # [document]
print(soup.title.name) Get the tag title
print(soup.html.body.a) The lower level of tags can be obtained through the tag hierarchy
print(soup.body.a) # HTML can be omitted as a special root tag
print(soup.p.a) Failed to obtain the A tag
print(soup.a.attrs) # get attributes
Copy the code
The above code demonstrates the use of the name attribute and the attrs attribute, where the attrs attribute yields a dictionary that can be retrieved by key.
BeautifulSoup also uses the following method to get the attribute value of the label:
print(soup.a["href"])
print(soup.a.get("href"))
Copy the code
Once you have retrieved the web page tag, you need to retrieve the text inside the tag, using the following code.
print(soup.a.string)
Copy the code
In addition, you can use the text property and the get_text() method to get the tag content.
print(soup.a.string)
print(soup.a.text)
print(soup.a.get_text())
Copy the code
You can also get all the text inside the tag, using strings and stripped_strings.
print(list(soup.body.strings)) Get a space or a newline
print(list(soup.body.stripped_strings)) # Remove whitespace or line feeds
Copy the code
Extension label/node selector traversal document tree
Direct child node
The immediate child of the Tag object, which can be obtained using contents and the children attributes.
from bs4 import BeautifulSoup
text_str = ""< HTML >< head>
Instantiate Beautiful Soup
soup = BeautifulSoup(text_str, "html.parser")
The # contents property gets the immediate children of the node, returning the contents as a list
print(soup.div.contents) # return list
The # children property also fetches the direct children of the node, returned as the generator type
print(soup.div.children)
Copy the code
Note that the preceding two attributes are obtained from direct child nodes. For example, the descendant tag span in the H1 tag is not obtained separately.
If you want to get all the tags, use the descendants property, which returns a generator and all the tags, including the text inside the tag, are individually fetched.
print(list(soup.div.descendants))
Copy the code
Acquisition of other nodes (just know, just check and play)
parent
和parents
: Direct parent node and all parent nodes;next_sibling
.next_siblings
.previous_sibling
.previous_siblings
: indicates the next sibling node, all sibling nodes below, the previous sibling node, and all sibling nodes above. Since the newline character is also a node, pay attention to the newline character when using these attributes.next_element
.next_elements
.previous_element
.previous_elements
: These attributes represent the previous node or the next node, respectively. Note that they are not hierarchical, but for all nodes, as in the code abovediv
The next node of the node ish1
And thediv
The sibling of the node isul
.
Document tree search related functions
The first function to learn is the find_all() function, which looks like this:
find_all(name,attrs,recursive,text,limit=None,**kwargs)
Copy the code
name
: Indicates the name of the tag, for examplefind_all('p')
It’s finding all of themp
Tag, which accepts tag name strings, regular expressions, and lists;attrs
: Attribute passed in. This parameter can be passed in as a dictionary, for exampleattrs={'class': 'nav'}
, returns a list of tags;
The following is an example of the two parameters:
print(soup.find_all('li')) Get all li's
print(soup.find_all(attrs={'class': 'nav'})) Pass the attrs attribute
print(soup.find_all(re.compile("p"))) # Transfer regular, the measured effect is not ideal
print(soup.find_all(['a'.'p'])) # pass list
Copy the code
recursive
: callfind_all ()
BeautifulSoup retrieves all descendants of the current tag, using parameters if you want to search only the immediate children of the tagrecursive=False
, the test code is as follows:
print(soup.body.div.find_all(['a'.'p'],recursive=False)) # pass list
Copy the code
text
: can retrieve the text string content of the document, andname
The optional values of the parameters are the same,text
Arguments accept tag name strings, regular expressions, and lists.
print(soup.find_all(text='home')) # [' home ']
print(soup.find_all(text=re.compile("^"))) # [' home ']
print(soup.find_all(text=["Home page",re.compile('class')))# [' Eraser crawler ', 'home ',' column ']
Copy the code
limit
: can be used to limit the number of results returned;kwargs
If a parameter with a specified name is not a parameter name built into the search, the parameter will be searched as an attribute of the tag. According to hereclass
Property search becauseclass
Is a Python reserved word that requires writingclass_
According to theclass_
Only one CSS class name is required. If multiple CSS names are required, fill in the CSS names in the same order as the labels.
print(soup.find_all(class_ = 'nav'))
print(soup.find_all(class_ = 'nav li'))
Copy the code
Note also that some attributes in web nodes cannot be used as kwargs parameters in the search, such as the data-* attribute in HTML5, which needs to be matched by the attrs parameter.
A list of other methods that are roughly in line with users of the find_all() method is as follows:
find()
: Function prototypefind( name , attrs , recursive , text , **kwargs )
, returns a matching element;Find_parents (), find_parent ()
: Function prototypefind_parent(self, name=None, attrs={}, **kwargs)
, returns the parent node of the current node;Find_next_siblings (), find_next_sibling ()
: Function prototypefind_next_sibling(self, name=None, attrs={}, text=None, **kwargs)
, returns the next sibling of the current node;Find_previous_siblings (), find_previous_sibling ()
: same as above, return the previous sibling of the current node;Find_all_next (), find_next(), find_all_previous (), find_previous ()
: Function prototypefind_all_next(self, name=None, attrs={}, text=None, limit=None, **kwargs)
Retrieves descendants of the current node.
CSS selectors this section is a bit different from PyQuery. The core method is select(), and the data returned is a list tuple.
- Look it up by tag name,
soup.select("title")
; - By class name,
soup.select(".nav")
; - By looking at the ID name,
soup.select("#content")
; - By combining lookup,
soup.select("div#content")
; - By looking up attributes,
soup.select("div[id='content'")
.soup.select("a[href]")
;
There are a few other tricks you can use when looking up by attribute, such as:
^ =
: You can obtain nodes starting with XX:
print(soup.select('ul[class^="na"]'))
Copy the code
* =
: gets the node whose attribute contains the specified character:
print(soup.select('ul[class*="li"]'))
Copy the code
Workshop 9 reptile
After BeautifulSoup has mastered the basics, it is very easy to write a crawler case. The target of collection is www.9thws.com/#p2, which has a large number of artistic QR codes for design brother’s reference.
The following is applied to label and attribute retrieval of BeautifulSoup module. The complete code is as follows:
from bs4 import BeautifulSoup
import requests
import logging
logging.basicConfig(level=logging.NOTSET)
def get_html(url, headers) - >None:
try:
res = requests.get(url=url, headers=headers, timeout=3)
except Exception as e:
logging.debug("Acquisition anomaly", e)
if res is not None:
html_str = res.text
soup = BeautifulSoup(html_str, "html.parser")
imgs = soup.find_all(attrs={'class': 'lazy'})
print("The amount of data obtained is.len(imgs))
datas = []
for item in imgs:
name = item.get('alt')
src = item["src"]
logging.info(f"{name}.{src}")
Get the spliced data
datas.append((name, src))
save(datas, headers)
def save(datas, headers) - >None:
if datas is not None:
for item in datas:
try:
# grab pictures
res = requests.get(url=item[1], headers=headers, timeout=5)
except Exception as e:
logging.debug(e)
if res is not None:
img_data = res.content
with open("./imgs/{}.jpg".format(item[0]), "wb+") as f:
f.write(img_data)
else:
return None
if __name__ == '__main__':
headers = {
"User-Agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36"
}
url_format = "http://www.9thws.com/#p{}"
urls = [url_format.format(i) for i in range(1.2)]
get_html(urls[0], headers)
Copy the code
This code test output useslogging
Module implementation, the effect is shown in the figure below. Only one page of data is collected in the test. If you need to expand the collection range, you only need to modify itmain
The page number rule within the function will do. == In the process of writing the code, it was discovered that the data request type was POST and the data return format was JSON, so this case is just a handy example of BeautifulSoup ==
Code warehouse address: codechina.csdn.net/hihell/pyth… Go and give a follow or Star.
Write in the back
Bs4 module learning road, officially started, let’s work together.
Today is the 238/365 day of continuous writing. Expect attention, likes, comments and favorites.