“Offer comes, ask friends to take it! I am participating in the 2022 Spring Recruit Punch card campaign. Click here for more details.”

Today is the new day of the 120 crawlers column series, and the next three articles will focus on BeautifulSoup4.

BeautifulSoup4 basic knowledge supplement

BeautifulSoup4 is a Python parser library that is used to parse HTML and XML. It is used to parse HTML more often in crawlers.

pip install beautifulsoup4
Copy the code

BeautifulSoup relies on third-party parsers to parse data. The advantages of common parsers are as follows:

  • The Python standard library html.parser: Python built-in standard library, strong fault tolerance;
  • LXML parser: Fast speed, strong fault tolerance;
  • html5lib: has the strongest fault tolerance and the same parsing mode as the browser.

The basic use of the Beautifulsoup4 library is demonstrated with a custom HTML code that looks like this:

<html>
  <head>
    <title>Test the BS4 module script</title>
  </head>
  <body>
    <h1>Eraser crawler class</h1>
    <p>Demonstrate this with a custom HTML code</p>
  </body>
</html>
Copy the code

BeautifulSoup is used for simple operations, including instantiating BS objects, printing page labels, and so on.

from bs4 import BeautifulSoup

text_str = ""< HTML >   Test BS4 module script    < H1 > Eraser crawler lesson  

Demonstrate with 1 section of custom HTML code

with 2 sections of custom HTML

""
Instantiate Beautiful Soup soup = BeautifulSoup(text_str, "html.parser") The above is to format the string as a Beautiful Soup object, which you can format from a file # soup = BeautifulSoup(open('test.html')) print(soup) Enter the title tag of the webpage print(soup.title) Enter the page head TAB print(soup.head) Test the input paragraph tag p print(soup.p) Get the first by default Copy the code

We can call a web label directly from BeautifulSoup object, but there is a problem here. Calling a label from a BS object only gets the first label, as in the code above, we only get a P label. If you want to get more information, please read on.

To get there, you need to know about the four built-in objects in BeautifulSoup.

  • BeautifulSoup: basic object, the entire HTML object, generally regarded as a Tag object can be viewed;
  • Tag: tag object. Tags are nodes in a web page, such as title, head, p.
  • NavigableString: String inside the label;
  • Comment: comment object, there are not many scenes used in crawler.

The following code shows you how these objects appear. Note the comments in the code.

from bs4 import BeautifulSoup

text_str = ""< HTML >   Test BS4 module script    < H1 > Eraser crawler lesson  

Demonstrate with 1 section of custom HTML code

with 2 sections of custom HTML

""
Instantiate Beautiful Soup soup = BeautifulSoup(text_str, "html.parser") The above is to format the string as a Beautiful Soup object, which you can format from a file # soup = BeautifulSoup(open('test.html')) print(soup) print(type(soup)) # <class 'bs4.BeautifulSoup'> Enter the title tag of the webpage print(soup.title) print(type(soup.title)) # <class 'bs4.element.Tag'> print(type(soup.title.string)) # <class 'bs4.element.NavigableString'> Enter the page head TAB print(soup.head) Copy the code

forTag objectThere are two important properties, yesnameattrs

from bs4 import BeautifulSoup

text_str = ""< HTML >   Test BS4 module script    < H1 > Eraser crawler lesson  

Demonstrate with 1 section of custom HTML code

with 2 sections of custom HTML Code to demonstrate < / p > < a href = "http://www.csdn.net" > CSDN website < / a > < / body > < / HTML > "" "

Instantiate Beautiful Soup soup = BeautifulSoup(text_str, "html.parser") print(soup.name) # [document] print(soup.title.name) Get the tag title print(soup.html.body.a) The lower level of tags can be obtained through the tag hierarchy print(soup.body.a) # HTML can be omitted as a special root tag print(soup.p.a) Failed to obtain the A tag print(soup.a.attrs) # get attributes Copy the code

The above code demonstrates the use of the name attribute and the attrs attribute, where the attrs attribute yields a dictionary that can be retrieved by key.

BeautifulSoup also uses the following method to get the attribute value of the label:

print(soup.a["href"])
print(soup.a.get("href"))
Copy the code

Once you have retrieved the web page tag, you need to retrieve the text inside the tag, using the following code.

print(soup.a.string)
Copy the code

In addition, you can use the text property and the get_text() method to get the tag content.

print(soup.a.string)
print(soup.a.text)
print(soup.a.get_text())
Copy the code

You can also get all the text inside the tag, using strings and stripped_strings.

print(list(soup.body.strings)) Get a space or a newline
print(list(soup.body.stripped_strings)) # Remove whitespace or line feeds
Copy the code

Extension label/node selector traversal document tree

Direct child node

The immediate child of the Tag object, which can be obtained using contents and the children attributes.

from bs4 import BeautifulSoup

text_str = ""< HTML >< head>  
      

Eraser crawler lesson best

with 1 paragraph custom

CSDN website
Instantiate Beautiful Soup soup = BeautifulSoup(text_str, "html.parser") The # contents property gets the immediate children of the node, returning the contents as a list print(soup.div.contents) # return list The # children property also fetches the direct children of the node, returned as the generator type print(soup.div.children) Copy the code

Note that the preceding two attributes are obtained from direct child nodes. For example, the descendant tag span in the H1 tag is not obtained separately.

If you want to get all the tags, use the descendants property, which returns a generator and all the tags, including the text inside the tag, are individually fetched.

print(list(soup.div.descendants))
Copy the code

Acquisition of other nodes (just know, just check and play)

  • parentparents: Direct parent node and all parent nodes;
  • next_sibling.next_siblings.previous_sibling.previous_siblings: indicates the next sibling node, all sibling nodes below, the previous sibling node, and all sibling nodes above. Since the newline character is also a node, pay attention to the newline character when using these attributes.
  • next_element.next_elements.previous_element.previous_elements: These attributes represent the previous node or the next node, respectively. Note that they are not hierarchical, but for all nodes, as in the code abovedivThe next node of the node ish1And thedivThe sibling of the node isul.

Document tree search related functions

The first function to learn is the find_all() function, which looks like this:

find_all(name,attrs,recursive,text,limit=None,**kwargs)
Copy the code
  • name: Indicates the name of the tag, for examplefind_all('p')It’s finding all of thempTag, which accepts tag name strings, regular expressions, and lists;
  • attrs: Attribute passed in. This parameter can be passed in as a dictionary, for exampleattrs={'class': 'nav'}, returns a list of tags;

The following is an example of the two parameters:

print(soup.find_all('li')) Get all li's
print(soup.find_all(attrs={'class': 'nav'})) Pass the attrs attribute
print(soup.find_all(re.compile("p"))) # Transfer regular, the measured effect is not ideal
print(soup.find_all(['a'.'p'])) # pass list
Copy the code
  • recursive: callfind_all ()BeautifulSoup retrieves all descendants of the current tag, using parameters if you want to search only the immediate children of the tagrecursive=False, the test code is as follows:
print(soup.body.div.find_all(['a'.'p'],recursive=False)) # pass list
Copy the code
  • text: can retrieve the text string content of the document, andnameThe optional values of the parameters are the same,textArguments accept tag name strings, regular expressions, and lists.
print(soup.find_all(text='home')) # [' home ']
print(soup.find_all(text=re.compile("^"))) # [' home ']
print(soup.find_all(text=["Home page",re.compile('class')))# [' Eraser crawler ', 'home ',' column ']
Copy the code
  • limit: can be used to limit the number of results returned;
  • kwargsIf a parameter with a specified name is not a parameter name built into the search, the parameter will be searched as an attribute of the tag. According to hereclassProperty search becauseclassIs a Python reserved word that requires writingclass_According to theclass_Only one CSS class name is required. If multiple CSS names are required, fill in the CSS names in the same order as the labels.
print(soup.find_all(class_ = 'nav'))
print(soup.find_all(class_ = 'nav li'))
Copy the code

Note also that some attributes in web nodes cannot be used as kwargs parameters in the search, such as the data-* attribute in HTML5, which needs to be matched by the attrs parameter.

A list of other methods that are roughly in line with users of the find_all() method is as follows:

  • find(): Function prototypefind( name , attrs , recursive , text , **kwargs ), returns a matching element;
  • Find_parents (), find_parent (): Function prototypefind_parent(self, name=None, attrs={}, **kwargs), returns the parent node of the current node;
  • Find_next_siblings (), find_next_sibling (): Function prototypefind_next_sibling(self, name=None, attrs={}, text=None, **kwargs), returns the next sibling of the current node;
  • Find_previous_siblings (), find_previous_sibling (): same as above, return the previous sibling of the current node;
  • Find_all_next (), find_next(), find_all_previous (), find_previous (): Function prototypefind_all_next(self, name=None, attrs={}, text=None, limit=None, **kwargs)Retrieves descendants of the current node.

CSS selectors this section is a bit different from PyQuery. The core method is select(), and the data returned is a list tuple.

  • Look it up by tag name,soup.select("title");
  • By class name,soup.select(".nav");
  • By looking at the ID name,soup.select("#content");
  • By combining lookup,soup.select("div#content");
  • By looking up attributes,soup.select("div[id='content'").soup.select("a[href]");

There are a few other tricks you can use when looking up by attribute, such as:

  • ^ =: You can obtain nodes starting with XX:
print(soup.select('ul[class^="na"]'))
Copy the code
  • * =: gets the node whose attribute contains the specified character:
print(soup.select('ul[class*="li"]'))
Copy the code

Workshop 9 reptile

After BeautifulSoup has mastered the basics, it is very easy to write a crawler case. The target of collection is www.9thws.com/#p2, which has a large number of artistic QR codes for design brother’s reference.

The following is applied to label and attribute retrieval of BeautifulSoup module. The complete code is as follows:

from bs4 import BeautifulSoup
import requests
import logging

logging.basicConfig(level=logging.NOTSET)


def get_html(url, headers) - >None:
    try:
        res = requests.get(url=url, headers=headers, timeout=3)
    except Exception as e:
        logging.debug("Acquisition anomaly", e)

    if res is not None:
        html_str = res.text
        soup = BeautifulSoup(html_str, "html.parser")
        imgs = soup.find_all(attrs={'class': 'lazy'})
        print("The amount of data obtained is.len(imgs))
        datas = []
        for item in imgs:
            name = item.get('alt')
            src = item["src"]
            logging.info(f"{name}.{src}")
            Get the spliced data
            datas.append((name, src))
        save(datas, headers)


def save(datas, headers) - >None:
    if datas is not None:
        for item in datas:
            try:
                # grab pictures
                res = requests.get(url=item[1], headers=headers, timeout=5)
            except Exception as e:
                logging.debug(e)

            if res is not None:
                img_data = res.content
                with open("./imgs/{}.jpg".format(item[0]), "wb+") as f:
                    f.write(img_data)
    else:
        return None


if __name__ == '__main__':
    headers = {
        "User-Agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36"
    }
    url_format = "http://www.9thws.com/#p{}"
    urls = [url_format.format(i) for i in range(1.2)]
    get_html(urls[0], headers)
Copy the code

This code test output usesloggingModule implementation, the effect is shown in the figure below. Only one page of data is collected in the test. If you need to expand the collection range, you only need to modify itmainThe page number rule within the function will do. == In the process of writing the code, it was discovered that the data request type was POST and the data return format was JSON, so this case is just a handy example of BeautifulSoup ==

Code warehouse address: codechina.csdn.net/hihell/pyth… Go and give a follow or Star.

Write in the back

Bs4 module learning road, officially started, let’s work together.

Today is the 238/365 day of continuous writing. Expect attention, likes, comments and favorites.