Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

BS4, full name for BeatifulSoup, provides simple, Python-like functions to handle navigation, searching, modifying analysis trees, and more. We can through it is very convenient to complete the crawler’s HTML parsing work.

This article briefly introduces some common functions in BS4 that can handle most situations.

1. Locate labels

First of all, before crawling to locate the data of the label, this use F12 developer tools in the button, click the button, and then click the web page, you can quickly locate the corresponding label in the page, the specific details are not detailed, explore their own, very simple, very easy to use.

Now, formally, I’ll show you how to use the code to get the tag I found earlier.

Here are two functions from BeautifulSoup, the find() and find_all() functions.

First you look at the tag you’re looking for, what it is, and whether it has a class or ID attribute (if not, try to find one), because class and ID are very rare and, with luck, can be a hit.

For example, the arrow in the image above points to a DIV tag with the ID ozoom

# HTML is the content of the web page obtained by the previous request
bsobj = bs4.BeautifulSoup(html,'html.parser')

Get the div tag with id ozoom
Find tags by id
div = bsobj.find('div', attrs = {'id' : 'ozoom'})

List_t = list_t
# find tags by class
title = div.find('div', attrs = {'class': 'list_t'})
Copy the code

Note: If the tag has an ID attribute, try to look it up by ID, because the entire page ID is unique. Use class to find, the best now browser page source Ctrl + F search, how many tags of the same class (if more than, you can try to find its parent tag, narrow down the search later).

Then we’ll look at the find_all function, which is suitable for finding many tags of a type at once, as shown in the following figure.

Each li tag in the list is a piece of data, and we need to retrieve all of them. If we use the previous find function, we can only retrieve one Li tag at a time. So we need to use the find_all function to get all the qualified labels at one time and store them as an array to return.

First of all, since the li tag has no ID and no class, and there are many irrelevant and distracting Li tags in the page, we need to look up from its parent tag, narrow the search scope, find the titleList DIV tag, and take a look. All the li tags are required, and the find_all function is used to get all the li tags.

# HTML is the target web page content to get
html = fetchUrl(pageUrl)
bsobj = bs4.BeautifulSoup(html,'html.parser')

pDiv = bsobj.find('div', attrs = {'id': 'titleList'})
titleList = pDiv.find_all('li')
Copy the code

Basically, using the find and find_all functions together makes it easy to handle almost any HTML page.

2. Extract data

Once I find the tag, how do I get the data in the tag?

There are two general scenarios for the location of data in a label.

<! -- The first type, located in the tag content -->
    <p>This is the data. This is the data</p>

<! -- The second type, located in the tag attribute -->
    <a href="/xxx.xxx_xx_xx.html"></a>
Copy the code

If it is the first case, it is easy to use pTip. Text (pTip is the p tag already obtained).

Link = aTip[“href”] link = aTip[“href”] (aTip is the a tag that has been obtained previously).