Python crawler: Find () and findAll() for BeautifulSoup

BeautifulSoup’s find() and findAll() functions are similar in some ways in that they can be used to filter HTML pages and find desired groups of tags and individual tags.

These two functions are very similar:

findAll(tag,attributes,recursive,text,limit,keywords)

find(tag,attributes,recursive,text,keywords)

Tag argument tag: You can pass a tag name or a Python list of tag names as tag arguments.

Such as: the.findall ({” tag1 “, “tag2”, “tag3”, “tag4”})

The attributes parameter is a Python dictionary that encapsulates the attributes of a tag and their corresponding attribute values.

For example, return the tags of attribute1 and Attribute2 in the HTML document

findAll(“tag”,{“classs”:{“attribute1″,”attribute2”}})

The recursive parameter recursive is a Boolean variable. If you want to know how many layers of information there are in the tag structure of the HTML document you are fetching, if recursive is set to True, findAll will findAll of the children of the tag parameter and all of the children of the tags as required. If recursive is set to False, findAll will only look for documents and tags. FindAll supports recursive lookup by default (recursive defaults to True), and you generally do not need to set this parameter. It’s just that when you really know what information you need, and speed of fetching is important, you can set the recursive parameters as you see fit.

The text argument is matched using the text content of the tag, not the attribute of the tag.

If we need to query the amount of text in a web page that contains “the text”, we can use the following statement

namelist=bsObj.findAll(text=”the text”)

print(len(namelist))

The range limiting parameter limit can only be used with the findAll method. The find() method is equivalent to the findAll() method when limit is equal to 1. If you are only interested in the first X items retrieved from the web page, you can set it. Note, however, that the first few results obtained after this parameter is set will be in the order on the page, and may not be the results you want.

The keyword parameter, which allows you to select tags that have specified attributes.

Such as:

alltext=bsObj.findAll(id=”text”)

print(allText[0].get_text())

.get_text() clears all the tags in the HTML document you are working on and returns a literal string. If you’re dealing with a large source code with many hyperlinks, paragraphs, and tags, get_text() clears them out, leaving unlabeled text.

Usually use.get_text() at the end of your printing, storing, and manipulating data to extract the text you want. In general, preserve the tag structure of an HTML document as much as possible.

Python crawler: Find () and findAll() for BeautifulSoup

Related Posts

【Webpack advanced 】Loader in-depth parsing

Lerna – based multi-package JavaScript project setup and maintenance

React source code concept