[tool] Beautifulsoup is simple to use

This is the 28th day of my participation in the August Challenge

Life is too short to learn Python together

The crawler agreement

Robots.txt file, which says whether to allow crawling site data

Bs4 module

The installation

pip3 install beautifulsoup4

Module is introduced

BeautifulSoup is a parsing library that extracts data from HTLML

Bs4 Crawl Autohome News – BASIC use of BS4

import requests
from bs4 import BeautifulSoup

# HTML in text format
response = requests.get('https://www.autohome.com.cn/news/1/#liststart')
# print(response.text)

Parsing HTML with BS4 first argument: HTML text second argument: what parser to use, either the built-in HTML.parser, no need to install third-party modules, or LXML (PIP install LXML) ""
soup = BeautifulSoup(response.text, 'lxml')

Find div "" for article-wrapper class
div1 = soup.find(class_='article-wrapper')
# print(div)

Find div "" with id auto-channel-lazyload-article
div2 = soup.find(id='auto-channel-lazyload-article')
# print(div)

Look for the article class ul tag.
ul = soup.find(class_='article')
Continue to find all li "" under UL
li_list = ul.find_all(name='li')
for li in li_list:
    Find the thing under each li.
    title = li.find(name='h3')
    Prevention of advertising
    if title:
        title = title.text
        url='https:'+li.find('a').attrs.get('href')
        desc=li.find('p').text
        img='https:'+li.find(name='img').get('src')
        print("" News Headline: % S News Address: % S News Summary: % S News Image: %s"%(title,url,desc,img))
Copy the code

Bs4 use

Traverse the document tree

Select directly by tag name, which is fast, but if there are multiple identical tags, only the first one is returned

Preparing HTML Documents

html_doc = """ The Dormouse's story  helloThe Dormouse's story 
 Once upon a time there were three little sisters; and their names were ElsieLacie and Tillie; and they lived at the bottom of a well.
 ... 
 """
Copy the code

Document fault tolerance, not a standard HTML can be parsed

The HTML body tag of the above HTML document is not closed
soup=BeautifulSoup(html_doc,'lxml')
Copy the code

Traverse the document tree usage

Gets the data inside the label

Head = soup. Headprint(head)
Copy the code

Get label name

print(head.name)
Copy the code

Gets the attributes of the tag, which may be multiple, in a list

p = soup.body.p Get the P tag under the body tag
print(p.attrs)  # get p tag attribute /id, etc
print(p.attrs.get('class'))  The result is a list
print(p.get('class'))  Method 2, the result is a list
print(p['class'])  Method 2, the result is a list
Copy the code

Nested choice

a=soup.body.a
print(a.get('id'))
Copy the code

# child node, descendant node
print(soup.p.contents) All the children of #p
print(soup.p.children) # return an iterator containing all children under p
print(list(soup.p.children)) # return an iterator containing all children under p
Parent node, ancestor node
print(soup.a.parent) Get the parent node of a tag (only one)
print(soup.p.parent) Get the parent node of the p tag
print(soup.a.parents) # Find all ancestor nodes of a tag, father's father, father's father...
print(list(soup.a.parents))# Find all ancestor nodes of a tag, father's father, father's father...
print(len(list(soup.a.parents)))# Find all ancestor nodes of a tag, father's father, father's father...
# sibling node
print(soup.a.next_sibling) # Next brother
print(soup.a.previous_sibling) # Last brother

print(list(soup.a.next_siblings)) # Brothers below => generator objects
print(list(soup.a.previous_siblings)) # brothers above => generator objects
Copy the code

Search document tree

Find and find_all

Find (): Returns only the first one found

Find_all (): all found

Five filters

String filtering: The filtering content is a string

a = soup.find(name='a')
print(a) # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
res = soup.find(id='my_p')
res=soup.find(class_='story')
res=soup.find(href='http://example.com/elsie')

res=soup.find(attrs={'id':'my_p'})
res=soup.find(attrs={'class':'story'})
Copy the code

Regular expression

import re
re_b=re.compile('^b')
res=soup.find(name=re_b)
res=soup.find_all(name=re_b)
res=soup.find_all(id=re.compile('^l'))
Copy the code

The list of

Get the data for all tags with the tag name body or b
res=soup.find_all(name=['body'.'b'])
Get data for tags with sister or title attributes
res=soup.find_all(class_=['sister'.'title'])
Copy the code

True and False

# fetch all tags with names
res=soup.find_all(name=True)
Get the tag with the id
res=soup.find_all(id=True)
Get the tag without id
res=soup.find_all(id=False)
res=soup.find_all(href=True)
Copy the code

Method — Understanding

def has_class_but_no_id(tag) :
    return tag.has_attr('class') and not tag.has_attr('id')

print(soup.find_all(has_class_but_no_id))
Copy the code

Other methods –limit(Limit the number of items searched)

res=soup.find_all(name=True,limit=1)
printRes =soup.body. Find_all (name='b ',recursive=False)
res=soup.body.find_all(name='p',recursive=False)
res=soup.body.find_all(name='b',recursive=True)
print(res)
Copy the code

The CSS to choose

More selectors reference www.w3school.com.cn/cssref/css_…

res = soup.select('#my_p')
ret=soup.select('body p')  # Children and grandchildren
ret=soup.select('body>p')  # Direct children (children)
ret=soup.select('body>p') [0].text  # Direct children (children)
res = soup.select('#my_p') [0].attrs # get attributes
res = soup.select('#my_p') [0].get_text() # Get content
res = soup.select('#my_p') [0].text # Get content
res = soup.select('#my_p') [0].strings # Zizizun content into the generatorObject. String# fetch only if there is text under the current tag otherwise None
Copy the code

conclusion

The LXML parsing library is recommended

There are three kinds of selectors: tag selectors, find and find_all, and CSS selectors

The filter capability of the label selector is weak

A simple find find_all query matches a single result or multiple results

Select is recommended if you are familiar with CSS selectors

Remember the usual get_text()/text method for getting the attribute attrs and the content.

conclusion

The article was first published in the wechat public account Program Yuan Xiaozhuang, at the same time in nuggets.

The code word is not easy, reprint please explain the source, pass by the little friends of the lovely little finger point like and then go (╹▽╹)

[tool] Beautifulsoup is simple to use

The crawler agreement

Bs4 module

Bs4 Crawl Autohome News – BASIC use of BS4

Bs4 use

Traverse the document tree

Search document tree

Find and find_all

Five filters

The CSS to choose

conclusion

conclusion

Related Posts

10 days interview preparation -Day2 Computer network system

What are STM32 series microcontroller doing before entering main function?

The architecture layer of DDD landing