This is the 28th day of my participation in the August Challenge
Life is too short to learn Python together
The crawler agreement
Robots.txt file, which says whether to allow crawling site data
Bs4 module
- The installation
pip3 install beautifulsoup4
- Module is introduced
BeautifulSoup is a parsing library that extracts data from HTLML
Bs4 Crawl Autohome News – BASIC use of BS4
import requests
from bs4 import BeautifulSoup
# HTML in text format
response = requests.get('https://www.autohome.com.cn/news/1/#liststart')
# print(response.text)
Parsing HTML with BS4 first argument: HTML text second argument: what parser to use, either the built-in HTML.parser, no need to install third-party modules, or LXML (PIP install LXML) ""
soup = BeautifulSoup(response.text, 'lxml')
Find div "" for article-wrapper class
div1 = soup.find(class_='article-wrapper')
# print(div)
Find div "" with id auto-channel-lazyload-article
div2 = soup.find(id='auto-channel-lazyload-article')
# print(div)
Look for the article class ul tag.
ul = soup.find(class_='article')
Continue to find all li "" under UL
li_list = ul.find_all(name='li')
for li in li_list:
Find the thing under each li.
title = li.find(name='h3')
Prevention of advertising
if title:
title = title.text
url='https:'+li.find('a').attrs.get('href')
desc=li.find('p').text
img='https:'+li.find(name='img').get('src')
print("" News Headline: % S News Address: % S News Summary: % S News Image: %s"%(title,url,desc,img))
Copy the code
Bs4 use
Traverse the document tree
Select directly by tag name, which is fast, but if there are multiple identical tags, only the first one is returned
- Preparing HTML Documents
html_doc = """ The Dormouse's story helloThe Dormouse's story
Once upon a time there were three little sisters; and their names were ElsieLacie and Tillie; and they lived at the bottom of a well.
...
"""
Copy the code
- Document fault tolerance, not a standard HTML can be parsed
The HTML body tag of the above HTML document is not closed
soup=BeautifulSoup(html_doc,'lxml')
Copy the code
- Traverse the document tree usage
Gets the data inside the label
Head = soup. Headprint(head)
Copy the code
Get label name
print(head.name)
Copy the code
Gets the attributes of the tag, which may be multiple, in a list
p = soup.body.p Get the P tag under the body tag
print(p.attrs) # get p tag attribute /id, etc
print(p.attrs.get('class')) The result is a list
print(p.get('class')) Method 2, the result is a list
print(p['class']) Method 2, the result is a list
Copy the code
Nested choice
a=soup.body.a
print(a.get('id'))
Copy the code
# child node, descendant node
print(soup.p.contents) All the children of #p
print(soup.p.children) # return an iterator containing all children under p
print(list(soup.p.children)) # return an iterator containing all children under p
Parent node, ancestor node
print(soup.a.parent) Get the parent node of a tag (only one)
print(soup.p.parent) Get the parent node of the p tag
print(soup.a.parents) # Find all ancestor nodes of a tag, father's father, father's father...
print(list(soup.a.parents))# Find all ancestor nodes of a tag, father's father, father's father...
print(len(list(soup.a.parents)))# Find all ancestor nodes of a tag, father's father, father's father...
# sibling node
print(soup.a.next_sibling) # Next brother
print(soup.a.previous_sibling) # Last brother
print(list(soup.a.next_siblings)) # Brothers below => generator objects
print(list(soup.a.previous_siblings)) # brothers above => generator objects
Copy the code
Search document tree
Find and find_all
Find (): Returns only the first one found
Find_all (): all found
Five filters
- String filtering: The filtering content is a string
a = soup.find(name='a')
print(a) # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
res = soup.find(id='my_p')
res=soup.find(class_='story')
res=soup.find(href='http://example.com/elsie')
res=soup.find(attrs={'id':'my_p'})
res=soup.find(attrs={'class':'story'})
Copy the code
- Regular expression
import re
re_b=re.compile('^b')
res=soup.find(name=re_b)
res=soup.find_all(name=re_b)
res=soup.find_all(id=re.compile('^l'))
Copy the code
- The list of
Get the data for all tags with the tag name body or b
res=soup.find_all(name=['body'.'b'])
Get data for tags with sister or title attributes
res=soup.find_all(class_=['sister'.'title'])
Copy the code
- True and False
# fetch all tags with names
res=soup.find_all(name=True)
Get the tag with the id
res=soup.find_all(id=True)
Get the tag without id
res=soup.find_all(id=False)
res=soup.find_all(href=True)
Copy the code
- Method — Understanding
def has_class_but_no_id(tag) :
return tag.has_attr('class') and not tag.has_attr('id')
print(soup.find_all(has_class_but_no_id))
Copy the code
- Other methods –limit(Limit the number of items searched)
res=soup.find_all(name=True,limit=1)
printRes =soup.body. Find_all (name='b ',recursive=False)
res=soup.body.find_all(name='p',recursive=False)
res=soup.body.find_all(name='b',recursive=True)
print(res)
Copy the code
The CSS to choose
More selectors reference www.w3school.com.cn/cssref/css_…
res = soup.select('#my_p')
ret=soup.select('body p') # Children and grandchildren
ret=soup.select('body>p') # Direct children (children)
ret=soup.select('body>p') [0].text # Direct children (children)
res = soup.select('#my_p') [0].attrs # get attributes
res = soup.select('#my_p') [0].get_text() # Get content
res = soup.select('#my_p') [0].text # Get content
res = soup.select('#my_p') [0].strings # Zizizun content into the generatorObject. String# fetch only if there is text under the current tag otherwise None
Copy the code
conclusion
The LXML parsing library is recommended
There are three kinds of selectors: tag selectors, find and find_all, and CSS selectors
The filter capability of the label selector is weak
A simple find find_all query matches a single result or multiple results
Select is recommended if you are familiar with CSS selectors
Remember the usual get_text()/text method for getting the attribute attrs and the content.
conclusion
The article was first published in the wechat public account Program Yuan Xiaozhuang, at the same time in nuggets.
The code word is not easy, reprint please explain the source, pass by the little friends of the lovely little finger point like and then go (╹▽╹)