Depend on the installation
resuests
pip install requests
Copy the code
Beautifulsoup
pip install bs4
Copy the code
Demand analysis
Climb hupu pedestrian street main road first 50 pages post. First, the return message response for each page is retrieved through Requests, and the message body Response.text is parsed through BeautifulSoup.
code
import requests
from bs4 import BeautifulSoup as bs
import time
url = "https://bbs.hupu.com/bxj-postdate"
useragent = "Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"
header = {
'user-agent': useragent,
# Hupu must be logged in from page 11 to view
Copy cookies directly after logging in from the browser. This will become obsolete and will have to be copied again later.
'cookie': 'your cookie'
}
for page in range(50):
page_url = url + The '-' + str(page+1)
print(F '-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the first{page+1}Page content{page_url}-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- ')
response = requests.get(page_url, headers=header)
bs_info = bs(response.text, 'html.parser')
ul = bs_info.find('ul', attrs={'class'.'for-list'})
for li in ul.findAll('li') :# titles
title_div = li.find('div', attrs={'class'.'titlelink box'})
a_tag = title_div.find('a', attrs={'class'.'truetit'})
# the author
author_div = li.find('div', attrs={'class'.'author box'})
author_link = author_div.find('a', attrs={'class'.'aulink'})
# Post time
pub_date = author_div.findAll('a') [1].text
print('https://bbs.hupu.com/'+a_tag.get('href'), a_tag.text.strip(), author_link.text, pub_date)
time.sleep(1)
Copy the code
The results show
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 1 page https://bbs.hupu.com/bxj-postdate-1------------------- https://bbs.hupu.com//36407012.html First-tier cities basic salary is very low Shout poof shout poof I come the 2020-07-07 https://bbs.hupu.com//36407009.html abstract people to attend the university entrance exam so tiger flapping JR0132279583 2020-07-07... . . Take 0.19072270393371582... . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- content on page https://bbs.hupu.com/bxj-postdate-50------------------- https://bbs.hupu.com//36390518.html Tablet see bettas are live very card See me lie the pavilion in the 2020-07-06 https://bbs.hupu.com//36390517.html boy less than ten minutes summer bathing is not very normal? Hu Feifei 1013 2020-07-06... . . Take 0.25310277938842773Copy the code
Refer to the link
- Requests Official Documentation
- BeautifulSoup official documentation
Next: Getting started with a Python crawler (part 2) : Using Requests and xpath to crawl the forum post list