Depend on the installation

resuests

pip install requests
Copy the code

Beautifulsoup

pip install bs4
Copy the code

Demand analysis

Climb hupu pedestrian street main road first 50 pages post. First, the return message response for each page is retrieved through Requests, and the message body Response.text is parsed through BeautifulSoup.

code

import requests
from bs4 import BeautifulSoup as bs
import time

url = "https://bbs.hupu.com/bxj-postdate"
useragent = "Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"
header = {
    'user-agent': useragent,
    # Hupu must be logged in from page 11 to view
    Copy cookies directly after logging in from the browser. This will become obsolete and will have to be copied again later.
    'cookie': 'your cookie'
}

for page in range(50):
    page_url = url + The '-' + str(page+1)
    print(F '-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the first{page+1}Page content{page_url}-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- ')
    response = requests.get(page_url, headers=header)
    bs_info = bs(response.text, 'html.parser')
    ul = bs_info.find('ul', attrs={'class'.'for-list'})
    for li in ul.findAll('li') :# titles
        title_div = li.find('div', attrs={'class'.'titlelink box'})
        a_tag = title_div.find('a', attrs={'class'.'truetit'})
        # the author
        author_div = li.find('div', attrs={'class'.'author box'})
        author_link = author_div.find('a', attrs={'class'.'aulink'})
        # Post time
        pub_date = author_div.findAll('a') [1].text
        print('https://bbs.hupu.com/'+a_tag.get('href'), a_tag.text.strip(), author_link.text, pub_date)
    time.sleep(1)
Copy the code

The results show

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 1 page https://bbs.hupu.com/bxj-postdate-1------------------- https://bbs.hupu.com//36407012.html First-tier cities basic salary is very low Shout poof shout poof I come the 2020-07-07 https://bbs.hupu.com//36407009.html abstract people to attend the university entrance exam so tiger flapping JR0132279583 2020-07-07... . . Take 0.19072270393371582... . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- content on page https://bbs.hupu.com/bxj-postdate-50------------------- https://bbs.hupu.com//36390518.html Tablet see bettas are live very card See me lie the pavilion in the 2020-07-06 https://bbs.hupu.com//36390517.html boy less than ten minutes summer bathing is not very normal? Hu Feifei 1013 2020-07-06... . . Take 0.25310277938842773Copy the code

Refer to the link

  • Requests Official Documentation
  • BeautifulSoup official documentation


Next: Getting started with a Python crawler (part 2) : Using Requests and xpath to crawl the forum post list