Beautifulsoup used in the previous article to parse the structure of a web page does the trick, but it’s not very efficient. The better way is to use Xpath parsing, which has a clearer hierarchy and better performance, but requires an additional LXML library to be installed. Let’s update the previous article with LXML.

Install the LXML

pip install lxml
Copy the code

code

import requests
from bs4 import BeautifulSoup as bs
import time
import lxml


url = "https://bbs.hupu.com/bxj-postdate"
useragent = "Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"
header = {
    'user-agent': useragent,
    # Hupu must be logged in from page 11 to view
    Copy cookies directly after logging in from the browser. This will become obsolete and will have to be copied again later.
    'cookie': 'your cookie'
}

for page in range(0.50):
    page_url = url + The '-' + str(page+1)
    print(F '-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the first{page+1}Page content{page_url}-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- ')
    response = requests.get(page_url, headers=header)
    last = time.time()
    selector = lxml.etree.HTML(response.text)
    ul_li = selector.xpath('//*[@id="ajaxtable"]/div[1]/ul/li')
    for li in ul_li:
        item = {}
        # titles
        title_box = li.xpath('./div[@class="titlelink box"]') [0]
        item['link'] = hupu_domin + title_box.xpath('./a[@class="truetit"]/@href[1]') [0]
        # Test found that some titles are wrapped with  tags (see the next cell) and require special handling
        if title_box.xpath('./a[@class="truetit"]/text()'):
            item['text'] = title_box.xpath('./a[@class="truetit"]/text()') [0]
        else:
            item['text'] = '[aggravation] :' + title_box.xpath('./a[@class="truetit"]/b/text()') [0]
        # the author
        autoor_box = li.xpath('./div[@class="author box"]') [0]
        item['author'] = autoor_box.xpath('./a[@class="aulink"]/text()') [0]
        # Post time
        item['date'] = autoor_box.xpath('./a[2]/text()') [0]
        print(item)
        
    now = time.time()
    print(F 'time{now-last}')
    time.sleep(1)
Copy the code

The results show

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 117 {1 page https://bbs.hupu.com/bxj-postdate-1-------------------'link': 'https://bbs.hupu.com/36406828.html'.'text': 'SATS math is ruined by hardware'.'author': 'Oda Nobunaga, the Lord of heaven'.'date': '2020-07-07'}
{'link': 'https://bbs.hupu.com/36406825.html'.'text': 'Rational discussion, which group of Zhou Wang Tao Lin and Luo Dayou, Li Zongsheng, Pu Shu and Wang Feng has stronger creative ability'.'author': 'Go to Single Country kai'.'date': '2020-07-07'}
{'link': 'https://bbs.hupu.com/36406824.html'.'text': 'Math number two, I'm stupid.'.'author': 'I only like Wansy.'.'date': '2020-07-07'}
{'link': 'https://bbs.hupu.com/36406822.html'.'text': 'We are going to interview Gu Ming for her part-time job tonight. Do you want to remind me of anything? '.'author': 'Heavy Tech Anthony'.'date': '2020-07-07'}... {'link': 'https://bbs.hupu.com/36404667.html'.'text': '2020 Top 10 Hottest majors in College Entrance Examination zT '.'author': 'Little bear'.'date': '2020-07-07'}
{'link': 'https://bbs.hupu.com/36404666.html'.'text': 'What do guys think about girls with tattoos? '.'author': 'Sanmao and Echo'.'date': '2020-07-07'} Time 0.09512805938720703... . . -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- content on page 101 {https://bbs.hupu.com/bxj-postdate-50-------------------'link': 'https://bbs.hupu.com/36390301.html'.'text': 'Tomorrow is the college entrance examination, JRS leave a blessing to the senior three students! '.'author': 'Chunxi of hypercarry'.'date': '2020-07-06'}
{'link': 'https://bbs.hupu.com/36390297.html'.'text': 'A shares are up 6% a day and nobody's talking about it? '.'author': 'mcskyward'.'date': '2020-07-06'}
{'link': 'https://bbs.hupu.com/36390296.html'.'text': 'Shao Bing should be guo Degang chasing after the meal'.'author': 'Millennium Base'.'date': '2020-07-06'}
{'link': 'https://bbs.hupu.com/36390290.html'.'text': 'The basic salary offered by Citic Bank is average. Isn't Citic one of the banks with the highest income?'.'author': 'Nene my wife'.'date': '2020-07-06'}
{'link': 'https://bbs.hupu.com/36390287.html'.'text': 'Two men on a high-speed train doing this? '.'author': Super Brawler.'date': '2020-07-06'}... . . {'link': 'https://bbs.hupu.com/36389948.html'.'text': 'Brother Meng asks for a GIF of murong Fu actor in 1997'.'author': '圗 bi'.'date': '2020-07-06'}
{'link': 'https://bbs.hupu.com/36389943.html'.'text': 'Brothers, tomorrow is the college entrance examination, today I bought test supplies to spend 66.6 what level?'.'author': 'Retire after high school'.'date': '2020-07-06'} take 0.0585782527923584Copy the code

Bottom line: Parsing time is doubled using xpath.

reference

  • xpath_syntax
  • lxml


Getting started with a Python crawler: Crawl forum post lists using Requests and Beautifulsoup