preface

Text and pictures from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with.

TED Crossin’s Programming classroom

PS: If you need Python learning materials, please click on the link below to obtain them

Note.youdao.com/noteshare?i…

The popularity value of the content in the hot list of Zhihu is calculated based on the page views, interaction volume, professional weight, creation time and time on the list within the last 24 hours. Zhihu’s hot list is a ranking based on the popularity of content.

1. Crawl web pages

Python crawlers typically use the Requests library to handle network requests. Methods and parameters for Requests are not expanded here.

Zhihu hot list

The microblogging hot

1. The urls we selected are accessible when not logged in, so the empty parameter in the Requests method doesn’t matter. However, crawler needs landing state more often, so it is required to simulate landing by setting different parameters to carry out relevant operations. 2. The Page content retrieved from the Requests module corresponds to the page displayed after right-clicking on the site and selecting Show Page source. It is different from what we actually see on the web or what we see on the web elements when F12 enters developer mode. The former is the result of a web request, and the latter is the result of a browser rendering of the page.

2. Parse the crawl

The first step is to crawl through the entire content of the page. The next step is to locate the target in all the content, and then read and save it.

I’m using BeautifulSoup here, because it’s the first thing that reptiles do, and it works pretty well. BeautifulSoup provides methods and parameters that make it easy to locate a target.

In the web source code of Zhihu hot list, you can see the following at the bottom:

import requests
import re
from bs4 import BeautifulSoup

headers={"User-Agent":""."Cookie":""}
zh_url = "https://www.zhihu.com/billboard"
zh_response = requests.get(zh_url,headers=headers)

webcontent = zh_response.text
soup = BeautifulSoup(webcontent,"html.parser")
script_text = soup.find("script",id="js-initialData").get_text()
rule = r'"hotList":(.*?) ,"guestFeeds"'
result = re.findall(rule,script_text)

temp = result[0].replace("false"."False").replace("true"."True")
hot_list = eval(temp)
print(hot_list)
Copy the code

Here, I use the list structure of hot list data in Script. After locating and retrieving the relevant string, I first convert true and false in JS to true and false in Python, and finally directly convert the string into a directly usable list of data through eval().

The running code results are shown in the figure below:

import requests
from bs4 import BeautifulSoup


url = "https://s.weibo.com/top/summary"
headers={"User-Agent":""."Cookie":""}
wb_response = requests.get(url,headers=headers)
webcontent = wb_response.text
soup = BeautifulSoup(webcontent,"html.parser")
index_list = soup.find_all("td",class_="td-01")
title_list = soup.find_all("td",class_="td-02")
level_list = soup.find_all("td",class_="td-03")

topic_list = []
for i in range(len(index_list)):
    item_index = index_list[i].get_text(strip = True)
    if item_index=="":
        item_index = "0"
    item_title = title_list[i].a.get_text(strip = True)
    if title_list[i].span:
        item_mark = title_list[i].span.get_text(strip = True)        
    else:
        item_mark = "Top"
    item_level = level_list[i].get_text(strip = True)
    topic_list.append({"index":item_index,"title":item_title,"mark":item_mark,"level":item_level,"link":f"https://s.weibo.com/weibo?q=%23{item_title}%23&Refer=top"})
print(topic_list)
Copy the code

Through analysis, the popular microblog data are stored in the list one by one:

1, first of all, when selecting the website to climb to reduce the difficulty, for example, the same hot list of Zhihu, zhihu.com/hot need to log in, while zhihu.com/billboard can access without logging in 2, when analyzing the climbed content, according to the specific page content to choose the most convenient way. When you need to batch crawl similar pages, try to tidy up a common parsing strategy.

The complete code

weibo_top.py

import requests
from bs4 import BeautifulSoup

url = "https://s.weibo.com/top/summary"
headers = {"User-Agent": ""."Cookie": ""}
wb_response = requests.get(url, headers=headers)
webcontent = wb_response.text
soup = BeautifulSoup(webcontent, "html.parser")
index_list = soup.find_all("td", class_="td-01")
title_list = soup.find_all("td", class_="td-02")
level_list = soup.find_all("td", class_="td-03")

topic_list = []
for i in range(len(index_list)):
    item_index = index_list[i].get_text(strip=True)
    if item_index == "":
        item_index = "0"
    item_title = title_list[i].a.get_text(strip=True)
    if title_list[i].span:
        item_mark = title_list[i].span.get_text(strip=True)

    else:
        item_mark = "Top"
    item_level = level_list[i].get_text(strip=True)
    topic_list.append({"index": item_index, "title": item_title, "mark": item_mark, "level": item_level,
                       "link": f"https://s.weibo.com/weibo?q=%23{item_title}%23&Refer=top"})
print(topic_list)
Copy the code

zhihu_billboard.py

import requests
import re
from bs4 import BeautifulSoup

headers={"User-Agent":""."Cookie":""}
zh_url = "https://www.zhihu.com/billboard"
zh_response = requests.get(zh_url,headers=headers)

webcontent = zh_response.text
soup = BeautifulSoup(webcontent,"html.parser")
script_text = soup.find("script",id="js-initialData").get_text()
rule = r'"hotList":(.*?) ,"guestFeeds"'
result = re.findall(rule,script_text)

temp = result[0].replace("false"."False").replace("true"."True")
hot_list = eval(temp)
print(hot_list)
Copy the code