This is the 8th day of my participation in the Gwen Challenge in November. See details: The Last Gwen Challenge in 2021.
preface
Use Python to realize the grasp of hot topics, no more nonsense to say.
Let’s have a good time
The development tools
**Python version: **3.6.4
Related modules:
Requests module;
Re module;
Pandas module;
LXML module;
The random module;
And some modules that come with Python.
Environment set up
Install Python and add it to the environment variables. PIP installs the required related modules.
Thought analysis
This article takes “Huo Zun handwritten apology letter” as an example to explain how to crawl weibo comments!
Grab comments
The web address
https://m.weibo.cn/detail/4669040301182509
Copy the code
Web analytics
Tweets are dynamically loaded, and when you go into the browser’s developer tools and pull down on the web page, you get the package you need
Get the real URL
https://m.weibo.cn/comments/hotflow?id=4669040301182509&mid=4669040301182509&max_id_type=0
https://m.weibo.cn/comments/hotflow?id=4669040301182509&mid=4669040301182509&max_id=3698934781006193&max_id_type=0
Copy the code
The difference between the two urls is obvious. The first URL has no parameter max_id, and the second one starts with max_id, while max_id is actually the max_id in the previous packet
But one thing to note is that the parameter max_id_type, which also changes, needs to be retrieved from the packet
Code implementation
import re
import requests
import pandas as pd
import time
import random
df = pd.DataFrame()
try:
a = 1
while True:
header = {
'User-Agent': 'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'
}
resposen = requests.get('https://m.weibo.cn/detail/4669040301182509', headers=header)
# Weibo crawling about dozens of pages will close the account, and by constantly updating cookies, will make crawlers more durable...
cookie = [cookie.value for cookie in resposen.cookies] Generate cookies widgets with list derivation
headers = {
Cookie = SUB = cookie = SUB
'cookie': f'WEIBOCN_FROM={cookie[3]}; SUB=; _T_WM={cookie[4]}; MLOGIN={cookie[1]}; M_WEIBOCN_PARAMS={cookie[2]}; XSRF-TOKEN={cookie[0]}'.'referer': 'https://m.weibo.cn/detail/4669040301182509'.'User-Agent': 'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'
}
if a == 1:
url = 'https://m.weibo.cn/comments/hotflow?id=4669040301182509&mid=4669040301182509&max_id_type=0'
else:
url = f'https://m.weibo.cn/comments/hotflow?id=4669040301182509&mid=4669040301182509&max_id={max_id}&max_id_type={max_id_type}'
html = requests.get(url=url, headers=headers).json()
data = html['data']
max_id = data['max_id'] Return max_id and max_id_type to the next URL
max_id_type = data['max_id_type']
for i in data['data']:
screen_name = i['user'] ['screen_name']
i_d = i['user'] ['id']
like_count = i['like_count'] # thumb up
created_at = i['created_at'] # time
text = re.sub(r'<[^>]*>'.' ', i['text']) # comments
print(text)
data_json = pd.DataFrame({'screen_name': [screen_name], 'i_d': [i_d], 'like_count': [like_count], 'created_at': [created_at],'text': [text]})
df = pd.concat([df, data_json])
time.sleep(random.uniform(2.7))
a += 1
except Exception as e:
print(e)
df.to_csv('weibo. CSV', encoding='utf-8', mode='a+', index=False)
print(df.shape)
Copy the code
Results show