This is the 7th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.
preface
Use Python to realize the grasp of hot topics, no more nonsense to say.
Let’s have a good time
The development tools
Python version: 3.6.4
Related modules:
Requests module;
Re module;
Pandas module;
LXML module;
The random module;
And some modules that come with Python.
Environment set up
Install Python and add it to the environment variables. PIP installs the required related modules.
Thought analysis
This article is based on the hot topic of Zhihu “How to View the Tencent Intern’s suggestion to the Tencent senior management to issue the prohibition of escorting? As an example
The target site
https://www.zhihu.com/question/478781972
Copy the code
Web analytics
After viewing the source code of the web page, you need to access the developer tool of the browser to capture packets. Enter Noetwork→XHR and pull down the web page with the mouse to get the data packets we need
Get the exact URL
https://www.zhihu.com/api/v4/questions/478781972/answers? include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_det ail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_conte nt%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2C relevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author %2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follo wer_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset=0&platform=desktop&sort_by=default
https://www.zhihu.com/api/v4/questions/478781972/answers? include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_det ail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_conte nt%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2C relevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author %2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follo wer_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset=5&platform=desktop&sort_by=default
Copy the code
Urls have a lot of unnecessary parameters that you can delete in your browser. The difference between the two urls lies in the offset parameter. The offset parameter of the first URL is 0, and the offset parameter of the second URL is 5, and the offset is increasing with the tolerance of 5. The web page data format is JSON.
Code implementation
import requests\
import pandas as pd\
import re\
import time\
import random\
\
df = pd.DataFrame()\
headers = {\
'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'\
}\
for page in range(0.1360.5):\
url = f'https://www.zhihu.com/api/v4/questions/478781972/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Crewa rd_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_ed it%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccommen t_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2 Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B %2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D. settings.table_of_content.enabled&limit=5&offset={page}&platform=desktop&sort_by=default'\
response = requests.get(url=url, headers=headers).json()\
data = response['data'] \for list_ in data:\
name = list_['author'] ['name'] # Zhihu author \
id_ = list_['author'] ['id'] # the author id \
created_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(list_['created_time']))# Answer time \
voteup_count = list_['voteup_count'] # agree number \
comment_count = list_['comment_count'] # number of comments below \
content = list_['content'] # Answer content \
content = ' '.join(re.findall("[\u3002\uff1b\uff0c\uff1a\u201c\u201d\uff08\uff09\u3001\uff1f\u300a\u300b\u4e00-\u9fa5]", content)) # regular expression extraction \
print(name, id_, created_time, comment_count, content, sep='|')\
dataFrame = pd.DataFrame(\
{'Zhihu Author': [name], 'the author id': [id_], 'Answer time': [created_time], 'Agreement number': [voteup_count], 'Comments below': [comment_count],\
'Answer content': [content]})\
df = pd.concat([df, dataFrame])\
time.sleep(random.uniform(2.3))\
df.to_csv('Zhihu answer.csv', encoding='utf-8', index=False) \print(df.shape)
Copy the code
Results show