This is the 7th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.

preface

Use Python to realize the grasp of hot topics, no more nonsense to say.

Let’s have a good time

The development tools

Python version: 3.6.4

Related modules:

Requests module;

Re module;

Pandas module;

LXML module;

The random module;

And some modules that come with Python.

Environment set up

Install Python and add it to the environment variables. PIP installs the required related modules.

Thought analysis

This article is based on the hot topic of Zhihu “How to View the Tencent Intern’s suggestion to the Tencent senior management to issue the prohibition of escorting? As an example

The target site

https://www.zhihu.com/question/478781972
Copy the code

Web analytics

After viewing the source code of the web page, you need to access the developer tool of the browser to capture packets. Enter Noetwork→XHR and pull down the web page with the mouse to get the data packets we need

Get the exact URL

https://www.zhihu.com/api/v4/questions/478781972/answers? include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_det ail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_conte nt%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2C relevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author %2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follo wer_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset=0&platform=desktop&sort_by=default
https://www.zhihu.com/api/v4/questions/478781972/answers? include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_det ail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_conte nt%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2C relevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author %2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follo wer_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset=5&platform=desktop&sort_by=default
Copy the code

Urls have a lot of unnecessary parameters that you can delete in your browser. The difference between the two urls lies in the offset parameter. The offset parameter of the first URL is 0, and the offset parameter of the second URL is 5, and the offset is increasing with the tolerance of 5. The web page data format is JSON.

Code implementation

import requests\
import pandas as pd\
import re\
import time\
import random\
\
df = pd.DataFrame()\
headers = {\
    'user-agent''the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'\
}\
for page in range(0.1360.5):\
    url = f'https://www.zhihu.com/api/v4/questions/478781972/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Crewa rd_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_ed it%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccommen t_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2 Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B %2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D. settings.table_of_content.enabled&limit=5&offset={page}&platform=desktop&sort_by=default'\
    response = requests.get(url=url, headers=headers).json()\
    data = response['data'] \for list_ in data:\
        name = list_['author'] ['name']  # Zhihu author \
        id_ = list_['author'] ['id']  # the author id \
        created_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(list_['created_time']))# Answer time \
        voteup_count = list_['voteup_count']  # agree number \
        comment_count = list_['comment_count']  # number of comments below \
        content = list_['content']  # Answer content \
        content = ' '.join(re.findall("[\u3002\uff1b\uff0c\uff1a\u201c\u201d\uff08\uff09\u3001\uff1f\u300a\u300b\u4e00-\u9fa5]", content))  # regular expression extraction \
        print(name, id_, created_time, comment_count, content, sep='|')\
        dataFrame = pd.DataFrame(\
            {'Zhihu Author': [name], 'the author id': [id_], 'Answer time': [created_time], 'Agreement number': [voteup_count], 'Comments below': [comment_count],\
             'Answer content': [content]})\
        df = pd.concat([df, dataFrame])\
    time.sleep(random.uniform(2.3))\
df.to_csv('Zhihu answer.csv', encoding='utf-8', index=False) \print(df.shape)
Copy the code

Results show