This is the 8th day of my participation in the Gwen Challenge in November. See details: The Last Gwen Challenge in 2021.

preface

Use Python to realize the grasp of hot topics, no more nonsense to say.

Let’s have a good time

The development tools

**Python version: **3.6.4

Related modules:

Requests module;

Re module;

Pandas module;

LXML module;

The random module;

And some modules that come with Python.

Environment set up

Install Python and add it to the environment variables. PIP installs the required related modules.

Thought analysis

This article takes “Huo Zun handwritten apology letter” as an example to explain how to crawl weibo comments!

Grab comments

The web address

https://m.weibo.cn/detail/4669040301182509
Copy the code

Web analytics

Tweets are dynamically loaded, and when you go into the browser’s developer tools and pull down on the web page, you get the package you need

Get the real URL

https://m.weibo.cn/comments/hotflow?id=4669040301182509&mid=4669040301182509&max_id_type=0
https://m.weibo.cn/comments/hotflow?id=4669040301182509&mid=4669040301182509&max_id=3698934781006193&max_id_type=0
Copy the code

The difference between the two urls is obvious. The first URL has no parameter max_id, and the second one starts with max_id, while max_id is actually the max_id in the previous packet

But one thing to note is that the parameter max_id_type, which also changes, needs to be retrieved from the packet

Code implementation

import re
import requests
import pandas as pd
import time
import random

df = pd.DataFrame()
try:
    a = 1
    while True:
        header = {
            'User-Agent': 'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'
        }
        resposen = requests.get('https://m.weibo.cn/detail/4669040301182509', headers=header)
        # Weibo crawling about dozens of pages will close the account, and by constantly updating cookies, will make crawlers more durable...
        cookie = [cookie.value for cookie in resposen.cookies]  Generate cookies widgets with list derivation
        headers = {
         Cookie = SUB = cookie = SUB
            'cookie': f'WEIBOCN_FROM={cookie[3]}; SUB=; _T_WM={cookie[4]}; MLOGIN={cookie[1]}; M_WEIBOCN_PARAMS={cookie[2]}; XSRF-TOKEN={cookie[0]}'.'referer': 'https://m.weibo.cn/detail/4669040301182509'.'User-Agent': 'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'
        }
        if a == 1:
            url = 'https://m.weibo.cn/comments/hotflow?id=4669040301182509&mid=4669040301182509&max_id_type=0'
        else:
            url = f'https://m.weibo.cn/comments/hotflow?id=4669040301182509&mid=4669040301182509&max_id={max_id}&max_id_type={max_id_type}'

        html = requests.get(url=url, headers=headers).json()
        data = html['data']
        max_id = data['max_id']  Return max_id and max_id_type to the next URL
        max_id_type = data['max_id_type']
        for i in data['data']:
            screen_name = i['user'] ['screen_name']
            i_d = i['user'] ['id']
            like_count = i['like_count']  # thumb up
            created_at = i['created_at']  # time
            text = re.sub(r'<[^>]*>'.' ', i['text'])  # comments
            print(text)
            data_json = pd.DataFrame({'screen_name': [screen_name], 'i_d': [i_d], 'like_count': [like_count], 'created_at': [created_at],'text': [text]})
            df = pd.concat([df, data_json])
        time.sleep(random.uniform(2.7))
        a += 1
except Exception as e:
    print(e)

df.to_csv('weibo. CSV', encoding='utf-8', mode='a+', index=False)
print(df.shape)
Copy the code

Results show