Hello, I’m white and white I. This article is participating in Python Theme Month. See the link for details

Before have written a crawl micro blog hot comments, but recently Wu mou’s melon is too big, the whole network are concerned about this matter, until the official [public security department] out that this incident has a small end, today and everyone chat crawl micro blog hot comments, by the way to understand the context of the incident.

Crawl target

Website: Weibo

 

Results show

Tool use

Development environment: Win10, PYTHon3.7

Development tools: PyCharm, Chrome

Toolkits: Requests, RE, CSV

Project idea analysis

Find the article that needs to eat melon

The basic configuration data that the request header needs to carry

headers = {
    "referer": "",
    "cookie":"",
    "user-agent": ""
}
Copy the code

Find the comment data submitted dynamically for the article

Use the packet capture tool to find the corresponding comment data

https://m.weibo.cn/comments/hotflow?id=4661850409272066&mid=4661850409272066&max_id=5640809315785878&max_id_type=0
Copy the code

The URL of microblog will have an article ID, mid is also the article ID, max_id is the max_id in each JSON data, there is no rule

Fetching the current max_id gets the request interface for the next page

Simple source code analysis


import csv
import re
import requests
import time

start_url = "https://m.weibo.cn/comments/hotflow?id=4661850409272066&mid=4661850409272066&max_id_type=0"
next_url = "https://m.weibo.cn/comments/hotflow?id=4638585665621278&mid=4661850409272066&max_id={}&max_id_type=0"
continue_url = start_url
headers = {
    "referer": "https://m.weibo.cn/detail/4638585665621278"."cookie": "SUB=_2A25Nq-BcDeRhGeBG7VUW-SnEyjyIHXVvV4AUrDV6PUJbkdAKLULFkW1NRhXYfC2JIAilAAFJ_-2diWZ1ZEACRZ5K; SCF=AgGUxHxg_ZjvVbYikCOVICTc-a4gDcEtR02fexDZstBq_XKr3s1Rp9CxdS4y4k4IvDQ2eIgTTyJg73pcUmvYRKc.; _T_WM=58609113785; WEIBOCN_FROM=1110006030; MLOGIN=1; M_WEIBOCN_PARAMS=oid%3D4638585665621278%26luicode%3D20000061%26lfid%3D4638585665621278%26uicode%3D20000061%26fid%3D46385 85665621278; XSRF-TOKEN=06ed3f"."user-agent": "Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"
}
count = 0

def csv_data(fileheader) :
    with open("wb1234.csv"."a", newline="")as f:
        write = csv.writer(f)
        write.writerow(fileheader)


def get_data(start_url) :
    print(start_url)
    try:
        response = requests.get(start_url, headers=headers).json()
        max_id = response['data'] ['max_id']
    except Exception as e:
        get_data(start_url.split("type") [0] + "type=1")

    else:
        # max_id = response['data']['max_id']
        content_list = response.get("data").get('data')
        for item in content_list:
            global count
            count += 1
            create_time = item['created_at']
            text = "".join(re.findall('[\u4e00-\u9fa5]', item["text"]))
            user_id = item.get("user") ["id"]
            user_name = item.get("user") ["screen_name"]
            # print([count, create_time, user_id, user_name, text])
            csv_data([count, create_time, user_id, user_name, text])

        global next_url
        continue_url = next_url.format(max_id)
        time.sleep(2)
        get_data(continue_url)


if __name__ == "__main__":
    fileheader = ["id"."Comment time"."User id"."user_name"."Comment content"]
    csv_data(fileheader)
    get_data(start_url)
Copy the code

This article uses the article synchronization assistant to synchronize

I am ** white and white I **, a program yuan like to share knowledge ❤️

Thank you very much for your likes, favorites, attention, comments, one-click support for four consecutive.