CodingGo Technology Community

Free programming learning platform

\

The article brief introduction

Often brush weibo students will focus on some bloggers have a meaning, look at their hair under the words, pictures, video, and comments, but over time, probably because of all sorts of reasons, the one you want to go through some bloggers microblogging, found that it has been deleted, more exaggerated is found dead blog main titles. If you have a blogger you’re interested in, save their tweets regularly so you don’t have to worry about finding them tomorrow if all the twitter servers go down. (The same goes for your own weibo.)

Look at some micro-blog crawlers on the Internet, they are aimed at the micro-blog version of a long time ago, and the crawling content is not comprehensive, such as long micro-blog can not be completely crawling, pictures are not crawling or no classification, has not been suitable for the current version of the full crawling content of micro-blog.

This example is mainly based on Python3.6.2, which can realize the complete crawling, numbering and local storage of the single blogger’s microblog content.

Environment introduction

Python3.6.2/Windows-7- 64-bit/microblogging mobile

Achieve the goal

You will be interested in the microblogging microblogging (all or filter non-original, etc.) to obtain the content, including the blog, pictures and comments, text and comments by number into TXT files, pictures by number into the specified path folder. In this way, it is convenient for you to regularly save and retrieve the weibo information you are concerned about. Secondly, after obtaining these data, you can also conduct further data analysis on the microblog, comments and other information of the blogger.

In this example, the obtained data is stored in the local file system. If the amount of data to be crawled is large, you can use a database such as MongoDB to facilitate data storage and retrieval.

The preparatory work

Generally speaking, the information of the same website on PC is more comprehensive but not easy to climb, while that on mobile terminal is relatively simple. Therefore, in this example, the mobile terminal site m.weibo.com is selected as the entrance to climb.

Enter the homepage of the blogger to be climbed, take the “natural history magazine” I pay attention to as an example, and find that its homepage URL is: m.weibo.cn/u/119505453…

The number 1195054531 is the uid we are looking for. Then open it and type the URL in your browser: m.weibo.cn/u/1195054531 enter the same home page again, press F12 to open Google Developer tools at this time, click “Network”, because the loading mode of mobile site is asynchronous loading, we mainly focus on requests under XHR, click “XHR”, press F5 to refresh and send requests again. At this point, we find that the browser has sent two requests, the first request is mainly to get some introduction information about the blogger, and the second request is to get all the information about the first page of tweets, we focus on the second request.

Click “Headers” to find the requested information, such as Request URL, Cookie, Referer, etc. (Cookie information is obtained manually, valid for several hours and needs to be obtained manually after expiration). The Request URL is m.weibo.cn/api/contain…

After observation, it is found that adding &page= page number at the end of the URL can control the number of weibo pages to be crawled. “Preview” :

If you observe the returned JSON data, cards is the information card of each microblog. Click on mBlog to get detailed micro-blog related content: \

We mainly need the following data: ‘ID’ : microblog number ‘text’ : microblog text ‘islongText’ : judge whether the microblog is a long microblog ‘bMIDDLE_pic’ : judge whether the microblog has pictures

Click on a specific microblog and go to the complete content and comment page of the microblog. Similarly, by observing “Request for relevant information in Network, it can be found that the url of this page is m.weibo.cn/api/comment… The number after ID is the number of the microblog we obtained previously. Page parameter can control the number of microblog pages. The data returned in JSON format is as follows:

‘data’ and ‘hotdata’ are comments and hot comments data respectively.

Implementation logic

  1. The cards data of each microblog page is obtained by controlling page parameters, which contains detailed information of each microblog.
  2. Start traversing each microblog page and each microblog on each page at the same time, during which the following operations are performed:
  3. Determine whether it is a long microblog, if it is not to obtain text information, otherwise enter the detailed microblog content request, obtain text information, and write the text information into the TXT document;
  4. Judge whether weibo with pictures, such as through the request to obtain the picture address, traversal the address, the link into TXT document, save the picture to the local, such as no picture end;
  5. Through the weibo comment request, obtain the list of comment data, traverse the list to obtain every comment under the weibo and save it to the corresponding weibo content in the TXT document; … Until you’ve gone through every single tweet.

The crawl process


Crawl results

In the folder are the corresponding microblog pictures, in the TXT document are the microblog text and comments.

Taking the third weibo of “Natural History Magazine” as an example, the original blog content is as follows:

The micro-blog text and comments in Txt are as follows:

The corresponding pictures in the folder are as follows:

It is relatively convenient to search and consult.

Code implementation

# -*- coding: UTF-8 -*- "" Created on 3月9日 @author: ora_jason ''' from lxmlimport html import requests import json import re import os import time import urllib.request Class CrawlWeibo: # getgetCards (self, id, page): # id: user id; List_cards = 0 list_cards = [] while ii < page: 2 = 2 + 1 print (' climbing in the first page % d CARDS '% 2) url = "https://m.weibo.cn/api/container/getIndex?type=uid&value=" + id + '&containerid=107603' + id + '&page=' + str( ii) response = requests.get(url, List_cards.append (ob_json['data']['cards']) # Sleep (2) print(' pause 2 seconds ') # return list_cards# return list_cards#Copy the code

\

Due to limited space, please long press the qr code below to follow the programming dog, reply “0310”, to obtain the full source code of this article crawler

Click below to read the article,

Free entry to web crawler grammar at ****