In the last article, we explained and analyzed how to obtain the micro-blog of individual bloggers. In this article, we will use the code to achieve our no-login micro-blog crawler.
Here we want to expect that you can enter a microblog name to get all his microblog, so we will follow this path step by step.
1. First of all, we must obtain the user ID, so that we can splice it with the interface analyzed in our last paper to form the data we want.
Here we have the same method as the previous article, by searching the microblog name Xie Na can catch the packet to the microblog interface.
req_url = 'https://m.weibo.cn/api/container/getIndex?&containerid=100103type%3D3%26q%3D{keyword}'.format( keyword=key_word) search_web_content = requests.get(url=req_url, headers=headers).text json_text = json.loads(search_web_content) if json_text['ok'] == 1: card_list = json_text['cards'] for cards in card_list: if cards['card_type'] == 11: card_group_list = cards['card_group'] for card_group in card_group_list: if card_group['user']['screen_name'] == key_word: Uid = card_group['user']['id'] # user id content_URL = 'https://m.weibo.cn/api/container/getIndex?containerid=107603{uid}&page={page}'.format( uid=uid, Page =1) print(content_URL) else: print("微博 ")Copy the code
In this way, we can obtain the user ID through the microblog name.
2. Here we use scrapy+redis+MongoDB to build our crawler.
Scrapy startproject weibo_spider
Create the crawler
import json import re from scrapy.http import Request from scrapy_redis.spiders import RedisSpider class WeiboSpider(RedisSpider): name = 'weibo_spider' redis_key = 'spider:starts_url' def parse(self, response): json_text = json.loads(response.text) if json_text['ok'] == 1: Cards_list = json_text['cards'] for cards in cards_list: if cards['card_type'] == 9: Mblog ['mblog'] scheme = cards['scheme'] text = mblog[' mblog'] # Avatar_hd = mblog['user']['avatar_hd'] # next_page = re.findall('page=([\d]*)', str(response.url), re.S)[0] uid = re.findall('containerid=(.*?) &', str(response.url), re.S)[0] next_page = int(next_page) + 1 next_url = 'https://m.weibo.cn/api/container/getIndex?containerid={uid}&page={page}'.format( uid=uid, page=next_page)Copy the code
That’s all there is to it, then we just need to start the crawler and use Redis to lpush where we get the user ID.
The comments of each micro-blog can also be found according to the idea I mentioned and then get the API interface.