Distributed crawler is the key to user information crawling

preface

It’s been a long time since I updated the crawler project, and I’m a little ashamed to say so. In order to learn from Python enthusiasts, this time I introduce the distributed crawler to you.

The crawler logic

The purpose of our crawler this time is to crawl zhihu information, that is, to crawl the information of zhihu users and their followers as well as their followers. It is a little round here, I don’t know if you understand. It’s like recursion in an algorithm, spreading from one user to a follower user to their followers. The initial link page to climb is as follows.

We find that the user name after “people” is the user name, that is, you can modify the specified user you want to climb, follower is the follower, if you want to climb the information of the people you follow, change to “following”. We opened the developer tools and found that the current information page did not have the information we wanted to extract. The reason is that the page is Ajax loaded and we need to switch to the XHR bar to find the link we need. As shown in the figure below.

Finally, we found the loading page, obtained the loading link, and found that the data format is Json format, so it is much more convenient for our data collection. If we want to crawl more followers, we just change the value in limit to a multiple of 20. Now our crawler logic is clear.

The source code section

1. Don’t use distribution

class ZhihuinfoSpider(Spider):
    name = 'zhihuinfo'
    #radis_key='ZhihuinfoSpider:start_urls'
    allowed_domains = ['www.zhihu.com']
    start_urls = ['https://www.zhihu.com/api/v4/members/bu-xin-ming-71/followers?include=data%5B*%5D.answer_count%2Carticles_count%2Cgende r%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=20&limit=20']

    def parse(self, response):
        responses=json.loads(response.body.decode('utf-8'))"data"]
        count=len(responses)
        if count<20:
            pass
        else:
            page_offset=int(re.findall('&offset=(.*?) & ',response.url)[0])
            new_page_offset=page_offset+20
            new_page_url=response.url.replace(
                '&offset='+str(page_offset)+'&'.'&offset=' + str(new_page_offset) + '&'
            )
            yield Request(url=new_page_url,callback=self.parse)
        for user in responses:
            item=ZhihuItem()
            item['name']=user['name']
            item['id']= user['id']
            item['headline'] = user['headline']
            item['url_token'] = user['url_token']
            item['user_type'] = user['user_type']
            item['gender'] = user['gender']
            item['articles_count'] = user['articles_count']
            item['answer_count'] = user['answer_count']
            item['follower_count'] = user['follower_count']

            with open('userinfo.txt') as f:
                user_list=f.read()
            if user['url_token'] not in user_list:
                with open('userinfo.txt'.'a') as f:
                    f.write(user['url_token'] +'-- -- -- -- --)
                yield item

                new_url='https://www.zhihu.com/api/v4/members/'+user['url_token'] +'/followers? include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(t ype%3Dbest_answerer)%5D.topics&offset=20&limit=20'
                yield  Request(url=new_url,callback=self.parse)
Copy the code

2. Use distributed

class ZhihuinfoSpider(RedisCrawlSpider):
    name = 'zhihuinfo'
    radis_key='ZhihuinfoSpider:start_urls'
    allowed_domains = ['www.zhihu.com']
    #start_urls = ['https://www.zhihu.com/api/v4/members/bu-xin-ming-71/followers?include=data%5B*%5D.answer_count%2Carticles_count%2Cgend er%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=20&limit=20']

    def parse(self, response):
        responses=json.loads(response.body.decode('utf-8'))"data"]
        count=len(responses)
        if count<20:
            pass
        else:
            page_offset=int(re.findall('&offset=(.*?) & ',response.url)[0])
            new_page_offset=page_offset+20
            new_page_url=response.url.replace(
                '&offset='+str(page_offset)+'&'.'&offset=' + str(new_page_offset) + '&'
            )
            yield Request(url=new_page_url,callback=self.parse)
        for user in responses:
            item=ZhihuItem()
            item['name']=user['name']
            item['id']= user['id']
            item['headline'] = user['headline']
            item['url_token'] = user['url_token']
            item['user_type'] = user['user_type']
            item['gender'] = user['gender']
            item['articles_count'] = user['articles_count']
            item['answer_count'] = user['answer_count']
            item['follower_count'] = user['follower_count']

            with open('userinfo.txt') as f:
                user_list=f.read()
            if user['url_token'] not in user_list:
                with open('userinfo.txt'.'a') as f:
                    f.write(user['url_token'] +'-- -- -- -- --)
                yield item

                new_url='https://www.zhihu.com/api/v4/members/'+user['url_token'] +'/followers? include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(t ype%3Dbest_answerer)%5D.topics&offset=20&limit=20'
                yield  Request(url=new_url,callback=self.parse)
Copy the code

You can see that using Scrapy distribution requires only two changes and then adding configurations to the configuration file. In distributed mode, enable multiple terminals at last. Instead of using distribution, you just need to run it on one terminal

Running the page

The results

The speed of distributed crawler is very fast, and more than 20,000 data have been collected in less than half a minute. The crawler logic is the same for those who don’t have a Scrapy framework. You only need to copy part of the crawler code can also run.

Recommended reading:

Crawler Advanced Where to Hotel (home and abroad)

Scrapy grab taobao delicacies

The case of large crawlers: Where to crawl

For those who are interested in crawlers, data analysis and algorithms, please add the wechat official account TWcoding and let’s play Python together.

If it works for you.Please,star.

God helps those who help themselves

Distributed crawler is the key to user information crawling

Related Posts

【 Tianyi Dong 】IPFS Competitors (I)

Getting Started with Apache Kylin 2 – Principles and Architecture

Trials and hardships, my 2020 | Denver annual essay