preface
It’s been a long time since I updated the crawler project, and I’m a little ashamed to say so. In order to learn from Python enthusiasts, this time I introduce the distributed crawler to you.
The crawler logic
The purpose of our crawler this time is to crawl zhihu information, that is, to crawl the information of zhihu users and their followers as well as their followers. It is a little round here, I don’t know if you understand. It’s like recursion in an algorithm, spreading from one user to a follower user to their followers. The initial link page to climb is as follows.
We find that the user name after “people” is the user name, that is, you can modify the specified user you want to climb, follower is the follower, if you want to climb the information of the people you follow, change to “following”. We opened the developer tools and found that the current information page did not have the information we wanted to extract. The reason is that the page is Ajax loaded and we need to switch to the XHR bar to find the link we need. As shown in the figure below.
Finally, we found the loading page, obtained the loading link, and found that the data format is Json format, so it is much more convenient for our data collection. If we want to crawl more followers, we just change the value in limit to a multiple of 20. Now our crawler logic is clear.
The source code section
1. Don’t use distribution
class ZhihuinfoSpider(Spider):
name = 'zhihuinfo'
#radis_key='ZhihuinfoSpider:start_urls'
allowed_domains = ['www.zhihu.com']
start_urls = ['https://www.zhihu.com/api/v4/members/bu-xin-ming-71/followers?include=data%5B*%5D.answer_count%2Carticles_count%2Cgende r%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=20&limit=20']
def parse(self, response):
responses=json.loads(response.body.decode('utf-8'))"data"]
count=len(responses)
if count<20:
pass
else:
page_offset=int(re.findall('&offset=(.*?) & ',response.url)[0])
new_page_offset=page_offset+20
new_page_url=response.url.replace(
'&offset='+str(page_offset)+'&'.'&offset=' + str(new_page_offset) + '&'
)
yield Request(url=new_page_url,callback=self.parse)
for user in responses:
item=ZhihuItem()
item['name']=user['name']
item['id']= user['id']
item['headline'] = user['headline']
item['url_token'] = user['url_token']
item['user_type'] = user['user_type']
item['gender'] = user['gender']
item['articles_count'] = user['articles_count']
item['answer_count'] = user['answer_count']
item['follower_count'] = user['follower_count']
with open('userinfo.txt') as f:
user_list=f.read()
if user['url_token'] not in user_list:
with open('userinfo.txt'.'a') as f:
f.write(user['url_token'] +'-- -- -- -- --)
yield item
new_url='https://www.zhihu.com/api/v4/members/'+user['url_token'] +'/followers? include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(t ype%3Dbest_answerer)%5D.topics&offset=20&limit=20'
yield Request(url=new_url,callback=self.parse)
Copy the code
2. Use distributed
class ZhihuinfoSpider(RedisCrawlSpider):
name = 'zhihuinfo'
radis_key='ZhihuinfoSpider:start_urls'
allowed_domains = ['www.zhihu.com']
#start_urls = ['https://www.zhihu.com/api/v4/members/bu-xin-ming-71/followers?include=data%5B*%5D.answer_count%2Carticles_count%2Cgend er%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=20&limit=20']
def parse(self, response):
responses=json.loads(response.body.decode('utf-8'))"data"]
count=len(responses)
if count<20:
pass
else:
page_offset=int(re.findall('&offset=(.*?) & ',response.url)[0])
new_page_offset=page_offset+20
new_page_url=response.url.replace(
'&offset='+str(page_offset)+'&'.'&offset=' + str(new_page_offset) + '&'
)
yield Request(url=new_page_url,callback=self.parse)
for user in responses:
item=ZhihuItem()
item['name']=user['name']
item['id']= user['id']
item['headline'] = user['headline']
item['url_token'] = user['url_token']
item['user_type'] = user['user_type']
item['gender'] = user['gender']
item['articles_count'] = user['articles_count']
item['answer_count'] = user['answer_count']
item['follower_count'] = user['follower_count']
with open('userinfo.txt') as f:
user_list=f.read()
if user['url_token'] not in user_list:
with open('userinfo.txt'.'a') as f:
f.write(user['url_token'] +'-- -- -- -- --)
yield item
new_url='https://www.zhihu.com/api/v4/members/'+user['url_token'] +'/followers? include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(t ype%3Dbest_answerer)%5D.topics&offset=20&limit=20'
yield Request(url=new_url,callback=self.parse)
Copy the code
You can see that using Scrapy distribution requires only two changes and then adding configurations to the configuration file. In distributed mode, enable multiple terminals at last. Instead of using distribution, you just need to run it on one terminal
Running the page
The results
The speed of distributed crawler is very fast, and more than 20,000 data have been collected in less than half a minute. The crawler logic is the same for those who don’t have a Scrapy framework. You only need to copy part of the crawler code can also run.
Recommended reading:
Crawler Advanced Where to Hotel (home and abroad)
Scrapy grab taobao delicacies
The case of large crawlers: Where to crawl
For those who are interested in crawlers, data analysis and algorithms, please add the wechat official account TWcoding and let’s play Python together.
If it works for you.Please,star.
God helps those who help themselves