preface

Today we’ll use scrapy to get Zhihu memes. Let’s have a good time

The development tools

Python version: 3.6.4
Related modules:

Scrapy module

Request module;

Fake_useragent module;

And some modules that come with Python.

Environment set up

Install Python and add it to the environment variables. PIP installs the required related modules.

Introduction of the principle

The principle is actually quite simple, because WE know that Zhihu has an API that can always be used:

https://www.zhihu.com/node/QuestionAnswerListV2 post request this link, carrying the data format is as follows: data = {'method': 'next'.'params': '{"url_token":%s,"page_size":%s,"offset":%s}'
}
1.Url_token: problemsid"Https://www.zhihu.com/question/, such as the problem302378021The problem of"idfor302378021
2.Page_size: number of answers per page10)
3.Offset: Indicates the offset of the currently displayed answerCopy the code

You can get all the answers to the question, and then use the regular expression to extract all the image links under each answer.

Create a new scrapy project:

scrapy startproject zhihuEmoji
Copy the code

Then create a new zhihuemoji. py file under the spiders folder to implement our main crawler program:

"Zhihu Emoticons crawling"
class zhihuEmoji(scrapy.Spider) :
    name = 'zhihuEmoji'
    allowed_domains = ['www.zhihu.com']
    question_id = '302378021'
    answer_url = 'https://www.zhihu.com/node/QuestionAnswerListV2'
    headers = {
                'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/79.0.3945.130 Safari/537.36'.'Accept-Encoding': 'gzip, deflate'
            }
    ua = UserAgent()
    Request function
    def start_requests(self) :
        offset = -10
        size = 10
        while True:
            offset += size
            data = {
                        'method': 'next'.'params': '{"url_token":%s,"page_size":%s,"offset":%s}' % (self.question_id, size, offset)
                    }
            self.headers['user-agent'] = self.ua.random
            yield scrapy.FormRequest(url=self.answer_url, formdata=data, callback=self.parse, headers=self.headers)
    Parsing function
    def parse(self, response) :
        # To save pictures
        if not os.path.exists(self.question_id):
            os.mkdir(self.question_id)
        # Parse the response to get the data in the answer to the question, then get the image link in each answer and download it
        item = ZhihuemojiItem()
        answers = eval(response.text)['msg']
        imgregular = re.compile('data-original="(.*?) "', re.S)
        answerregular = re.compile('data-entry-url="\\\\/question\\\\/{question_id}\\\\/answer\\\\/(.*?) "'.format(question_id=self.question_id), re.S)
        for answer in answers:
            item['answer_id'] = re.findall(answerregular, answer)[0]
            image_url = []
            for each in re.findall(imgregular, answer):
                each = each.replace('\ \'.' ')
                if each.endswith('r.jpg'):
                    image_url.append(each)
            image_url = list(set(image_url))
            for each in image_url:
                item['image_url'] = each
                self.headers['user-agent'] = self.ua.random
                self.download(requests.get(each, headers=self.headers, stream=True))
                yield item
    Download picture
    def download(self, response) :
        if response.status_code == 200:
            image = response.content
            filepath = os.path.join(self.question_id, str(len(os.listdir(self.question_id)))+'.jpg')
            with open(filepath, 'wb') as f:
                f.write(image)
Copy the code

ZhihuemojiItem () is used to store all picture links and corresponding answer ids that we have climbed, and the specific definition is as follows:

class ZhihuemojiItem(scrapy.Item) :
    image_url = scrapy.Field()
    answer_id = scrapy.Field()
Copy the code

OK, done, complete source code see personal profile related files