What is the experience of fetching thousands of little sister images in Python?

Article | so-and-so rice \

Source: Python Technology “ID: pythonall”

There are many topics on Zhihu about the level of appearance and body shape, some of which have received hundreds or thousands of replies, with tens of thousands of followers and views. If we’re going to spend a lot of time enjoying these topics while we’re fishing, we can use Python to make a little script to download the zhihu answer pictures and download them locally.

Request URL Analysis

First open the F12 console panel, see the photo URL is https://pic4.zhimg.com/80/xxxx.jpg?source=xxx this format.

Scroll down the page to find a URL request with the limit and offset parameters. \

Check whether the content in the Response panel contains the URL of the image, where the URL of the image is in the data-original property. \

Extract the URL of the image

As you can see from the figure above, the address of the image is stored in the Data-Original property under the Content property.

The following code gets the address of the image and writes it to the file.

import re
import requests
import os
import urllib.request
import ssl

from urllib.parse import urlsplit
from os.path import basename
import json

ssl._create_default_https_context = ssl._create_unverified_context

headers = {
    'User-Agent': "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36".'Accept-Encoding': 'gzip, deflate'
}

def get_image_url(qid, title):
    answers_url = 'https://www.zhihu.com/api/v4/questions/'+str(qid)+'/answers? include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detai l%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content %2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Cre levant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2 Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5D.author.follower_coun t%2Cbadge%5B*%5D.topics%3Bdata%5B*%5D.settings.table_of_content.enabled&offset={}&limit=10&sort_by=default&platform=desk top'
    offset = 0
    session = requests.Session()

    while True:
        page = session.get(answers_url.format(offset), headers = headers)
        json_text = json.loads(page.text)
        answers = json_text['data']

        offset += 10

        if not answers:
            print('Get picture address done')
            return

        pic_re = re.compile('data-original="(.*?) "', re.S)

        for answer in answers:
            tmp_list = []
            pic_urls = re.findall(pic_re, answer['content'])

            forItem in pic_urls: # remove transfer character \ pic_url = item.replace("\ \"."")
                pic_url = pic_url.split('? ') [0# Repeatif pic_url not in tmp_list:
                    tmp_list.append(pic_url)

            
            for pic_url in tmp_list:
                if pic_url.endswith('r.jpg') :print(pic_url)
                    write_file(title, pic_url)

def write_file(title, pic_url):
    file_name = title + '.txt'

    f = open(file_name, 'a')
    f.write(pic_url + '\n')
    f.close(a)Copy the code

Example results:

Download the pictures

The following code reads the address of the image in the file and downloads it.


def read_file(title):
    file_name = title + '.txt'Pic_urls = [] # check whether the file existsif not os.path.exists(file_name):
        return pic_urls

    with open(file_name, 'r') as f:
        for line in f:
            url = line.replace("\n"."")
            if url not in pic_urls:
                pic_urls.append(url)

    print("There are {} unique urls in the file".format(len(pic_urls)))
    returnPic_urls def download_pic(pic_urls, title): # create folderif not os.path.exists(title):
        os.makedirs(title)

    error_pic_urls = []
    success_pic_num = 0
    repeat_pic_num = 0

    index = 1

    for url in pic_urls:
        file_name = os.sep.join((title,basename(urlsplit(url)[2)))if os.path.exists(file_name):
            print("Picture {} already exists".format(file_name))
            index += 1
            repeat_pic_num += 1
            continue

        try:
            urllib.request.urlretrieve(url, file_name)
            success_pic_num += 1
            index += 1
            print("Download {} complete! ({} / {})".format(file_name, index, len(pic_urls)))
        except:
            print("Download {} failed! ({} / {})".format(file_name, index, len(pic_urls)))
            error_pic_urls.append(url)
            index += 1
            continue
        
    print("All pictures downloaded! (Success: {}/ repeat: {}/ failure: {})".format(success_pic_num, repeat_pic_num, len(error_pic_urls)))

    if len(error_pic_urls) > 0:
        print('Failed to print picture address below')
        for error_url in error_pic_urls:
            print(error_url)
Copy the code

conclusion

This article uses a Python crawler to create a small script. If you find the article interesting and useful, click on it to support it.

PS: Reply “Python” within the public number to enter the Python novice learning exchange group, together with the 100-day plan!

Old rules, brothers still remember, the lower right corner of the “watching” click, if you feel the content of the article is good, remember to share moments to let more people know!

[Code access ****]

Identify the qr code at the end of the article, reply: 210325

What is the experience of fetching thousands of little sister images in Python?

Request URL Analysis

Extract the URL of the image

Download the pictures

conclusion

Related Posts

🙈 Shame, there are so many poses for Spring Bean initialization/destruction

Arthas flame graph performance analysis

JDK1.8 new features (five) : Stream, set operation sharp tool, let you use to fly up