Article | so-and-so rice \
Source: Python Technology “ID: pythonall”
There are many topics on Zhihu about the level of appearance and body shape, some of which have received hundreds or thousands of replies, with tens of thousands of followers and views. If we’re going to spend a lot of time enjoying these topics while we’re fishing, we can use Python to make a little script to download the zhihu answer pictures and download them locally.
Request URL Analysis
First open the F12 console panel, see the photo URL is https://pic4.zhimg.com/80/xxxx.jpg?source=xxx this format.
Scroll down the page to find a URL request with the limit and offset parameters. \
Check whether the content in the Response panel contains the URL of the image, where the URL of the image is in the data-original property. \
Extract the URL of the image
As you can see from the figure above, the address of the image is stored in the Data-Original property under the Content property.
The following code gets the address of the image and writes it to the file.
import re
import requests
import os
import urllib.request
import ssl
from urllib.parse import urlsplit
from os.path import basename
import json
ssl._create_default_https_context = ssl._create_unverified_context
headers = {
'User-Agent': "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36".'Accept-Encoding': 'gzip, deflate'
}
def get_image_url(qid, title):
answers_url = 'https://www.zhihu.com/api/v4/questions/'+str(qid)+'/answers? include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detai l%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content %2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Cre levant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2 Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5D.author.follower_coun t%2Cbadge%5B*%5D.topics%3Bdata%5B*%5D.settings.table_of_content.enabled&offset={}&limit=10&sort_by=default&platform=desk top'
offset = 0
session = requests.Session()
while True:
page = session.get(answers_url.format(offset), headers = headers)
json_text = json.loads(page.text)
answers = json_text['data']
offset += 10
if not answers:
print('Get picture address done')
return
pic_re = re.compile('data-original="(.*?) "', re.S)
for answer in answers:
tmp_list = []
pic_urls = re.findall(pic_re, answer['content'])
forItem in pic_urls: # remove transfer character \ pic_url = item.replace("\ \"."")
pic_url = pic_url.split('? ') [0# Repeatif pic_url not in tmp_list:
tmp_list.append(pic_url)
for pic_url in tmp_list:
if pic_url.endswith('r.jpg') :print(pic_url)
write_file(title, pic_url)
def write_file(title, pic_url):
file_name = title + '.txt'
f = open(file_name, 'a')
f.write(pic_url + '\n')
f.close(a)Copy the code
Example results:
Download the pictures
The following code reads the address of the image in the file and downloads it.
def read_file(title):
file_name = title + '.txt'Pic_urls = [] # check whether the file existsif not os.path.exists(file_name):
return pic_urls
with open(file_name, 'r') as f:
for line in f:
url = line.replace("\n"."")
if url not in pic_urls:
pic_urls.append(url)
print("There are {} unique urls in the file".format(len(pic_urls)))
returnPic_urls def download_pic(pic_urls, title): # create folderif not os.path.exists(title):
os.makedirs(title)
error_pic_urls = []
success_pic_num = 0
repeat_pic_num = 0
index = 1
for url in pic_urls:
file_name = os.sep.join((title,basename(urlsplit(url)[2)))if os.path.exists(file_name):
print("Picture {} already exists".format(file_name))
index += 1
repeat_pic_num += 1
continue
try:
urllib.request.urlretrieve(url, file_name)
success_pic_num += 1
index += 1
print("Download {} complete! ({} / {})".format(file_name, index, len(pic_urls)))
except:
print("Download {} failed! ({} / {})".format(file_name, index, len(pic_urls)))
error_pic_urls.append(url)
index += 1
continue
print("All pictures downloaded! (Success: {}/ repeat: {}/ failure: {})".format(success_pic_num, repeat_pic_num, len(error_pic_urls)))
if len(error_pic_urls) > 0:
print('Failed to print picture address below')
for error_url in error_pic_urls:
print(error_url)
Copy the code
conclusion
This article uses a Python crawler to create a small script. If you find the article interesting and useful, click on it to support it.
PS: Reply “Python” within the public number to enter the Python novice learning exchange group, together with the 100-day plan!
Old rules, brothers still remember, the lower right corner of the “watching” click, if you feel the content of the article is good, remember to share moments to let more people know!
[Code access ****]
Identify the qr code at the end of the article, reply: 210325