Try this node crawler to download 10000 photos of some kind in one click
January 29, 2024
by 賴俊賢
No Comments
preface
This is a year a month a day, probably the heart is very hot season to write a small tool, one key download Zhihu beautiful girl pictures. As for what I felt at that time, I can only simulate it now, when I write the article.
Solemnly declare:
This small tool is just learning to use, there is no malicious
This small tool is just learning to use, there is no malicious
This small tool is just learning to use, there is no malicious
If you just want to use it seriously, you can also click here
On the whole
August The sky was blue, the sun hung like a ball of fire, and the clouds seemed to melt away. Without a breath of wind, the earth is like a big steamer.
It’s hot, it’s irritable, it’s boring. Accidentally opened the zhihu again? “What is it like to take pictures of good-looking girls? , together brush a big pile of good-looking little sister, see the person good life intoxicated. As a former tech loser, I’m sure you’re thinking the same thing as I am right now, if only you could fit them into that “Learning Lessons” folder.
How to do? Right click to save each picture? No, no, no, it’s inefficient. It hurts when you press your right hand. I forgot, I’m a fucking programmer! Programmers! Programmers! Shouldn’t that be left to the program.
“Just do it.”
demand
The requirement is very simple: to automatically download the pictures of all answers under a post on Zhihu to the local
Analysis of the
So as long as you know the following two basically can be done.
Image links
To get the link of the picture uploaded by the person answering the post, as for all the pictures, that is to collect all the link of the picture uploaded by the person answering the post
Download the pictures
This tentative guess is to use a mature library, I just need to pass in the link address of the image, and the directory to download the image to complete the download. If you don’t find such a library, you’ll just have to figure out how native NodeJS does it.
Get image link
When we open the Chrome console, we get a bunch of requests when the page opens, but one with “answers” is suspicious. Does it return the answers?
Until we verify this idea, let’s not look at the specific response to the request. Let’s start by clicking on all 948 of the answers buttons on the page. If we guess right, the request for “answers” should come back and return the answer.
When YOU click on the button, “Answers” does send out again, and look at the response, which looks something like this
{
data: [{// xxx// Information about the speakerauthor: {
// ...
},
// Answer the questionscontent:
'.// Post descriptionquestion: {}
/ / XXX and so on
},
{
/// xxx}].paging: {
// Whether to endis_end:false.// Is it just the beginningis_start:false.// Check the API address of the next pagenext: "https://www.zhihu.com/api/v4/questions/49364343/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward _info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit %2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2C created_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author% 2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge %5B%3F%28type%3Dbest_answerer%29%5D.topics&limit=5&offset=8&sort_by=default".// The API address of the previous pageprevious: "https://www.zhihu.com/api/v4/questions/49364343/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward _info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit %2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2C created_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author% 2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge %5B%3F%28type%3Dbest_answerer%29%5D.topics&limit=5&offset=0&sort_by=default".// Total number of responsestotals: 948}}Copy the code
From the response we get the total number of responses and the content field of the respondent returned by the current request. The image address we want is in the data-Original property of the IMG tag under the noscript tag. Therefore, for requirement 1:
We seem to have 50 percent of the information, and the other half of the information is, how do we get all of the respondents? Remember that paging is also included in the response to retrieve data for the next content
// Whether to endis_end:false.// Check the API address of the next pagenext: 'https://www.zhihu.com/api/v4/questions/49364343/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward _info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit %2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2C created_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author% 2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge %5B%3F%28type%3Dbest_answerer%29%5D.topics&limit=5&offset=8&sort_by=default"'.// Total number of responsestotals: ' 'Copy the code
The Query request part takes a total of three parameters
{
include: 'xxxx'.// This parameter may be the background verification of Zhihuoffset: 3./ / pagelimit: 5.// Number of contents per pagesort_by: 'default'// Sort
}
Copy the code
So it looks like if we set offset to 0 and limit to the total, will we get all the data? After trying, we can only get the data for 20 respondents at most, so let’s get all the data from is_end and next in a series of requests.
Download online pictures
For 2. The last Google search found that there is such a library request, for example, to download an online image to the local just need some code as follows
const request = require('request) request('http://google.com/doodle.png')
.pipe(fs.createWriteStream('doodle.png'))
Copy the code
So now that I have 1 and 2, all I have to do is rub it up and write code to implement it.
preview
Before talking about code implementation, let’s have a look at the actual download effect and basic use of it!!
use
require('./crawler') ({dir: './imgs'.// The location of the imagequestionId: '34078228'./ / zhihu thread id, such as https://www.zhihu.com/question/49364343/answer/157907464, enter 49364343proxyUrl: 'https://www.zhihu.com'// When the number of zhihu requests reaches a certain threshold, Zhihu will be considered as a crawler (like a packet IP). Then if you have a proxy server to forward the request data, you can continue downloading.
})
Copy the code
The proxyUrl will not focus on this field, but will explain what it does in detail later
implementation
Check out crawler.js
let path = require('path')
let fs = require('fs')
let rp = require('request-promise')
let originUrl = 'https://www.zhihu.com'classCrawler{
constructor (options) {
// The constructor is mainly the initialization of some attributesconst { dir = './imgs', proxyUrl = originUrl, questionId = '49364343', offset = 0, limit = 100, timeout = 10000 } = options
// The default original URL for a request to Zhihu in non-proxy mode is https://www.zhihu.comthis.originUrl = originUrl
// The actual path of the request in proxy mode, which is also https://www.zhihu.com by default// When the IP of your computer is blocked, you can request Zhihu through the proxy server, and we get data from the proxy serverthis.proxyUrl = proxyUrl
// The requested final URLthis.uri = `${proxyUrl}/api/v4/questions/${questionId}/answers? limit=${limit}&offset=${offset}&include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_de tail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_cont ent%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_inf o%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%2A%5D.mark_ infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics&sort_by=defau lt`// If it is the last datathis.isEnd = false// Post id of Zhihuthis.questionId = questionId
// Set the timeout for requests (the timeout for getting answers to posts and downloading images is currently the same)this.timeout = timeout
// Get the image link after parsing the answerthis.imgs = []
// Image download path root directorythis.dir = dir
// The directory where the final image was downloaded according to questionId and dirthis.folderPath = ' '// The number of images downloadedthis.downloaded = 0// Initialize methodthis.init()
}
async init () {
if (this.isEnd) {
console.log('All downloads have been completed, please enjoy')
return
}
// Get the answer to the postlet { isEnd, uri, imgs, question } = awaitthis.getAnswers()
this.isEnd = isEnd
this.uri = uri
this.imgs = imgs
this.downloaded = 0this.question = question
console.log(imgs, imgs.length)
// Create the image download directorythis.createFolder()
// Iterate over the downloaded imagethis.downloadAllImg(() = > {
// After downloading all images from the current request, continue to request the next wave of dataif (this.downloaded >= this.imgs.length) {
setTimeout(() = > {
console.log('Rest for three seconds and continue to the next wave')
this.init()
}, 3000)}}}// Get the answerasync getAnswers () {
let { uri, timeout } = thislet response = {}
try {
const { paging, data } = await rp({ uri, json: true, timeout })
const { is_end: isEnd, next } = paging
const { question } = Object(data[0])
// Aggregate multiple answers into contentconst content = data.reduce((content, it) = > content + it.content, ' ')
// Match the content parse image URLconst imgs = this.matchImg(content)
response = { isEnd, uri: next.replace(originUrl, this.proxyUrl), imgs, question }
} catch (error) {
console.log('Error calling Zhihu API, please try again')
console.log(error)
}
return response
}
// Match the string to find all image links
matchImg (content) {
let imgs = []
let matchImgOriginRe = /
content.replace(matchImgOriginRe, ($0, $1) = > imgs.push($1))
return [ ...new Set(imgs) ]
}
// Create a file directory
createFolder () {
let { dir, questionId } = thislet folderPath = `${dir}/${questionId}`let dirs = [ dir, folderPath ]
dirs.forEach((dir) = >! fs.existsSync(dir) && fs.mkdirSync(dir))this.folderPath = folderPath
}
// Iterate over the downloaded image
downloadAllImg (cb) {
let { folderPath, timeout } = thisthis.imgs.forEach((imgUrl) = > {
let fileName = path.basename(imgUrl)
let filePath = `${folderPath}/${fileName}`
rp({ uri: imgUrl, timeout })
.on('error'.() = > {
console.log(`${imgUrl}Download error ')
this.downloaded += 1
cb()
})
.pipe(fs.createWriteStream(filePath))
.on('close'.() = > {
this.downloaded += 1console.log(`${imgUrl}Download complete ')
cb()
})
})
}
}
module.exports = (payload = {}) = > {
returnnew Crawler(payload)
}
Copy the code
Source code implementation is basically very simple, you can quickly understand the comments.
IP been closed
I was downloading images from multiple posts using crawler.js that I had written.
The system has detected abnormal traffic to your account or IP address. Please verify that these requests are not automated."
Finished, zhihu won’t let me ask??
Finished, zhihu won’t let me ask??
Finished, zhihu won’t let me ask??
It took a long time, but was finally sealed as a reptile. Online to find some solutions, such as crawler how to solve the IP closure?
There are basically two ideas
1, slow down the crawl speed, reduce the pressure caused by the target website. But this will reduce the amount of grasping per unit time.
2. The second method is to break through anti-crawler mechanism and continue high frequency crawler by setting proxy IP and other means. But this requires multiple stable proxy IP addresses.
It is impossible to continue using this machine and request Zhihu directly when the IP has not changed, but we can try 2. Use a proxy server.
It occurred to me that I bought a server last year. I just put a resume-native program.
Although it is impossible to do as shown in the above two pictures, which proxy service is blocked, then switch to another proxy server in time. But at least you can download the image again through the proxy server and wank…
The proxy IP continues to run
The proxy program proxy.js runs on the server, it will monitor the request path of /proxy*, and when the request arrives, it will forward the httpProxy middleware to zhihu to pull data through the request written before by itself, and then return the response to our local. It is shown in a graph as follows
So our original request path are (in order to simplify the include this long in addition to the query parameter) www.zhihu.com/api/v4/ques…
ZZZ can be the domain name or IP address plus port of your own proxy server.
In this way, we indirectly bypass the embarrassment of Zhihu IP blocking, but this is only a temporary solution, after all, the proxy server will also be blocked IP.
At the end
Write this tool just to learn to use, and no other malicious. If you just want to use it seriously, you can also click here