Try this node crawler to download 10000 photos of some kind in one click

preface

This is a year a month a day, probably the heart is very hot season to write a small tool, one key download Zhihu beautiful girl pictures. As for what I felt at that time, I can only simulate it now, when I write the article.

Solemnly declare:

This small tool is just learning to use, there is no malicious

If you just want to use it seriously, you can also click here

On the whole

August The sky was blue, the sun hung like a ball of fire, and the clouds seemed to melt away. Without a breath of wind, the earth is like a big steamer.

It’s hot, it’s irritable, it’s boring. Accidentally opened the zhihu again? “What is it like to take pictures of good-looking girls? , together brush a big pile of good-looking little sister, see the person good life intoxicated. As a former tech loser, I’m sure you’re thinking the same thing as I am right now, if only you could fit them into that “Learning Lessons” folder.

How to do? Right click to save each picture? No, no, no, it’s inefficient. It hurts when you press your right hand. I forgot, I’m a fucking programmer! Programmers! Programmers! Shouldn’t that be left to the program.

“Just do it.”

demand

The requirement is very simple: to automatically download the pictures of all answers under a post on Zhihu to the local

Analysis of the

So as long as you know the following two basically can be done.

Image links

To get the link of the picture uploaded by the person answering the post, as for all the pictures, that is to collect all the link of the picture uploaded by the person answering the post

Download the pictures

This tentative guess is to use a mature library, I just need to pass in the link address of the image, and the directory to download the image to complete the download. If you don’t find such a library, you’ll just have to figure out how native NodeJS does it.

Get image link

When we open the Chrome console, we get a bunch of requests when the page opens, but one with “answers” is suspicious. Does it return the answers?

Until we verify this idea, let’s not look at the specific response to the request. Let’s start by clicking on all 948 of the answers buttons on the page. If we guess right, the request for “answers” should come back and return the answer.

When YOU click on the button, “Answers” does send out again, and look at the response, which looks something like this


{
  data: [{// xxx
      // Information about the speaker
      author: {
        // ...
      },
      // Answer the questions
      content: 


      
       
      
'.// Post description
      question: {}
      / / XXX and so on
    },
    {
      /// xxx}].paging: {
    // Whether to end
    is_end:false.// Is it just the beginning
    is_start:false.// Check the API address of the next page
    next: "https://www.zhihu.com/api/v4/questions/49364343/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward _info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit %2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2C created_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author% 2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge %5B%3F%28type%3Dbest_answerer%29%5D.topics&limit=5&offset=8&sort_by=default".// The API address of the previous page
    previous: "https://www.zhihu.com/api/v4/questions/49364343/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward _info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit %2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2C created_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author% 2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge %5B%3F%28type%3Dbest_answerer%29%5D.topics&limit=5&offset=0&sort_by=default".// Total number of responses
    totals: 948}}Copy the code

From the response we get the total number of responses and the content field of the respondent returned by the current request. The image address we want is in the data-Original property of the IMG tag under the noscript tag. Therefore, for requirement 1:

We seem to have 50 percent of the information, and the other half of the information is, how do we get all of the respondents? Remember that paging is also included in the response to retrieve data for the next content

// Whether to end
is_end:false.// Check the API address of the next page
next: 'https://www.zhihu.com/api/v4/questions/49364343/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward _info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit %2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2C created_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author% 2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge %5B%3F%28type%3Dbest_answerer%29%5D.topics&limit=5&offset=8&sort_by=default"'.// Total number of responses
totals: ' '
Copy the code

The Query request part takes a total of three parameters

{
  include: 'xxxx'.// This parameter may be the background verification of Zhihu
  offset: 3./ / page
  limit: 5.// Number of contents per page
  sort_by: 'default' // Sort
}
Copy the code

So it looks like if we set offset to 0 and limit to the total, will we get all the data? After trying, we can only get the data for 20 respondents at most, so let’s get all the data from is_end and next in a series of requests.

Download online pictures

For 2. The last Google search found that there is such a library request, for example, to download an online image to the local just need some code as follows


const request = require('request) request('http://google.com/doodle.png')
  .pipe(fs.createWriteStream('doodle.png'))

Copy the code

So now that I have 1 and 2, all I have to do is rub it up and write code to implement it.

preview

Before talking about code implementation, let’s have a look at the actual download effect and basic use of it!!

use


require('./crawler') ({dir: './imgs'.// The location of the image
  questionId: '34078228'./ / zhihu thread id, such as https://www.zhihu.com/question/49364343/answer/157907464, enter 49364343
  proxyUrl: 'https://www.zhihu.com' // When the number of zhihu requests reaches a certain threshold, Zhihu will be considered as a crawler (like a packet IP). Then if you have a proxy server to forward the request data, you can continue downloading.
})


Copy the code

The proxyUrl will not focus on this field, but will explain what it does in detail later

implementation

Check out crawler.js

let path = require('path')
let fs = require('fs')
let rp = require('request-promise')
let originUrl = 'https://www.zhihu.com'

class Crawler {
  constructor (options) {
    // The constructor is mainly the initialization of some attributes
    const { dir = './imgs', proxyUrl = originUrl, questionId = '49364343', offset = 0, limit = 100, timeout = 10000 } = options
    // The default original URL for a request to Zhihu in non-proxy mode is https://www.zhihu.com
    this.originUrl = originUrl
    // The actual path of the request in proxy mode, which is also https://www.zhihu.com by default
    // When the IP of your computer is blocked, you can request Zhihu through the proxy server, and we get data from the proxy server
    this.proxyUrl = proxyUrl
    // The requested final URL
    this.uri = `${proxyUrl}/api/v4/questions/${questionId}/answers? limit=${limit}&offset=${offset}&include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_de tail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_cont ent%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_inf o%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%2A%5D.mark_ infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics&sort_by=defau lt`
    // If it is the last data
    this.isEnd = false
    // Post id of Zhihu
    this.questionId = questionId
    // Set the timeout for requests (the timeout for getting answers to posts and downloading images is currently the same)
    this.timeout = timeout
    // Get the image link after parsing the answer
    this.imgs = []
    // Image download path root directory
    this.dir = dir
    // The directory where the final image was downloaded according to questionId and dir
    this.folderPath = ' '
    // The number of images downloaded
    this.downloaded = 0
    // Initialize method
    this.init()
  }

  async init () {
    if (this.isEnd) {
      console.log('All downloads have been completed, please enjoy')
      return
    }
    // Get the answer to the post
    let { isEnd, uri, imgs, question } = await this.getAnswers()

    this.isEnd = isEnd
    this.uri = uri
    this.imgs = imgs
    this.downloaded = 0
    this.question = question
    console.log(imgs, imgs.length)
    // Create the image download directory
    this.createFolder()
    // Iterate over the downloaded image
    this.downloadAllImg(() = > {
      // After downloading all images from the current request, continue to request the next wave of data
      if (this.downloaded >= this.imgs.length) {
        setTimeout(() = > {
          console.log('Rest for three seconds and continue to the next wave')
          this.init()
        }, 3000)}}}// Get the answer
  async getAnswers () {
    let { uri, timeout } = this
    let response = {}

    try {
      const { paging, data } = await rp({ uri, json: true, timeout })
      const { is_end: isEnd, next } = paging
      const { question } = Object(data[0])
      // Aggregate multiple answers into content
      const content = data.reduce((content, it) = > content + it.content, ' ')
      // Match the content parse image URL
      const imgs = this.matchImg(content)

      response = { isEnd, uri: next.replace(originUrl, this.proxyUrl), imgs, question }
    } catch (error) {
      console.log('Error calling Zhihu API, please try again')
      console.log(error)
    }

    return response
  }
  // Match the string to find all image links
  matchImg (content) {
    let imgs = []
let matchImgOriginRe = /

    content.replace(matchImgOriginRe, ($0, $1) = > imgs.push($1))

    return [ ...new Set(imgs) ]
  }
  // Create a file directory
  createFolder () {
    let { dir, questionId } = this
    let folderPath = `${dir}/${questionId}`
    let dirs = [ dir, folderPath ]

    dirs.forEach((dir) = >! fs.existsSync(dir) && fs.mkdirSync(dir))this.folderPath = folderPath
  }
  // Iterate over the downloaded image
  downloadAllImg (cb) {
    let { folderPath, timeout } = this
    this.imgs.forEach((imgUrl) = > {
      let fileName = path.basename(imgUrl)
      let filePath = `${folderPath}/${fileName}`

      rp({ uri: imgUrl, timeout })
        .on('error'.() = > {
          console.log(`${imgUrl}Download error ')
          this.downloaded += 1
          cb()
        })
        .pipe(fs.createWriteStream(filePath))
        .on('close'.() = > {
          this.downloaded += 1
          console.log(`${imgUrl}Download complete ')
          cb()
        })
    })
  }
}

module.exports = (payload = {}) = > {
  return new Crawler(payload)
}



Copy the code

Source code implementation is basically very simple, you can quickly understand the comments.

IP been closed

I was downloading images from multiple posts using crawler.js that I had written.

The system has detected abnormal traffic to your account or IP address. Please verify that these requests are not automated."

Finished, zhihu won’t let me ask??

It took a long time, but was finally sealed as a reptile. Online to find some solutions, such as crawler how to solve the IP closure?

There are basically two ideas

1, slow down the crawl speed, reduce the pressure caused by the target website. But this will reduce the amount of grasping per unit time.

2. The second method is to break through anti-crawler mechanism and continue high frequency crawler by setting proxy IP and other means. But this requires multiple stable proxy IP addresses.

It is impossible to continue using this machine and request Zhihu directly when the IP has not changed, but we can try 2. Use a proxy server.

It occurred to me that I bought a server last year. I just put a resume-native program.

Although it is impossible to do as shown in the above two pictures, which proxy service is blocked, then switch to another proxy server in time. But at least you can download the image again through the proxy server and wank…

The proxy IP continues to run

The proxy program proxy.js runs on the server, it will monitor the request path of /proxy*, and when the request arrives, it will forward the httpProxy middleware to zhihu to pull data through the request written before by itself, and then return the response to our local. It is shown in a graph as follows

So our original request path are (in order to simplify the include this long in addition to the query parameter) www.zhihu.com/api/v4/ques…

ZZZ can be the domain name or IP address plus port of your own proxy server.

In this way, we indirectly bypass the embarrassment of Zhihu IP blocking, but this is only a temporary solution, after all, the proxy server will also be blocked IP.

At the end

Write this tool just to learn to use, and no other malicious. If you just want to use it seriously, you can also click here