Anyone who has worked in the imperial capital knows how difficult it is to rent a suitable apartment. The agency charges one month’s rent as a broker’s fee. And a lot of black intermediary under the banner of renting all kinds of swindling. Trying to find the real landlord in a sea of posts is like looking for a needle in a haystack, and requires competing wits with various shady brokers. Here’s the story of my bloody fight.

So, How to start? Let’s pick a position first. Market such a website, it can be said that the intermediary accounted for the majority of the terrain, easy to defend difficult to attack, decisively give up. Idle fish, resources are too few, the significance of attacking down is not much, so also give up. I set my sights on douban. Most of the children in The Imperial Capital know that there are many rental groups in Douban group, mostly young people, many of them sublet, but most of them sign contracts with landlords, saving the intermediary fees. I did a quick look at it, and it was up to 90 pages of updates a day, with 25 pieces of data per page, some of which were old and overwritten. That’s a lot of data, and it’s a lot of intermediaries, but it’s a lot better than other places.

Solemnly declare: you must control the frequency when crawling data, do not affect the normal access to the website! And the frequency is too high will be douban kill, and climb and cherish! Also, please read the comments carefully!

Let’s first analyze the structure of the page we want to grab. Take the famous Beijing rental group for example.

More group discussions
https://www.douban.com/group/beijingzufang/discussion?start=0
https://www.douban.com/group/beijingzufang/discussion?start=25

At this point, we just need to get the data of each page separately, and then do some filtering, which can greatly reduce the time of filtering. We chose the first 20 pages for the crawl to keep the data as up to date as possible while not impacting the site.

Ok, the important thing is that, as a front-end, I use Node to do fetching, introducing some necessary dependencies first.

import fs from 'fs'    // Node's file module, which outputs filtered data as HTML
import path from 'path' // Node path module, used to handle file paths

// The following modules are not built-in modules of Node.js and need to be installed using NPM

// The client requests the proxy module
import superagent from "superagent"   
// Dom manipulation on the node side, can be understood as node version of jQuery, the syntax is almost the same as jQuery
import cheerio from "cheerio"   
// A tool that determines the order of execution through events
import eventproxy from 'eventproxy' 
// Async is a third-party node module that mapLimit uses to control access frequency
import mapLimit from "async/mapLimit"  
Copy the code

Then we can organize the pages we want to grab into an array

let ep = new eventproxy()  // instantiate eventProxy

let baseUrl = 'https://www.douban.com/group/beijingzufang/discussion?start=';  
let pageUrls = []  // An array of pages to fetch

let page = 20  // Fetch the number of pages
let perPageQuantity = 25   // Number of entries per page

for (let i = 0; i < page; i++) {
  pageUrls.push({
    url: baseUrl + i * perPageQuantity
  });
}
Copy the code

Briefly analyze the DOM structure of the page. The valid data in the page is all in the table, with the first TR being the title and then each tr corresponding to one data. Then there are four TDS under each TR. It holds the title, author, number of responses, and last modified time.

Let’s start by writing an entry function that accesses all the pages to crawl and saves the data we need. You know, it’s been a while since I wrote jQuery.

function start() {
  // Walk through the crawl page
  const getPageInfo = (pageItem, callback) = > {
    // Set the access interval
    let delay = parseInt((Math.random() * 30000000) % 1000.10)
    pageUrls.forEach(pageUrl= > {
      superagent.get(pageUrl.url)
        // Simulate the browser
        .set('User-Agent'.'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36') 
        // If you don't crawl a little bit, you will probably be killed by Douban. At this time, you need to simulate login state to access
        // .set('Cookie','')  
        .end((err, pres) = > {
          let $ = cheerio.load(pres.text) // Create a jquery-like object with cheerio

          let itemList = $('.olt tbody').children().slice(1.26) // Fetch each row in the table and filter out the table headings

          // Iterate over each piece of data in the page
          for (let i = 0; i < itemList.length; i++) {
            let item = itemList.eq(i).children()

            let title = item.eq(0).children('a').text() || ' ' // Get the title
            let url = item.eq(0).children('a').attr('href') | |' ' // Get the details page link
            // let author = item.eq(1).children('a').attr('href').replace('https://www.douban.com/people', '').replace(/\//g, "') | | '/ / get the author id
            let author = item.eq(1).children('a').text() || ' ' // The reason for using the author's nickname instead of id here is to find out that some intermediaries have registered many accounts, so they can change their location at once. Although the same name also exists, the probability is negligible for such a small amount of data
            let markSum = item.eq(2).text() // Get the number of responses
            let lastModify = item.eq(3).text() // Get the last modification time

            let data = {
              title,
              url,
              author,
              markSum,
              lastModify
            }
            // ep.emit(' event name ', data content)
            ep.emit('preparePage', data) // Every time a piece of data is processed, it is sent through the preparePage event, which is mainly used for counting
          }
          setTimeout((a)= > {
            callback(null, pageItem.url); }, delay); }}})})Copy the code

We use mapLimit to control access frequency and refer to official documentation for details of mapLimit. portal

  mapLimit(pageUrls, 2.function (item, callback) {
    getPageInfo(item, callback);
  }, function (err) {
    if (err) {
      console.log(err)
    }
    console.log('Captured')});Copy the code

Let’s talk a little bit about the filtering strategy. First of all, in the title, filter out inappropriate locations, and the most commonly used words by agents. You can also add the keywords you want, targeted screening. Then the number of posts of each author is counted. The judgment condition here is that if the number of posts of each person appears more than 5 times in the captured page, it is considered as an intermediary. If a post gets a lot of replies, it’s either an old post that’s been reposted, or it’s probably someone who’s been reranking. I’ve set a threshold of 100. Imagine a normal landlord would not be so crazy to brush the sense of existence, because the good room does not worry about renting out, it is likely that the intermediary every day in the brush old posts. Even if it’s because the house is better and everyone is watching, the chances of you actually getting it are pretty low, so just filter it out.

// We set three global variables to hold some data
let result = []   // Store the final filter result
let authorMap = {} // We count the number of posts per object as an attribute
let intermediary = [] // Intermediate ID list, you can also save this part of the data, later when the filter directly out!

// Remember the previous ep.emit(), which was captured every time it emitted. Ep.after (' event name ', number, callback() after the specified number of events).
That is, a total of 20*25 (pages * data per page) events have been captured before the callback function is executed
ep.after('preparePage', pageUrls.length * page, function (data) {
    / / here we introduced to don't want to appear keywords, separated by '|'. For example, exclude some positions, exclude common phrases used by intermediaries
    let filterWords = / bet a pay a month rent | | short pay rent | | | shell have room line 6 | line 6 / 
    // Here we pass in the keyword to filter, if not, can be set to space
    let keyWords = / Western Flag /
    
    // We first count the number of posts per person and save it as an object property. The count is implemented using the property that the object property name cannot be repeated.
    data.forEach(item= > {
      authorMap[item.author] = authorMap[item.author] ? ++authorMap[item.author] : 1
      if (authorMap[item.author] > 4) {
        intermediary.push(item.author) // If you find that someone has more than 5 posts, you will be put in the cold house.}})// Set = Set; // Set = Set
    intermediary = [...new Set(intermediary)]
    // Iterate over the captured data again
    data.forEach(item= > {
    // If (); // If (); // If ()
      if (item.markSum > 100) {
        console.log('Too many comments, discard')
        return
      }
      if (filterWords.test(item.title)) {
        console.log('Headline with unwanted words')
        return
      }
      if(intermediary.includes(item.author)){
        console.log('Too many posts, discard')
        return
      }
      // Only after you pass the above layers will you come to the final step, where if you do not set the desired keywords, the filtering results will be added to the list of results
      if (keyWords.test(item.title)) {
        result.push(item)
      }
    })
    
    / /...
});
Copy the code

So far, we’ve got the list of results we want, but printing them out doesn’t work that well, so we’ll generate it in HTML. We just need to do simple HTML assembly

// Set the HTML template
let top = '
      ' +
      '<html lang="en">' +
      '<head>' +
      '<meta charset="UTF-8">' +
      '<style>' +
      '.listItem{ display:block; margin-top:10px; text-decoration:none; } ' +
      '.markSum{ color:red; } ' +
      '.lastModify{ color:"#aaaaaa"}' +
      '</style>' +
      ' Filter result ' +
      '</head>' +
      '<body>' +
      '<div>'
let bottom = '</div> </body> </html>'

// Assemble valid data HTML
let content = ' '

result.forEach(function (item) {
  content += `<a class="listItem" href="${item.url}" target="_blank">${item.title}_____<span class="markSum">${item.markSum}</span>____<span class="lastModify">${item.lastModify}</span>`
})

let final = top + content + bottom
  
// Finally, output the generated HTML to the specified file directory
fs.writeFile(path.join(__dirname, '.. /tmp/result.html'), final, function (err) {
  if (err) {
    return console.error(err);
  }
  console.log('success')});Copy the code

Finally, we just need to expose the entry function

export default {
  start
}

Copy the code

Since we are writing using ES6 syntax, we need to use babel-Node. First install Babel -cli, you can choose global or local install, NPM I Babel -cli -g. Also don’t forget the three dependencies installed at the beginning of this article.

Finally we import the above script in the index.js file and execute babel-node index.js. We have seen exciting success.

// index.js
import douban from './src/douban.js'
douban.start()

Copy the code

Finally, let’s open the HTML to have a look at the effect. The number of replies marked in red is the number of replies. Click the title to jump directly to the corresponding page of Douban. At the same time, we can easily judge whether we have seen this data by using the effect of changing color after clicking the label A.

I simply set some filtering conditions, and the number of data dropped from 500 to 138, greatly shortening our screening time. If I add some specific filter keywords, the search results will be more accurate!

Well, it’s getting late, so that’s all for today’s sharing. If you find it difficult to find a house, you should still go to Lianjia. I love big agencies like lianjia, which are more reliable and less worried. Finally, I wish you all to find a warm nest!