Written in the beginning

Sometimes I wonder, what do we code for? The work? Make money? Is to let the life live well, or to let their waist is not so good? After thinking about it, it is also the result of each playing fifty boards. A thousand people have a thousand Hamlets. Maybe it’s nice to let nature take its course in finding fun without having to pursue a realistic purpose. Most of the code farmers should not be willing to write only business needs, even if it is a new era of farmers, then we have to do an ideal, ambition, in line with the socialist core values of farmers, isn’t it? So, in the leisure time required by the business, after watching the nuggets of the various magic tricks of the flower work, my hand told me that it also wants to wave, the flower work. So, in line with the principle of never stop doing something, the reptilian experience of Duang ~, Koa + Superagent + Cheerio was born.

A little description of the project

Actually at the beginning, is purely for crawler this curious with a point defense technology, at first just want to simple with the back-end service, write a summary of the crawler sample, see if you can put the page down, down, may be the DOM structure of character make me look too cluttered, making harvest greatly reduced sense of accomplishment, I decided to analyze the DOM, Filter out the data you need. After some operations, the data is filtered successfully, and the data is written into TXT files through the FS module of Node. Originally, the initial experience of the crawler should end here, but it may be that dopamine secreted too much in this day, suddenly want to write interface, experience the feeling of KOA masturbating interface, so write a simple front-end page, and successfully through postman rough debugging interface, get the data you want.

Technology is introduced

I’m not going to tell you that I only used Express and that’s why I wanted to try koA. I’m not going to tell you that I only used Express and that’s why I wanted to try KOA

  • superagent

    This thing, however, is a relatively lightweight AjaxAPI that relies on nodeJS internally, which we use to send requests and crawl web pages. Specific usage to see the documentation, here will not do too much redundant descriptionSuperagent 英 文 版
  • cheerio

    I was wondering if there is an easy way to analyze DOM on the server side after the web elements are captured by superagent. I found Cheerio in Cheerio’s Gitee and explained it like this:

Cheerio implements a subset of the core jQuery. Cheerio removes all DOM inconsistencies and browser residues from the jQuery library to show off its really gorgeous API, and Cheerio uses a very simple, consistent DOM model. As a result, parsing, manipulation, and rendering are very efficient.

To be more clear, Cheerio is a concise implementation of the core functions of jquery, which is mainly used for DOM operation or analysis on the server side

Sleepy comes to the pillow, and the crawler’s basic technical support is satisfied, so it’s time to start.

The body of the

Project directory

How to use Koa in the project is not mentioned in the article, if you do not understand the direct baidu, a lot of information


Part of the code (some explanation of the code will be put in the comments of the code block)

Package. json, where nodemon is used, otherwise the service will be stopped and restarted after each modification

{
  "name": "pachong"."version": "1.0.0"."description": "No description"."main": "./app.js"."scripts": {
    "start": "nodemon app.js"
  },
  "author": "yln"."license": "ISC"."dependencies": {
    "cheerio": 10 "" ^ 1.0.0 - rc.."koa": "^ 2.13.4." "."koa-cors": "0.0.16"."koa-router": "^ 10.1.1"."superagent": "^ 6.1.0"
  },
  "devDependencies": {}}Copy the code

app.js

const Koa = require('koa');
const Router = require('koa-router')
const Cors = require('koa-cors')
const SuperAgentRequest = require('superagent');
const Index = require('./index')
const app = new Koa();
const route = new Router();
// Before starting the service, add data in options and change the name. This is convenient for subsequent extension
const name = 'db';
const options = {
  'db': {
    'source': 'db'.// Create a custom name that corresponds to the name above
    'url': 'https://xxxxxxxxxxxxxxxx'// In order to prevent unnecessary disputes, the detailed URL is omitted here. Fill in the url according to your own needs}}// Use KOA-CORS to solve cross-domain problems
// The front-end service uses the live-server service. The port number is http://127.0.0.1:5500. You can set this parameter based on your own live-server configuration
app.use(Cors({
  origin: function() {
    return 'http://127.0.0.1:5500'
  },
  maxAge: 5.// Specifies the validity period of this precheck request, in seconds.
  credentials: true.// Whether cookies can be sent
  allowMethods: ['GET'.'POST'.'PUT'.'DELETE'.'OPTIONS'].// Sets the allowed HTTP request methods
  allowHeaders: ['Content-Type'.'Authorization'.'Accept'].// Set all header fields supported by the server
  exposeHeaders: ['WWW-Authenticate'.'Server-Authorization'] // Set to get additional custom fields
}));
let webData = {}

// Encapsulate a separate promise method here
function getData() {
  return new Promise((resolve, reject) = > {
    SuperAgentRequest.get(options[name].url).end(async (err, res) => {
      let sourceData = {}
      if (res.ok) {
      // Switch is used here mainly for subsequent extension, after all, it is not possible to only climb this site
      // Of course, you can also make distinctions in the form of files, there are many ways
        switch (options[name].source) {
          case 'db':
            sourceData = await Index.dbHandle(res.text);
            break;

          default:
            break;
        }
        webData = {
          status: 200.data: sourceData
        };
        resolve()
      } else {
        console.error('failure', err);
        webData = {
          status: 404.data: {}}; reject(); }},err= > {
      webData = {
        status: 404.data: {}};console.error('Data acquisition failed');
      reject()
    })
  })

}
// Create an interface
route.get('/hello'.async (ctx) => {
  await getData();
  ctx.body = webData;
  // You can also use the promise function to return explicitly
  // return getData().then(res => {
  // ctx.body = webData
  // })
})

app.use(route.routes()).use(route.allowedMethods())// Register the interface
app.listen(3000)
Copy the code

Index.js I choose to call methods here for data processing, and do TXT file write operations

const FS = require('fs');
const Handle = require('./handle');
// It is not in accordance with the specification to use the Chinese name of the project folder
const dbWriteUrl = '/ Users/apple/Desktop/node - crawler/textStore. TXT';
async function dbHandle(params) {
  const data = await Handle.dbNewBookHandle(params);// Data filtering is handled in handle.js
  FS.writeFileSync(dbWriteUrl, data);// File write
  return data
}
const Index = {
  dbHandle
}
module.exports = Index
Copy the code

Handle.js takes a while, but the main problem is the newline. As you can see from the following, I output the final string using string summations. In fact, I originally intended to use json.stringify to output the object in the array as a string. But I find that using JSON. After stringify, \ n line failure, at first don’t realize is JSON. The stringify conversion problem, walk a lot of detours. Since there are a lot of characters with Spaces, \n, etc., which I don’t need, I need to deal with it.

The data filtering here is to analyze the DOM of the original web page, and then obtain the data we want layer by layer.

const cheerio = require('cheerio');
function dbNewBookHandle(data) {
// Cheerio. Load = cheerio. Load = cheerio.
  $ = cheerio.load(data);
  const newBookData = $('.slide-list ul li');
  let disposeBook = ' ';
  Array.prototype.forEach.call(newBookData, function(element) {
    const node = $(element);
    disposeBook += ` {\ n the title:${$(node.find('.title a')).html().replace(/(\n)/g."").trim()}The author:${$(node.find('.author')).html().replace(/(\n)/g."").trim()}Publication Date:${$(node.find('.more-meta .year')).html().replace(/(\n)/g."").trim()}Press:${$(node.find('.more-meta .publisher')).html().replace(/(\n)/g."").trim()}About the book:${$(node.find('.more-meta .abstract')).html().replace(/(\n)/g."").trim()}Book Cover:${$(node.find('.cover img')).attr("src").replace(/(\n)/g."").trim()}
}\n`

  })
  return disposeBook

}
const Handle = {
  dbNewBookHandle,
}
module.exports = Handle
Copy the code

View.js Ajax request part of the code

   getCrawlData() {
      return new Promise((resolve, reject) = > {
        $.ajax({
          type: 'GET'.url: 'http://localhost:3000/hello'.data: {},
          success: function(result) {
            resolve(result)
          },
          error: function(err) {
            reject(err)
          }
        })
      })
   }
Copy the code

The final result

Front-end page display

TXT fileThe body of the end