Novice crawler, Teach you to crawl nuggets (2)

It has been nearly two weeks since the last tutorial, I can’t help it, I’m busy with my studies (￢_￢)

The three tools used in this article are

Cheerio: jQuery syntax to help you parse web pages in non-browser environments
- I didn’t use it last time, but I’m sure I did
Segment A Chinese word segmentation tool based on The Pangu thespot, written by Cnode, manual @leizongmin

Cheerio usage

const cheerio = require('cheerio'),
    $ = cheerio.load('<h2 class="title">Hello world</h2>');

$('h2.title').text('Hello there! ');
$('h2').addClass('welcome');

$.html();
//=> Hello there! 
Copy the code

Extra usage stamp here

Segment usage

const Segment = require('segment');
// Create an instance
const segment = new Segment();
// Use the default recognition module and dictionary. It takes 1 second to load the dictionary file
segment.useDefault();

// start participle
console.log(segment.doSegment('This is a Chinese word segmentation module based on Node.js. '));
// [{w: 'this is ', p: 0},
// {w: 'a ', p: 2097152},
// {w: 'based on ', p: 262144},
// { w: 'Node.js', p: 8 },
// {w: ' ', p: 8192},
// {w: 'Chinese ', p: 1048576},
// {w: 'participle ', p: 4096},
// {w: 'module ', p: 1048576},
// {w: '. ', p: 2048 } ]

Copy the code

But we generally don’t need to output parts of speech, and we don’t need to output extra punctuation marks, so

const result = segment.doSegment(text, {
  simple: true, // Do not outputtrue// remove punctuation}); / / /'this is'.'a'.'based on'.'Node.js'.'the'.'Chinese'.'word'.'modules' ]
Copy the code

See segment for a more advanced use

See github for the full code

Basic usage: ╰(‘ ◡ ‘) one item

Climb take pictures

Crawl the picture

The SRC attribute and the data-src attribute of the img element carry the image address. Why do I not get the SRC value in the following code? Img.eq (I).src: prop(‘data-src’); img.eq(I).src: prop(‘data-src’

Custom properties are poorly compatible

Internet Explorer 11+ Chrome 8+ Firefox 6.0+ Opera 11.10+ Safari 6+

/(HTTPS :\/\/user-gold-cdn).+? /(HTTPS :\/\/user-gold-cdn). \/ignore-error\/1/g It is important to note that/is escaped, and lazy matching. , I am not going to say about lazy matching (briefly mentioned (//▽//), in fact, match the shortest string that meets the requirements), if said and can write a lot of

Those of you who want to know more about it can look at this explanation

/** ** @param {any} $cheerio * @param {any} request function */
function saveImg($, request) {
  const img = $('.lazyload');
  const origin = request.default();  // Here IS a simple wrapper around the request. Default returns the unwrapped request
  for (let i = 0; i < img.length; ++i) {
    //data.body.match(/(https:\/\/user-gold-cdn).+? \/ignore-error\/1/g)
    let src = img.eq(i).prop('data-src');
    let name = src.match(/ / /. {16} \? /g) && src.match(/ / /. {16} \? /g) [0].slice(1.- 1); // Match the image name
    if (name) {
      origin.get(src).pipe(fs.createWriteStream(`./images/${name}.png`)); // Download the image happily}}}Copy the code

The data processing

Introduce the Map data structure used to store word frequency (word – the number of occurrences of words)

Like objects, it is also a collection of key-value pairs, but the range of “keys” is not limited to strings. Values of all types (including objects) can be used as keys. In other words, the Object structure provides string-value mapping, and the Map structure provides value-value mapping, which is a more complete Hash structure implementation. If you need key-value data structures, Map is better than Object.

In fact, for this article, the keys are all strings and there is no problem with using an object at all. The use of Map was purely for a sampling (● ‘◡’ ●) Blue explanation of Map replication, which is different from the object

The Map copy

Object copy, passed as an argument to the constructor, cannot be copied

async function getPage(request, url) {
  const data = await request.get({ url });
  const $ = cheerio.load(data.body);
  saveImg($, request);
  // Get the content
  let length = $('p').length;
  for (let i = 0; i < length; ++i) {
    let result = segment.doSegment(
      $('p')  // Most of the content is wrapped with the P tag. There is no complicated processing here
        .eq(i)
        .text(),
      {
        simple: true.// No part of speech output
        stripPunctuation: true // Remove punctuation}); result.forEach((item, key) = > {
      map.set(item, map.get(item) + 1 || 1); //1 + undefined || 1 => 1
    });
  }
  map = sortToken(map);
}

function sortToken(map) {
  const words = {}; / / store
  let mapCopy = new Map(map); // Get the copy, Map direct assignment should also be address reference, see above
  map.forEach((value, key) = > {
    // The length of the participle is greater than 1
    if(value ! = =1 && key.length > 1) { // The word frequency is greater than 1 and not a single word is left.
      words[key] = value;
    }
    if (value === 1) { // The word frequency is too lowmapCopy.delete(key); }});const keys = Object.keys(words);
  / / sorting
  keys.sort((a, b) = > {
    return words[b] - words[a];
  });
  // If you are interested in the 20 words with the highest frequency in each article, you can look at the top K algorithm (we get the first k, it gets the KTH, but it needs to save the first K to compare which words are the largest).
  // This method is only a rough way to get the 20 words with the highest frequency. In fact, there will be some bias. If the 11th word is 23 in the first sorting, and the 10th word is 12 in the second sorting, then the previous word with the highest frequency will be overwritten
  // But the advantage of this is to save memory (actually false), the real can use maximum heap and use database storage, so there is no memory
  // Retrieve data from database, and then refer to the top K algorithm to get the result
  keys.slice(0.20).forEach(item= > {
    console.log(item, words[item]);
  });
  // Return the word frequency 1 in the participle
  return mapCopy;
}
Copy the code

code
methods
function
object
perform
call
component
a

You can also analyse the titles again, and then refine the sorting algorithm to analyze the entire article-content (class) text directly, rather than just the P tag as I did, and finally display the data using a visualization tool (such as e-cahrt)

Like students can star oh Github

Above, if there are mistakes, welcome to correct

Novice crawler, Teach you to crawl nuggets (2)

The three tools used in this article are

Cheerio usage

Hello there!

Segment usage

Climb take pictures

The data processing

Related Posts

Tencent wechat pay, see how programmers make code pay

Js – Understand function combinations

React Hooks blog (part 2)