It has been nearly two weeks since the last tutorial, I can’t help it, I’m busy with my studies (¬_¬)
The three tools used in this article are
- Cheerio: jQuery syntax to help you parse web pages in non-browser environments
- I didn’t use it last time, but I’m sure I did
- Segment A Chinese word segmentation tool based on The Pangu thespot, written by Cnode, manual @leizongmin
Cheerio usage
const cheerio = require('cheerio'),
$ = cheerio.load('<h2 class="title">Hello world</h2>');
$('h2.title').text('Hello there! ');
$('h2').addClass('welcome');
$.html();
//=> Hello there!
Copy the code
Extra usage stamp here
Segment usage
const Segment = require('segment');
// Create an instance
const segment = new Segment();
// Use the default recognition module and dictionary. It takes 1 second to load the dictionary file
segment.useDefault();
// start participle
console.log(segment.doSegment('This is a Chinese word segmentation module based on Node.js. '));
// [{w: 'this is ', p: 0},
// {w: 'a ', p: 2097152},
// {w: 'based on ', p: 262144},
// { w: 'Node.js', p: 8 },
// {w: ' ', p: 8192},
// {w: 'Chinese ', p: 1048576},
// {w: 'participle ', p: 4096},
// {w: 'module ', p: 1048576},
// {w: '. ', p: 2048 } ]
Copy the code
But we generally don’t need to output parts of speech, and we don’t need to output extra punctuation marks, so
const result = segment.doSegment(text, {
simple: true, // Do not outputtrue// remove punctuation}); / / /'this is'.'a'.'based on'.'Node.js'.'the'.'Chinese'.'word'.'modules' ]
Copy the code
See segment for a more advanced use
See github for the full code
Basic usage: ╰(‘ ◡ ‘) one item
Climb take pictures
Crawl the picture
The SRC attribute and the data-src attribute of the img element carry the image address. Why do I not get the SRC value in the following code? Img.eq (I).src: prop(‘data-src’); img.eq(I).src: prop(‘data-src’
Custom properties are poorly compatible
Internet Explorer 11+ Chrome 8+ Firefox 6.0+ Opera 11.10+ Safari 6+
/(HTTPS :\/\/user-gold-cdn).+? /(HTTPS :\/\/user-gold-cdn). \/ignore-error\/1/g It is important to note that/is escaped, and lazy matching. , I am not going to say about lazy matching (briefly mentioned (//▽//), in fact, match the shortest string that meets the requirements), if said and can write a lot of
Those of you who want to know more about it can look at this explanation
/** ** @param {any} $cheerio * @param {any} request function */
function saveImg($, request) {
const img = $('.lazyload');
const origin = request.default(); // Here IS a simple wrapper around the request. Default returns the unwrapped request
for (let i = 0; i < img.length; ++i) {
//data.body.match(/(https:\/\/user-gold-cdn).+? \/ignore-error\/1/g)
let src = img.eq(i).prop('data-src');
let name = src.match(/ / /. {16} \? /g) && src.match(/ / /. {16} \? /g) [0].slice(1.- 1); // Match the image name
if (name) {
origin.get(src).pipe(fs.createWriteStream(`./images/${name}.png`)); // Download the image happily}}}Copy the code
The data processing
Introduce the Map data structure used to store word frequency (word – the number of occurrences of words)
Like objects, it is also a collection of key-value pairs, but the range of “keys” is not limited to strings. Values of all types (including objects) can be used as keys. In other words, the Object structure provides string-value mapping, and the Map structure provides value-value mapping, which is a more complete Hash structure implementation. If you need key-value data structures, Map is better than Object.
In fact, for this article, the keys are all strings and there is no problem with using an object at all. The use of Map was purely for a sampling (● ‘◡’ ●) Blue explanation of Map replication, which is different from the object
The Map copy
Object copy, passed as an argument to the constructor, cannot be copied
async function getPage(request, url) {
const data = await request.get({ url });
const $ = cheerio.load(data.body);
saveImg($, request);
// Get the content
let length = $('p').length;
for (let i = 0; i < length; ++i) {
let result = segment.doSegment(
$('p') // Most of the content is wrapped with the P tag. There is no complicated processing here
.eq(i)
.text(),
{
simple: true.// No part of speech output
stripPunctuation: true // Remove punctuation}); result.forEach((item, key) = > {
map.set(item, map.get(item) + 1 || 1); //1 + undefined || 1 => 1
});
}
map = sortToken(map);
}
function sortToken(map) {
const words = {}; / / store
let mapCopy = new Map(map); // Get the copy, Map direct assignment should also be address reference, see above
map.forEach((value, key) = > {
// The length of the participle is greater than 1
if(value ! = =1 && key.length > 1) { // The word frequency is greater than 1 and not a single word is left.
words[key] = value;
}
if (value === 1) { // The word frequency is too lowmapCopy.delete(key); }});const keys = Object.keys(words);
/ / sorting
keys.sort((a, b) = > {
return words[b] - words[a];
});
// If you are interested in the 20 words with the highest frequency in each article, you can look at the top K algorithm (we get the first k, it gets the KTH, but it needs to save the first K to compare which words are the largest).
// This method is only a rough way to get the 20 words with the highest frequency. In fact, there will be some bias. If the 11th word is 23 in the first sorting, and the 10th word is 12 in the second sorting, then the previous word with the highest frequency will be overwritten
// But the advantage of this is to save memory (actually false), the real can use maximum heap and use database storage, so there is no memory
// Retrieve data from database, and then refer to the top K algorithm to get the result
keys.slice(0.20).forEach(item= > {
console.log(item, words[item]);
});
// Return the word frequency 1 in the participle
return mapCopy;
}
Copy the code
code
methods
function
object
perform
call
component
a
You can also analyse the titles again, and then refine the sorting algorithm to analyze the entire article-content (class) text directly, rather than just the P tag as I did, and finally display the data using a visualization tool (such as e-cahrt)
Like students can star oh Github
Above, if there are mistakes, welcome to correct