Using Node.js to develop an information crawler (2) - HTML content extraction optimization

I wrote an article before about using Node.js to develop an information crawler, in which HTML content is extracted and page elements are extracted using Cheerio Settings. If only one website is captured, there is no problem, but if multiple websites are captured, a lot of code with similar logical structure will be generated, as shown in the screenshot below. So we optimized it

Effect achieved

The corresponding data can be extracted based on the set data structure and the extracted elements

The data structure

let obj = {
  title: { dom: '.title-link', target: 'text' },
  link: { dom: '.title-link', target: 'attr', attrName: 'href' },
  content: { dom: '.content-text', target: 'text'}}Copy the code

Data results

[ { title: 'I am the title',
    link: 'https://juejin.cn',
    content: 'I am content'}]Copy the code

The implementation code

  extract() {// List elementslet$(this.zonedom).find(this.listdom) nodelist.each ((I, e) => { Keys (this.datadoms).foreach (objEle => {})})}Copy the code

Detailed code address: extract.js

That’s it 🙂

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Using Node.js to develop an information crawler (2) – HTML content extraction optimization

Effect achieved

The data structure

Data results

The implementation code

Using Node.js to develop an information crawler (2) – HTML content extraction optimization

Effect achieved

The data structure

Data results

The implementation code

Related Posts

JDK growth Note 5: Initial introduction to LinkedList

【Fegin Technical Topics 】 “Raw” gives you an insight into the Feign workflow and operation mechanism at the source level

JVM tuning practices in high concurrency scenarios