A Node.js scraper for humans.

Installation

$ npm i --save scrape-itCopy the code

Example

const scrapeIt = require("scrape-it");

// Promise interface
scrapeIt("http://ionicabizau.net", {
    title: ".header h1"
  , desc: ".header h2"
  , avatar: {
        selector: ".header img"
      , attr: "src"
    }
}).then(page => {
    console.log(page);
});

// Callback interface
scrapeIt("http://ionicabizau.net", {
    // Fetch the articles
    articles: {
        listItem: ".article"
      , data: {

            // Get the article date and convert it into a Date object
            createdAt: {
                selector: ".date"
              , convert: x => new Date(x)
            }

            // Get the title
          , title: "a.article-title"

            // Nested list
          , tags: {
                listItem: ".tags > span"
            }

            // Get the content
          , content: {
                selector: ".article-content"
              , how: "html"
            }
        }
    }

    // Fetch the blog pages
  , pages: {
        listItem: "li.page"
      , name: "pages"
      , data: {
            title: "a"
          , url: {
                selector: "a"
              , attr: "href"
            }
        }
    }

    // Fetch some other data from the page
  , title: ".header h1"
  , desc: ".header h2"
  , avatar: {
        selector: ".header img"
      , attr: "src"
    }
}, (err, page) => {
    console.log(err || page);
});
// { articles:
//    [ { createdAt: Mon Mar 14 2016 00:00:00 GMT+0200 (EET),
//        title: 'Pi Day, Raspberry Pi and Command Line',
//        tags: [Object],
//        content: '

Everyone knows (or should know)... a" alt="">

\n' }, // { createdAt: Thu Feb 18 2016 00:00:00 GMT+0200 (EET), // title: 'How I ported Memory Blocks to modern web', // tags: [Object], // content: '

Playing computer games is a lot of fun. ...' },

// { createdAt: Mon Nov 02 2015 00:00:00 GMT+0200 (EET), // title: 'How to convert JSON to Markdown using json2md', // tags: [Object], // content: '

I love and ... '},

// pages: // [ { title: 'Blog', url: '/' }, // { title: 'About', url: '/about' }, // { title: 'FAQ', url: '/faq' }, // { title: 'Training', url: '/training' }, // { title: 'Contact', url: '/contact' } ], // title: 'Ionică Bizău', // desc: 'Web Developer, Linux Geek and Musician', // Avatar: '/images/logo.png'}Copy the code

Documentation

scrapeIt(url, opts, cb)

A scraping module for humans.

Params

  • String|Object url: The page url or request options.
  • Object opts: The options passed to scrapeHTML method.
  • Function cb: The callback function.

Return

  • Promise A promise object.

scrapeIt.scrapeHTML($, opts)

Scrapes the data in the provided element.

Params

  • Cheerio $: The input element.
  • Object opts: An object containing the scraping information.

    If you want to scrape a list, you have to use the listItem selector:

    • listItem (String): The list item selector.
    • data (Object): The fields to include in the list objects:

      • (Object|String): The selector or an object containing:

        • selector (String): The selector.
        • convert (Function): An optional function to change the value.
        • how (Function|String): A function or function name to access the

          value.
        • attr (String): If provided, the value will be taken based on

          the attribute name.
        • trim (Boolean): If false, the value will not be trimmed

          (default: true).
        • eq (Number): If provided, it will select the nth element.
        • listItem (Object): An object, keeping the recursive schema of

          the listItem object. This can be used to create nested lists.

    Example:

    {
     articles: {
         listItem: ".article"
       , data: {
             createdAt: {
                 selector: ".date"
               , convert: x => new Date(x)
             }
           , title: "a.article-title"
           , tags: {
                 listItem: ".tags > span"
             }
           , content: {
                 selector: ".article-content"
               , how: "html"
             }
         }
     }
    }Copy the code

    If you want to collect specific data from the page, just use the same

    schema used for the data field.

    Example:

    {
       title: ".header h1"
     , desc: ".header h2"
     , avatar: {
           selector: ".header img"
         , attr: "src"
       }
    }Copy the code

Return

  • Object The scraped data.

How to contribute

Have an idea? Found a bug? See how to contribute.

Where is this library used?

If you are using this library in one of your projects, add it in this list. :sparkles:

  • The UI – studentsearch (by Rakha Kanz Kautsar) – API for majapahit. Cs. UI. Ac. Id/studentsearch

License

The MIT © Ionic ă Biz ă u