Welcome to nodecrawler 👋

   

 
   

 
   

 

Node crawler Notes

🏠 Homepage

What is a reptile?

A crawler simply crawls information from a web page. And the web structure is a tree structure, like a spider web. And a crawler is like a spider on the web that picks up the information we’re interested in.

Two things to determine before you start writing a crawler.

  1. Where to crawler? (Information about where to climb?) .
  2. What to crawler? (What information do you want to crawl?) .

A sharp tool is necessary to make a good craft

At the beginning, I found several node crawler libraries, but the effect was not ideal. It paid off, but EVENTUALLY I found one: Apify.

Use Apify to start my reptilian journey

1. First create a new project and install the Apify dependency.

npm i apify -S
Copy the code

The next step is to make sure to crawl the information of the site (take douban Top 250 movies as an example)

2. Now we have identified the URL to crawl (movie.douban.com/top250), now start coding.

// introduce apify const apify = require('apify');
Copy the code

3. Apify provides a dynamic requestQueue to manage the urls we want to climb and we can use it to manage all the urls we want to climb.

const Apify = require('apify'); Apify.main(async ()=>{// first create a requestQueue const requestQueue = await apify.openrequestqueue (); // Add the URL to the queue to await requestqueue.addrequest ('https://movie.douban.com/top250');
  })
Copy the code

5. Now that you have the request queue, here’s what you need to doWhat to crawler. A method is required to parse the requested web page content.

Defines a function to parse web page content, which is then passed into an instance of the Apify crawler

async functionHandlePageFunction ({request, $}) {const title = $({request, $}) {'title').text(); Console. log(' title of web page:${title}`);
}
Copy the code

6. Finally, create a CheerioCrawler instance and pass requestQueue and handlePageFunction as arguments. And I’m gonna start the crawler

Const crawler = new apify. CheerioCrawler({requestQueue, handlePageFunction}) // Start crawler await crawler.run();Copy the code

Let’s do some code integration, and then start the crawler.

const Apify = require('apify'); Apify.main(async () => {// create a requestQueue const requestQueue = await apify.openrequestqueue (); // Add the url to the queue to await requestqueue.addrequest ({url:'https://movie.douban.com/top250'}); Const handlePageFunction = async ({request, $}) => {const title = $({request, $});'title').text(); Console. log(' title of web page:${title}`); } // Create a CheerioCrawler with a requestQueue, Const crawler = new Apify.CheerioCrawler({requestQueue, HandlePageFunction}) // Start crawler await crawler.run(); })Copy the code

Run the code, the title of the page was successfully crawled

At this point, we have a simple crawler, but not what we need (to climb the full top250). We need to dynamically add urls to get to the full 250 movies.

8. Get all the pages to crawl

The initial URL is the home page, we need to get the page with all the page numbers. By parsing the page, we can add the page we want to climb to the request queue by using apify’s method of dynamically adding urls to the queue.

const {
  utils: { enqueueLinks },
} = Apify;
Copy the code
await enqueueLinks({
            $,
            requestQueue, 
            selector: '.next > a'BaseUrl: request. LoadedUrl, baseUrl: request. LoadedUrl, baseUrl: request.Copy the code

9. Next you need to modify the handlePageFunction to parse the movie information you need.

/** ** get movie info */function parseMovie($) {
    const movieDoms = $('.grid_view .item');
    const movies = [];
    movieDoms.each((index, item) => {
        const movie = {
            rank: $(item).find('.pic em').text(), // rank name: $(item).find()'.title').text(), // Score: $(item).find('.rating_num'// Sketch: $(item).find().'.inq').text() // theme} movies.push(movie)})return movies
}
Copy the code

10. Put the code back together, run it and see what happens

const Apify = require('apify'); const { utils: { enqueueLinks }, } = Apify; Apify.main(async () => {// first create a requestQueue const requestQueue = await apify.openrequestqueue (); await requestQueue.addRequest({ url:'https://movie.douban.com/top250' });
    const crawler = new Apify.CheerioCrawler({
        requestQueue,
        handlePageFunction
    })
    async function handlePageFunction({ request, $ }) {
        await enqueueLinks({
            $,
            requestQueue,
            selector: '.next > a'BaseUrl: request. LoadedUrl, baseUrl: request. LoadedUrl, baseUrl: request. const movies = parseMovie($); movies.forEach((item, i) => { console.log(`${item.rank}|${item.name}|${item.score}|${item.sketch}')})} // Start crawler await crawler.run(); }) /** * parse the page to get the movie information */function parseMovie($) {
    const movieDoms = $('.grid_view .item');
    const movies = [];
    movieDoms.each((index, item) => {
        const movie = {
            rank: $(item).find('.pic em').text(), // rank name: $(item).find()'.title').text(), // Score: $(item).find('.rating_num'// Sketch: $(item).find().'.inq').text() // theme} movies.push(movie)})return movies
}
Copy the code

⭐️ run it to see the results

Now the results are as good as we want them to be, but do you think the code above is a bit of a hassle to find the link, convert it, and add it to the request queue? Can I give a URL rule and the program automatically add it to the queue for me?

Local data persistence

Here we use SQLite for local data persistence because it is a lightweight database and does not require a server.

  1. Install the SQlite3 dependency
npm i sqlite3 -S
Copy the code
  1. Initialize the database and create the movies table
const sqlite3 = require('sqlite3').verbose();
function initDB() {let db = new sqlite3.Database('./db/crawler.db', (err) => {
        if (err) {
          return console.error(err.message,'aha!');
        }
        console.log('Connected to the crawler database.');
      });
      createTable(db);
      return db;
}

functioncreateTable(db){ const sql = `CREATE TABLE IF NOT EXISTS movies( rank TEXT, name TEXT, TEXT TEXT, sketch TEXT ); ` db.run(sql,[],(err)=>{if(err){
             return console.log(err)
         }
         console.log('Table created successfully')})}Copy the code

3. Insert data

function insertData(db, movie = {}) {
    db.serialize(function () {
        db.run(`INSERT INTO movies(rank,name,score,sketch) VALUES(?,?,?,?)`, [movie.rank, movie.name, movie.score, movie.sketch], function (err) {
            if (err) {
                return console.log(err.message);
            }
            // get the last insert id
            console.log(`A row has been inserted with rowid ${this.lastID}`); }); })}Copy the code

At the end

At this point, a simple crawler is written, but IP proxies and request source masquerading are missing. And then I’m going to add

Code address: github.com/hp0844182/n…