preface

A web crawler (also known as a web spider, web bot, or more commonly as a web chaser in the FOAF community) is a program or script that automatically crawls information from the World Wide Web according to certain rules. Other less commonly used names include ant, autoindex, simulator or worm.

We can use the web crawler for automatic collection of data information, such as used in the search engine to crawl the site included, is applied to data analysis and mining the data gathering, applied to financial analysis of financial data collection, in addition, can also apply web crawler to public opinion monitoring and analysis, the target customer data collection and other fields.

1. Classification of web crawlers

According to the system structure and implementation technology, web crawler can be roughly divided into the following types: General Purpose Web Crawler, Focused Web Crawler, Incremental Web Crawler, and Deep Web Crawler). The actual web crawler system is usually realized by combining several crawler technologies. The following is a simple introduction of these crawlers.

1.1. General web crawler

Scalable Web Crawler (also known as Scalable Web Crawler) is Scalable across the Web from seed urls, gathering data for portal site search engines and large Web service providers.

1.2. Focus on web crawlers

Topical Crawlers, also known as Topical Crawlers, refer to web crawlers that selectively crawl pages related to predefined themes. Compared with general web crawler, focused crawler only needs to crawl topic-related pages, which greatly saves hardware and network resources. The saved pages are also updated quickly due to the small number of pages, and it can well meet the needs of some specific people for information in specific fields.

1.3. Incremental Web crawler

Refers to the downloaded web page to take incremental update and only crawling new or has changed the web page crawler, it can to a certain extent to ensure that the crawling page is as new as possible page.

1.4. Deep Web crawlers

According to the way of existence, Web Pages can be divided into Surface Web and Deep Web (also known as Invisible Web Pages or Hidden Web). Surface Web page refers to the traditional search engine can index pages, to hyperlink can reach the static Web page mainly composed of Web pages. Deep Web is those Web pages where most of the content is not available through static links, hidden behind search forms, and only available if the user submits some keywords.

Create a simple crawler application

After a simple understanding of the above kinds of crawlers, the following to achieve a simple crawler small application.

2.1. Achieve goals

When it comes to crawlers, there is a large probability that they will think of big data, which is also associated with Python. Because I mainly do front-end development, relatively speaking, JavaScript is more skilled and simple. To achieve a small goal, use NodeJS to crawl the list of articles on the front page of the blogosphere (my favorite developer site) and write them to a local JSON file.

2.2 environment construction

  • NodeJS: To install NodeJS on your computer, go to the official website to download and install NodeJS.
  • NPM: NodeJS package management tool, installed with NodeJS.

After NodeJS is installed, open the command line and it is availablenode -vCheck whether NodeJS is installed successfullynpm -vTo check whether NodeJS is successfully installed, the following information (depending on the version) should be printed:

2.3. Concrete implementation

2.3.1. Install dependency packages

NPM install Superagent Cheerio –save-dev install superagent Cheerio dependency packages. Create a crawler.js file.

  • Superagent: SuperAgent is a lightweight, flexible, easy-to-read client request proxy module with a low learning curve, used in NodeJS environments.

  • Cheerio: Cheerio is a fast, flexible and lean implementation of the core jQuery designed for servers. It can manipulate strings just like jquery does.

// import dependencies const HTTP = require("http");
const path       = require("path");
const url        = require("url");
const fs         = require("fs");
const superagent = require("superagent");
const cheerio    = require("cheerio");
Copy the code

2.3.2. Crawl data

Then get requests the page, after obtaining the page content, parses and values the returned DOM according to the data it wants, and finally translates the processed JSON result into a string and saves it locally.

// Fetch the page address const pageUrl="https://www.cnblogs.com/"; // Decode the stringfunction unescapeString(str){
    if(! str){return ' '
    }else{
        return unescape(str.replace(/&#x/g,'%u').replace(/; /g,''));}} // Fetch datafunction fetchData(){
    console.log('Crawl data time node:',new Date()); Superagent.get (pageUrl).end((error,response)=>{// page document datalet content=response.text;
        if(content){
            console.log('Data obtained successfully'); } // Define an empty array to receive datalet result=[];
        let $=cheerio.load(content);
        let postList=$("#main #post_list .post_item");
        postList.each((index,value)=>{
            let titleLnk=$(value).find('a.titlelnk');
			let itemFoot=$(value).find('.post_item_foot');

            lettitle=titleLnk.html(); / / titlelet href=titleLnk.attr('href'); / / linkslet author=itemFoot.find('a.lightblue').html(); / / the authorlet headLogo=$(value).find('.post_item_summary a img').attr('src'); / / avatarlet summary=$(value).find('.post_item_summary').text(); / / profilelet postedTime=itemFoot.text().split('published on') [1]. Substr (0 16th); // Release timelet readNum=itemFoot.text().split('read') [1]; / / readingreadNum=readNum.substr(1,readNum.length-1);

            title=unescapeString(title);
            href=unescapeString(href);
            author=unescapeString(author);
            headLogo=unescapeString(headLogo);
            summary=unescapeString(summary);
            postedTime=unescapeString(postedTime);
            readNum=unescapeString(readNum);

            result.push({
                index,
                title,
                href,
                author,
                headLogo,
                summary,
                postedTime,
                readNum }); }); // Array converts to string result= json.stringify (result); // Write fs.writefile ("cnblogs.json",result,"utf-8",(err)=>{// Listen for errors, if normal output, print nullif(! err){ console.log('Write data successfully'); }}); }); } fetchData();Copy the code

3. Perform optimization

3.1. Generate results

Open the command line in the project directory and type node crawler.js,

A cnblogs. Json file will be created in the directory. Open the file as follows:

Open the homepage of the Blog park for comparison:

Find that you have successfully climbed to the list of articles you want on the home page

3.2. Timed crawling

Find that only every time to get the data, the price timer, let it automatically climb every five minutes, code is as follows:

··· // Request every 5 minutessetInterval(()=>{ fetchData() },5*60*1000); ...Copy the code

4, summarize

The application of web crawler is far more than these, the above is just a simple introduction to the web crawler, and a small demo, if there is insufficient, welcome to correct.

References:


baike.baidu.com/item/ Web crawler /5… Blog.csdn.net/zw0Pi8G5C1x… www.jianshu.com/p/1432e0f29… www.jianshu.com/p/843ade9bf…