preface

Crawlers should follow the robots protocol

What is a reptile

Quoted from Baidu Baike:

A web crawler (also known as a web spider, web bot, or more commonly as a web chaser in the FOAF community) is a program or script that automatically crawls information from the World Wide Web according to certain rules. Other less commonly used names include ant, autoindex, simulator or worm.

When you visit a website and find a lot of nice pictures, you will right-click and save them locally. When you save a few pictures, you will think why can’t I write a script to automatically download these pictures locally? And so the reptile was born……

Common crawler type

A server rendered page (SSR) is a rendered HTML fragment that the server has returned
Client-rendered pages (CSR) A common single-page application is client-rendered pages

The second one requires interface crawler analysis. This article explains how to use the first one, which uses NodeJS to crawl remote images and download them locally

End result:

To prepare

1 directory

├── ├─ app.js ├── download.jsonCopy the code

2 Installation Dependencies

Axios request library

npm i axios --save
Copy the code

‘JQ’ of Cheerio server

npm i cheerio --save
Copy the code

Fs file module

npm i fs --save
Copy the code

Start the crawler

Climb an outdoor website, climb the recommended home page picture and download to the local

Analyze the page structure to determine what to crawl
The Node obtains the page content through HTTP request
Cheerio to get an array of pictures
Iterate through the image array and download it locally

2. Write the code, Axios gets the HTML fragment, analyzes it and finds that the picture is in the ‘NewSimg’ block, cheerio uses it basically the same as JQ, gets the title and download link of the picture

const res = await axios.get(target_url); const html = res.data; const $ = cheerio.load(html); const result_list = []; $('.newscon').each(element => { result_list.push({ title: $(element).find('.newsintroduction').text(), down_loda_url: $(element).find('img').attr('src').split('! ') [0]}); }); this.result_list.push(... result_list);Copy the code

Now that you have an array of download links, all you need to do is walk through the array, send the request and save it locally with FS

const target_path = path.resolve(__dirname, `./cache/image/${href.split('/').pop()}`);
const response = await axios.get(href, { responseType: 'stream' });
await response.data.pipe(fs.createWriteStream(target_path));
Copy the code

3 request optimization to avoid too frequent requests will be blocked IP, there are several simple methods:

Avoid frequent requests within a short period of time
The Axios interceptor sets up the User-agent, using a different one for each request
IP library, each request using a different IP

The complete code

class stealData { constructor() { this.base_url = ''; This.current_page = 1; this.result_list = []; } async init() { try { await this.getPageData(); await this.downLoadPictures(); } catch (e) { console.log(e); }} sleep(time) {return new Promise((resolve) => {console.log(' automatically sleep, ${time / 1000} seconds after re-send the request...... `) setTimeout(() => { resolve(); }, time); }); } async getPageData() { const target_url = this.base_url; try { const res = await axios.get(target_url); const html = res.data; const $ = cheerio.load(html); const result_list = []; $('.newscon').each((index, element) => { result_list.push({ title: $(element).find('.newsintroduction').text(), down_loda_url: $(element).find('img').attr('src').split('! ') [0]}); }); this.result_list.push(... result_list); return Promise.resolve(result_list); } catch (e) {console.log(' failed to get data '); return Promise.reject(e); } } async downLoadPictures() { const result_list = this.result_list; try { for (let i = 0, len = result_list.length; i < len; I ++) {console.log(' Start downloading ${I + 1}! `); await this.downLoadPicture(result_list[i].down_loda_url); await this.sleep(3000 * Math.random()); Console. log(' ${I + 1} file downloaded successfully! `); } return Promise.resolve(); } catch (e) {console.log(' failed to write data '); return Promise.reject(e) } } async downLoadPicture(href) { try { const target_path = path.resolve(__dirname, `./cache/image/${href.split('/').pop()}`); const response = await axios.get(href, { responseType: 'stream' }); await response.data.pipe(fs.createWriteStream(target_path)); Console. log(' Write succeeded '); return Promise.resolve(); } catch (e) {console.log(' failed to write data '); return Promise.reject(e) } } } const thief = new stealData('xxx_url'); thief.init();Copy the code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Node.js crawler – crawl the image and download it to the local

preface

What is a reptile

Common crawler type

To prepare

Start the crawler

The complete code

Node.js crawler – crawl the image and download it to the local

preface

What is a reptile

Common crawler type

To prepare

Start the crawler

The complete code

Related Posts

Java – 5 minutes to learn about Java garbage collection

10+ years as a database development engineer with an in-depth understanding of MySQL indexes

Over the course of my four years in college, I have contributed all of my hidden algorithmic learning tools!