preface
Crawlers should follow the robots protocol
What is a reptile
Quoted from Baidu Baike:
A web crawler (also known as a web spider, web bot, or more commonly as a web chaser in the FOAF community) is a program or script that automatically crawls information from the World Wide Web according to certain rules. Other less commonly used names include ant, autoindex, simulator or worm.
When you visit a website and find a lot of nice pictures, you will right-click and save them locally. When you save a few pictures, you will think why can’t I write a script to automatically download these pictures locally? And so the reptile was born……
Common crawler type
- A server rendered page (SSR) is a rendered HTML fragment that the server has returned
- Client-rendered pages (CSR) A common single-page application is client-rendered pages
The second one requires interface crawler analysis. This article explains how to use the first one, which uses NodeJS to crawl remote images and download them locally
End result:
To prepare
1 directory
├── ├─ app.js ├── download.jsonCopy the code
2 Installation Dependencies
- Axios request library
npm i axios --save
Copy the code
- ‘JQ’ of Cheerio server
npm i cheerio --save
Copy the code
- Fs file module
npm i fs --save
Copy the code
Start the crawler
Climb an outdoor website, climb the recommended home page picture and download to the local
- Analyze the page structure to determine what to crawl
- The Node obtains the page content through HTTP request
- Cheerio to get an array of pictures
- Iterate through the image array and download it locally
2. Write the code, Axios gets the HTML fragment, analyzes it and finds that the picture is in the ‘NewSimg’ block, cheerio uses it basically the same as JQ, gets the title and download link of the picture
const res = await axios.get(target_url); const html = res.data; const $ = cheerio.load(html); const result_list = []; $('.newscon').each(element => { result_list.push({ title: $(element).find('.newsintroduction').text(), down_loda_url: $(element).find('img').attr('src').split('! ') [0]}); }); this.result_list.push(... result_list);Copy the code
Now that you have an array of download links, all you need to do is walk through the array, send the request and save it locally with FS
const target_path = path.resolve(__dirname, `./cache/image/${href.split('/').pop()}`);
const response = await axios.get(href, { responseType: 'stream' });
await response.data.pipe(fs.createWriteStream(target_path));
Copy the code
3 request optimization to avoid too frequent requests will be blocked IP, there are several simple methods:
- Avoid frequent requests within a short period of time
- The Axios interceptor sets up the User-agent, using a different one for each request
- IP library, each request using a different IP
The complete code
class stealData { constructor() { this.base_url = ''; This.current_page = 1; this.result_list = []; } async init() { try { await this.getPageData(); await this.downLoadPictures(); } catch (e) { console.log(e); }} sleep(time) {return new Promise((resolve) => {console.log(' automatically sleep, ${time / 1000} seconds after re-send the request...... `) setTimeout(() => { resolve(); }, time); }); } async getPageData() { const target_url = this.base_url; try { const res = await axios.get(target_url); const html = res.data; const $ = cheerio.load(html); const result_list = []; $('.newscon').each((index, element) => { result_list.push({ title: $(element).find('.newsintroduction').text(), down_loda_url: $(element).find('img').attr('src').split('! ') [0]}); }); this.result_list.push(... result_list); return Promise.resolve(result_list); } catch (e) {console.log(' failed to get data '); return Promise.reject(e); } } async downLoadPictures() { const result_list = this.result_list; try { for (let i = 0, len = result_list.length; i < len; I ++) {console.log(' Start downloading ${I + 1}! `); await this.downLoadPicture(result_list[i].down_loda_url); await this.sleep(3000 * Math.random()); Console. log(' ${I + 1} file downloaded successfully! `); } return Promise.resolve(); } catch (e) {console.log(' failed to write data '); return Promise.reject(e) } } async downLoadPicture(href) { try { const target_path = path.resolve(__dirname, `./cache/image/${href.split('/').pop()}`); const response = await axios.get(href, { responseType: 'stream' }); await response.data.pipe(fs.createWriteStream(target_path)); Console. log(' Write succeeded '); return Promise.resolve(); } catch (e) {console.log(' failed to write data '); return Promise.reject(e) } } } const thief = new stealData('xxx_url'); thief.init();Copy the code