background
When you think of crawlers, most people think of Python. NodeJS is also a great way to write crawlers. Javascript is better than Python for simple, efficient crawlers. Because Javascript is asynchronous, it can crawl multiple pages at once more efficiently than Python. But the Python ecology is more complete (mainly because I don’t yet know Python π)
How to climb?
- Determine the crawl object (select a seed URL)
The idea is to go through the first 20 blogs on the recommended page and then go to the details to get the data.
/ / 1. Recommended API URL / / https://apinew.juejin.im/recommend_api/v1/article/recommend_all_feed / / parameter {client_type id_type: 2: 2608, sort_type: 200, cursor: '0', limit: 20 } // 2. API https://juejin.im/post/γarticle_idγCopy the code
- Installation environment
I’m in the KOA framework environment here, and I have Babel installed to support ES6 syntax.
- Code, crawl page data and analyze
The idea is very simple. The page is regarded as a get request, and after loading, cheerio is used to get the data in the page, and then fs method is used to write it into TXT file. Cheerio is a fast, flexible and implementable jQuery core implementation specially customized for the server. The address of Chinese document is attached, and the specific API can be checked by yourself.
(function() { const fs = require('fs'); const axios = require('axios'); const cheerio = require('cheerio'); / / use AXIOS obtain recommended LIST data AXIOS. Post (' https://apinew.juejin.im/recommend_api/v1/article/recommend_all_feed '{id_type: 2, client_type: 2608, sort_type: 200, cursor: '0', limit: 20 } ) .then(async({ data: = > {{data}}) if (data. The length = = = 0) {the console. The log (' no data '+ data); return false; } for (let i = 0; i < data.length; I++) {/ / block details one by one for each content / / mainly because I want to in order of the list data const res = await axios. Get (ID / / article is my view on its own API analysis data [I] item_info. Url | | ('https://juejin.im/post/' + (data[i].item_info.article_id || data[i].item_info.advert_id)) ); // Cheerio const $= cheerio.load(res. Data, {normalizeWhitespace: false, xmlMode: false, decodeEntities: false}); Fs.appendfile ('./spider/result. TXT ', // get the HTML content in the article tag // you can see the page first, $('article').html(), 'utf8', err => {if (err) throw err; Console. log(' data appended to file '); }); } }) .catch(error => { console.log(error); }); }) ();Copy the code
He climbed and climbed
Think of double 12, I want to climb taobao data, the result process is a bit complicated. I’m going to do it the old-fashioned way
(function() { const fs = require('fs'); const axios = require('axios'); const cheerio = require('cheerio'); // const iconv = require('iconv-lite'); const util = require('util'); axios({ methods: 'get', url: 'https://list.tmall.com/search_product.htm?spm=a221t.1476805.cat.19.52006769d615Pr&cat=54290107&q=%E4%BC%91%E9%97%B2%E5% A5%97%E8%A3%85', headers: { 'content-type': 'text/html; charset=GBK' } }) .then(res => { if (! Res.data) {console.log(' data is empty, cannot be appended '); return false; } const contentBinary = iconv.decode( Buffer.from(util.inspect(res.data), 'binary'), 'GBK' ); const $ = cheerio.load(contentBinary, { normalizeWhitespace: true, decodeEntities: false }); console.log($('.productTitle a').attr('title')); console.log($('em').attr('title')); fs.appendFile( './spider/taobao.txt', $('.productTitle a').attr('title'), 'utf8', err => { if (err) throw err; Console. log(' data appended to file '); }); }) .catch(err => { if (err) throw err; }); }) ();Copy the code
Results printed Chinese is all gibberish, all kinds of Baidu search, and then mostly recommended iconV-Lite to deal with. But it doesn’t work here. I don’t know why. Have big guy to understand give directions
Give up? That’s impossible
The above method, crawling taobao site content is Chinese garbled. Later I found a wonderful tool called Puppeteer, attached with gitHub address.
Puppeteer is a Node.js API that controls Headless Chrome. It is a Node.js library that provides an advanced API to control Headless Chrome through the DevTools protocol. Most things that can be done manually in a browser can be done using Puppeteer.
What is Headless Chrome? You’re actually running Chrome in an interface free environment. And can operate through the command line or program language, without human intervention, running more stable
; (async function() { const cheerio = require('cheerio'); const puppeteer = require('puppeteer'); const fs = require('fs'); Const browser = await puppeteer.launch({headless: false, // have browser interface startup slowMo: 100, // slow browser execution to facilitate test observation args: [// start Chrome '-- no-sandbox', '--window-size=1280,960']}); const page = await browser.newPage(); // Open a new page. Taobao needs to await page.goto( 'https://list.tmall.com/search_product.htm?spm=a221t.1476805.cat.19.52006769d615Pr&cat=54290107&q=%E4%BC%91%E9%97%B2%E5% A5%97%E8%A3%85' ); Const HTML = await page.content(); Const $= cheerio.load(HTML, {normalizeWhitespace: true, decodeEntities: false}); // Const $= cheerio.load(HTML, {normalizeWhitespace: true, decodeEntities: false}); const goods = []; const price = []; const saleNum = []; $('.productTitle a').each((i, elem) => { goods.push($(elem).attr('title')); }); $('.productPrice em').each((i, elem) => { price.push($(elem).attr('title')); }); $('.productStatus em').each((i, elem) => { saleNum.push($(elem).text()); }); // Concatenate commodity name, price and monthly trading volume, Const result = goods.reduce((current, item, I) => {return '${current}\n${item},Β₯${price[I]},${saleNum[I]}'; } "); Fs. appendFile('./spider/tmall. CSV ', 'product, price, month transaction' + result, 'utf8', err => {if (err) throw err; Console. log(' data appended to file '); }); await page.close(); await browser.close(); }) ();Copy the code
Since this is using chrome browser to get page data, there is no Chinese garbled problem. π π π π
The last
At this point, the basic reptile is ready. But inside taobao site, search only provides the first page of data, paging requires login. But can only use JS to log in, and then to climb the data.
Type ("#username",config.username); await page.type("#password",config.password); await page.click('#login'); console.log("Login Secuessed..." );Copy the code
Please give me a thumbs up πππ
Please give me a thumbs up πππ
Please give me a thumbs up πππ