preface
Objective of this paper: to crawl the keyword search results of Baidu search engine and deploy them into the function calculation of Ali Cloud.
Before you begin, please take a quick look at the following function calculations and Puppeteer concepts for the next step.
Function to calculate
What is the function evaluation
Function Compute: Function Compute is an event-driven, fully managed computing service that allows you to focus on writing code and not on server infrastructure. When an event is triggered, function computing runs tasks flexibly and reliably in the cloud service, and supports log queries, performance monitoring, and alarms.
Advantages of using functions for calculation
- You do not need to purchase and manage servers and other infrastructure, and the operation and maintenance cost is low.
- You only need to focus on the development of business logic, design, optimize, test, review, and upload your own application code using the development language supported by functional computation.
- Trigger an application to respond to a user request in an event-driven manner. Seamless connection with Aliyun object storage OSS, API gateway, log service, table storage and other services, helping you quickly build applications.
- Provides log query, performance monitoring, and alarm functions to quickly rectify faults.
- Flexible expansion at the millisecond level enables rapid expansion of the bottom layer to cope with peak pressure.
- Charge on demand, support 100 milliseconds level charge. You only need to pay for the computing resources actually used, which is suitable for user access scenarios with obvious peaks and troughs.
Puppeteer
What is the Puppeteer
Puppeteer is a Node library that provides a high-level API for controlling headless Chrome or Chromium via the DevTools protocol, and can also be configured to use full (non-headless) Chrome or Chromium.
You can directly control Chrome through the apis provided by Puppeteer to simulate most user actions to perform UI tests or to crawl pages to collect data.
What can Puppeteer do
-
Generate screenshots and PDF of the page
-
You can grab SPA or SSR websites
-
Automated tests that simulate form submission, keyboard input, mouse events, and more
-
Capture a timeline of your site to help diagnose performance problems
-
Create an up-to-date automated test environment. Run tests directly in the latest version of Chrome, using the latest JavaScript and browser features.
Puppeteer official help document
1. Prepare
Before starting, ensure that the following tools have been correctly installed, updated to the latest version, and configured correctly.
- Funcraft
- Docker
Funcraft
Funcraft is a command line tool provided by function computation. With this tool, you can easily manage resources such as function computation, API gateway, and logging service. Funcraft helps you develop, build, and deploy with a resource configuration file, template.yml. This article provides three ways to install Funcraft.
-
The installation
-
Run the following command to install Funcraft
npm install @alicloud/fun -g
-
After the installation is complete, run fun on the control terminal to check the version information.
fun --version
-
-
Configuration Funcraft
-
Execute the following command
fun config
-
Configure Account ID, AccessKeyId, AccessKeySecret, and Default Region Name as prompted.
-
Docker
Funcraft relies on Docker to simulate the local environment for dependency compilation, installation, local run debugging, etc.
Windows
-
The installation
Download and install the Docker Desktop based on your system
-
Configuring a Domestic Mirror
{ "registry-mirrors": [ "https://docker.mirrors.ustc.edu.cn"."https://registry.docker-cn.com"."http://hub-mirror.c.163.com"]}Copy the code
Right-click the Docker icon in the status bar at the bottom right corner of the desktop, modify the JSON in the Docker Daemon TAB, add the image address above to the array of “Registry -mirrors”, and save it.
Tips: Recommended to use Aliyun Docker image.
Practice 2.
Initialize the project
-
Run the following command and select the HTTP-trigger-nodejs10 template
fun init -n xxx
-n, --name
Option is the name of the project to be generated for the folder. The default value isfun-app
-
Create a folder to store baidu crawler functions
cd fun-puppetter && mkdir baiduKeywordResult
-
Generate package.json file
npm init -y
-
Generate the Funcraft configuration file
fun config
Configure Account ID, AccessKeyId, AccessKeySecret, and Default Region Name as prompted.
-
Replace the contents of the template.yml file with:
ROSTemplateFormatVersion: '2015-09-01' Transform: 'Aliyun::Serverless-2018-04-03' Resources: FunPuppetter: Type: 'Aliyun: : Serverless: : Service Properties: Description:' Puppetter Service, a Service can create multiple functions' baiduKeywordResult: Type: 'Aliyun::Serverless::Function' Properties: Handler: index.handler Runtime: nodejs10 CodeUri: './baiduKeywordResult' Timeout: 600 MemorySize: 1024 InstanceConcurrency: 3 Events: httpTrigger: Type: HTTP Properties: AuthType: ANONYMOUS Methods: ['POST', 'GET']Copy the code
Yml declares a service named FunPuppetter. In this service, declare a function named baiduKeywordResult, configure the function to trigger httpTrigger, the entry to index. Handler, and the function runtime to nodejs10. Also, we specify that the Timeout handler can run for a maximum of 600 seconds and that the MemorySize function is allocated 1024 MB for execution. Specifies that InstanceConcurrency sets an InstanceConcurrency level for a function, indicating how many requests a single function instance can handle simultaneously. Specifies the CodeUri as the current directory. At deployment time, Fun packages and uploads the directory specified by CodeUri. For more configuration rules, see.
-
Move the /index.js file to /baiduKeywordResult
index.js
var getRawBody = require('raw-body'); var getFormBody = require('body/form'); var body = require('body'); module.exports.handler = function(req, resp, context) { console.log('hello world'); var params = { path: req.path, queries: req.queries, headers: req.headers, method : req.method, requestURI : req.url, clientIP : req.clientIP, } getRawBody(req, function(err, body) { resp.setHeader('content-type'.'text/plain'); for (var key in req.queries) { var value = req.queries[key]; resp.setHeader(key, value); } params.body = body.toString(); resp.send(JSON.stringify(params, null.' ')); }); } Copy the code
The contents of the directory should look like this:
Run fun Local start to debug and Funcraft responds as follows:
Turn on the generate Url, and if the response looks like this, you can start Coding
Coding
Here are the interfaces we want to implement:
Baidu keyword search results
// request
{
url: 'http://localhost:8000/2016-08-15/proxy/FunPuppetter/baiduKeywordResult'.params: {
keyword, // The keyword to search for
page, // How many pages to crawl}},// response
{
msg: 'success'.code: 2000.data: [
{
title,
abstract,
redirectUrl,
url,
domain,
keyword,
pageNum,
}
]
}
Copy the code
/baiduKeywordResult/index.js
/ baiduKeywordResult/index. Js file content is as follows:
Rely on the package
const puppeteer = require('puppeteer');
const _ = require('lodash');
const async = require('async');
const axios = require('axios');
const cheerio = require('cheerio');
const nodeUrl = require('url');
Copy the code
-
lodash
High performance JavaScript utility library
-
async
Async library is a very excellent asynchronous control library. Besides functions, it also provides a large number of other tool functions. When there is no async/await, async library plays an especially prominent role.
-
axios
Axios is a Promise-based HTTP library that can be used in browsers and Node.js.
-
cheerio
Cheerio is a quick, flexible and concise implementation of jquery’s core functionality, mainly for server-side DOM manipulation
-
url
Used to process and parse urls
Handle function
module.exports.handler = async function(req, resp, context) {
// Receive parameters
let { keyword, page } = req.queries;
if (_.isEmpty(keyword) || _.isEmpty(page)) {
resp.send(JSON.stringify({
msg: 'Incorrect parameters! '.code: 4005.data: null}}))try {
// Baidu search results are 76 pages at most
page = Math.min(page, 76);
const task = new Task({ keyword, page })
const result = await task.start();
console.log('response result', result)
resp.send(JSON.stringify(result))
} catch(e) {
console.log(e)
}
}
Copy the code
Task Class
class Task {
// Constructor, which will be called when creating the example
constructor(task) {
this._result = {
msg: 'success'.data: {},
code: 5000
};
this._browser = null;
this._task = task;
}
async start() {
try {
await this.initialize();
await this.execute();
this._result.code = 2000;
} catch(e) {
console.log(e.stack);
this._result.msg = e.stack;
this._result.code = 5000;
} finally {
await this.destroy();
}
return this._result;
}
async initialize() {
// Open a browser instance
this._browser = await puppeteer.launch({
headless: true.ignoreDefaultArgs: ['--disable-extensions'].args: [
'--no-sandbox'.'--disable-setuid-sandbox']}); }async execute() {
const { keyword, page } = this._task;
const pageRange = _.range(0, page * 10.10);
let results = [];
// Retrieve search results for each page concurrently
results = await async.mapLimit(
pageRange,
50.async (offset) => {
// Failure retry mechanism
let retry = 0;
let success = false;
do {
// Open a Tab page
let entryPage = await this._browser.newPage();
try {
const url = `https://baidu.com/s?wd=${keyword}&pn=${offset}`;
console.log('Crawler url:', url)
await entryPage.goto(url,{
waitUntil: 'load'.timeout: 1000 * 30});let pageData = [];
if(this.isLastPage(entryPage)) {
pageData = await this.structureData(offset, entryPage);
}
success = true;
return pageData;
} catch(e) {
console.log('error', e);
retry++;
// If this fails after 6 attempts, an exception is thrown, which is caught by a catch in the handler function
if (retry >= 6) {
throwe; }}finally {
await entryPage.close()
}
} while(! success && retry <6)}); results = _.flatMapDepth(results).map((item, index) = >{
item.rank = index + 1;
return item;
})
console.log(results);
this._result.data = results;
}
async structureData(offset = 0, entryPage) {
const htmlContent = await entryPage.content();
let htmlData = await this.htmlParse(htmlContent);
// Iterate over parsed data, adding page and keyword fields
htmlData = _.map(htmlData, (data) = > {
data.keyword = this._task.keyword;
data.pageNum = Math.max(1, offset / 10);
return data;
});
return htmlData;
}
async htmlParse(html) {
// Parse HTML to get data
const $ = cheerio.load(html);
let pageItems = [];
$(".result.c-container").each(function (i, el) {
const that = $(el);
const item = {
title: _.trim(that.find("h3 > a").text()),
abstract: _.trim(that.find(".c-abstract").text()),
redirectUrl: _.trim(that.find("h3 > a").attr("href")),
url: ""}; pageItems.push(item); });// Request url concurrently to obtain the real URL after Baidu redirects
pageItems = await new Promise((resolve, reject) = > {
async.mapLimit(
pageItems,
50.async (item) => {
const redirectResponse = await axios.head(item.redirectUrl, {
timeout: 1000 * 10./ / 10 seconds
maxRedirects: 0.validateStatus: function (status) {
return status >= 200 && status < 400; }}); item.url = redirectResponse.headers.location || item.redirectUrl; item.domain = nodeUrl.parse(item.url).host;return item;
},
(err, results) = > {
if (err) {
reject(err);
} else{ resolve(results); }}); });return pageItems;
}
async isLastPage(entryPage) {
const htmlContent = await entryPage.content();
// Parse HTML to get data
const $ = cheerio.load(htmlContent);
return $("#page").length && $("#page .n").length
}
async destroy() {
await this._browser.close(); }}Copy the code
/baiduKeywordResult/package.json
"Dependencies" : {" async ":" ^ 3.2.0 ", "axios" : "^ 0.19.2", "cheerio" : "^ 1.0.0 - rc. 3", "lodash" : "^ 4.17.15", "puppeteer" : "^ 2.0.0", "url" : "^ 0.11.0"},Copy the code
Install dependencies
$ fun install -d
Copy the code
If you install the dependency directly using NPM install, puppeteer will run with an error. The problem here is that Puppeteer relies on Chromium, which in turn relies on some system library. So NPM install will also trigger the chromium download operation. Here users often encounter problems, mainly:
- Because of the large volume of Chromium, it often fails to download due to network problems.
- NPM only downloads Chromium, the system libraries that Chromium depends on are not installed automatically. Users also need to find missing dependencies to install.
Fortunately, the function calculation command line tool Funcraft has already integrated the Puppeteer solution. As long as the Puppeteer dependencies are included in package.json, use Fun Install -d to install all system dependencies in one click.
3. Local debug functions
To debug code locally, you can use the following command:
$ fun local start
using template: template.yml
HttpTrigger httpTrigger of FunPuppetter/baiduKeywordResult was registered
url: http://localhost:8000/2016-08-15/proxy/FunPuppetter/baiduKeywordResult
methods: [ 'POST', 'GET' ]
authType: ANONYMOUS
Copy the code
The browser open http://localhost:8000/2016-08-15/proxy/FunPuppetter/baiduKeywordResult will automatically download the Response
Response:
{"msg":"Incorrect parameters!"."code":4005."data":null}
Copy the code
Response with keyword and page parameters:
http://localhost:8000/2016-08-15/proxy/FunPuppetter/baiduKeywordResult?keyword=vue&page=3
{
"msg": "success"."data": [{"title": "Vue. Js's official website." "."abstract": "Vue.js - The Progressive JavaScript Framework... Subscribe to our weekly you can browse past issues and listen to podcasts at news.vuejs.org."."redirectUrl": "http://www.baidu.com/link?url=Men7IMCzaXf2qP148hYmJKK54l5fL03Wbya_S4L25_i"."url": "https://cn.vuejs.org/"."domain": "cn.vuejs.org"."keyword": "vue"."pageNum": 1."rank": 1
},
{
"title": "Vue. | js tutorial novice tutorial"."abstract": Vue.js tutorial vue.js is a set of incremental frameworks for building user interfaces. Vue focuses only on the view layer and adopts a bottom-up incremental design. Vue aims to pass..."."redirectUrl": "http://www.baidu.com/link?url=WXIdaqC4EhUmm3Vdis5p0BCM3vUo139WwLQCB28LV8p5epqoiZMceQ1AWV_HpjKAb2jaqVpsXyWytUzPrnDqt_"."url": "https://www.runoob.com/vue2/vue-tutorial.html"."domain": "www.runoob.com"."keyword": "vue"."pageNum": 1."rank": 2
},
{
"title": "Introduction - vue.js"."abstract": "Vue.js - The Progressive JavaScript Framework... Vue (pronounced vju/curliest, similar to View) is a set of progressive frameworks for building user interfaces. Unlike other large frameworks,Vue is set to..."."redirectUrl": "http://www.baidu.com/link?url=RjryFjnGxvreIzhFX1iicF8hHcRbNhkoTTTrFLjsLk4EmqM5ydhCbTR2vye8NBUv"."url": "https://cn.vuejs.org/v2/guide/"."domain": "cn.vuejs.org"."keyword": "vue"."pageNum": 1."rank": 3}... ] ."code": 2000
}
Copy the code
4. One-click service deployment
To debug code locally, you can use the following command:
- Confirm the configuration in the YML file and select Y
fun deploy -y
You can skip confirmation during deployment
- Use NAS services to manage dependencies
FunPuppetter/baiduKeywordResult function more than 50 m size, you need to use the Nas service to manage the dependency.
-
? Do you want to let fun to help you automate the configuration?
Asked if Fun is used to automate the configuration of NAS management dependencies, select Yes
-
? We recommend using the ‘NasConfig: Auto’ configuration to manage your function dependencies.
Whether to use the NasConfig: Auto configuration to manage function dependencies, select Yes.
Tips: Manual configuration is optional. Function to calculate mounting NAS access. If you have configured it manually, the user is prompted to select the NAS storage function dependency that has been configured
If you see this, the deployment is successful.
Why is Response forced to download
Because the server enforces the Content-Disposition: Attachment field in the Response header, this field causes the returned result to be opened as an attachment in the browser. This field cannot be overridden. Using custom domain names is not affected.
Configure the custom domain name
Next we configure a custom domain name for the function service so that Http trigger function responses are no longer forced to download.
-
Log in ali Cloud function computing console
-
Open the custom domain name to create a domain name
Replace fun.root2.cn with your domain address
-
Resolve the domain name to the Endpoint of the function calculation
The Endpoint is obtained in the upper right corner of the function computing console/overview.
Open the cloud Resolution DNS console, select the domain name, and add records
The record type is CNAME and the record value is the Endpoint calculated by the function
-
Tests whether the resolution takes effect
The following drawing is successfully resolved
How do I update new dependencies?
If new dependencies are added, simply re-execute Fun NAS sync for synchronization.
If you change the code, simply re-execute Fun Deploy to redeploy.
Project code
Github.com/ITHcc/fun-p…