The principle of

Check whether ua’s HTTP_user_agent is a crawler by intercepting it in nginx. If not, the page will be returned normally. If so, PhantomJS will render the full HTML before returning.

PS: As long as the content returned by two routes is basically the same, it will not be punished by the search engine, considered as a means of cheating.

Install PhantomJS

MAC: local

brew update && brew cask install phantomjs

Phantomjs -v view

2.1.1

Write the script file spider. Js

// spider.js
'use strict';

console.log('=====start=====');

// Wait time for a single resource to avoid loading other resources after the resource is loaded
var resourceWait = 500;
var resourceWaitTimer;

// Maximum waiting time
var maxWait = 5000;
var maxWaitTimer;

// Resource count
var resourceCount = 0;

// PhantomJS WebPage module
var page = require('webpage').create();

// NodeJS system module
var system = require('system');

// Get the second parameter from the CLI as the destination URL
var url = system.args[1];

// Set the PhantomJS window size
page.viewportSize = {
	width: 1280.height: 1014};// Get the mirror
var capture = function (errCode) {
	// External access to page content via stdout
	console.log(page.content);

	// Clear the timer
	clearTimeout(maxWaitTimer);

	// Exit as normal
	phantom.exit(errCode);
};

// Resource requests and counts
page.onResourceRequested = function (req) {
	resourceCount++;
	clearTimeout(resourceWaitTimer);
};

// The resource is loaded
page.onResourceReceived = function (res) {
	// HTTP packet return in chunk mode will trigger the resourceReceived event for several times. You need to check whether the resource is end
	if(res.stage ! = ='end') {
		return;
	}

	resourceCount--;

	if (resourceCount === 0) {
		// When all the resources on the page have been loaded, the current rendered HTML is intercepted
		// Since onResourceReceived is called immediately after the resource is loaded, we need to give JS some time to run the parsing task
		// By default, 500 milliseconds are reserved
		resourceWaitTimer = setTimeout(capture, resourceWait); }};// Resource loading timed out
page.onResourceTimeout = function (req) {
	resouceCount--;
};

// Failed to load the resource
page.onResourceError = function (err) {
	resourceCount--;
};

// Open the page
page.open(url, function (status) {
	if(status ! = ='success') {
		phantom.exit(1);
	} else {
		// When the initial HTML of the page is returned successfully, start the timer
		// When the maximum time is reached (5 seconds by default), intercept the HTML rendered at that point
		maxWaitTimer = setTimeout(function () {
			capture(2); }, maxWait); }});Copy the code

Local Test Baidu

phantomjs spider.js ‘www.baidu.com/’

The HTML structure is returned as a success, but the access speed is significantly reduced

Command servitization

To respond to search engine crawler requests, we need to servitize this command and create a simple Web service through Node.

// server.js
// The ExpressJS call method
var express = require('express');
var app = express();

// Import NodeJS child process module
var child_process = require('child_process');

app.get(The '*'.function(req, res){

    / / a complete URL
    var url = req.protocol + ': / /'+ req.hostname + req.originalUrl;

    // The prerendered page string container
    var content = ' ';

    // Start a child phantomjs process
    var phantom = child_process.spawn('phantomjs'['spider.js', url]);

    // Set the stdout character encoding
    phantom.stdout.setEncoding('utf8');

    // Listen to Phantomjs' stdout and concatenate it
    phantom.stdout.on('data'.function(data){
        content += data.toString();
    });

    // Listen for child process exit events
    phantom.on('exit'.function(code){
        switch (code){
            case 1:
                console.log('Load failed');
                res.send('Load failed');
                break;
            case 2:
                console.log('Load timeout:'+ url);
                res.send(content);
                break;
            default:
                res.send(content);
                break; }}); }); app.listen(3000.function () {
  console.log('Spider app listening on port 3000! ');
});
Copy the code

Now that we have a pre-rendered Web service running Node Server.js, all we have to do is forward the search engine crawler’s request to the Web service, and finally return the rendered results to the crawler. To prevent the Node process from hanging, you can start it with nohup, nohup node server.js &.

With Nginx configuration, we can easily solve this problem.

upstream spider_server { server localhost:3000; } server { listen 80; server_name example.com; location / { proxy_set_header Host $host:$proxy_port; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; if ($http_user_agent ~* "Baiduspider|twitterbot|facebookexternalhit|rogerbot|linkedinbot|embedly|quora link preview|showyoubot|outbrain|pinterest|slackbot|vkShare|W3C_Validator|bingbot|Sosospider|Sogou Pic Spider|Googlebot|360Spider") { proxy_pass http://spider_server; }}}Copy the code

Pros and cons of the PhantomJS scheme

Advantages:

  • VUE code change impact is small, or even basically no change, the development cost is the lowest compared with other SEO programs
  • SEO solutions can be isolated and used for other projects

disadvantages

  • The access speed slows down
  • Some ES6 compilations may have problems, such as sets, Promises, and so on
  • High concurrent access to the server causes high pressure

reference

Github.com/lengziyu/vu…

www.v2ex.com/t/302616