Absorb what I have, share what I have

Today, the big front thought has been deeply rooted in people, a lot of knowledge will be involved. So for the front end of the present is also it does not refuse, practice star absorption method, as much as possible to absorb knowledge, finally achieve the effect of making the best use of things

Recently, I have also been learning knowledge about crawlers. The subway information data needed in the previous project was directly copied instead of crawlers climbing down

While it’s true that these numbers won’t change much in the near future, they still feel a bit low. So after learning about the knowledge of reptiles, I plan to discuss and exchange with you, and then directly into the topic

First of all, what are crawler and Robots protocols

Then introduce the basic process of crawler

Finally, according to the actual chestnut climb a douban recently released movie to a small test

Crawler and Robots protocol

Let’s start with the definition: a crawler is a program that automatically obtains web content. Is an important part of search engine, so search engine optimization is to a large extent for crawler optimization.

Look at the introduction of Robots protocol. Robots.txt is a text file, and robots.txt is a protocol, not a command

Robots.txt is the first file that the crawler needs to view. It tells the crawler what file can be viewed on the server, and the crawler robot will determine the access range according to the content in the file

The following figure is the access range listed on douban movie page about robots protocol

Therefore, everyone in the industry is recognized by the Robots protocol, do not let you climb the page do not climb, but also a peaceful Internet

That’s a little off. Let’s look at another picture, just to sort out what we’ve said

What was it climbing

This is a very intelligent question, to be clear, the crawler gets a piece of HTML code, so this is not unfamiliar to us, as long as we convert it into a DOM tree

So, now look at the right half of the figure, and this is a comparison

The folders on the left are not defined by Robots protocol. According to the truth, admin/private and TMP folders cannot be grasped, but without Robots protocol, people can climb without restraint

On the right is the one that defines the Robots protocol. On the contrary, search engines like Google also use the robots.txt file to see what cannot be captured, and then skip admin or private

Well, the content of the introduction said here, not to point real things are all on paper

The basic flow of crawlers

In fact, for the use of crawlers, the process is no more than these four steps

  1. Fetching the data
  2. Data warehousing
  3. Start the service
  4. Render data

Fetching the data

The following will enter the exciting link, we do not stop, follow me together to knock out a climb douban movie page for their own appreciation

Let’s take a look at the overall directory structure

Fetching the data
artifact

Request artifact

So how does request work? Listen to the wind and look at the code

// Easy to uselet request = require('request');

request('http://www.baidu.com'.function (error, response, body) {
    console.log('error:', error); // Error log console.log is printed when an error occurs.'statusCode:', response && response.statusCode); // Prints the response status code console.log('body:', body); });Copy the code

After reading the code above, it’s not obvious to you. Friends, the HTML code is in sight, so don’t be shy, just switch to the familiar DOM and do whatever you want

So cheerio, which everyone calls Node’s VERSION of JQ, enters the scene. You can manipulate the DOM exactly as JQ does

Below also no longer roundabout son, hurriedly write reptile together!

Read the content

The home page should be analyzed according to the douban movie page first, which are the hot movies, let’s take a look at the DOM structure

// request-promise () const rp = require(); // request-promise () const rp = require()'request-promise'); // Convert HTML code to DOM, which can be called node version jq const cheerio = require('cheerio'); Const debug = require(const debug = require('debug') ('movie:read'); // The method used to read the page is constread = async (url) => {
    debug('Start reading recently released movies'); Const opts = {url, // target page transform: body => {// Body is the HTML code captured by the target page // The cheerio.load method converts the HTML code into a DOM structure that can be manipulatedreturncheerio.load(body); }};return rp(opts).then($ => {
        letresult = []; // The result array is li $('#screening li.ui-slide-item').each((index, item) => {
            let ele = $(item);
            let name = ele.data('title');
            let score = ele.data('rate') | |'No score yet';
            let href = ele.find('.poster a').attr('href');
            let image = ele.find('img').attr('src'); // The movie id can be obtained from the movie hrefletid = href && href.match(/(\d+)/)[1]; Image = image && image. Replace (/ JPG $/,'webp');

            if(! name || ! image || ! href) {return; } result.push({ name, score, href, image, id }); Debug (' Reading movie:${name}`); }); // Returns an array of resultsreturnresult; }); }; Module.exports = // exportsread;
Copy the code

Once the code is written, why don’t you think about what you’ve done

  • The HTML code was captured via the request
  • Cheerio has turned HTML into DOM
  • Will need the contents of the existing array (name pictures | | | | score address id)
  • Returns the result array and exports the read method

Data warehousing

Here we use mysql to set up a database to store data. It doesn’t matter if you don’t know much about it. First follow me step by step. Let’s first install XAMPP and Navicat visual database management tools, and then follow my steps below

XAMPP start mysql

Navicat connects to databases and builds tables

A few words may not be as good as the reality of the picture and the truth. Let’s take a look at the picture first

Connecting to a Database

First, we need to create an SQL file in the SRC directory with the same name as the database we just created. We will call it my_movie.sql (of course, the directory structure has already been created).

Then, go back to the db.js file and write the code to connect to the database

// db.js

const mysql = require('mysql');
const bluebird = require('bluebird'); // Create a connection const connection = mysql.createconnection ({host:'localhost'// host port: 3306, // default port: 3306 database:'my_movie'// The corresponding database user:'root',
    password: ' '}); connection.connect(); Module.exports = // bluebird exports = // bluebird exports = // bluebird exports = // Bluebird exports = // Bluebird exports = // Bluebird exports = // Bluebird exports = bluebird.promisify(connection.query).bind(connection);Copy the code

The above code has created the operation to connect to the Mysql database. Now, without slowing down, write the content directly to the database

Write to database

At this point, let’s take a look at the write.js file, which, as the name implies, is used to write to the database, directly into the code

// write. Js file // import query method from db.js const query = require('./db');
const debug = require('debug') ('movie:write'); Const write = async (movies) => {debug('Start writing movies'); // movies are the result arrays read from read.jsfor (letMovie of movies) {// Use the query method to check whether the database has already been savedlet oldMovie = await query('SELECT * FROM movies WHERE id=? LIMIT 1', [movie.id]); // The SQL query returns an array. If the array is not empty, it indicates that the data has been savedif(array.isarray (oldMovie) &&oldmovie.length) {// Update data in movieslet old = oldMovie[0];
            await query('UPDATE movies SET name=? ,href=? ,image=? ,score=? WHERE id=? ', [movie.name, movie.href, movie.image, movie.score, old.id]);
        } else{// Insert content into movies table await query('INSERT INTO movies(id,name,href,image,score) VALUES(? ,? ,? ,? ,?) ', [movie.id, movie.name, movie.href, movie.image, movie.score]); } debug(' Writing movie:${movie.name}`); }}; module.exports = write;Copy the code

It might be a bit confusing, because the front end rarely writes SQL statements. But no, after I comb through the above code, I will briefly introduce the SQL statement part

What exactly is written in write.js?

  • The query method is introduced to write SQL statements
  • Iterate over the read result array
  • Query whether any data has been saved
    • Yes: Update data
    • None: Insert data

Ok, so now that we’ve implemented writing to the database, let’s strike while the iron is hot and talk a little bit about SQL statements

SQL Statement learning

? Here by the way, a brief description of the syntax used in SQL statements, add, delete, change and check everywhere

  1. Insert data
INSERT INTO tags(name, ID, URL) VALUES('crawlers'Ten,'https://news.so.com/hotnews'Insert a name, ID, and address into the tags table with VALUESCopy the code
  1. Update the data
UPDATE articles SET title= UPDATE articles SET title='Hello world',content='The world is not as bad as you think! 'WHERE id=1 WHERE id=1 WHERE id=1 WHERE id=1Copy the code
  1. Delete the data
DELETE FROM tags WHERE id=11 DELETE FROM tags WHERE id=11Copy the code
  1. The query
Syntax: SELECT column name FROM table name WHERE query condition ORDER BY sort column name: SELECT name,title,content FROM tags WHERE ID =8Copy the code

Here has been the method of reading and writing all finished, we must see some tired. It’s time to test the results, otherwise it’s all bullshit

Perform read and write operations

Now go to index.js and check it out

// index.js file constread = require('./read');
const write = require('./write');
const url = 'https://movie.douban.com'; // target page (async () => {// asynchronously grab target page const movies = awaitread(url); // Write data to database await write(movies); Process.exit (); }) ();Copy the code

Ok, so let’s see what it looks like

We need to write a page to show this, since fetching and writing data is only allowed in node. So we’re also going to create a Web service that’s going to display the page, so hang in there and you’re done. Go for it

Start the service

Since it’s time to create a Web service, start writing the content of server.js

Server service

// server.js file const express = require('express');
const path = require('path');
const query = require('.. /src/db'); const app = express(); // Set the template engine app.set('view engine'.'html');
app.set('views', path.join(__dirname, 'views'));
app.engine('html', require('ejs').__express); // select app.get('/', async (req, res) => {// Async (req, res) => {'SELECT * FROM movies'); // Render the home page template and pass movies data to res.render('index', { movies }); }); // Listen to localhost:9000 port app.listen(9000);Copy the code

When you’re done with the server, go to the index. HTML template. This is the last thing, and you’re done

Render data

// index.html <! DOCTYPE html> <html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="Width = device - width, initial - scale = 1.0">
    <meta http-equiv="X-UA-Compatible" content="ie=edge"</title> </head> <body> <div class="container">
        <h2 class="caption"> </h2> <ul class="list"> < %for(leti=0; i<movies.length; i++){let movie = movies[i];  
            %>
                <li>
                    <a href="<%=movie.href%>" target="_blank">
                        <img src="<%=movie.image%>" />
                        <p class="title"><%=movie.name%></p>
                        <p class="score"> score: < % = movie. The score % > < / p > < / a > < / li > < %} % > < / ul > < / div > < / body > < / HTML >Copy the code

Just iterate through the Movies array with the template engine and render it

Now, let’s look at the final result

Here incidentally send a code, in order to facilitate your reference knock knock knock

Thank you for watching, 886