Absorb what I have, share what I have
Today, the big front thought has been deeply rooted in people, a lot of knowledge will be involved. So for the front end of the present is also it does not refuse, practice star absorption method, as much as possible to absorb knowledge, finally achieve the effect of making the best use of things
Recently, I have also been learning knowledge about crawlers. The subway information data needed in the previous project was directly copied instead of crawlers climbing down
While it’s true that these numbers won’t change much in the near future, they still feel a bit low. So after learning about the knowledge of reptiles, I plan to discuss and exchange with you, and then directly into the topic
First of all, what are crawler and Robots protocols
Then introduce the basic process of crawler
Finally, according to the actual chestnut climb a douban recently released movie to a small test
Crawler and Robots protocol
Let’s start with the definition: a crawler is a program that automatically obtains web content. Is an important part of search engine, so search engine optimization is to a large extent for crawler optimization.
Look at the introduction of Robots protocol. Robots.txt is a text file, and robots.txt is a protocol, not a command
Robots.txt is the first file that the crawler needs to view. It tells the crawler what file can be viewed on the server, and the crawler robot will determine the access range according to the content in the file
The following figure is the access range listed on douban movie page about robots protocol
Therefore, everyone in the industry is recognized by the Robots protocol, do not let you climb the page do not climb, but also a peaceful Internet
That’s a little off. Let’s look at another picture, just to sort out what we’ve said
What was it climbing
This is a very intelligent question, to be clear, the crawler gets a piece of HTML code, so this is not unfamiliar to us, as long as we convert it into a DOM tree
So, now look at the right half of the figure, and this is a comparison
The folders on the left are not defined by Robots protocol. According to the truth, admin/private and TMP folders cannot be grasped, but without Robots protocol, people can climb without restraint
On the right is the one that defines the Robots protocol. On the contrary, search engines like Google also use the robots.txt file to see what cannot be captured, and then skip admin or private
Well, the content of the introduction said here, not to point real things are all on paper
The basic flow of crawlers
In fact, for the use of crawlers, the process is no more than these four steps
- Fetching the data
- Data warehousing
- Start the service
- Render data
Fetching the data
The following will enter the exciting link, we do not stop, follow me together to knock out a climb douban movie page for their own appreciation
Let’s take a look at the overall directory structure
Fetching the data
artifact
Request artifact
So how does request work? Listen to the wind and look at the code
// Easy to uselet request = require('request');
request('http://www.baidu.com'.function (error, response, body) {
console.log('error:', error); // Error log console.log is printed when an error occurs.'statusCode:', response && response.statusCode); // Prints the response status code console.log('body:', body); });Copy the code
After reading the code above, it’s not obvious to you. Friends, the HTML code is in sight, so don’t be shy, just switch to the familiar DOM and do whatever you want
So cheerio, which everyone calls Node’s VERSION of JQ, enters the scene. You can manipulate the DOM exactly as JQ does
Below also no longer roundabout son, hurriedly write reptile together!
Read the content
The home page should be analyzed according to the douban movie page first, which are the hot movies, let’s take a look at the DOM structure
// request-promise () const rp = require(); // request-promise () const rp = require()'request-promise'); // Convert HTML code to DOM, which can be called node version jq const cheerio = require('cheerio'); Const debug = require(const debug = require('debug') ('movie:read'); // The method used to read the page is constread = async (url) => {
debug('Start reading recently released movies'); Const opts = {url, // target page transform: body => {// Body is the HTML code captured by the target page // The cheerio.load method converts the HTML code into a DOM structure that can be manipulatedreturncheerio.load(body); }};return rp(opts).then($ => {
letresult = []; // The result array is li $('#screening li.ui-slide-item').each((index, item) => {
let ele = $(item);
let name = ele.data('title');
let score = ele.data('rate') | |'No score yet';
let href = ele.find('.poster a').attr('href');
let image = ele.find('img').attr('src'); // The movie id can be obtained from the movie hrefletid = href && href.match(/(\d+)/)[1]; Image = image && image. Replace (/ JPG $/,'webp');
if(! name || ! image || ! href) {return; } result.push({ name, score, href, image, id }); Debug (' Reading movie:${name}`); }); // Returns an array of resultsreturnresult; }); }; Module.exports = // exportsread;
Copy the code
Once the code is written, why don’t you think about what you’ve done
- The HTML code was captured via the request
- Cheerio has turned HTML into DOM
- Will need the contents of the existing array (name pictures | | | | score address id)
- Returns the result array and exports the read method
Data warehousing
Here we use mysql to set up a database to store data. It doesn’t matter if you don’t know much about it. First follow me step by step. Let’s first install XAMPP and Navicat visual database management tools, and then follow my steps below
XAMPP start mysql
Navicat connects to databases and builds tables
A few words may not be as good as the reality of the picture and the truth. Let’s take a look at the picture first
Connecting to a Database
First, we need to create an SQL file in the SRC directory with the same name as the database we just created. We will call it my_movie.sql (of course, the directory structure has already been created).
Then, go back to the db.js file and write the code to connect to the database
// db.js
const mysql = require('mysql');
const bluebird = require('bluebird'); // Create a connection const connection = mysql.createconnection ({host:'localhost'// host port: 3306, // default port: 3306 database:'my_movie'// The corresponding database user:'root',
password: ' '}); connection.connect(); Module.exports = // bluebird exports = // bluebird exports = // bluebird exports = // Bluebird exports = // Bluebird exports = // Bluebird exports = // Bluebird exports = bluebird.promisify(connection.query).bind(connection);Copy the code
The above code has created the operation to connect to the Mysql database. Now, without slowing down, write the content directly to the database
Write to database
At this point, let’s take a look at the write.js file, which, as the name implies, is used to write to the database, directly into the code
// write. Js file // import query method from db.js const query = require('./db');
const debug = require('debug') ('movie:write'); Const write = async (movies) => {debug('Start writing movies'); // movies are the result arrays read from read.jsfor (letMovie of movies) {// Use the query method to check whether the database has already been savedlet oldMovie = await query('SELECT * FROM movies WHERE id=? LIMIT 1', [movie.id]); // The SQL query returns an array. If the array is not empty, it indicates that the data has been savedif(array.isarray (oldMovie) &&oldmovie.length) {// Update data in movieslet old = oldMovie[0];
await query('UPDATE movies SET name=? ,href=? ,image=? ,score=? WHERE id=? ', [movie.name, movie.href, movie.image, movie.score, old.id]);
} else{// Insert content into movies table await query('INSERT INTO movies(id,name,href,image,score) VALUES(? ,? ,? ,? ,?) ', [movie.id, movie.name, movie.href, movie.image, movie.score]); } debug(' Writing movie:${movie.name}`); }}; module.exports = write;Copy the code
It might be a bit confusing, because the front end rarely writes SQL statements. But no, after I comb through the above code, I will briefly introduce the SQL statement part
What exactly is written in write.js?
- The query method is introduced to write SQL statements
- Iterate over the read result array
- Query whether any data has been saved
- Yes: Update data
- None: Insert data
Ok, so now that we’ve implemented writing to the database, let’s strike while the iron is hot and talk a little bit about SQL statements
SQL Statement learning
? Here by the way, a brief description of the syntax used in SQL statements, add, delete, change and check everywhere
- Insert data
INSERT INTO tags(name, ID, URL) VALUES('crawlers'Ten,'https://news.so.com/hotnews'Insert a name, ID, and address into the tags table with VALUESCopy the code
- Update the data
UPDATE articles SET title= UPDATE articles SET title='Hello world',content='The world is not as bad as you think! 'WHERE id=1 WHERE id=1 WHERE id=1 WHERE id=1Copy the code
- Delete the data
DELETE FROM tags WHERE id=11 DELETE FROM tags WHERE id=11Copy the code
- The query
Syntax: SELECT column name FROM table name WHERE query condition ORDER BY sort column name: SELECT name,title,content FROM tags WHERE ID =8Copy the code
Here has been the method of reading and writing all finished, we must see some tired. It’s time to test the results, otherwise it’s all bullshit
Perform read and write operations
Now go to index.js and check it out
// index.js file constread = require('./read');
const write = require('./write');
const url = 'https://movie.douban.com'; // target page (async () => {// asynchronously grab target page const movies = awaitread(url); // Write data to database await write(movies); Process.exit (); }) ();Copy the code
Ok, so let’s see what it looks like
We need to write a page to show this, since fetching and writing data is only allowed in node. So we’re also going to create a Web service that’s going to display the page, so hang in there and you’re done. Go for it
Start the service
Since it’s time to create a Web service, start writing the content of server.js
Server service
// server.js file const express = require('express');
const path = require('path');
const query = require('.. /src/db'); const app = express(); // Set the template engine app.set('view engine'.'html');
app.set('views', path.join(__dirname, 'views'));
app.engine('html', require('ejs').__express); // select app.get('/', async (req, res) => {// Async (req, res) => {'SELECT * FROM movies'); // Render the home page template and pass movies data to res.render('index', { movies }); }); // Listen to localhost:9000 port app.listen(9000);Copy the code
When you’re done with the server, go to the index. HTML template. This is the last thing, and you’re done
Render data
// index.html <! DOCTYPE html> <html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="Width = device - width, initial - scale = 1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge"</title> </head> <body> <div class="container">
<h2 class="caption"> </h2> <ul class="list"> < %for(leti=0; i<movies.length; i++){let movie = movies[i];
%>
<li>
<a href="<%=movie.href%>" target="_blank">
<img src="<%=movie.image%>" />
<p class="title"><%=movie.name%></p>
<p class="score"> score: < % = movie. The score % > < / p > < / a > < / li > < %} % > < / ul > < / div > < / body > < / HTML >Copy the code
Just iterate through the Movies array with the template engine and render it
Now, let’s look at the final result
Here incidentally send a code, in order to facilitate your reference knock knock knock
Thank you for watching, 886