preface
I used to study data and write some data crawlers in bits and pieces, but I wrote them randomly. There are a lot of things that don’t seem to make sense right now. I was just trying to refactor my previous project. Later, I used this weekend to completely rewrite a project, namely this project Guwen-Spider. At present, this crawler is a relatively simple type. It directly captures the page, and then extracts data from the page and saves the data to the database. Compared to what I wrote before, I think the difficulty lies in the robustness of the entire program, and the corresponding fault tolerance mechanism. In the process of writing the code yesterday, it was also reflected that the real body of the code was finished very quickly, and most of the time was spent in stability debugging and seeking a more reasonable way to deal with the relationship between data and process control.
background
The background of the project is to grab a level 1 page is the list of contents, click a catalog to enter a chapter and length list, click chapter or length to enter the specific content page.
An overview of the
This project github address: Guwen-spider (PS: there are eggs at the back ~~ escape
Project technical Details
The project uses a lot of ES7 async functions to more intuitively reflect the process of the program. For convenience, the famous async library is directly used in the process of data traversal, so it is inevitable to use callback promise, because the data processing occurs in the callback function, it is inevitable to encounter some data transfer problems. It is also possible to write a method with ES7 async await to achieve the same functionality. One of the best things about Class is that it uses the Class static method to encapsulate database operations. Static methods, like Prototype, don’t take up any extra space. The project is mainly used
- ES7 async await coroutine to do asynchronous related logic processing.
- NPM async library is used for loop traversal and concurrent request operations.
- 3 Use Log4JS to process logs
- 4 Use Cheerio to handle DOM operations.
- 5 use Mongoose to connect mongoDB for data saving and operation.
The directory structure
├ ─ ─ bin / / entrance │ ├ ─ ─ booklist. Js / / grab books logic │ ├ ─ ─ chapterlist. Js / / grab section logic │ ├ ─ ─ content. js / / scraping content logic │ └ ─ ─ index. The js / / Program entry ├─ config config file ├─ DbHelper // Database Operation Method directory ├─ logs // project Log directory ├─ model // mongoDB Set Operation instances ├─ node_modules ├─ Utils // Utility Functions ├─ package.json
Project implementation plan analysis
The project is a typical multi-level grab case, there are only three levels, namely book list, book item corresponding chapter list, a chapter link corresponding content. Grab this structure can be used in two ways, one is direct from the outer to the inner lining after scraping to execute the next outer fetching, another is to finish the outer fetching saved to the database, and then according to the outer grab the link to all the inner chapters, save again, and then from the database query to the corresponding unit links to content of fetching. These two schemes have their own advantages and disadvantages. In fact, I have tried both methods. The latter has one advantage, because the three levels are captured separately, so that it is more convenient to save as much data as possible to the corresponding chapters. You can imagine, if the former is used in accordance with the normal logic On one level traversal grab to the corresponding secondary section directory, again to traverse section list Scraping content, while grabbing finish needs to be saved to the third unit content, if you need a lot of the directory information, you need to for data transfer between the hierarchical data, It should be a complicated thing to think about. So keeping the data separate to some extent avoids unnecessarily complex data transfer.
At present, we consider that the number of ancient books we have to capture is not very large, only about 180 ancient books covering all kinds of classics and history. The chapter content itself is a very small amount of data, that is, 180 document records in a collection. All of the chapters in the 180 books are captured and there are 16,000 chapters in total, which means you have to visit 16,000 pages to crawl to the corresponding content. So choosing the second option should make sense.
Project implementation
Master cheng has three methods bookListInit chapterListInit, contentListInit, were fetching books directory, chapter lists, books content method of initialization method of public exposure. The operation process of these three methods can be controlled through async. After the book catalog is captured, the data is saved to the database, and then the execution result is returned to the main program. If the operation is successful, the main program will perform the crawling of the chapter list according to the book list, and the same is true for the book content.
Main entrance
/** * const start = async() => {let booklistRes = await bookListInit();
if(! booklistRes) { logger.warn('Book list fetching error, program terminates... ');
return;
}
logger.info('Book list fetching succeeded, now proceed to book chapter fetching... ');
let chapterlistRes = await chapterListInit();
if(! chapterlistRes) { logger.warn('Book chapter list fetching error, program terminates... ');
return;
}
logger.info('Book chapter list crawls successfully, now crawls book contents... ');
let contentListRes = await contentListInit();
if(! contentListRes) { logger.warn('Book chapter content fetching error, program terminates... ');
return;
}
logger.info('Book content captured successfully'); } // Start the entryif (typeof bookListInit === 'function' && typeof chapterListInit === 'function') {// start fetching start(); }Copy the code
Introducing bookListInit, chapterListInit contentListInit, three methods
booklist.js
/** * The initialization method returns the result of fetchingtrueGrab resultsfalse*/ const bookListInit = async() => {logger.info()'Grab the book list to start... ');
const pageUrlList = getPageUrlList(totalListPage, baseUrl);
let res = await getBookList(pageUrlList);
return res;
}
Copy the code
chapterlist.js
/** */ const chapterListInit = async() => {const list = await bookhelper.getBookList (bookListModel);if(! list) { logger.error('Failed to initialize book catalog query');
}
logger.info('Start fetching book chapter list, book catalogue total:' + list.length + '条');
let res = await asyncGetChapter(list);
return res;
};
Copy the code
content.js
// Const contentListInit = async() => {// Get a list of books const list = await bookhelper.getBookLi (bookListModel);if(! list) { logger.error('Failed to initialize book catalog query');
return;
}
const res = await mapBookList(list);
if(! res) { logger.error('Fetch section information, call getCurBookSectionList() for serial traversal operation, execute callback error, error information has been printed, please check the log! ');
return;
}
return res;
}
Copy the code
Content grab thinking
In fact, the logic of book catalog crawl is very simple, just need to use async. MapLimit to do a traversal can save the data, but we save the content of the simplified logic is actually traversal section list crawl links in the content. But the reality is that there are tens of thousands of links and we can’t keep all of them in one array from a memory footprint perspective and walk through them, so we need to unitize content fetching. Common ways of traversal Is the number of each query must have to do grab, such defect is just in a certain number do classification, there is no connection between the data, insert, on the basis of batch if wrong Fault tolerance will have a few questions, and if we want to keep a book as a collection of single will encounter problems. Therefore, we use the second method, which is a book unit for content grabbing and saving. The async.maplimit (list, 1, (series, callback) => {}) method is used for traversal, which inevitably uses the callback and feels disgusting. The second parameter of async.maplimit () sets the number of simultaneous requests.
/* * Content fetching steps: * The first step is to get the list of books. From the list of books, find a book and record the list of all the corresponding chapters. */ /** * init */ const contentListInit = async() => { Const list = await bookhelper.getBookList (bookListModel);if(! list) { logger.error('Failed to initialize book catalog query');
return;
}
const res = await mapBookList(list);
if(! res) { logger.error('Fetch section information, call getCurBookSectionList() for serial traversal operation, execute callback error, error information has been printed, please check the log! ');
return;
}
returnres; } @param {*} list */ const mapBookList = (list) => {return new Promise((resolve, reject) => {
async.mapLimit(list, 1, (series, callback) => {
let doc = series._doc;
getCurBookSectionList(doc, callback);
}, (err, result) => {
if (err) {
logger.error('Book catalog crawl async execution error! ');
logger.error(err);
reject(false);
return;
}
resolve(true); * @param {*} series * @param {*} callback */ const getCurBookSectionList = async(series, callback) => {let num = Math.random() * 1000 + 1000;
await sleep(num);
let key = series.key;
const res = await bookHelper.querySectionList(chapterListModel, {
key: key
});
if(! res) { logger.error('Get current book:' + series.bookName + 'Chapter content failed, enter the next book content crawl! ');
callback(null, null);
return; } const bookItemModel = getModel(key); const contentLength = await bookHelper.getCollectionLength(bookItemModel, {});if (contentLength === res.length) {
logger.info('Current Books:' + series.bookName + 'Database fetching completed, proceed to next data task');
callback(null, null);
return;
}
await mapSectionList(res);
callback(null, null);
}
Copy the code
How to save the data is a problem
Here, we classify data by key, and obtain links according to key each time for traversal. The advantage of this is that the saved data is a whole. Now think about the problem of data preservation
-
The 1 can be inserted in a monolithic manner
Advantages: Fast database operation does not waste time.
Cons: Some books may have hundreds of chapters, which means you have to save hundreds of pages before inserting them, which also consumes memory and may cause unstable application performance.
-
2 can be inserted into the database as per article.
Advantages: page grab-and-save allows data to be saved in a timely manner, even if the subsequent error does not need to save the previous chapter,
Disadvantages: also obviously is slow, think carefully if you want to climb tens of thousands of pages to do tens of thousands of times *N database operation here can also make a cache to save a certain number of times when the number reaches to save it is also a good choice.
* @param {*} list */ const mapSectionList = (list) => {return new Promise((resolve, reject) => {
async.mapLimit(list, 1, (series, callback) => {
let doc = series._doc;
getContent(doc, callback)
}, (err, result) => {
if (err) {
logger.error('Book catalog crawl async execution error! ');
logger.error(err);
reject(false);
return; } const bookName = list[0].bookName; const key = list[0].key; SaveAllContentToDB (result, bookName, key, resolve); // Save each article as a unit // logger.info(bookName +)'Data fetching complete, enter the next book fetching function... ');
// resolve(true); })})}Copy the code
Both have their pros and cons, and I’ve tried both here. Two sets of error saving are prepared,errContentModel and errorCollectionModel. When inserting an error, the information is saved to the corresponding set respectively. You can choose either one. The reason for adding collections to hold data is to make it easier to view it once and for subsequent operations without looking at the log.
(PS, in fact, the collection of errorCollectionModel can be completely used, and the collection of errContentModel can completely store chapter information)
Const errorSpider = mongoose.Schema({chapter: String, section: String, url: String, key: String, bookName: String, author: String,}) const errorCollection = mongoose.Schema({key: String, bookName: String, })Copy the code
We put the contents of each book message into a new collection named key.
conclusion
In fact, the main difficulty of writing this project lies in the control of program stability, the setting of fault tolerance mechanism, and the recording of errors. At present, this project can basically realize the direct running of the whole process at one time. But the program design also certainly has many problems, welcome to correct and exchange.
eggs
After writing this project, I made a front-end website based on React for page browsing and a server based on KOA2.x development. The overall technology stack is equivalent to React + Redux + KOA2. The front-end and back-end services are deployed separately, which can better remove the coupling of the front-end and back-end services. For example, the same set of server-side code can not only provide support to the Web side but also to the mobile side and APP. At present, the whole set is still very simple, but it can meet the basic query browsing function. I hope there is time to enrich the project later.
-
This project address address: Guwen-spider
-
React + Redux + semantic- UI address: guwen-react
-
Node Koa2.2 + Mongoose address: Guwen-node
The project is fairly simple, but adds an environment for learning and researching development from the front end to the server side.
The above で す