I’m gonna die if I don’t get a job
Front-end time to do project practice often need to use data, to find a job, no project how to line? How can you have a project without data? How do you get the data? A reptile, of course! However, when I began to write crawler, I found that things were not so simple. At the beginning, I used native node to write a bunch of bugs, and often couldn’t climb out of data due to various problems, so I gave up. After that, it is the process of looking for wheels. I went to NPM to search, and found that there are not many crawler frameworks in Node. One of the most downloaded framework is Crawler framework, which is also relatively simple to use, so I decided to choose it! After a period of time I want to crawl data is done with this framework, with several times after I found while using the framework, but each crawl data or want a lot of repetitive work, the work should be with code completion, so I set out to the crawler was encapsulated, make it more simple and easy to use, And the function has also carried on some improvement.
Installation:
I have uploaded the framework to NPM and named it Crawler-Lian. If you want to use it, you can install it directly, but there are still many bugs. Use it with caution:
npm i -D crawler-lian
Copy the code
Basic usage:
Simply set up a few selectors and processing functions to crawl data, come and try!
const {fetchBySelector,utils} = require('crawler-lian');
// Gets the attributes of the specified element
fetchBySelector(uri, { selector: 'a'.attr: 'href' }).then(({data}) = > console.log(data));
fetchBySelector(uri, { selectors:[
{selector:'.position'.attr:'text'},
{selector:'.phone-num'.attr:'text'}
] }).then(({data}) = > console.log(data));
// Get the specified list data and internal data in a page
fetchBySelector(uri, { groups:[
{
groupName: 'list'.// Define a name for yourself
// To crawl a list of data, el is usually li
el: '.s_position_list > .item_con_list> .con_list_item'.selectors:[
{
selector: '.position_link'.attr: 'href'.name: 'detail_url'
},
{
selector: '.format-time'.attr: 'text'.name: 'time',
handler({ value }) {
returnparseTime(value); }},].// Handle specific values
handler({ value }) {
return utils.removeSpace(value);
},
// Merge/process data items
process ({ matchs }) => {
// A selector selects an array of data, we only need the first item.
if (matchs && matchs.length > 0) {
return matchs[0]}},// Get the internal page data of the list item and merge it with the current item
// Support asynchronous operation
itemProcess({ data }) {
let detail_url = data.detail_url;
if (detail_url) {
let pro = fetchBySelector(detail_url, detailOptions)
.then(({ data: detailData }) = >({... data, ... detailData })) .catch(console.log);
return pro;
} else {
// console.log(data)
return data;
}
}
}
] }).then(({data}) = > console.log(data));
// Other configurations
const option = {
deDuplication: false.// Whether to remove the weight
selector: 'a'.// Default selector
attr: 'text'.// Default selected properties
trim: true.// Whether to remove the space before and after
handler: null.// The handler handles the selected concrete element
process: null.// Process an array selected by the selector, returning either new data or a Promise object
test: null.// To test the criteria, pass in a regular expression
filter:null.// Filter, the same function as test, passes in a function
groups, // Group crawl overrides the selector if it exists with the selector
itemProcess: null // Process a set of data
}
Copy the code
Give it a try:
const { fetchBySelector } = require('crawler-lian')
// Crawl douban data (just a little bit ^v^)
fetchBySelector('https://movie.douban.com/chart',
{
attr: 'text'.groups: [{groupName: 'list'.el: 'tr.item',
process({ matchs }) {
return matchs.length > 0 ? matchs[0] : null;
},
itemProcess({ data }) {
let { href } = data;
let pro = fetchBySelector(href, {
selectors: [{selector: '#link-report'.name: 'desc',
process({matchs}){
return matchs[0]}}]})// An item of data: If you want to store the database, you can also directly store the data here
.then(({ data: detailData }) = >({... data, ... detailData }));return pro;
},
selectors: [{selector: 'a.nbg'.attr: 'href'.name: 'href'
},
{
selector: 'a.nbg>img'.attr: 'src'.name: 'img_src'
},
{
selector: 'tr > td:nth-child(2) > div > a'.attr: 'text'.name: 'name',
handler({ value }) {
/ / to space
return value.replace(/\s+/g.' '); }}]}})// Output all data after parsing
.then(({ data }) = > console.log('result=', data))
Copy the code
The results are as follows:
result = {
list: [{href: 'https://movie.douban.com/subject/34805219/'.img_src: 'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2566870171.jpg'.name: 'Hungry Platform/Hungry Room (port)/ Desperate Platform (platform)'.desc: 'In a dystopian future, prisoners are held in vertically stacked cells, hungrily watching food fall from the top, while those near the top are fed and those at the bottom are radicalised by hunger. \n' +
' \n' +
' Directed by Gard Gasterlu-Urushia, Platform Hunger is a twisted social fable about humanity at its darkest and thirstiest. '
},
{
href: 'https://movie.douban.com/subject/2364086/'.img_src: 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2582428806.jpg'.name: 'Invisible Man/Invisible Man/Invisible Visitor (Port)'.desc: 'Cecilia (Elisabeth Moss) never imagined that Adrian (Oliver Jessen-Cohen, Oliver Jackson-Cohen), the handsome man she once fell head over heels in love with, would become the cause of her nightmare. Not long after he fell in love with Himself, Adrian began to control Cecilia both mentally and physically. Finally, the unbearable Cecilia dazed Adrian with valium late one night and successfully escaped from the evil den. \n' +
' \n' +
' Later, Cecilia was shocked to receive the news of Adrian's suicide. She could hardly believe that the demon had completely disappeared from her life. Cecilia's doubts were not without reason. In every corner of her life, an invisible shadow seemed to be watching her and trying to touch her. \n' +
' \n' +
'© Douban'
},
// There are a lot of omissions...]}Copy the code
conclusion
The framework also has some scattered simple features not mentioned here, of course, there are a lot of areas to be improved, if anyone read, I will write an article about the source code (can also look at yourself, write a very simple, is a bit messy…). .
Finally, if anyone uses it, let me know if there are any bugs (though there must be a lot of bugs…). .
The final final, 2020 session this year’s unripe look for a job, beg good person shelter: [email protected]. This is a vue project I did (using some B station data and styles). If you are interested, please check it out. The address is: lys.buctsnc.cn/.