I started to write a small project at the end of last year, and have done some optimization on and off. Here is a brief record.

The source of the

The origin is that I didn’t have much chance to get to know Node project before, and I only knew how to write scripts and make some small tools with Node in my work, which was far from running Node service on the server. So I want to write a running Node project on the server to practice.

I always like to subscribe to information by RSS, which is simple and efficient. Instead of receiving push from time to time every day and opening apps on various websites to receive information, it is better to get the initiative and concentrate on reading in the same time period. In this way, the anxiety of receiving information at irregular times every day is avoided. However, it is often difficult to remember to open 😅. When I open Reeder a week later, I find that the accumulated unread information has exploded again.

Therefore, I decided to set up a news push service to satisfy my core demands. I would push RSS front-end updates on wechat at 10 am every weekday, so that I could browse the news comfortably when I arrived at my work station every day, and select some useful things to save and study slowly.

Project warehouse: github.com/Colafornia/…

A push might look like this:

Scan code to obtain push service:

Now the push source is mainly the zhihu column of each factory, the personal blog of the big guys and the popular articles of the front of digging gold, all of which are my own personal taste.

Here’s a look at the development (and self-imposed requirements) process.

start

At the beginning, I felt that this requirement was very simple, and the specific operations could be divided into:

  1. Write a configuration file with the RSS feed address I want to grab in it
  2. Find an NPM package that parses RSS, traverse the sources in the configuration file, and process the data after parsing
  3. Just sift out articles updated in the last 24 hours, process the data, summarize it into a string, and push it with wechat
  4. The script written above runs through a timed task, done!

Finally, RSS-Parser was selected as the parsing tool, PushBear as the push service, and a version of node-schedule task scheduling tool was written.

Then I realized my lack of knowledge and did not consider the process daemon when the script was deployed to the server, so I studied a wave of PM2 and completed the task perfectly.

The transition

In fact, the project can make do with it, but it looks very low and uncomfortable. The main problems are:

  1. There were about 40 or 50 RSS feeds at the time, and parsing all of them at once would often result in timeouts or errors
  2. The RSS feed is written in the configuration file, and every time you want to add or modify the source, you need to change the code, which is very low
  3. PushBear can store only three days’ worth of tweets, so you can’t even see them from three days ago or a week ago
  4. The Nuggets’ RSS feed is not much content and is not sorted by popularity (or my posture is wrong 😅), so it doesn’t quite fit the bill

The first one is a little more complicated, and the solution is probably still primitive. The first problem is the need to control the number of concurrent requests and the instability of the RSS feed itself. The current solution is:

  1. Separate the grasping task and push task, and reserve time for three times of circular grasping, and only grab the previous failed source in the latter two times
  2. withasync 的 mapLimittimeoutMethod To set the maximum number of concurrent requests and the timeout period

The general code is as follows (some details are not posted) :

// Capture the timer ID
let fetchInterval = null;
// Fetch times
let fetchTimes = 0;
function setPushSchedule () {
    schedule.scheduleJob('00 30 09 * * *', () = > {// Grab tasks
        log.info('rss schedule fetching fire at ' + new Date());
        activateFetchTask();
    });

    schedule.scheduleJob('00 00 10 * * *', () = > {// Send the task
        log.info('rss schedule delivery fire at ' + new Date());
        let message = makeUpMessage();
        log.info(message);
        sendToWeChat(message);
    });
}
function activateFetchTask() {
  fetchInterval = setInterval(fetchRSSUpdate, 120000);
  fetchRSSUpdate();
}
function fetchRSSUpdate() {
    fetchTimes++;
    if (toFetchList.length && fetchTimes < 4) {
        // If the number of fetching times is less than three, and there are still unsuccessful fetching sources
        log.info(The first `${fetchTimes}Second fetch, there is${toFetchList.length}Article `);
        // The maximum number of concurrent requests is 15, and the timeout period is 8000ms
        return mapLimit(toFetchList, 15, (source, callback) => {
            timeout(parseRSS(source, callback), 8000);
        })
    }
    log.info('fetching is done');
    clearInterval(fetchInterval);
    return fetchDataCb();
}
Copy the code

In this way, more than 90% of the crawling problems are basically solved and the stability of the script is guaranteed.

For the problem that THE RSS source needs to be changed every time you want to add or modify the source in the configuration file, the solution is very simple, just write the source configuration to MongoDB. Some GUI software can directly add and modify the data in the graphical interface.

In order to solve the problem that the push service can only store the push within three days, we decided to add a weekly capture task every Friday to capture the new articles within a week and send the content to the warehouse as an issue. It’s also a solution.

In view of the gold digger RSS source problem, finally decided to directly call the gold digger interface to get data, which can follow their own needs, every day only grab ❤️ like the number of more than 70 articles.

By the way, I added an offset value to the time range of articles to avoid screening out articles with good quality but few likes because they were just released. Feel good about yourself

function filterArticlesByDateAndCollection () {
    const threshold = 70;
    // articles are a list of articles that have been ordered in ascending order of ❤️
    let results = articles.filter((article) = > {
        // The offset value is 5 hours, to avoid screening good quality but less likes due to just published articles
        return moment(article.createdAt).isAfter(moment(startTime).subtract(5.'hours'))
            && moment(article.createdAt).isBefore(moment(endTime).subtract(5.'hours'))
            && article.collectionCount > threshold;
    });
    // A maximum of 8 articles are included to avoid information explosion
    return results.slice(0.8);
}
Copy the code

During this period, I also fully realized the importance of log and added a new table in the database to store the daily push content.

In addition, a new Channel has been added to PushBear to push logs to myself. Every day after the fetching task is completed, I will send the captured content to me first. If I find any problems, I can log on to the server and fix them urgently (still very low 😅).

upgrade

After making the above changes, the script ran steadily for almost half a year, during which TIME I was too busy moving bricks to make any further changes.

Has not done the promotion, but one day suddenly found that there have been more than 30 users in the subscription service, so conscience, in line with the user responsible (but also their own new want to practice the technology 👻), and made a transformation.

The problems of this project are as follows:

  1. There is no article to repeat, if the article is published in zhihu column, nuggets also published, the author’s personal blog also published, it is equivalent to repeated several times
  2. Push time interval is not precise, are the current time in the past 24 hours to screen
  3. Script directly connect database access operation is not very good, I feel this form made into server, external exposure API is more reasonable (such as one day to write an RSS reader will also be used)
  4. Every time the code gets updated, the dependencies get updated, SSH on the server and thennpm installFeel not too professional, there is room for improvement (actually want to usedockerA)

1,2 problem is very easy to solve, each time before crawling first check the log, the specific time of the last push. Every catch a new article, and then compare with the last 7 days in the log of the article, repeat not in the crawl results, also solved.

For question 3, we decided to set up Koa Server, and first read push source from MongoDB and access push log into API.

The directory structure is as follows: Add Model and Controller. Put RSS crawl scripts and crawlers into task files.

There is no difficulty in calling the API to get the RSS feed:

An important issue comes to mind here, authentication. You can’t just expose all the apis to the outside world, which is like exposing the database.

Finally, I decided to use JSON Web Token (ABBREVIATED JWT) as the authentication scheme. The main reason is that JWT is suitable for one-time and short time command authentication. Currently, my service is limited to server-side API calls, and my daily usage is not long, so THERE is no need to issue tokens with long validity.

Koa has a JWT middleware

// index.js
app.use(jwtKoa({ secret: config.secretKey }).unless({
    path: [/^\/api\/source/, /^\/api\/login/]
}))
Copy the code

With middleware, all but the/API /source and/API /login interfaces need JWT authentication to access.

A/API /login interface is used to issue the token. After receiving the token, you can set the token in the request header to pass the authentication:

// api/base.js
// Encapsulate axios
// HTTP request interceptor
import axios from 'axios';
const config = require('.. /config');
const Instance = axios.create({
    baseURL: `http://localhost:${config.port}/api`.timeout: 3000.headers: {
        post: {
            'Content-Type': 'application/json',}}}); Instance.interceptors.request.use((config) = > {
        / / JWT validation
        const token = config.token;
        if (token) {
            config.headers['Authorization'] = `Bearer ${token}`
        }
        return config;
    },
    error => {
        return Promise.reject(error); });Copy the code

If the request header does not contain the correct token, an Authentication Error is returned.

As for question 4, the service is now relatively simple and can only be deployed on one machine. Manually logging in to NPM install is not a problem. If there are many machines and their dependencies are complex, it can easily cause problems. .

So I decided to build and deploy based on Docker.

FROM daocloud.io/node:8.4.0-onbuild
COPY package*.json ./
RUN npm install -g cnpm --registry=https://registry.npm.taobao.org
RUN cnpm install
RUN echo "Asia/Shanghai" > /etc/timezone
RUN dpkg-reconfigure -f noninteractive tzdata
COPY.EXPOSE 3001
CMD [ "npm"."start"."$value1"."$value2"."$value3"]
Copy the code

With relatively simple, is mainly responsible for installing dependencies, start services. There are two main points to note:

  1. In this case, I recommend using domestic images, such as DaoCloud, ali Cloud and so on
  2. Because the push service is time sensitive, the time zone of the base mirror is not a domestic time zone, so you need to set it manually

Then go to a public cloud site like DaoCloud to authorize access to Github repository and connect to your own host to implement continuous integration and automatically build and deploy our images. For specific steps, please refer to building front-end continuous integration Development Environment based on Docker.

That’s about it for this optimization. The next step might be to provide a push history view page, which is not a high priority, but I’ll do it later (and practice Nginx).

The current implementation plan may still have very unreasonable place, welcome to put forward suggestions.