How can I shorten the execution time of a scheduled task that processes 100 million levels of data?

Continue to answer questions from water friends. Problem abstraction:

(1) User membership system;

(2) The user will have a score flow, every month to do a score statistics, do different business for different grades of members;

Data hypothesis:

(1) Assume that the user is at 100W level;

(2) Assume that the user has one daily flow, that is to say, the daily incremental flow data is at 100W level, the monthly incremental flow data is at 3kW level, and the monthly flow data is at 100 million level;

Common solutions:

Use a timed task, calculated on the first day of each month.

//(1) Query all users

uids[] = select uid from t_user;

//(2) Iterate over each user

foreach $uid in uids[]{

//(3) query user scores within 3 months

scores[]= select score from t_flow

Where uid=$uid and time=[3 months];

//(4) iterate over the fractional flow

foreach $score in scores[]{

//(5) calculate the total score

sum+= $score;

}

//(6) do business according to the score

switch(sum)

Upgrade and downgrade, send coupons, send rewards;

}

What are the problems with a scheduled task executed once a month?

The amount of calculation is very large, the amount of data to be processed is very large, and it takes a long time. According to shui You, it takes 1-2 days.

Voice-over: outer loop 100W level user; Inner circulation 9kW level water flow; Business processes require more than a dozen database interactions.

Can multithreading be done in parallel?

Yes, per-user flow processing is decoupled.

What are the problems with switching to multi-threaded parallel processing, such as splitting by user?

Each thread has to access the database for business processing, and the database may fail.

The optimization direction of this kind of problem is:

(1) Reduce the number of repeated calculations for the same data;

(2) allocate CPU computing time, try to disperse processing, rather than centralized processing;

(3) Reduce the amount of single calculation data;

How to reduce the number of repeated calculations for the same data?

As shown in the figure above, it is assumed that each square is a month’s fractional flow data (about 3kW).

At the end of March, 9kW data of January, February and March should be queried and calculated.

At the end of April, 9kW data of February, March and April should be queried and calculated.

…

You’ll see that the data for February and March (in pink) have been queried and calculated multiple times. _ Voiceover: _ For this service, the data is calculated three times each month.

Add the monthly points summary table, which only calculates monthly increments each time:

flow_month_sum(month, uid, flow_sum)

(1) At the end of each month, only the score of the month is calculated, and the amount of data is reduced to 1/3 and the time is also reduced to 1/3;

(2) Also, add the previous 2 months to get the total score for the last 3 months (this takes almost no time);

Voiceover: The order of magnitude of this table is the same as that of the user table.

In this way, each fractional stream is counted only once.

How do YOU allocate CPU computation time to reduce the amount of data in a single computation?

The business requirement is to recalculate the score once a month. However, the data volume is too large and the calculation takes too long in a month. Therefore, the calculation can be distributed every day.

As shown in the figure above, the monthly points summary chart has been upgraded to the daily points summary chart.

If you divide one centralized calculation per month into 30 scattered calculations, the amount of data in each calculation is reduced to 1/30, and it only takes a few minutes to process.

Even if you do it every hour, you can reduce the amount of data per calculation to 1/24, which only takes a few minutes.

It’s shorter, but it’s a timed mission,Can you compute it in real timeWhat about fractional flow?

Only 100W score streams are added every day, and “daily score streams summary” can be calculated in real time.

DTS(or Canal) is used to add a monitoring score stream table. When the user’s score changes, daily score stream accumulation can be carried out in real time. The hourly scheduled task is calculated and evenly distributed to “every moment”, and 100W stream is added every day.

Voice-over: If you can’t use DTS/ CANAL, you can use MQ.

To sum up, for such a timed task with a large amount of data processed centrally at one time, the optimization idea is as follows:

(1) Reduce the number of repeated calculations for the same data;

(2) Allocate CPU computing time, try to disperse processing (even real-time), rather than centralized processing;

(3) Reduce the amount of single calculation data;

I hope you have some inspiration, thinking is more important than conclusion.

Homework:

Suppose a system login log (logging is more difficult than a database, which can be indexed) looks like this:

2019-08-15 23:11:15 uid=123 action=login

2019-08-15 23:11:18 uid=234 action=logout

…

This day, the system at the same time online user curve, accurate to the second.

Description:

(1) The action can only be login/logout;

(2) An online user is defined as a user who has logged in but has not yet logged out and is using the system;

(3) Users who log in before 8-15 and have not logged out before 8-15 are also counted as online users on the current day (the subtext is that it is not enough to scan the logs of the current day);

How can I shorten the execution time of a scheduled task that processes 100 million levels of data?

Related Posts

Automatic adjustment interval periodic tasks for Ant Financial SOFARegistry

Algorithm of the Day Minimization of the sum of the greatest number pairs, proof of the correctness of greedy solutions | Python topic month

Java threads: Are they memory efficient?