background

When relational database performs counting task, its execution efficiency will decrease with the increase of data magnitude. When the volume of data reaches the billions, the efficiency of counting tasks becomes unbearably low. In the relationship system of the Idle Fish team, we adopt such a way to realize the millisecond counting of the data of hundreds of millions.

challenge

In the existing business scenario of Xianyu, the amount of data that users collect and pay attention to others has reached hundreds of millions. Traditional relational databases such as mysql are inefficient in executing the conditional count command. In a scenario with a large amount of data, they cannot effectively support online services.

As shown in the figure above, when the count operation is performed according to the sub-table key in the relational database storage with the magnitude of hundreds of millions of data, the response time alone cannot meet the requirements of online services, let alone the impact of frequent execution of such low performance query statements on database performance. For the counting scenario of massive data in the industry, two solutions are usually adopted: counter and timed off-line calculation. Both approaches have their advantages and disadvantages:

This paper proposes a method based onOffline batch processing + Online incremental statisticsThe complex and time-consuming database counting operation is replaced by multiple KV storage reading and docking operation, which is realized in the online business scenarioQPS peak response remained at the millisecond level (less than 10ms), and the success rate was always close to 100%, which strongly supports the development of business.

plan

The counting scheme proposed in this paper, in short, is to synchronize the full amount of data to the offline database for batch processing at regular intervals, keep statistics of incremental data online in real time, and finally combine the results of the two to obtain accurate calculated values.

Offline batch

Idle fish currently stores relational data as follows (omit irrelevant fields) :

If you want to count the number of A certain relationship of A user, such as the number of treasures collected by user A, you only need to obtain the total number of data whose source is A and status is 0. With the help of data offline processing capability, the first solution we came up with was to cut the data according to time points, that is, the data before a certain time point (such as 0 o ‘clock every morning) was synchronized to offline storage for calculation, and then merged with today’s incremental data to obtain the final result, as shown in the figure below:

The specified cut time is called expected Snapshot read time. When performing the offline data synchronization task shown in the figure above, the Xianyu team adopts aliyun ODPS offline synchronization task to complete the data offline by scanning the table. In the single-table scenario, the snapshot at a certain point in time can be accurately obtained with the help of the mysql snapshot read mechanism, thus achieving accurate offline and online data cutting.

The dilemma of offline batch processing

In actual scenarios, massive data must be stored in separate databases and tables. To reduce the impact on online services, offline data synchronization tasks are executed in batches after database and tables are divided. Therefore, the snapshot read time of all tables cannot be consistent.

If the snapshot read time of multiple sub-tables is inconsistent, the actual snapshot read time must be different from the expected snapshot read time, as shown in the following figure:

Snapshot, we assume that the expected read time and snapshots of the actual execution time interval produced a data change (in the diagram above, this change is 30 points in the morning, the specific state of the relationship from 0 to 1), the actual execution snapshots of the read will not be able to read to the expected snapshot time snapshot, but the latest state of the data. In other words,If the expected snapshot time ≠ the actual data read time, data changes generated between the expected snapshot read time and the actual snapshot read time will pollute the statistical results.However, because the magnitude of data is too large, it is necessary to use the storage of different databases and tables, and execute offline tasks in batches, so the idea of data cutting according to the time point is not feasible.

The solution

To avoid this problem, you must set the start time of a synchronization task as the data snapshot read time and cut data based on the fixed point in time instead of the fixed point in time. Using such synchronization, the offline statistics obtained will look like the figure below:

After calculating different source values, the counting result can also be obtained, and the latest modification time of the relationship can be obtained, and the latest modification time of different relationships is different. This paper will be off-line statistics obtainedContains the last calculated value of the relationship at a certain point (the last modified time of all relationships + the total value after modification), denoted as offlineTotal. Different from the output results of general offline tasks, offlineTotal contains the time of the latest modification of this relationship. The online incremental data statistics scheme in the following will be based on this time to complete data docking and merging.

Online real-time incremental statistics

Combined with the count result of offline data, the online incremental data must contain the following information:

  1. The time that can be matched with the offline calculation results;

  2. The relationship measures the change in value after that time.

In the realization of idle fish, we use the way of KV storage to record the changes of the relationship. The recorded values are: dailyIncrTotal for each source for each day, and modifiedTimeIncr for each time the relationship is updated, as shown in the following table:

When calculating the calculated value, first use the offline calculated value offlineTotal plus the total increment of the day dailyIncrTotal to get a total value, as shown below:

It can be seen that since the statistical data of offline tasks are not cut strictly according to the time point (usually, offline tasks are executed between 0 a.m. and 1 a.m. every day), there is some data overlap between the offline count value and the daily increment. At this point, take the increment at this moment according to the latest modified time in the offline calculation value and remove this part of data from the total value:

After sorting out the calculation logic, the following formula can be obtained:

Note: σ dailyIncrTotal represents the date of the last modification from the offline record up to the date of the count.

At this point, a complete count request is replaced with 2+N KV stored queries (N is the date difference between the offline calculation result and the current time, usually no more than a day). The scheme not only has the same response speed as the real-time counter, but also can run the offline task repeatedly and roll the correction data with the help of the offline batch processing capability in case of abnormal situation.

conclusion

This paper introduces a $data scale scenarios to achieve rapid accurate count, offline batch is used to reduce the pressure line, improve the efficiency of calculation, at the same time use KV storage real-time record incremental data snapshot, implements the counting result millisecond response, offline data correction and to depend on, hoping to give readers some use different perspective to think of inspiration. Sometimes a scenario that seems time-consuming and difficult to optimize can have unexpected benefits if you think about it from another perspective.

Join the idle fish for some “cool” fun

Idle fish technology team is a dapper engineering technology team. We not only pay attention to the effective solution of business problems, but also promote the cutting edge practice of computer vision technology in mobile terminals by breaking the division of division of technology stack (the unification of android/iOS/Html5/Server programming model and language). As a software engineer in the Idle Fish technology team, you have the opportunity to demonstrate all your talents and courage in the evolution of the entire product and user problem solving to prove that technology development is a life-changing force.

Resume: [email protected]

Identify two-dimensional code, forward-looking technology is in your grasp