Author: Idle fish technology – thinking

1. Business Background

With the development of idle fish business, the scale of users has reached hundreds of millions, and the data indicators of user dimension have reached hundreds. How to quickly screen out the user groups that meet the expectations from the data of hundreds of millions of levels, and carry out fine crowd operation is a problem that the technology needs to solve. Many solutions in the industry often take minutes or even hours to generate query results. This paper provides an efficient data screening, statistics and analysis method to solve the big data scenario. From the data of hundreds of millions of levels, any combination of query conditions can be used to screen the data needed, and the data can be returned in milliseconds.

2. Technical selection analysis

From a technical perspective, our business scenario has the following characteristics:

  1. Need to support the combination of arbitrary dimensions (and/ OR) nested query, and require low latency;
  2. Large data scale, at least 100 million levels, and need to support continuous expansion;
  3. A single data index has multiple dimensions, at least hundreds, and it needs to be continuously increased. Taken together, this is a typical OLAP scenario.

2.1 OLTP and OLAP

Here is a simple comparison between OLTP and OLAP:

OLTP OLAP
define Online transaction processing On line analytical processing
Application scenarios Daily Business Operations Analysis and decision-making, report statistics
The transaction request Need to support transactions No transaction support required
Common data operations Read/write high concurrency Query-oriented, low concurrency requirements
Real-time requirement high Not strict
The DB size MB-GB GB-TB
Data rows Tens of thousands to millions Billions or even billions
Data is the number of columns Dozens to hundreds of columns Hundreds to thousands

The most common databases, such as MySql and Oracle, use row storage, which is suitable for OLTP. When implementing OLAP with a row database such as MySql, you typically encounter two bottlenecks:

  1. Data volume bottleneck: mysql is suitable for millions of data levels, more than that, query and write performance will significantly degrade. Therefore, the method of database and table is generally adopted to control the data scale in millions.
  2. Query efficiency bottleneck: mysql needs to create an index or a combination of indexes for common conditional queries. Non-indexed field queries require scanning the entire table, resulting in significant performance degradation.

In summary, our application scenario is not suitable for the use of row storage database, so we focus on the column storage database.

2.2 Row storage and column storage

The following is a simple comparison of the characteristics of downlink storage and column storage:

Line storage The column type storage
Storage characteristics Store data in the same row Data in the same column is stored together
Read the advantages Quick to read the entire row at once Fast reading of single column data
Read the shortcomings Single column reading is also required to read the entire row of data Whole row query, need to reorganize data
Data update features INSERT/UPDATE is convenient INSERT/UPDATE is a hassle
The index You need to index query columns separately Each column can be used as an index
The compression characteristics The data in the same row varies greatly and the compression ratio is low A column of data has a high compression ratio due to similarity

Row storage is suitable for near-line data analysis, such as a scenario in which all fields in a table are queried for several qualifying records. Column storage is suitable for statistical analysis of data. Consider the following scenario: a table used to store users has 20 fields, and we want to count the average age of the users. If it is row storage, we need to scan the entire table, traversing all rows. But in the case of column storage, the database only needs to locate the age column, and then scan the data in this column to get all the ages, calculate the average, and the performance is theoretically 20 times faster than row storage. HBase is common in column storage databases. The core design of HBase applications focuses on rowkey design. Common filtering criteria are combined into rowkey design and rowkey query is performed using the ROWkey GET (single record) or SCAN (range). Therefore, HBase is suitable for unstructured data storage with limited query conditions. In our scenario, all fields need to be used as filtering criteria, so structured storage is required in nature and low query latency is required. Therefore, HBase cannot be used. We consider group many column type stored in the DB products (ADS/PostgreSQL/HBase/HybridDB), performance, stability, comprehensive evaluation, speaking, reading and writing, grammar, complete degree and the development and deployment costs, we chose the HybridDB for MySQL computation select the engine specifications to build the crowd.

2.3 HybridDB for MySQL Calculation Specifications

HybridDB for MySQL computing specifications for our scenario, the core competencies are:

  1. Smart composite index of any dimension (users do not need to build their own indexes)
  2. 10 billion large table query millisecond response
  3. MySql BI ecological compatibility, complete SQL support
  4. Spatial search, full-text search, complex data types (multi-valued columns, JSON) support

So how does HybridDB for MySQL compute specifications achieve millisecond response of any dimensional combination query in big data scenarios?

  • First, HybridDB’s high-performance column storage engine, built into the storage of predicate computing ability, can use a variety of statistics to quickly skip data blocks to achieve fast filtering;
  • The second one is HybridDB’s intelligent index technology, which can automatically complete index with one key on the large and wide table and intelligently combine various predicate conditions according to the column index for filtering;
  • The third is the high performance MPP+DAG fusion computing engine, taking into account the high concurrency and high throughput two modes to achieve high performance vector calculation based on pipeline, and the computing engine and storage closely cooperate, so that the calculation is faster;
  • Fourthly, HybridDB supports a variety of data modeling techniques such as star model, snowflake model, aggregation sorting, etc. Business-appropriate data modeling can achieve better performance indicators.

Generally speaking, HybridDB for MySQL calculation specification is a SQL-centered multi-functional online real-time warehouse system, which is very suitable for our business scenarios, so we build our crowd selection bottom engine on this basis.

3. Business implementation

After building the crowd selection engine, we focused on the transformation of our message push system as an important drop-off point for crowd refinement operation.

3.1 Introduction to Idle Fish Message Push

Message PUSH is the fastest way for information to reach users. The commonly used PUSH method of idle fish is to calculate the PUSH crowd offline and prepare the corresponding PUSH document, and then PUSH it at the specified time on the next day. These are usually periodic PUSH tasks. However, temporary, urgent PUSH tasks that need to be sent immediately need the intervention of BI students. Each PUSH task takes up about half a day of BI students’ development time on average, and the operation is quite troublesome. This time, we integrated the crowd selection system with the original PUSH system, greatly improving the data preparation and transmission efficiency of such PUSH, and freeing up development resources.

3.2 System Architecture



Offline data layer: User-dimension data is scattered in the offline tables of each service system. We import the data summary into the user wide and large tables in the real-time computing layer through the offline T+1 scheduled task.

Real-time computing layer: according to the screening conditions of the crowd, query the number of users and the list of user IDS that meet the requirements from the user large and wide table, and provide services for the application system.

Crowd selection front desk system: provides visual operation interface. Operation students select screening conditions and save them as crowds for analysis or send PUSH. For each population, there is an SQL store. Similar to: Select count(*) from user_big_table where column1> 1 and column2 in (‘a’,’b’) and (column31=1 or column32=2) SQL can support multiple and/or nested combinations of arbitrary fields. Use SQL to save the crowd, when the data in the user table changes, you can execute SQL at any time to obtain the latest crowd users, to update the crowd.

Idle fish PUSH system: obtains the where conditions corresponding to the crowd from the front stage system of crowd circle selection, obtains the user list in pages from the real-time computing layer, and sends PUSH to the user. In the implementation process, we focus on solving the performance problem of paging query.

Paging query performance optimization scheme: when paging, when the scale of the crowd is very large (tens of millions of levels), the further the page number, the query performance will have a significant decline. Therefore, we increase the line number of the crowd data and export it to MySql to improve performance. The table structure is as follows:

Batch no. The crowd ID The line Numbers The user ID
1001 1 1 123
1001 1 2 234
1001 1 3 345
1001 1 4 456
  • Batch number: each time the population is exported, a new batch number will be added. The batch number is time-stamped and increasing.
  • Line number: incrementing from 1, each lot number corresponds to a line number from 1 to N.

We build a combined index for “population ID”+” batch number “+” line number”, and replace the paging method with index query when paging query, so as to ensure the query efficiency when large page number.

In addition, the extra cost of exporting data is due to the powerful data exporting ability of HybridDB. The data volume ranges from ten thousand to one million levels and the time consumption ranges from seconds to tens of seconds. After comprehensive weighing, this scheme is adopted.

4. Revenue from PUSH system transformation



The crowd selection system provides a strong bottom capacity support for the operation of xianyu fine users. At the same time, crowd selection can also be applied to other business scenarios, such as home page focus map selection and other scenarios requiring hierarchical user operation, which provides a large space for optimization of idle fish business.

This paper realizes the second-level return result of combined query in massive multi-dimensional data, which is a general technical implementation scheme in OLAP scenario. At the same time, an application case of reforming the original business system with this technical scheme is introduced, and a good business result is obtained, which can be used as a reference for similar requirements or scenarios.

5, in the future

The user data in the crowd selection engine is currently imported by T+1. This is in consideration of population-related indicators, which change frequency is not very fast, and many indicators (such as user tags) are calculated by offline T+1, so the data update frequency of T+1 is acceptable. Then we build a more powerful commodity selection engine based on HybridDB. Free fish commodity data changes faster than user data. On the one hand, users can update their products at any time. On the other hand, due to the characteristics of idle fish product list inventory (sold immediately removed), and other reasons, the status of products can change at any time. Therefore, our selection engine should be aware of these data changes as soon as possible, and make real-time adjustments at the delivery level. Based on HybridDB(storage) and real-time computing engine, we build a more powerful “Mach” real-time selection system. A series of articles on “Mach” will be published soon. Interested students can pay attention. In addition, if you have any questions about the specific technical implementation (details) in this article, please contact us. thank you

Resources: HybridDB for MySQL is introduced: www.aliyun.com/product/pet…