By Chen Kai

GrowingIO data development engineer, mainly responsible for the development and design of data platform for SaaS and OP products, specializing in micro-service and digital warehouse construction.

GrowingIO processes nearly 100 billion user behavior data every day, and the platform’s “event analysis” module is a frequently used feature that is simple and powerful. In event analysis, customers have the flexibility to use a combination of dimensions to view a metric, and the query speed is impressive.



This paper extracts the general data model of GrowingIO in event analysis to reveal the storage model and implementation principle behind this function.



In the data analysis of user behavior, whether there is no buried point or buried point, the expression form of a certain behavior data is usually: “someone” performs “certain action” and “how many times” under “certain dimension” at “certain time”.



So in the data statistics, this expression can be broken down into “indicator” and “dimension”, the indicator can be the number of users, page views, the number of times of a buried point, etc., the dimension can be time, city, browser, user attributes, etc.



In the context of massive data, how to efficiently complete the calculation of indicators and dimensions has always been a hot topic in the field of big data analysis. The following will describe how we effectively solve the problem in GrowingIO.


1. Start with a data requirement 📈



Suppose given the following set of raw data on user behavior:

Data Description: Indicates the access records of a user. (This only lists the region and device dimensions, of course, there are browser, platform, version and other dimensions, so I will not list them all here.)


1.1 Using SQL to analyze statistics


🤔 Now the service wants to calculate the number of users of Device: Mac in the Region dimension in the last 7 days. So Easy, one SQL


The analysis tool using GrowingIO platform can be expressed as follows:


However, through SQL, as the amount of data becomes larger and larger, billions or billions of dollars, the resources and response time required for calculation will increase linearly. At this time, the most intuitive feeling of customers when using platform tools is that the “chrysanthemum” turns around, and the chart has not been loaded.


1.2 How can I Make the Query More Efficient


1.2.1 Stack machines and add resources


The most straightforward way is to add more computing resources, or to cache and warm up the results of the query. However, with SaaS products, when query concurrency is high, no amount of computing resources can meet performance bottlenecks due to query queuing.


1.2.2 Data Layer


😼 In the hierarchical architecture of multiple stores, for the frequently used query results, we can generate a result table “Last 7 days – region – Device – indicator Table” by offline calculation, as shown in the following example:



In this way, in the feature query scenario, only the query result table is needed, which greatly reduces the calculation amount and the response time is much shorter.


😱 But the demands on the business side are constantly changing, and then a bunch of statistical demands hit you on the head


  • What are the “total users” and “Total visits” for the “last 30 days”?

  • What was the “number of users” of “Yesterday” “Device: Windows”?

  • What is the “number of users” in “30 days of acquisition” by “platform”?

  • Which users came on November 11? Please export this list of users

  • The endless demands for more funnels, retention, portraits, and so on are dizzying.

.


😫 You look at the generate tables and realize that these issues are not completely solved. You feel that you need to generate more result tables to meet more requirements. Eventually you find that some watches are only used once and there’s a lot of junk in the bin.


1.2.3 Data preaggregation


Similar to the idea of compartments, the data is precomputed, preaggregated, and the idea of space for time is used to speed up the calculation. This is also the solution of some mainstream open source frameworks, such as SparkSQL materialized view, Kylin Cube, Druid Segment, Carbondata MV, etc.


Here’s a diagram to show the main differences:


Based on the approach we are pursuing, we first need to find an efficient and flexible storage model.



1.3 Optimize the storage model


Based on the idea of data preaggregation in the previous section, it is not difficult to find from the results of preaggregation that there are several points that have not been escaped:

  • Multiple dimensions are still stored horizontally together

  • Dimensions and metrics are bound in a single record

  • Because of this limitation, we just limit the flexibility of “dimension + indicator” arbitrary assembly and expansion


🤔 How can you better combine dimensions and metrics as you like? We have made some improvements based on the preaggregated results:

  • Store each dimension vertically

  • Store dimensions and metrics separately



2. Storage model based on BitMap 💻



2.1 Vertical Storage (Number of People)


Taking the data set at the beginning as an example, dimensions are stored vertically:


Select user number from location: Beijing and device: Mac



OK! This can be very flexible to solve the problem of the combination of various dimensions, and even the user group can be directly acquired.


😇 However, from the table, it is found that the data structure of the user store “user collection” has not been resolved. You can store integer values in an array-like manner and use intersection (and) operations, but you need to achieve better data compression and computation.


At this point you should think of data structures like bitmaps


Why the data structure of BitMap was chosen, and the functions and basic uses of BitMap are not discussed here. You can refer to the Implementation of Java BitSet and optimized library RoaringBitmap.



2.2 Storage Counters (Count)


Map<Int, BitMap> (where the key is the count and the value is the person who matches the count)


The number of visits indicates who visits “once”, who visits “twice”, and so on.


Calculate the “Page Views” for “Region: Beijing” and “Device: Mac”.



2.3 Use a more elegant way to store times


In the Map<Int, BitMap> structure, the key stores a decimal number. This can cause maps to have too many keys, so there needs to be a way to optimize the structure.


In this case, the key of the Map is stored in the position of binary 1:


For example, the binary of 2 is “10”, denoting from right to left (the subscript I starts from 0), “the 0th bit is 0”, and “the first bit is 1”. So store this guy in a bitmap with key 1.


For example, the binary of 5 is “101”, which means “bit 0 is 1”, “bit 1 is 0”, and “bit 2 is 1” from right to left. So store this guy in a bitmap with keys 0 and 2.


Then represent the result in section 2.2 above as follows:


Calculate the “Page Views” for “Region: Beijing” and “Device: Mac”.




3. The problem of multi-dimensional intersection ⚔️



The ideal is beautiful, but the reality is cruel. In the example in Section 2.1, there is only one dimension combination for each user, but in reality, a user behavior may have multiple dimension combinations.


So what is a combination of dimensions: The unique value of all dimensions in a piece of data, called a combination.


PS: If your system has only one combination of dimensions for an ID. For example, once an order is generated, its price, goods, logistics and other information are basically fixed. So the previous model can satisfy most scenarios.



3.1 Facing Problems


🤔 So what’s the problem? At this point, back to the starting point, another batch of user behavior data is as follows:


At this time, an additional “user 1” used “Windows” in “Hangzhou”. If the previous model is stored as follows:

At this time, calculate the user of “region: Beijing” : directly return [1, 2, 3], there is no problem


At this time, calculate the users of “Region: Beijing” and “device: Windows”

❌ You will find that the result is wrong and only “user 3” should be satisfied.



3.2 Use dimension combination Number


In fact, the problem is that when dimensions are stored separately, an important measure of “dimension composition relationship” is lost. The problem is that “User 1” doesn’t meet both of these criteria, even though he’s been in “Beijing” and has used “Windows.”


So there needs to be a way to store this important information about “dimensional composition relationships.”


Each person dimension combination is numbered sequentially, and the following results are obtained:

Note: The numbers are for each person, same combination of dimensions, same number.


The corresponding storage structure is also changed: Map<Short, BitMap> (key represents the number, value represents the collection


At this point we can calculate the “region: Beijing” and “device: Windows” dimension users


The result is user 3, and the data is accurate.



3.3 Calculation times in multidimensional cases


Map<Int, Map<Short, BitMap>>


For example, “user 1” occurs twice in “number 0” and once in “number 1”


At this point we calculate the visits for the “Region: Beijing” and “Device: Mac” dimensions




4. Simple performance comparison



Environment preparation: SparkSQL (Local [16], 4G memory), BitMap Single-thread computation (4G memory)


Scenario: Simple combination of two or three dimensions to find the number of people and times, and select Top 1000 in descending order


X-axis meaning: Data volume/users

Meaning on the Y-axis: Calculates the time, in milliseconds.


As you can see, the computation time of SparkSQL increases as the amount of data increases, but the computation time of BitMap is relatively stable.



5. To summarize



BitMap is a data structure that combines the advantages of computing and storage, even when storing millions or even millions of ids.


And when you use BitMap storage, you naturally support many business scenarios such as clustering, tagging, funnel analysis, retention analysis, user reach, etc., because you don’t need to recalculate the population.


This article mainly reveals how we use BitMap as the underlying data model. Of course, there are still many challenges in the practical application process. Due to the reason of space, I will not elaborate here.


Here are some of the things we worked on and worked on:

Bitmaps are stored as ints, but in production, your ID data may be a string like a UUID, so you need to solve the problem of turning a string into a unique int.

  • How to make Bitmap compute better in distributed environment

  • How to solve the challenges of high kiwi dimensions

  • How can it be used in real time, chart analysis, cluster profiling, and more business scenarios

  • How do you design an SQL-like language to be compatible with this data model

.


About GrowingIO

GrowingIO is a growth platform based on user behavior data and a leading data operation solution provider in China. Provide products and consulting services such as customer data platform, customer acquisition analysis, product analysis, intelligent operation and so on for product, operation, market, data team and managers to help enterprises improve data-driven capability and achieve better growth on the way of data upgrading.

Click here for a 15-day free trial