GrowingIO big data engineer, mainly responsible for SaaS analysis and advertising module technical design and development, currently focusing on GrowingIO materialized view engine construction.

This article mainly introduces Percentile based on GrowingIO’s internal data storage structure Bitmap, and briefly introduces the differences between Hive Percentile, Spark Percentile, and Bitmap Percentile.

Before explaining the implementation of specific function algorithms, we should first clarify the following two questions:

  • What is the function of Percentile in practical application scenarios?

  • Did you remember today what a Bitmap is?

1. What is a quantile 🔢

1.1 quantile

What is 😱 quantile?

For an ordered set from small to large, find a number to disassemble the set into two sets according to the corresponding proportion of quantile.

Note: quantile is not an element in the set of exponents, but a number found to disassemble the set

🤔 a little bit of vertigo, how do you get out of the 90 quartile?

This number splits the set of samples into left and right sets, and the set of samples on the left is 90% and the set of samples on the right is 10%.

① How to find the 90 quantile of the set with odd number of elements?

Original set: [35, 40, 41, 44, 45, 46, 49, 50, 53, 55, 58]

Disassembly set: [35, 40, 41, 44, 45, 46, 49, 50, 53], 55, [58]

Number of elements in left set: 9

Number of elements in the right set: 1

I had this number right here in the original set that would split the sample into two sets at 90 decimal proportions

So the 90th percentile of the sample above is 55

② What if I had an even number of elements?

Original set: [40, 41, 44, 45, 46, 49, 50, 53, 55, 58]

Disassembly set: [40, 41, 44, 45, 46, 49, 50, 53, 55], X, [58]

This number exists from 55 to 58.

X = 58 – (58-55) * 0.9 = 55.3

Or X = 55 + (58-55) * (1-0.9) = 55.3

Meaning: 90% of the data is less than or equal to 55.3 and 10% of the data is greater than or equal to 55.3

1.2 Application in Service Scenarios

👉 has [order payment success] event and obtains the 90th quantile value of the total number of times each user has done this event in the past seven days, as shown in the following figure:

👉 is also the “successful order Payment” event. Obtain the sum of the actual purchase amount of each user under this event yesterday and calculate the value of 75 fractions, as shown in the figure below:

2.Percentile which is stronger?

2.1 Performance Comparison

Environment: Core[16], memory 2 gb

Comparison tests: [Percentile based on Bitmap] VS [Percentile_approx built into SparkSQL]

Scenario: a certain number of users and the corresponding count are randomly generated, and the quantile is calculated by randomly generating components to obtain the average consumption of hundreds of times

X-axis meaning: Data volume

Y axis: Indicates the calculation time, in milliseconds.

1. Percentile in SparkSQL supports only Int and Long data types. For general purpose, Percentile_approx is used for comparison

2. The data stored in the above Bitmap is consistent with the data processed by Percentile_approx

2.2 Hive the Percentile

Hive Sql Percentile operations are performed on a column, that is, a field in a table. If tens of millions of data are processed at every turn, each data is loaded into the memory, and the result is only one — stuck.

Therefore, Hive needs to compress or preprocess data in UDAF calculation, while Mapper needs to be continuously updated through aggregation calculation during generation, and its internal implementation is based on histogram.

Hive’s percentile_approx implementation is inspired by “A Streaming Parallel Decision Tree Algorithm”, which proposes the On-Line Histogram Building Algorithm.

What is histogram? The definition is as follows:

A histogram is a set of B pairs (called bins) of real numbers {(p1,m1),… ,(pB,mB)}, where B is a preset constant integer.

Within the definition of histogram is a constant B that identifies the number of bins. Why was this constant introduced? Suppose we have a simple sample dataset (elements repeatable) :

[1, 1, 1, 2, 2, 2, 3, 4, 4, 5, 6, 7, 8, 9, 9, 10, 10]
Copy the code

The histogram is: [(1, 3), (2, 3), (3, 1), (4, 2), (5, 1), (6, 1), (7, 1), (8, 1), (9, 2), (10, 2)].

As can be seen, the array lengths of bins (pairs of data points and frequencies) within this histogram are essentially the cardinality (number of different elements) of the sample dataset.

The storage overhead of Histogram increases linearly with the cardinality of the Sample dataset, which means that histogram will not be able to meet the statistical requirements of very large data sets without additional optimization. The constant B is introduced against this background, in order to control the length of the bins array of histogram (memory overhead).

Hive Percentile_approx by GenericUDAFPercentileApprox implementation, its core implementation is in front of the histogram bins array with a sequence of one identifies quantiles. The result of merge operation only preserves the sequence of histogram, and the final calculation result from histogram is as follows:

/** * Gets an approximate quantile value from the current histogram. Some popular * quantiles are 0.5 (median), 0.95, * * @param q The requested quantile, Must be strictly within the range (0,1). * @return the quantile value. */ public double quantile(double q) {assert(bins) ! = null && nusedbins > 0 && nbins > 0); double sum = 0, csum = 0; int b; for(b = 0; b < nusedbins; b++) { sum += bins.get(b).y; } for(b = 0; b < nusedbins; b++) { csum += bins.get(b).y; if(csum / sum >= q) { if(b == 0) { return bins.get(b).x; } csum -= bins.get(b).y; double r = bins.get(b-1).x + (q*sum - csum) * (bins.get(b).x - bins.get(b-1).x)/(bins.get(b).y); return r; } } return -1; // for Xlint, code will never reach here }Copy the code

2.3 Spark the Percentile

Percentile

  • It only takes ints, longs, computes exactly, uses OpenHashMap at the bottom, and then sorts the keys.

To speed things up, OpenHashMap adds an assumption:

  • For all data, only the Key is inserted or updated, but not deleted.

  • This assumption is mostly true in big data processing/statistics scenarios.

  • You can remove the zip table and implement the hash table using the open addressing method of linear probe.

The underlying data of OpenHashMap is OpenHashSet, so it essentially depends on why OpenHashSet is fast.

OpenHashSet uses bitsets (bitmaps) to store whether or not the data is in the set (bitoperations), and another array to store the actual data, structured as follows:

protected var _bitset = new BitSet(_capacity)
protected var _data: Array[T] = _
  _data = new Array[T](_capacity)
Copy the code
  • The two members are always of equal length. If the subscript x position of _bitset is 1, then there is actual data in the subscript x position of _data.

  • When inserting data, hash(key) generates pos to check whether the corresponding POS in _bitset is occupied. If yes, ++pos.

OpenHashSet is fast

  1. High memory utilization: The 8B pointer structure is removed, and larger hash tables can be created, reducing conflicts.

  2. Compact memory: bitmap operation is fast.

Percentile_approx

  • Accept approximations of Int, Long, Double. Use the GK algorithm.

Thesis See Space-efficient Online Computation of Quantile Summaries

The underlying implementation is implemented through QuantileSummaries, with two main member variables:

Sample: Array[Stat] : stores buckets. When more than 1000 buckets are compressed (new triples are generated). HeadSampled: ArrayBuffer[Double] : The buffer is sorted and updated to sample each time it reaches 5000.Copy the code

The main idea is to reduce space footprint. Spark implements merge Sample without even processing samples in order, directly sortBy:

// TODO: could replace full sort by ordered merge, the two lists are known to be sorted already.
 val res = (sampled ++ other.sampled).sortBy(_.value)
  val comp = compressImmut(res, mergeThreshold = 2 * relativeError * count)
  new QuantileSummaries(other.compressThreshold, other.relativeError, comp, other.count + count)
Copy the code

Stat definition:

/** * Statistics from the Greenwald-Khanna paper. * @param value the sampled value * @param g the minimum rank jump from  the previous value's minimum rank * @param delta the maximum span of the rank. */ case class Stats(value: Double, g: Int, delta: Int)Copy the code

Insert function: every N numbers, sort at least once (merge 1 time), time complexity O(NlogN) :

def insert(x: Double): QuantileSummaries = { headSampled += x if (headSampled.size >= defaultHeadSize) { val result = this.withHeadBufferInserted if (result.sampled.length >= compressThreshold) { result.compress() } else { result } } else  { this } }Copy the code

Time complexity O(n)

// Target rank
    val rank = math.ceil(quantile * count).toInt
    val targetError = math.ceil(relativeError * count)
    // Minimum rank at current sample
    var minRank = 0
    var i = 1
    while (i < sampled.length - 1) {
      val curSample = sampled(i)
      minRank += curSample.g
      val maxRank = minRank + curSample.delta
      if (maxRank - targetError <= rank && rank <= minRank + targetError) {
        return Some(curSample.value)
      }
      i += 1
    }
Copy the code

2.4 BitMap episode

GrowingIO: Massive Data Analysis based on BitMap

This is just a brief review of CBitmap and a supplement to throw weights

How are non-numeric index bitmaps stored?

Here, only a single event + a single dimension is used for analysis. For multi-dimensional analysis, please refer to the previous shared document, and the Traditional Chinese kung fu will be finished.

Generate CBitmap statistics.

CBitmap: Map(short → Map(dimsSortId → RoaringBitmap(uid))

How to store numeric index bitmaps?

What we are talking about above is the integer type data with a relatively small number of occurrences of a certain indicator of statistics, so the question arises:

(1) What if the result of my statistics is not an integer? In other words, the indicator of our statistics is the order amount?

② When THE results of my statistics are integer, but it, some things like to press “K” as a unit, with your little brain to think about, the number of statistical results from 0 to 1024 [bit bit 0 to 10] will save a lonely

The calculation accuracy is expected to calculate a common divisor for all values, so as to reduce the storage capacity. The concept of Weight is introduced here:

For example, for 100, 200, 300, we can put forward a 100 as a common divisor, save only 1, 2, 3, 0.01, 0.02, 0.03 can also be put forward 0.01. Only one value can be estimated. The specific estimation process is as follows: * 1. Calculate a high and a low. High can be thought of as log10 + 1, which is the order of magnitude of difference from 1. * low can be thought of as the number of digits to keep within 1e-4 (accuracy can be modified). According to all the high and Low, a relatively reasonable high and low can be counted as long as the proportion of the data corresponding to this high is higher than 1/10 of the average proportion * 3. High represents the maximum number of orders of magnitude for data amplification. Max (high-n, -1*low) where n affects the accuracy of the integer, starting with 6 and later changing to 7 * * For example, if you have a list of numbers: * 10,000 100 10 0.1 0.01 * * 5 3 2-1-2 * 000 1 2 * Then we get the combined high = 5, low = 2 * finally we get the weight = 0.1(n=6), 0.01(n=7) * * For an extreme example: * 10000... 0.00000 (0) 20... 1(20 0) * high = 21, low = 00 * high = 21, low = 0 * weight= 100... (21-7 0) * /Copy the code

2.5 Implement Percentile in Bitmap

Calculation logic

1. Sort n variables into array X from smallest to largest, where p is the quantile and x[I] is the ith element of array X

Set (n-1) * p% = integerPart + decimalPart (0) So I is equal to integerPart plus 1

3. DecimalPart = 0 = x[I]

4. When decimalPart! = 0, quantile = x[I] + decimalPart * (x[I +1] -x [I])

If the presence of dimsSortId is ignored, a new CBitmap structure is obtained:

CBitmap:

Map(short → Map(dimsSortId → RoaringBitmap(uid))

Short → RoaringBitmap(UID)

The Cbitmap used to generate event metrics is as follows:

{

1 → {0 → (1), 1 → (1)},

0 → {0 → (2, 3, 4)}

}

The transformed CBitmap is as follows:

{

1 → (1),

0 → (2, 3, 4)

}

How to calculate the Bitmap quantile

🤔 to point data, to a demand, first to a simple, the data is as follows:

CBitmap = {3 -> (1001, 1006) 2 -> (1003, 1005, 1006) 1 -> (1004) 0 -> (1001, 1002, 1003)}Copy the code

🤔 take out the CNT corresponding to each user in order, and then calculate the quantile according to the formula?

Uid -> CNT [(1002 -> 1), (1004 -> 2), (1005 -> 4), (1003 -> 5), (1001 -> 9), (1006 -> 12) 】 X = (1, 2, 4, 5, 9, 12) (6-1) * 0.75 = 3.75 quantile = x [I] + decimalPart * (x + 1] [I - x [I]) = 5 + 0.75 * (9-5) = 8Copy the code

Look like wood problem, in fact, slow…

The CNT corresponding to each user is obtained by traversing each C bit from top to bottom, and the sorted array of CNT is obtained. Finally, the result can be obtained according to the formula.

🤔 is a little bit simpler, and the solution is a little bit simpler?

Since CBitmap itself is graded and ordered, why not make full use of it?

For the quantile of Cbitmap, the premise is to obtain the CNT number corresponding to the ith person and the ith + 1 person after sorting

① After all, we need to know what the integerPart is.

(6-1) * 0.75 = 3.75

X [I] and x[I +1]

② Since the data at the high level in CBITmap must be larger than the data at the low level, cBITmap can calculate the ith person by traversing and excluding data from the high level to get the ITH person’s C bit

  • Start by recording a few necessary variables

TotalRbm = cbitmap Set of deleted users

CurrIdx = the C bit currently traversed

CurrRbm = roaringBitmap of the current pointer position

PersistRbm = totalRbm andNot currRbm

PreDiscardRbm = totalRbm and currRbm = totalRbm and currRbm

cBitmap = 
{
 3 -> (1001, 1006)
 2 -> (1003, 1005, 1006)
 1 -> (1004)
 0 -> (1001, 1002, 1003)
}

totalRbm = (1001, 1002, 1003, 1004, 1005, 1006)
currIdx -> currRbm = 3 -> (1001, 1006)
persistRbm = (1002, 1003, 1004, 1005)
preDiscardRbm = (1001, 1006)
Copy the code
  • Eliminate the algorithm, find the ith user’s c bit

1. If persistRbm numbers >= I, it indicates that the ith user is still in persistRbm. Set persistRbm to totalRbm and persistRbm to currIdx -= 1, recalculating the important variables into the exclusion algorithm.

2. If the number of persistRbm < I, shows that the individuals in the current preDiscardRbm, I now need a resCnt corresponding count value to the current currIdx of 【 1 < < currIdx 】, if there is a value before, You have to add them up.

  1. The next step is to find the required person in perDiscardRbm. The new location = previous i-PersistrBM number.

  2. Set preDiscardRbm to totalRbm and currIdx -= 1. Recalculate important variables and enter the exclusion algorithm.

When currIdx = 0, the cumulative resCnt is the result of x[I].

  • Come on, show:

  • Does the user need to iterate again when they still need to get the I +1 position?

Change the receiving variable into an array, and obtain the i-th user and the i-th +1 user by traversing the exclusion algorithm once, only the following two cases need to be considered:

1) When the ith user and the I +1 user are in the same CNT bit, there is no difference in the judgment logic of the subsequent exclusion algorithm.

2) When the I +1 user is in currIdx and the I user is in Curridx-1, the totalRbm is inconsistent and needs to be calculated separately.

  • Finally, the results of x[I] and x[I +1] are integrated to obtain the quantile:

Quantile = (x + [I] decimalPart * (x + 1] [I – x [I])) * weight

conclusion

This paper mainly reveals the advantages of Percentile algorithm based on BitMap as the underlying data model. The high compression of BitMap brings great advantages in storage, but also can flexibly calculate statistical data according to its data structure, and quickly calculate many requirements similar to Percentile.

BitMap also has a lot of extensibility and highlights, here are a few, stay tuned:

  • BitMap hacks: Avoid deserializing BitMap operations using bytes

  • BitMap transpose algorithm: Different count solutions

References:


  1. www.jianshu.com/p/e271caf2d…

  2. Xiaoyue26. Making. IO / 2019/05/26 /…

  3. www.jmlr.org/papers/volu…

  4. Dx.doi.org/10.1145/375…

About GrowingIO

GrowingIO is a leading one-stop digital growth solution provider in China. To provide customer data platform, advertising analysis, product analysis, intelligent operation and other products and consulting services for product, operation, marketing, data teams and managers, to help enterprises improve their data-driven capabilities and achieve better growth on the road of digital transformation.

Click here to get a free 15-day GrowingIO trial!