This article is the content shared by the guests of the 11th meitu Internet Technology Salon. The public account reply “NAIx” to get PPT.

Big data technologies and application systems have been playing a huge role in various industries, and a variety of open source technologies have brought great convenience to practitioners of big data. Bitmap, as a computing system generated under the requirements of big data, has many advantages, such as fast computing speed, high information density, and support for massive data.

Meitu has a huge amount of user data, and there are a lot of data computing tasks every day. Bitmap technology can greatly reduce the overhead of computing and save the cost of data storage. Although many companies have tried Bitmap, there is no relatively mature distributed Bitmap open source application so far. Therefore, Meitu developed its own distributed Bitmap system. Data computing tasks applied to meitu in various scenarios.


Bitmap profile

Bitmap is a technology widely referenced by various frameworks, but the principle is simple.

A bit is a bit, and a Bitmap identifies the corresponding value of an element by the number of bits (supporting 0 and 1 states). Simply put, a Bitmap is itself an array of bits.

For a simple example, suppose there are 10 users (ids 1 to 10) who log in to the system on a certain day 1, 3, 5, 7, 8, and 9. How to simply represent the login status of the users? All you have to do is find the user’s bit and put it 1.

If you want to check whether a user logs in to the system that day, you only need to check whether the value of the user ID bit is 1. And, by counting the number of 1 in the Bitmap, you can know the total number of users logged in to the system. The operations that Bitmap already supports, such as AND, OR, AND not, facilitate computation such as dimension crossing.


Bitmap has two important features

A high performance

The computing performance of Bitmap in its main battleground is impressive. At Meitu, early statistics were mainly based on Hive. A simple retention calculation based on a Meitu APP (that is, the number of new users who are still active the next day) took 139 seconds with Hive (left out Join) and 226 milliseconds with Bitmap (intersection calculation). The Hive time is 617 times that of Bitmap. As shown in the following figure, Hive is based on a four-node Hadoop cluster, while Bitmap uses only one node and one process.


Small storage space

Because a Bitmap identifies status by bit, data is highly compressed and occupies very little storage space. Given a billion active device ids (numeric types), a normal Int array would require about 3.72 gigabytes of storage, whereas a Bitmap would only require about 110 megabytes. Of course, the performance benefits (such as memory consumption) of the storage savings are also significant for operations such as de-reweighting and sorting.


Meitu Bitmap app

Meitu has a number of apps, such as Meitu Xiuxiu, Beauty Camera, Mepai, Beauty Camera, Fashion Selfie, etc. Known as Meitu Xiu Xiu and Beauty Camera, both have tens of millions of daily activities and have accumulated billions of historical users.

Most of the major daily statistics functions are based on Bitmap, such as add, activate, retain, upgrade, return visit, etc. At the same time, we also support time granularity (such as days, weeks, months and even years) and multiple dimensions such as APP, channel, version and region, as well as cross-counting of all dimensions.

The principle of Bitmap is simple, but supporting massive amounts of data and requirements through Bitmap services is more complex than you might think. From 2015 to now, from single version to distributed version, from single APP to access of various apps, from “small amount” of data of a small number of indicators to the current massive data and demand, we have also encountered many challenges in the practice of Bitmap, among which the typical ones are:

  • Tier 100 Bitmap index. This is difficult to maintain a single node, usually need to use external storage or self-developed a set of distributed data storage to solve;

  • Serialization and deserialization problems. Although Bitmap storage space is small and computing is fast, when using external storage, a single file of a large Bitmap can still reach hundreds of megabits or more after compression, which is a very large optimization space. In addition, storing and querying deserialized data is time-consuming;

  • How to do multi-dimensional cross calculation on distributed Bitmap storage better, and how to do fast response in high concurrency query scenario


Distributed Bitmap — Naix

Naix, the final form of Meitu Bitmap service, is a universal distributed Bitmap service independently developed by Meitu. In order to make Naix applicable to a variety of scenarios, we designed the components and structures to be as common as possible.

Naix is named after Dota, and there are various “Dota series” projects in meitu data Technology team, such as Kunkka, Puck, Arachnia, etc. The reason for calling distributed Bitmaps Naix is simple: the word Next stands for Next generation Bitmap.


Naix system design

As shown in the figure below, the Naix system is mainly divided into three layers: external call layer, system core node layer, and dependent external storage layer.



External call layer

The external call layer is divided into generator and TCP Client. The generator is a tool responsible for generating bitmaps. The original data and regular data are stored in HDFS or other storage media. The generator node is used to convert the corresponding text data or other data into bitmap-related data and then synchronize the data to the system. The TCP Client is responsible for the interaction between client applications and distributed systems.


Core node layer

The core node layer mainly consists of three types:

  • The Master node, the core of Naix, is mainly responsible for cluster management and maintenance, such as adding Bitmap and node management.

  • Transport node is the intermediate node of query operation. After receiving query related requests, Transport distributes them.

  • Data Nodes (the most core Data storage Nodes in Naix), we use Paldb as the basic Data storage of Bitmap.


Dependent external storage layer

Naix has a lightweight dependency on external storage, in which mysql is mainly used for metadata management and maintenance of scheduling intermediate state, data storage, etc., while Redis is more used as a cache during computing.


Naix data structures

index group

As shown in the figure above, index group is the most basic Naix data structure, similar to the DataBase in a regular DataBase, mainly used to isolate a variety of different businesses. For example, in the business of Meitu, Meitu Xiuxiu and Meipai are two different index groups. Each index group can have multiple I Ndex. An index is similar to a table in a regular database. For example, new and active Bitmaps belong to two different indexes.



index

In each index, there is a curing time attribute. Since Bitmap data may involve different time periods, format the data in different time periods into the same index. The index in the corresponding period involves multiple dimensions, such as version, channel, and region. Each dimension involves different dimension values (such as V1.0.0, v1.2.0, etc.). The Bitmap file we refer to refers to specific dimension values.


Data information dictionary management

The state a Bitmap uses to identify a user or element is usually an ID, but this is often not the case in real-world business applications. To collect statistics on IMEI and IDFA, you need to convert device IDS into ids through data dictionary mappings and generate bitmaps. At the same time, in order to facilitate the maintenance and use of data, we also do dictionary mapping management for dimensions and dimension values.


Naix genertor

For Bitmap raw data, it is usually similar to Mysql record data, HDFS text file, etc. Naix Generator is used to convert raw data into Bitmap related data and synchronize it to Naix system. The generator supports Bitmap generation in different scenarios in the form of plug-ins, and then the business parties develop their own business logic based on the plug-ins.

The Simple Plugin is the simplest and the first plugin we used. In Meitu, most of the data is the original HDFS data. The Hive Client filters the data to the processing server, and then converts the data into Bitmap data through the plug-in.

Due to the large amount of data and complex business of Meitu, it took nearly 3 hours to generate data every day in a previous stage. If there is a problem in the middle of the run again, it will certainly affect other statistical business and cause serious consequences. So we developed the MapReduce Plugin to speed up data generation by distributing its advantages.

It turns out that using the MapReduce Plugin you can eventually compress the generate process from nearly three hours to about eight minutes (based on a four-node test cluster). Based on the characteristics of MapReduce, when the business and data volume continue to increase, we can also easily maintain the continuous fast generate speed through node expansion or map and Reduce quantity adjustment.

The third plugin is bitmap to Bitmap Plugin. For bitmap data of various time periods, users can configure the plugin provided by us to periodically generate bitmaps in the system according to the bitmap. Bitmaps such as week, month, and year, this plug-in can generate periodic bitmaps (such as week by day, month by week, etc.) from the native Bitmap. Users only need to submit the generation plan, and the system will automatically generate Bitmap data results on a regular basis.


Naix storage

shard

How to store massive data in a distributed system? Generally, conventional distributed bitmaps rely on external storage such as hbase or distributed storage based on service cutting. Data search and data copy in the calculation process are great bottlenecks. After various attempts, we finally adopt the sharding method, that is, all the bitmaps are sharded with fixed width. Data in the same fragment or copy with the same serial number is stored on the same node. Data in different fragments may be stored on the same or different nodes.

The sharding design brings a number of benefits:

  • The problem of distributed data storage of 100 terabytes can be solved easily.

  • Parallel computing: Bitmap structure is very special, basically Bitmap operations can be pieced parallel computing, and then summary integration. For huge Bitmap data, it can also be improved in this way;

  • Data copy problem: In most bitmaps, data is separated by service before sharding. However, when the amount of data is large, data of a single service cannot be stored on a single node. Data copy is necessary when computing across businesses is involved. However, after sharding, these calculations are naturally distributed to different nodes for independent calculation according to the sharding, avoiding data copy.

  • Serialization and deserialization problems: Usually occur in large bitmaps, but after sharding, the size of all bitmaps is basically controllable, so there is no serialization and serialization problems;

  • Crossing the INT barrier: Normally Bitmap implementations only support the INT range, but as Meitu’s business grows, its user growth will quickly outstrip the INT range. By using the data fragmentation method, the fragments can be easily superplaced horizontally to support the length up to long by the ID displacement within the fragment.


replication

Replication is an extremely important feature of regular distributed systems to prevent data loss due to machine downtime, disk damage, and so on. In Naix, replication supports index group level.

As shown in the figure, the primary fragment is identified in dark blue, and the remaining copy fragments are identified in light blue and turquoise. According to the index group set by two different replication numbers and the two indexes corresponding to the two indexes, we can see in the figure that the same fragment corresponding to the same replication subscript is stored on the same data node. Different copies of the same fragment must be stored on different nodes.


Space and file fragmentation related optimizations

Optimization of space and file fragmentation is one of the most attempted parts of Bitmap practices. Bitmaps are implemented based on Long arrays or other numeric arrays, and their data tends to be too dense or sparse, leaving a lot of room for optimization. The compression algorithm of most bitmaps is similar to alignment compression, which saves space and reduces computation. In Meitu Bitmap, ewah (similar to RLE) is used in the early stage, and RoaringBitmap is used later. This is also the Bitmap compression method commonly used by Spark, Hive, Kylin, Druid and other frameworks. Performance comparison between Ewah and RoaringBitmap was carried out in our real business scenario. Space saving was 67.3% and data time saving was 58%. On the Query side, actual scenario testing improved slightly, but not as much as space and generate time.

Early Bitmaps used file storage, and when reading, we optimized similar to MMap. But in everyday business, there are many bitmaps that are broken down in fine grained dimensions (such as fewer new users on a channel), and processing sharding will break these small bitmaps into even smaller ones. Small files have very low block and inode utilization of the operating system. This paper attempts to optimize the storage scheme to solve the problem of small file fragmentation. The following schemes have been investigated:

Redis

Redis itself supports bitset operations, but this implementation is not as effective as expected. Assuming simple Bitmap data kv storage, take 200T data capacity as an example, each machine is 256G, keep a copy backup, about 160 servers are needed, with the growth of business volume, the cost will gradually increase;

HBase

In Meitu Big Data, HBase can be used in many scenarios. If HBase is used to store Bitmap data, the space for optimizing read and write performance is limited and the data is heavily dependent on HBase.

RocksDB

RocksDB is currently widely used in the industry. Tests show that CPU and disk usage is unstable when compression is enabled on RocksDB. However, when compression is not enabled, the performance of RocksDB deteriorates seriously when the data volume increases, and the performance of RocksDB does not improve when multiple DB databases are used.

PalDB

PalDB is a read-only KV store developed by linkedin. In official tests, PalDB performed about 8 times better than RocksDB and LevelDB when the data volume reached a certain level. PalDB even performs better than Java’s built-in HashSet and HashMap. The design of PalDB has both advantages and disadvantages. Its design leads to limited application scenarios, but it has basically met the application scenarios of Naix. PalDB has a small amount of source code, and we have made simple adjustments based on specific usage scenarios. According to the test, the storage space is finally saved by 13%, and the query time can be improved by more than 60% when using PalDB in actual concurrent scenarios. The generator takes a little longer, but the effect can be ignored because the GENERATOR mode of Mr Is added.


Naix query

In Naix system, the whole interaction is realized on the basis of our self-developed RPC service (RPC service is based on Netty, using TCP protocol). ProtoBuf is used in RPC bottom layer and Naix business side protocol.

The process of distributed computing includes nodes, sharding, data storage, etc., and for query scenarios, how can we find relevant data? So how do you calculate that?

In Naix, when the client initiates a query request to the Transport node, the Transport node selects the optimal node to distribute the request based on the query conditions. Fragment selection is determined in the corresponding node according to the request conditions. After each node finds the corresponding fragment, it performs calculation. The computed result nodes are aggregated and returned to the client, which is similar to the calculation process of fork-join superimposed on fork-join.

The Naix system supports queries in a general way: it supports combination expressions with the operator ∩ ∪ – (); The user selects the desired query Tuple and operators to assemble the query expression according to the usage scenario. Of course, we also encapsulate the assembly conversion tool of the query expression. Since Naix supports queries that are not business related, users can use the assembly expression to perform various query operations.

A few simple examples:

  • The simplest device or user location, such as querying whether a user was added or active on a particular day;

  • Multi-dimensional combination analysis, such as checking the retention of new users in Vivo channel on one day;

  • Multi-dimensional local combination crossover analysis (a common scenario of data analysis), such as counting the active users of MEIpai v6.0 and V8.0 versions corresponding to Baidu and Vivo channels on a certain day, which involves the query of four combinations of two channels and two versions crossover. This operation is usually used for data analysis. Including the first two, these simple query operations respond in milliseconds on average;

  • Multi-dimensional full cross calculation, similar to the need to know the channel and version of a day in the United States to cross all information, output such a large level of data results. The performance of similar operations depends on the number of dimensions calculated by the query and the amount of data involved, usually at the second to minute level.


future

In order to popularize Naix system to more corporate businesses and even external scenarios, we are still improving and optimizing. Currently, we are making the following attempts:

  • In the early days we were more focused on developing systems to make sure we could meet the needs. At present, we are constantly enriching operation and maintenance tools to make Naix maintenance and management more convenient.

  • Try various computational optimizations to improve query performance to the next level;

  • SQL Query is also part of our plan, because SQL is a more widely accepted approach, and we want to reduce learning costs for different users by using a common query expression like this.