Background — Company profile
Eggplant Technology (Overseas SHAREit Group) is a global Internet technology company, mainly engaged in mobile Internet software development and global mobile advertising realization solutions, cross-border payment solutions and other Internet services. SHAREit, a representative product of Eggplant Technology, is a one-stop digital entertainment content and cross-platform resource sharing platform, with nearly 2.4 billion installed users. Eggplant Technology, as an overseas enterprise, has created a number of tools and content applications in Southeast Asia, South Asia, the Middle East, Africa and other regions, and ranks among the top downloads on Google Play all year round.
Background – Business characteristics and selection
Eggplant Technology has many product matrices and relatively complex product forms, including tools, content, games, advertising, payment, etc. For relatively complex business scenarios, according to different business forms, we did different database selection. At present, the six main databases used by Eggplant Technology include:
- Self-developed and persistent KV: feature platform, user portrait, behavior record, etc
- Redis Cluster: service cache and session information
- Cassandra: Content library
- MySQL: Hue, Metadata, and operation platform
- ClickHouse: Data analysis, real-time reporting
- TiDB: user growth, APM systems, cloud billing, etc
Application practice of TiDB – Business pain points and ADVANTAGES of TiDB
The pain points of thinking based on the business level, we introduced in multiple business scenarios TiDB: first, eggplant as a sea of science and technology enterprises, introduced several public clouds as infrastructure, so at the database level, want to consider the business under cloudy architecture, data migration, database compatible service adaptation and the problem of data synchronization. Secondly, Eggplant has a number of high-traffic APP products, and its business shows a trend of rapid growth. Traditional DRS databases, such as MySQL, hinder the rapid development of its business because of the need for separate databases and tables. Third, NoSQL databases such as Cassandra and HBase cannot meet complex scenarios such as distributed transactions and multi-table join.
Some of the APM systems of Eggplant Technology are HTAP scenarios. The same business data has both OLTP and OLAP requirements. We hope that a set of databases can handle it.
After the introduction of TiDB, TiDB has given play to its unique advantages in many aspects to help Eggplant Technology build a sustainable database ecology:
- The cross-cluster migration and data synchronization capabilities of TiDB are utilized to build the business expansion capability under the multi-cloud architecture and meet the business architecture design under the multi-cloud architecture.
- TiDB provides the ability of automatic horizontal elastic expansion, so as to achieve no business awareness, and solve the problem of database and table.
- TiDB is highly compatible with MySQL and has low learning and migration costs in large-capacity and high-concurrency scenarios.
- Use TiDB HTAP capability to meet the dual requirements of OLTP and OLAP on a piece of data.
Application practice of TiDB – APPLICATION in APM scenario
Eggplant Technology’s APM (Application Performance Management) system provides integrated capabilities to monitor, analyze, kanban and repair APP crashes and Performance problems, which is used to support a number of high-growth APP applications. The first characteristic of this system is the large amount of information, generating billions of pieces of data every day, need to be retained for 30 days; The second feature is high timeliness requirements. For some difficult situations, such as crashes and serious performance problems, if timeliness can not be met, it will directly affect the user experience, and even the product revenue. The third feature is the need to open up the work order system, to provide the integrated ability of problem tracking and repair; The fourth feature is that OLTP transaction scenarios need to be combined with OLAP analysis scenarios.
Start by analyzing the data flow from the early APM, from APP data reporting to log collection and finally to ClickHouse, and the entire data flow is similarThe batchThe process takes about two hours. The overall timeliness is weak and problems are not exposed in time, which will affect user experience. In addition, the system includes MySQL and ClickHouse databases. Why? Because ClickHouse can be used to doAnalytical aggregation of dataMySQL is mainly used to buildProcess the repair order, two sets of databases at the same time in support, relatively high cost.Looking at the new APM data flow after the introduction of TiDB, we can see the realization from APP reporting, to kanban display, to alarm, and then to process work orderMinutes of classQuasi – real – time viewing version display and alarm. This part is mainly with the help of TiDB HTAP ability, through aggregation analysis to watch the version of the display, to the alarm center timely alarm. At the same time, the OLTP capability of TiDB is used to update kanban rows. Therefore, we can get through the process of version viewing, monitoring, problem tracking and repair through a set of TiDB database.
Evolution based on TiKV — self-developed distributed KV
As we all know, TiKV is the storage layer of TiDB as well as a key-value database. Next, we will talk about the process of eggplant Technology to build a distributed KV system based on TiKV. Eggplant Technology mainly provides tools and content products, and the amount of data generated is very large. KV storage needs to support two scenarios: one is real-time data generation and real-time writing; The other is for the user portrait and feature engine, which quickly loads large amounts of data generated offline to the online KV storage to provide fast access to online business, namely Bulk Load capability. In actual business, TB throughput per hour is required.
The following figure shows the distributed KV developed by Eggplant Technology based on Rocksdb. This system can meet the two kinds of KV requirements mentioned above. The architecture shown on the left is mainlyRealtime write capability implementation, the data goes from the SDK to the network protocol layer, then to the topology layer, then to the mapping layer of the data structure, and finally to Rocksdb. In the right pane, Bulk Load is imported in batches. You may have a question, why the left real-time write process can not meet the hour-level TB data import? There are two main reasons: First, Rocksdb write amplification, especially in large-key scenarios, Rocksdb write amplification is very serious. The other is that the load or storage capacity of a single disk is limited due to the network bandwidth of a single disk. On the rightThe ability to import entire batchesHow is that done? It uses Spark to parse Parquet data, pre-slice and generate SST, upload SST to storage node of Rocksdb, and finally load it into KV layer through INGest & Compact for online business access. Single throughput per second can reach 100 megabits.
Evolution based on TiKV – Distributed KV based on TiKV
Since Eggplant Technology has developed a distributed KV based on Rocksdb, why would it use TiKV? First of all, in terms of technology, although self-developed distributed KV has been running for more than two years in production, supporting hundreds of TB of data, some technical problems, such as automatic elastic scaling, strong consistency, transaction and large key support, still need further investment in research and development. Second, there is still a certain lack of high-quality database talent reserve at the talent level. After many investigations and communication with TiKV r&d students, we found that our needs and pain points coincide with TiKV’s product planning, which prompted us to actively embrace TiKV. With the help of TiKV, we can technically create KV products that are separated from storage and computation. Third, TiKV has an active open source community that we can leverage together to polish our product.
The architecture in the figure below is a distributed KV built by Eggplant Technology based on TiKV. The left part mainly addresses a process of real-time data writing, from the SDK to the network storage, to the data calculation, and finally to the storage engine of TiKV. Our focus is on the right sideOverall Bulk Load capabilityDifferent from the self-developed distributed KV, the whole SST generation process is done inside TiKV. The reason for doing so is that it is possibleMinimize Spark code development and maintenance costs and improve ease of use.
Evolution based on TiKV – Test results
The following two tables are the results of actual tests based on TiKV’s Bulk Load capability. The table above is the maximum that can be achieved with an E5 CPU, 40 Vcores, and disk using NVMe256 millionSingle throughput of. In the following table, we performed Bulk Load and online reading at the same time. It can be seen that the jitter of Latency response time is very small. Both P99 and P99.99 are at the same levelRelatively stableIn the state. This test result is a Demo verification, and we believe that after our subsequent optimization, there will be qualitative improvement in both storage throughput and response delay.
The capabilities of Bulk Load were developed and evolved together with our R&D classmates at TiKV. We believe in the power of openness, and we will publish the entire architecture, including test data, on GitHub later. If you have corresponding requirements, you can pay attention to them.