Construction and practice of Kuaishou Trillion-level real-time OLAP platform

The author | Li Yuance

Edit | Xue Liang

At ArchSummit global Architect Summit held in Beijing on December 7-8, Li Yuance, architect of Kuaishou Big Data Platform, shared the construction and practice of Kuaishou in OLAP platform. The following is an abridged version of the speech.

At present, Kuaishou App has a daily population of 150 million and generates trillions of user behavior data every day. It is a challenging and valuable work to explore these data efficiently. Today we will focus on sharing the design scheme and main improvement process of Kuaishou OLAP platform for the construction of trillion-level data scale.

1

An overview of kuaishou OLAP platform

The OLAP platform of Kuaishou was not born for a long time. Before April 2018, some multi-dimensional analysis requirements still adopted pre-defined indicators and off-line calculation. Its disadvantages were obvious.

The Druid OLAP analytics engine was launched in April this year, and the Superset data visualization platform has solved a number of business pain points. In May, the Druid platform was upgraded to 0.12, the latest release in the community at the time, addressing issues such as time zones and file loading performance. As of July, the Druid platform was generating more than 100 billion messages a day and more than 1,000 user-configured visuals. Since July, the platform has entered a phase of rapid development. Druid has had a lot of problems with query performance and stability, and we have made a lot of improvements. In September, we launched Druid probe system, temporal and dimensional materialized view, Indexing Service fine particle resource allocation, and a lot of optimization on the resource scheduling side. As of November this year, OLAP’s daily data intake has peaked at more than 500 billion messages, and the number of user-configured visualizations has exceeded 10,000.

The OLAP platform has developed rapidly in the past six months. Thanks to the high availability architecture design based on Druid and the efforts of the team partners, the OLAP platform has been launched so far without medium or large failures, and the services are stable.

Kuaishou OLAP platform has a total of 150 physical servers, access to more than 2000 data sources, the number of input messages is about 500 billion every day, and the index data storage is about 400TB. The peak number of queries per day is 10 million, which is very large, but many API calls are triggered in the program, and the proportion of artificial triggers is small. On the whole, the average query delay is 50 ms, about 100 ms for P90 and 500 ms to 1 second for P99. In terms of visualization, the accumulated number of user kanban boards is more than 800, and the number of charts is more than 10,000.

2

Kuaishou business scenario using OLAP

The first is the multimedia quality analysis service. Kuaishou uses the services of many CDN manufacturers in China, involving hundreds of domain names, and reports tens of billions of CDN quality monitoring data every day. CDN service quality is directly related to the user experience of the host APP. The CDN quality team of the company needs to make real-time analysis and intelligent scheduling of CDN monitoring data, as well as real-time monitoring of the scheduling effect. In addition, CDN quality problems need to be quickly analyzed and positioned, which itself is also a multidimensional analysis process, and OLAP technology can well meet this demand.

Another business scenario is A/B Test. Kuaishou has launched about 1000 A/B experiments, and thousands of A/B indicators need to be compared. Tens of billions of data will flow into A/B Test platform every day. The analysis of A/B Test index is also A typical multidimensional analysis process. OLAP platform needs to meet the demand of hundreds of thousands of query calls every day, and the delay of query should be guaranteed at the level of 100 milliseconds.

During the selection of OLAP platform, the needs of many business teams of the company have been investigated. In summary, people will pay more attention to the following points. For example, the support of large data scale, a single data source may have tens of billions of data needs to be entered every day; Query delay, to ensure that the level of milliseconds to seconds; Real-time data, many business lines clearly put forward the need for real-time data analysis; In addition to high concurrency queries, platform stability, etc., there are also some relatively low weight requirements, such as flexible changes to data Schema, precise de-weighting functions, and SQU interface support, etc.

Based on the summary of the user survey, we compared the commonly used OLAP technology.

First of all, Hive/SparkSQL is widely used in the field of data warehouse, but the query delay is difficult to meet the requirements of milliseconds to seconds, and because it is offline calculation, the data timeliness is poor.

Secondly, ES is a powerful system, which can meet the requirements well in the scenario of medium data scale. However, in the scenario of trillions or larger data scale, the data write performance and query performance meet great bottlenecks.

Kylin and Druid have similar functions. Druid’s OLAP architecture provides better data timeliness than Kylin’s and allows for more flexible data changes. Therefore, Druid is used as the OLAP query engine.

3

Druid System overview

Coordinator and Overlord are the primary nodes of Druid. The Middle Manager is mainly responsible for data indexing and generating index files. The Historical node is mainly responsible for loading index files and providing Historical data query services. The Broker is an access node for queries. Druid also needs to store metadata, such as MySQL; When the Middle Manager generates an index file, it needs to publish the index file to a shared storage system first. We choose the HDFS system that is widely used.

Druid provides excellent query performance due to its use of five techniques: pre-aggregation of data, column storage, Bitmap indexing, MMAP, and intermediate caching of query results. The following is specific for two points.

Let’s start with data preaggregation. Druid splits a row of data messages into three parts, including a timestamp column, a dimension column, and a metric column. Pre-aggregation is when data is entered into the Druid system, the raw data is pre-aggregated over a certain period of time, and it aggregates the metrics to be calculated based on a full dimension, that is, the content to be indexed. All subsequent queries are secondary queries with these pre-aggregated intermediate results.

Let’s move on to Bitmap indexes. Bitmap indexes are mainly used to speed up query scenarios with conditional filtering. Druid generates a Bitmap set for each value of each column when it builds the index file. As shown in the figure, Gender is Male and the corresponding Bitmap is “1001”, representing Gender of “Male” in lines 1 and 4. To take an example of a query, suppose to filter data Gender = ‘Female’ and City = ‘Taiyuan’, Then only the Bitmap “0110” corresponding to Gender = “Female” and Bitmap “0101” corresponding to Taiyuan need to be operated and the result is “0100”, indicating that the second row meets the screening condition. Bitmap can be used to quickly locate the data to be read and speed up the query.

Druid supports real-time data import from Kafka. It also supports offline data import from HDFS or HIVE systems. Druid provides a rich query API. In addition to the default Restful interfaces, Python, Java, Go and other programming languages have third-party API implementations. In addition, Druid provides support for SQL interfaces. Druid support is now available in Hive with StorageHandler, which allows users to query data in Druid using Hive SQL. For example, time zone issues, case-sensitive data source dimensions and metrics, and default limit, default time range options.

4

Druid’s experience with Kuaishou and some major improvements

This is a Druid OLAP platform architecture diagram. In the middle is Druid’s own components. Data is taken in kafka in real time and imported offline from Hive warehouses in batches. In addition, we have a complete Metric system, probe system, Druid data source management system, etc.

The OLAP platform also faces many challenges in the context of trillions or even trillions of data. For example, how to make query faster, how to make resource utilization more efficient, how to make data management and data access more convenient, how to make the cluster platform more stable, we have made targeted improvements and optimization for these problems.

First of all, in terms of stability, we have made a variety of resource isolation deployment schemes, realizing the high availability and load balancing of brokers at the access layer through brokers.

The Historical data storage layer is divided into two layers. The first is the separation of hot and cold data. Hot data is stored on SSD machines, and when hot data becomes cold, it is automatically migrated to HDD machines. Since most of the queries are for the most recent data, the speed effect of using SSDS is significant. Considering the high cost of SSDS, you can set up copies of hot data by placing one copy on SSDS and the other on HDDS, and then setting the weight of SSDS. Most requests can still be placed on SSDS. When the SSD machine fails, requests are sent to HDDS, which saves a lot of money.

In addition to the separation of hot and cold data, Kuaishman also isolates special services through Tier configuration because some of them require higher query stability. Special service data source index data is stored on the dedicated Historical machine. In this way, the query performance is not affected in the case that some large queries may cause historical memory GC or the system IO supports high Load.

We have also made many optimizations to speed up query performance in large-scale data scenarios. The first is the materialization view, which will do two levels of materialization, one is dimensional level of materialization, the other is temporal level of materialization.

What is materialized view? Suppose that the original dimension of a data source has ten columns. By analyzing query requests, it is found that the three dimensions in group1 and the three dimensions in group2 often appear at the same time respectively, while the query frequency of the remaining four dimensions may be low. More seriously, one of the dimension columns not queried is a high base dimension, which is a dimension with a large count district value, such as the User ID. In this case, there is a significant query performance problem because a high base dimension can affect the performance of Druid’s data pre-aggregation. Poor aggregation results in a larger index file Size, which results in a larger read I/O and poorer overall query performance. For the optimization of this case, we will build a pre-aggregate index for group1 and group2 dimensions respectively. Then, when receiving a new query request, the system will analyze the dimension set to be queried in the request first. If the dimension set to be queried is a subset of the dedicated index dimension set just created, Directly access the newly created index, do not need to access the original aggregate index, query performance will have a relatively obvious improvement, this is a materialized view design idea, is also a typical space for time scheme.

Temporal materialized views: In addition to the query scenarios mentioned earlier, there is another query Case that Druid does not satisfy well. For example, for a query with a large time range, if the aggregation strength of a data source is at the minute level, it is more troublesome to query the data of the last three months, because all the index files of the last three months at the minute level need to be scanned and aggregated again.

To solve this problem, we create an hour-level or even hour-level materialized index on top of the minute-level index of the data source. In this case, the aggregation effect is better and the overall size of the index is smaller. When a new query request is received, if the query granularity is day level or higher, the query request is automatically routed to the day level materialized index, which also has a significant improvement in query performance.

SQL > select * from Druid; SQL > select * from Druid; SQL > select * from Druid; SQL > Select * from Druid; SQL > Select * from Druid;

The first is to optimize the interaction between Overlord and MySQL. The Segments table is queried multiple times by Overlord when publishing new Segments. The solution is simply to add indexes to the Segments table. Compared with the optimized MySQL query performance, it can be reduced from more than 10 seconds to 1 second, which is more than 10 times the improvement.

Another is to optimize the interactive performance between Coordinator and MySQL. A Coordinator periodically scans the Segments table in full, and each scan takes a long time. First of all, the full scan is completely unnecessary. We changed it into an incremental scan scheme, and the time of the whole scan was reduced from 1.7 minutes to about 40 seconds. Then create MySQL index for incremental scan of SQL, scan time can be reduced to 30 milliseconds, the overall performance of thousands of improvements.

The next step is to optimize the Segment file loading process. The Coordinator scanning Segment matching Rule process is implemented sequentially by default. We have accelerated parallelization and improved some details. The time it takes to coordinate multi-million Segment files has been reduced from 3 minutes to 30 seconds. The Druid metadata system is now largely free of performance bottlenecks.

5

Druid cluster resource utilization improvements

First, each Kafka indexing task corresponds to a Supervisor’s service. The Supervisor’s task count is set to a fixed value. Too much lag in reading Kafka may cause data delays, and too much lag may cause resource waste. In addition, it is difficult for users to estimate how appropriate task count should be when creating a indexing task. Our optimization scheme is to make the Supervisor automatically adjust the Task count according to the current consumption Kafka delay, so that there is no data delay during peak periods of services, and resources can be returned to the cluster during peak periods of data, and the utilization rate of the whole cluster can be significantly improved.

Another is the indexing Task resource allocation issue in the Middle Manager. Druid assigns a fixed number of slots to each Middler Manager, but as opposed to Kafka Indexing Task, Hadoop indexing task is a Hadoop client that commits only one task. It does not occupy resources, so there will be some waste of resources. The way to optimize this problem is to change the Middler Manager task scheduling configuration from the number of slots to the size of memory allocation. We treat different types of tasks differently. By default, Kafka tasks and Hadoop tasks have different memory sizes. Of course, users can specify their own memory size when submitting tasks. We set some maximum limits to prevent malicious submissions.

In addition, running a Compaction Compaction on the Segment file in a timely manner can help speed up query performance and save storage. Druid commits a special Compaction task when Compaction occurs, which scans the Segment file sequentially to merge, resulting in poor performance. We do a parallel operation that commits a Hadoop task, scans the Segment data in parallel on a Hadoop cluster, and then compacts this Compaction.

We’ve also done a lot of work on platform usability. There is a problem in the operation of the platform. There are many data sources to access every day. In the initial stage of the platform, the administrator can participate in the completion, but when the business grows rapidly, the workload is very heavy. After data source access, there will be many requirements to modify the dimension and index definition of data source, which need to be systematically solved.

In addition, users often do not have a deep understanding of the Druid platform or their own data, and may not have a clear understanding of the business analysis requirements. When accessing data sources, they often import a lot of dimension and indicator information, which brings a potential hazard: The more dimensions, the worse the aggregation effect will be, and even some high kiwis will seriously affect the effect of data aggregation and query performance.

To address these issues, we designed two sets of tools, the Druid Data Source Management System and the Druid Probe System.

The data source management system is a Web management system. Users can access, view, and manage data sources on this system. The information that can be viewed includes dimension and indicator information, Kafka consumption rate, Kafka consumption lag, and so on. The above image shows indexing Task list information for the data source management system, which has permission management so that only the owner of the data source can modify configuration information such as data source dimensions and metrics.

Above is the Indexing Task detail page, which, in addition to basic information, looks at things like Kafka consumption rates, giving users the ability to troubleshoot online problems with their own data sources.

This is the New and edit page for the data source. The process of creating a new Kafka data source is very convenient. Kafka information is extracted directly from Kafka’s management system. Users do not need to manually fill in the data source. For the timestamp column and timestamp column format, the system will automatically extract the user Kafka data for filling, if the user wrote the wrong timestamp column format, can also be automatically corrected. The dimension and index system has also done data parsing in advance to offer Suggestion, users only need to click the mouse to choose.

This figure shows the list information of the data source. In the list, you can clearly see the data amount, average size of the Segment file, dimension and indicator information of the data source. In addition, if the data source is imported by offline task, the name of the offline task will be automatically associated with the data source, facilitating the quick locating of the scheduled import task.

The Druid probe system addresses the following issues:

First, analysis of data source query heat. The probe system ranks all of Druid’s data sources in terms of their overall query popularity, so administrators can see which data sources are the most popular to query and take care of. In addition, some cold data sources or zombie data sources without query requests can be found, and users can be notified to go offline to avoid occupying cluster resources.

For a single data source, probe system can also do to this data source internal dimensions and indicators of heat analysis, know which dimension is often be queried, which dimensions and indicators are not often cold dimension or index of the query, especially can also find some dimension is a cold dimension and high gil, this Case will seriously affect the query performance, Users should be informed of optimization in time.

Let’s talk about the data visualization aspect of the OLAP platform. A powerful visualization tool, is a necessary component of the OLAP platform, we use the open source Superset solution. Superset is Airbnb’s open source, deeply integrated, interactive, efficient data analysis and visualization platform that supports a wide variety of data visualization charts.

Up to now, our Superset has accumulated tens of thousands of charts, users have encountered many problems in the process of using Superset, we also made a lot of changes to solve these problems. Including data synchronization, authority management, alarm function, some interactive improvement of product design, etc.

Several key improvements are introduced, such as the support for multiple time shift. The so-called time Shift can be used to draw the comparison between the current value and the previous day’s year-on-year and sequential indicators in a single graph. What is displayed here is the comparison of the current day with the previous day and the same day last week. Users can add as many indicators of other dates as they want to compare to the same picture. In addition to these timing charts, we have also done a lot of time Shift support for other charts.

What is shown here is the linkage refresh function of multiple charts in the same kanban in Superset when sliding the mouse window. Select the time range of one chart and refresh the other charts in association, which is quite practical in multi-table association analysis.

Here is the design of the Superset alarm function. The company relies on Druid and Superset for much of its monitoring data, and has a strong need for alarms. We refer to Grafana’s alarm function design and implement a similar function in Superset. Users can customize some alarm dimensions, indicators, check periods, alarm levels, etc.

6

Summary: Kuaishou improvements to Druid

In terms of performance improvement, we optimized the interaction of materialized view and metadata at both temporal and dimensional levels. At the resource management level, automatic scaling of Supervisor Indexing tasks, Middler Manager fine-grained resource allocation, and parallel Compaction are implemented. At the stability level, the isolated deployment of Broker and Historical was designed. At the level of platform ease-of-use, the management system of data source, data probe system and Superset data visualization platform were developed by ourselves.

Finally, share some work plans of kuaishou OLAP platform in the future. First, we will introduce some new OLAP technologies, such as Clickhouse. Second, we are considering the integration of OLAP and Adhoc, as well as routine reporting, in the hope that OLAP technology can also play a bigger role in offline data analysis. Third, from data inflow to data visualization to provide one-stop service, reduce the threshold for technical and non-technical personnel to use. Fourth, we hope the platform can evolve from technical output to productization and service-oriented direction.

The authors introduce

Li Yuance, kuaishou Technology Big data platform architect, data query engine team leader. Responsible for the development and application of SQL engine, OLAP engine and multidimensional visualization platform in the company. He worked for Qihoo 360 and is the author of open source project XLearning. His research interests include distributed computing, OLAP engine, SQL on Hadoop, AI on Hadoop, etc.