This article is based on “Big Data Platform Architecture from 0 to 1”, which was shared by Professor Zhao Guoxian of Lianjia (now: Shell) in DataFun Talk data Architecture series “Selection and Application of Data Engine under Massive Data”, and modified slightly without changing the original intention.

Big data platform construction methods are similar, but the platform construction also faces many challenges. In the face of these challenges, how to overcome and repair it, so that the platform can better meet the needs of users, which is the focus of this topic. The following is the content chapter of this share. First, I will talk about architecture 1.0 and 2.0, what are they respectively, and what problems have been encountered from 1.0 to 2.0. The second part is about the data platform, which data platform there are, what problems these data platform solve; The third part introduces the current important project “OlAP engine selection and effect” and some problems encountered. The fourth part is about the research on transparent compression.

Architecture phase 1.0, with Hadoop at the bottom, is used to store data and analyze data. Log data and transaction data need to be transferred to Hadoop platform, we use Kafka and SQOOP for data transfer. Then on the basis of Hadoop platform, through an open source Hive and Oozie to do a scheduling, developers write Hql to complete business requirements, and then the data mysql cluster or Redis cluster, the upper layer is to undertake a report system. This demand basically ran for a year, but also solved some problems. However, the existing problems are as follows :(1) the architecture is simple and not easy to decouple, and problems arising from the close combination need to be traced from the bottom to the top; (2) The platform architecture is demand-driven. It takes two weeks to solve the problem after facing a requirement, and sometimes it is no longer needed after development and operation; (3) Make the big data engineer into a data acquisition engineer, spending a lot of time on how to obtain data; (4) Frequent failures, such as Hql run failure or network delay failure, Oozie issues tasks through XML configuration, we solve the problem by running from the bottom layer of data warehouse to the top layer of data warehouse, and also need to brush MSL again, which takes time.

Faced with these problems, we made an architecture adjustment. The data platform is divided into three layers. The first layer is Cluster layer, which is mainly some open source products, Hadoop implements distributed storage, resource scheduling Yarn, computing engines MapReduce, Spark, Presto, etc. Build data warehouse Hive on these foundations. Distributed real-time databases such as HBase, Oozie, and SQOOP are used for data storage, computing, and scheduling, as well as data security. The second layer is toolchain, this layer is a self-developed scheduling platform, oozie for architecture 1.0. The basic requirements include dispatching and distribution, monitoring and alarm, intelligent dispatching and dependency triggering, which will be introduced in detail later. There is a dependency visualization when a problem occurs, and data problems can be quickly located and fixed. Then Meta (metadata management platform), data warehouse currently has more than 30,000 tables, through the metadata management platform to achieve data warehouse data visualization. There is another AdHoc, which exposes the tables in the data warehouse, so that demanders can independently search for the data they need through the platform. I only need to optimize the query engine, record maintenance, permission control, speed limit and streaming. The top layer abstracts the data of the whole big data into API, which is divided into three parts: API for internal big data, API for corporate business, and GENERAL API. Big data internal API can meet some requirements of data platform, such as visualization platform, data management platform, etc., there are proprietary apis to manage these apis. For the company’s business API, we are for the business service, through our technology to make the business produce more output, the user needs the data API, through THE API to obtain data. General API, data warehouse internal reports generate some API, business demanders according to their own needs automatic assembly OK. Architecture 2.0 basically solves the problem that our architecture 1.0 solved.

The second part is a brief introduction to the platform. The first one is the storage layer – cluster layer to solve the operation and maintenance work. We made a Presto based on open source. After one or two weeks, interns can adapt to the work and release the pressure of operation and maintenance. The data volume is now 18PB, and the daily tasks are more than 90,000, with an average of 3-4 tasks/minute. The second is metadata management platform. This table abstraction consists of various layers, such as analysis data and basic detail data, and provides a search box similar to Baidu to obtain the required data through search. In this way, business personnel can use our data very conveniently. It can realize data map (how long the data, the association is how can be displayed), data warehouse visualization, management operation and maintenance data, data assets very good management and operation and maintenance, the data development work is convenient and simple.

The third data platform scheduling system, data warehouse in each layer need to flow, data problems how to recover data. The main work of the data scheduling system includes :(1) data flow scheduling, which can be configured very easily. (2) Rely on trigger, make full use of resources, can make the scheduling task very compact, can output our data as fast as possible. (3) Multiple data sources need to be integrated into the data warehouse. How to import SQL Server data, Oracle data and other data into the data warehouse? The system can be connected to multiple data sources, so our financial personnel, operation personnel and business personnel can independently access the data into the data warehouse. And then analysis and scheduling. (4) Visualization of dependency relationships. For example, we have 100 tasks associated, including 50 tasks in the lowest STD layer and 20 tasks in the middle layer. If problems emerge in the middle ODS, they will affect the tasks of the upper dependency layer, which can be easily located through visualization.

In addition to the first three platforms, we need a platform to present our data in order to show its value to our users. Our indicator platform supports roll-up and drill-down, multi-dimensional analysis and self-configuration reports to unify the company’s various indicators. Let’s talk about the various indicators of the unified company, such as lianjia scene, for example, a performance (selling ten houses in a week requires commission). In 2016, we found that there were multiple scopes, so we unified the indicators through the indicator system, and all the indicators came from here, so we can do our own visualization. There are all kinds of financial personnel, district manager or store manager can also configure their own data from the index platform, do their own desktop, the back-end of the index system using a multidimensional analysis engine to support the follow-up Kylin.

Index platform architecture, an application visualization platform definitely needs the support of underlying capabilities. This time, the theme is also data engine. Lianhomelink uses an open source data engine named Kylin, which can write data in the data warehouse to HBase through cluster scheduling to make an estimated calculation. This can support the index system billions of sub-second data query, do not support detailed query because of the prediction. Also introduced baidu open source Palo, after optimization, through such a framework to meet the upper seismograph, indicator platform and permission system. The metrics platform is used by operations, marketing, and bosses to enable multidimensional analytics, SQL query interfaces, very large data sets, the ability to release data, and data visualization.

We are demand-driven, we encounter a lot of requirements every day, and data developers just pull out the data they need. We use adhoc platform to extract data from data warehouse, based on which we make an intelligent search engine. There are many search engines based on adhoc, such as presto, hive, spark, etc. The user also does not know which engine to choose, his demand is to take out the data he needs as far as possible, so the development of intelligent selection engine, authority control, and can support a variety of interfaces, self-service query, so as to basically solve the work of data development. We have developed a QueryEngine, in the bottom presto, SparkSQL, Hive, etc. Queryengine is able to play the characteristics of their engines, such as presto query fast, but SQL support ability is not strong, SparkSQL the same, For some special SQL queries, Hive is slow but stable. Queryengine is an intelligent choice of engines, users submit SQL, QueryEngine to determine which engine is right for you. How to do a brief introduction, the SQL parse into the function used, used tables, need to return the field structure, according to the ability of each engine to determine which is appropriate. Billing is still being developed because resources are limited. Queryengine supports the mysql protocol, because some users need BI capabilities and need to aggregate the returned data, we can’t open a variety of BI capabilities, we just need to meet the mysql protocol to expose the data, users can use only other BI.

Many platforms have been derived from Architecture 1.0 to architecture 2.0. The big architecture already exists, but how to solve some of the problems encountered. Here we share two cases, one is the selection and effect of OLAP engine, and the other is why transparent compression is done and how it is done. Rolap engine is basically based on the relational database, based on the relational model for real-time aggregation operations, mainly through the traditional database or SPQRK SQL and PRESTO, SPQRK SQL and PRESTO is based on real-time data calculation; Molap is based on a predefined model that performs aggregation calculations in advance and stores summary results. Druid is mainly real-time access (Kylin does not), real-time kafka data with Spark SQL to do a calculation and then upload the data, can support second level query; Another popular one is called OLAP, which is a hybrid of multiple engines and routes to different engines for different scenarios.

When Rolap queries, data will be scanned out first, and then aggregated. Data from multiple nodes will be integrated into one node through aggregation results and then returned. The advantage is to support any SQL query, because the data is hard calculation, using detailed data, there is no data redundancy, consistency is very good, the disadvantage is that the return of large or complex data volume is slow, because you are based on detailed data, no matter how optimized the calculation of data one by one, there will be bottlenecks, concurrency is very poor.

Molap will have a central cube in the middle. In the data warehouse, the data will be stored in the cube through precomputation. Through the preaggregation storage, a small amount of calculation is supported, why a small amount of calculation, because the data is already predicted. The advantage is to support large data sets, fast return and high concurrency, the disadvantage is not to support details, the need to define dimensions and indicators in advance, the application scenario is to predict the query mode, concurrent requirements of the scene, solidified scene can use MOLAP.

As for the technology selection, there were basically many open source components. Why kylin was chosen because it supports high concurrency, can support sub-second query in the face of tens of billions of data, mainly offline, has certain flexibility, and it is better to have SQL interface, but Kylin can meet these requirements. Apache Kylin™ is an open source distributed analysis engine that provides SQL query interface and multidimensional analysis capabilities on top of Hadoop to support very large scale data. Develop and contribute to the open source community. It can query huge Hive tables in subseconds. The solution is to define dimensions and indicators in advance, calculate cube, store it in hbase, and parse SQL routes to hbase to obtain results during query.

Now let’s talk about the OlAP architecture of Homelink, HBase cluster, data warehouse computing and preprocessing in this area, and an HBase cluster for Kylin. Kylin needs to do predictive calculations, so there’s a build cluster that writes data to a Kylin-based Hadoop cluster, and then a load balancing with Nginx, and then a Query cluster, and then an online query, and then a Kylin middleware, Solve query, cube task execution, data management, statistics. Most of the index platform is to query Kylin, but Kylin can not meet the detailed query, this through QueryEngine intelligent matching, spark cluster or Presto cluster, and Alluxio compression, and then the detailed query results returned to the index platform, and finally returned to the products of other businesses. In horizontal also made a permission management, monitoring early warning, metadata management, scheduling system, to achieve the overall platform support.

Next, we will talk about the capability expansion of Lianjia Kylin, which is basically similar with little difference. The main problems are as follows: distributed construction, cube growth is fast, build cluster cannot bear, so distributed optimization can meet the requirement that 500Cube can run out in the specified time; Optimized the dictionary download strategy at build time. Kylin build requires all metadata dictionaries to be downloaded, so it takes several minutes to download the metadata dictionary from Hadoop. It takes time to download the metadata dictionary at each build. Optimization of global dictionary lock, build need to lock the entire build cluster, after the completion of the lock release, source code found that there is no need for global lock only need to lock the required field can be optimized lock set to the field level; Kylin’s Query query machine uses the G1 garbage collector. We have developed a middleware that can basically accommodate a queue of unlimited capacity, pre-schedule specific cube, manage and control permissions, and realize the concurrent control of tasks. The architecture has an external scheduling system and a Kylin middleware through which all queries and builds go. We also did a task queue, statistics, priority scheduling, monitoring alarms, cube bisecting, and visual configuration and presentation.

Architecture from 0 to 1.0 encountered another problem – cluster, storage chain of all data, large data volume, fast data growth (0-1PB two years, 1Pb-16PB less than a year, facing cost problems), cold data expectations, for these problems put forward a transparent compression project. Hierarchical storage (a Hadoop feature) stores data at different levels. For example, some data is stored on SSDS and some data is stored on disks. The Hot policy stores all data on disk, while the warm policy stores some data on disk and some data on Archive (which is cheap and has a small number of revolutions). The second is the ZFS file system, which has storage pools, self-healing capabilities, compression and variable block sizes, copy-on-write/checksum/snapshot, ARC (adaptive memory caching) and L2ARC (SSD for secondary caching).

The idea of transparent compression design is :(1) define the main content of data cold processing isolation. Some of the data needs to be stored in the ZFS file system to do a transparent compression to meet the need to reduce costs, so that the cold data needs to be defined; (2) generate a specific list of cold data by obtaining specific data, and mark its cold data rate; Then, periodically remove rows from the cold data table for cold data migration and move them. Move defined cold data to ZFS compression via HDFS directory, and remove unwanted data to Ext4. Some of the data is stored on ZFS and some on EXT4.

Transparent compression optimization work includes: the first Hadoop cold and hot data separation optimization. It involves heterogeneous storage strategy selection and HDFS hot and cold data movement optimization. The second is ZFS file system optimization. ZFS supports many compression algorithms, and Gz compression efficiency is found to be the best after testing. The following figure shows the efficiency comparison of various algorithms. As the compressed data becomes larger, the CPU usage becomes higher. Massive data clusters are not only storage but also computing. The loading time of compressed data by Datanodes is directly related to the efficiency of accessing this part of data. According to the table,ZFS gz compression has some advantages over LZ4 in data loading by Datanodes. It’s closer to EXT4. Considering compression rate, read and write speed, datanode loading speed, etc., gz is selected as the compression algorithm of ZFS file system.

Data growth before transparent compression is very fast, close to 30% growth rate, logical data has 3PB, 3 backup total space: 9.3PB, the actual total space: 7PB, simple estimated cost savings of 3 million. After compression, although the actual number increases again, the real number decreases slowly.

In the future, transparent compression is costly to CPU. We hope to extract transparent compression calculation and compress all data transparently through QAT card, so as to save cost. Another is the combination of EC code and transparent compression, using EC code can be two backup or 1.5 backup; Third, data intelligence is recovered. Compressed access still affects performance. Hot data is put on hot storage devices and SSD for intelligent acceleration. The fourth integration of large storage devices, do cold data storage.

The final conclusion is:

(1) Do a good job in the early stage of demand analysis and technology selection, do not blindly read online articles;

(2) How to ensure stable and iterative technology in the face of changing business requirements;

(3) Monitoring first, take out the whole operation data for monitoring;

(4) Online optimization requires continuous optimization.

– END

This article is published by DataFun community, public ID: Datafuntalk