Decrypt Huawei cloud FusionInsight MRS new feature: One architecture three lakes

Huawei cloud experts interpret LiteOS module development and its implementation principle for you in detail.

Abstract: Huawei cloud FusionInsight MRS Product manager Chen Xiang delivered the keynote speech huawei Cloud FusionInsight MRS, One Architecture to Achieve Three Data Lakes at huawei Cloud TechWave Cloud Native 2.0 Special Day. It also shared the development trend of data lake in the era of intelligent data, the technological innovation of MRS cloud native data lake to realize an architecture to construct offline, real-time and logical data lake, and the successful cases in business practice.

This document is shared in huawei Cloud FusionInsightMRS Cloud Native Data Lake, One Architecture, Three Lakes, And New features of Huawei cloud FusionInsightMRS component. The original article is written by IT Lao Mo.

On May 20, Chen Xiang, product manager of Huawei cloud FusionInsight MRS, delivered a keynote speech “Huawei Cloud FusionInsight MRS, One Architecture to Achieve three Data Lakes” on the “Huawei Cloud TechWave Cloud Native 2.0 Special Day”. It also shared the development trend of data lake in the era of intelligent data, the technological innovation of MRS cloud native data lake to realize an architecture to construct offline, real-time and logical data lake, and the successful cases in business practice.

Entering the era of intelligent data, ten consensus of data lake construction in the industry

After decades of rapid development, big data processing technology has become increasingly mature, around the data warehouse, data lake derivative technology as many as the stars, the industry in the years of exploration, also have ten important consensus on the future data lake form, lake warehouse integration has become the preferred architecture of intelligent data lake. In order to meet the new challenges to big data technology in the era of intelligent data, Huawei FusionInsight MRS cloud native data Lake has been comprehensively upgraded, introducing popular components Hudi and ClickHouse, and strengthening the self-developed HetuEngine virtualization engine. At the same time, new IoTDB timing processing ability, expand the data enable application boundary.

Huawei cloud FusionInsight MRS Cloud native data lake

Huawei FusionInsight MRS Cloud native data lake Provides government and enterprise customers with a cloud-native data lake solution that integrates lake and warehouse, and builds an offline, real-time, and logical data lake with sustainable architecture evolution. Support enterprise customers all data real-time analysis, offline analysis, real-time interactive query, retrieval and multimode analysis, data warehouse, data access and management and so on big data application scenario, the number of enterprise customers efficient use, simplified use number, help enterprise customers realize a enterprise with a lake, a lake city, business insights, more accurate and cash value faster.

** Offline data lake: ** provides multiple computing engines such as interactive, BI and AI, and adopts OBS to realize storage and computation separation, making the architecture of cloud native data lake more flexible. Supports the super scale of 20,000 + nodes in a single cluster, and supports 100,000 + nodes through cluster federation. Supports rolling upgrade, ensuring uninterrupted upgrade of key services.
** Real-time data lake: ** Build real-time update processing capabilities from T+1 to T+0 with Hudi support for real-time incremental ACID data into the lake, ClickHouse millisecond OLAP analysis, etc.
** Logical Data lake: **HetuEngine provides collaborative analysis across lakes, warehouses, and clouds, enabling integration of lakes and warehouses to reduce data migration by 80% and improve collaborative analysis efficiency by 50 times.

A new feature of sanhu architecture, covering the whole process of data analysis

Hudi: Incremental real-time entry into the lake, realizing fast aging, easy development, high performance and higher resource utilization

Traditional data lakes do not support data update, so data is processed in T+1 offline mode, which cannot meet the demands of flexible services. To address the data timeliness, Huawei FusionInsightMRS cloud native data lakes introduce Hudi.

Hudi supports data update, data deletion, and ACID guarantee to ensure real-time data update operations. It provides a variety of views, including read optimization view, incremental view, real-time view, can provide different views for different analysis applications, based on these technologies can easily implement incremental table, zipper table, mirror table and other data storage models. The introduction of Hudi has resulted in four significant effects:

1. Faster data aging: In the business system, minute-level data can be entered into the lake through the CDC system, with data timeliness ranging from T+1 to T+0.

2. Higher processing performance: In the scenario where data is deleted or updated, Hive update is traditionally used. If only one row of data is processed, the entire table or at least the entire partition needs to be processed.

3. Easier development: For developers, traditional data into the lake does not support the update or delete, developers need to create a temporary table, will be covered again after data processing, to may need to write a lot of code to accomplish the same task, have the blessing of Hudi, after doing a data update operation is as simple as using a database, a single statement.

4. Higher resource utilization: Traditional model of T + 1 is not 24 hours to run the task, but in the evening for batch processing, the morning report out the whole process, calculate the peak only night batch run time, and ratio of resources is according to the calculation of peak demand, lead to insufficient resource utilization during the day, after the introduction of Hudi, data real-time acquisition into the lake, By spreading out the lake treatment work throughout the day, it virtually flattens out the peaks and valleys of overall resource consumption.

A financial customer constructed a data lake based on Hudi, and the delay of data entering the lake was reduced to minute level, and the utilization rate of daytime resources was increased by 2 times +, and the data processing efficiency was increased by 50%. The developer could complete the development with a single statement, simplifying the difficulty of development.

ClickHouse: Real-time OLAP engine for all-self-service, cost-effective real-time analysis of reports

Due to the limited processing capacity of traditional OLAP engines, data is generally organized according to topics or topics and then connected with BI tools, resulting in disconnection between BI users and data engineers who provide data. For example, when A BI user has a new demand and the required data is not in the thematic market, the demand needs to be sent to the data engineer for the development of the corresponding ETL task. This process usually requires inter-department coordination, long time cycle and low cooperation benefit.

Now, huawei FusionInsight MRS cloud native data lake can load all detailed data into ClickHouse in the form of large and wide tables. BI users can perform self-service analysis based on ClickHouse large and wide tables, which requires little input from data engineers. Even in the face of most new requirements, There is no need to re-supply the number, the development efficiency and BI report on-line rate will be greatly improved. ClickHouse also analyzes data in a table at the millisecond level.

The implementation of self-service BI based on ClickHouse has also achieved good results in huawei internal practice. HIS data Lake of Huawei Group was originally based on traditional OLAP engine modeling. Limited by development efficiency, dozens of reports were launched in a few years. After the introduction of Clickhouse, 400+ reports were developed and launched in three months, and the efficiency of the launch increased by 50 times. At present, the total use scale of huawei internal ClickHouse has reached 2000+ nodes, and the data volume has reached 10+PB, and the daily increase of data volume is 100TB.

HetuEngine: Data Virtualization Engine that breaks through geographic constraints and breaks down data “walls”

With the demand of enterprise development and digital transformation, enterprise business is more and more complex, and the demand for innovation is higher and higher. Uniparental * * work is difficult to meet the changing needs of business, the enterprise may exist multiple lake, multiple positions and multiple systems, but the traditional chimney construction, lake between warehouse and engine no direct connectivity capability, need to move back and forth through the ETL data, data flow caused by link long, more than the data redundancy, and generate data island. The data consistency and reliability are difficult to be guaranteed by the redundancy of multiple data.

In order to make it easier to use data, facilitate cross-lake collaboration and solve the problem of data fragmentation in lake warehouses, Huawei launched HetuEngine, a data virtualization engine, to achieve cross-lake, cross-warehouse, on-cloud, off-cloud and multi-cloud collaborative analysis capabilities, break through geographical limitations, break data “walls”, and improve the efficiency of cross-lake collaborative analysis by 50 times. Cross-warehouse collaborative analysis reduces data relocation synchronization between systems by 80%, and analysis performance is improved from minute level to second level.

By introducing HetuEngine data virtualization engine, a certain financial line improves the concurrency capacity in data lake query analysis. Only 1/5 resources can support 45 concurrency, the maximum peak concurrency is 200QPS, and the average delay is optimized to 8 seconds. In terms of collaborative analysis of lake and warehouse, HetuEngine can break through the data barrier between data lake and data warehouse, improve the performance of collaborative analysis of lake and warehouse from minute level to second level, and reduce the synchronization of data relocation between systems by 80%, greatly improving the efficiency of data management.

IoTDB: timing database, cloud side to easily build timing data mart

Timing data has two characteristics: it is processed at the end, the edge and the cloud, and it does not need to be updated after collection. In the traditional time sequence processing scheme, different technology stacks are used in the end, edge and cloud, and the heterogeneous technology stacks will inevitably bring the complexity of data processing. The timing database IoTDB (also known as timing Engine) developed by Tsinghua University realizes a data compatibility for all scenarios through the unified format of timing data file TsFile. One set of engine connects the cloud side and another set of framework integrates the cloud side. Huawei and Tsinghua University maintain close cooperation, the latest release of IoTDB cluster version, is a version of Huawei and Tsinghua led the development.

In Shanghai, Chengdu, Chongqing and other cities, IoTDB has been used to manage subway monitoring data. The original 144 trains needed 9 servers, but now only one IoTDB instance is needed to meet the requirements. The sampling delay of measurement points has also been reduced from the original 500ms to 200ms, with an increase of 414 billion data point management. Greatly improving resource utilization.

conclusion

At present, huawei FusionInsight MRS cloud native data lake, together with 800+ ecological partners, serves 3000+ government and enterprise customers and is widely used in utilities, finance, operators, energy, medical care, manufacturing, transportation and other industries.

Click to follow, the first time to learn about Huawei cloud fresh technology ~

Decrypt Huawei cloud FusionInsight MRS new feature: One architecture three lakes

Entering the era of intelligent data, ten consensus of data lake construction in the industry

Huawei cloud FusionInsight MRS Cloud native data lake

A new feature of sanhu architecture, covering the whole process of data analysis

conclusion

Related Posts

How to use AI intelligent analysis technology to combat uncivilized behavior of throwing objects from high altitude?

Audio digital watermark embedding extraction based on MATLAB DCT+DWT+SVD

Machine learning techniques: Use deep learning to process text