AIOps community is initiated by Cloud Intelligence. Aiming at operation and maintenance business scenarios, AIOps community provides the overall service system of algorithms, computing power and data sets as well as solution exchange community for intelligent operation and maintenance business scenarios. The community is committed to spreading AIOps technology, aiming to solve the technical problems of intelligent operation and maintenance industry together with customers, users, researchers and developers from various industries, promote the implementation of AIOps technology in enterprises, and build a healthy and win-win AIOps developer ecosystem.

Evolution of big data architecture

For most people, the concepts and terminology involved in big data architecture are numerous and complex. How to transform these chaotic words into organic thinking, so that they can be presented in the way of horizontal section and vertical section is a problem we must think about. This chapter will take you to understand the evolution of big data architecture in detail by sorting out the core architecture types of big data and explaining the basic selection tools at different stages.

Introduction to Basic Knowledge

MPP architecture & Distributed architecture

  • MPP architecture

Massivly Parallel Processing (MPP) refers to the Parallel distribution of tasks to multiple SMP nodes. After each node completes the computation, the results of the respective parts are aggregated together to obtain the final result.

Because MPP is widely used in the field of database, it has high requirements on transaction consistency. In general, MPP has consistency > reliability > fault tolerance. In some external cases, consistency must be guaranteed when MPP consistency is possible, otherwise the essence of database positioning is lost.

  • Distributed architecture

Distributed architecture (Hadoop architecture/batch architecture) refers to the autonomy of each node in the cluster, that is, running local applications independently. MPP architecture cannot achieve node autonomy, it can only provide services as a whole. Distributed architecture (Hadoop architecture/batch architecture) Each node in a cluster can be autonomous, that is, run local applications independently. MPP architecture cannot achieve node autonomy, it can only provide services as a whole.

Distributed architecture pays more attention to “divide and conquer”, which ensures a balance between the overall nodes. Therefore, in terms of overall priority, distributed architecture is fault-tolerant > reliability > consistency.

In general, the distributed architecture we are used to referring to refers to the Hadoop family, cluster and some database refers to MPP.

OLAP for data warehouse and OLTP for transaction database

On-line Analytical Processing (OLAP) is applied in the field of data warehouse, supporting data analysis of complex query, focusing On providing decision support (DSS) for business. Online Transaction Processing (OLTP) is used in online business Transaction systems to support frequent online operations (add, delete, modify) and Transaction features.

Generally speaking, OLAP focuses on transaction calculation, BI analysis and intelligent decision making. OLTP is more about the consistency of transactions, such as add, delete and change operations in online interactive systems.

How distributed Architecture “layers down”

Distributed architecture is divided into distributed message queue layer, distributed computing engine layer, distributed storage architecture layer, distributed SQL engine layer and distributed configuration management layer. The data architecture in this chapter will focus on distributed computing, distributed storage and distributed SQL engine. Next, we will start with distributed storage.

Hadoop Ecosystem

Since both the distributed architecture and batch processing described above are based on the Hadoop ecosystem, the hierarchical logic described above can also be nested within the Hadoop ecosystem. HDFS refers to distributed storage, MapReduce refers to distributed computing, and Hive refers to distributed SQL.

As a pioneering technology tool, Hadoop is widely used in all areas of the industry. Hadoop released its first version, HDFS and MapReduce, in 2004, until Hadoop 1.0.0 was released in 2011. During this time, HDFS and MapReduce achieved capabilities previously unavailable to MPP-based databases. For example, how many nodes calculate how many data quantities. The time span is from 2011 to 2016, and the version span is from 1.0.1 to 2.7.0. In just five years, Hadoop updated nearly three major versions, which can be said to be the outbreak period of Hadoop. From 2017 to 2021, Hadoop spans only three smaller releases, from 3.0.0 to 3.3.1. From the frequency of releases above, we can see that the Hadoop ecosystem is constantly improving.

Comparison of MPPDB features with Hadoop and traditional data warehouse

Here’s a horizontal comparison of MPPDB with Hadoop and traditional data repositories. MPP corresponds to a database, while Hadoop corresponds to a distributed cluster. There are similarities between the two. From the storage side, MPP features such as operation and maintenance complexity, expansion ability and operation and maintenance cost are basically in the middle, while Hadoop has comparative advantages in overall performance. However, Hadoop has high technical requirements, so it is also a challenge for some start-ups or some enterprises with insufficient technology precipitation.

MPP mainly realizes two function points, one is to eliminate shared resources, and the other is to support parallel computing. MPP parallel structure and HDFS distributed storage, in principle, is a Hive distributed SQL based on Hadoop ecology, to achieve the distributed architecture capability based on HDFS parallel processing.

In general, MPP and Hadoop are both related and different. The uniqueness of the difference is that Hadoop has its own complete ecosystem.

Comparison of real – time computing type selection characteristics

ClickHouse is an efficient column database management system based on THE OLAP scenario. It implements ordered data storage, primary key index, sparse index, data Sharding, data Partitioning, TTL, master/slave replication and other rich functions. It is an analytical database.

Elasticsearch is a distributed, scalable, real-time search and data analysis engine. It makes it easy to search, analyze and explore massive amounts of data. By taking advantage of Elasticsearch’s horizontal scalability, massive data with very low value density becomes more valuable in production environments.

HBase – Hadoop Database is a high reliability, high performance, column-oriented, and scalable distributed storage system. The HBase technology can be used to build large-scale structured storage clusters on low-cost PC servers.

Druid is an open source big data system designed for OLAP query requirements. Druid provides low-latency data insertion and real-time data query capabilities.

distributed SQL Engine feature comparison

Spark (SQL on Hadoop) is a mapReduce-like universal parallel framework developed using Scala. It has the advantages of MapReduce and is a fast and universal computing engine designed for large-scale data processing.

Note: MapReduce is disk oriented and Spark is memory oriented.

History of big data architecture

  • Batch type architecture

Batch big data architecture, also known as offline big data architecture, is capable of processing big data, but the timeliness of data processing is poor.

  • Streaming architecture

Compared with batch big data architecture, streaming big data architecture removes ETL process, obtains data flow through data channel, and pushes processing results to data consumers in the form of message queue. Abandoned the offline batch processing mode, but the data storage cycle is short; If historical data scenarios or complex data scenarios are involved in calculation, the implementation is very difficult.

  • Lambda architecture

Lambda data architecture adds a link for real-time calculation on the basis of batch big data architecture. The data service layer completes the combination of offline and real-time results. When the index is calculated by stream processing, the batch processing is still calculated, and the final calculation is based on batch processing, that is, the stream processing results will be overwritten after each batch calculation.

  • Kappa architecture

The Kappa data architecture is optimized on the basis of Lambda architecture by deleting the Batch Layer architecture and replacing the data channel with message queue. Streaming reprocessing of historical data has lower throughput than batch processing and requires additional computing resources to compensate.

  • Real-time OLAP architecture

Real-time OLAP variant architecture is a further evolution of Kappa data architecture, which is to reduce the aggregation processing pressure of real-time computation by the OLAP engine.

  • Advantages: High degree of freedom, meet the real-time self-service analysis requirements of data analysts, reduce the processing pressure of the computing engine.
  • Disadvantages: Message queues hold the amount of data, shifting the computational burden to the query layer.

Data architecture feature comparison

The figure below shows the evolution correlation of Lambda, Kappa and real-time OLAP variant architectures.

How to understand the integration of lake warehouse

Evolution history of integrated architecture of lake warehouse

Lake warehouse integration is the first data warehouse, what it needs to do is to layer the data, the need to clean the data before warehousing, at this time, all the data has lost its original value; However, data lake is to store data first and then load and transform data according to business needs. Its advantage lies in ensuring that any data can still be stored at the bottom when demand changes, which is also a value of data lake and an important value of data lake.

The integration of lake and warehouse is the combination of data lake and data warehouse. Data lake is the improvement of data diversification ability of storage layer.

  • ETL extract-transform-load (ETL extract-transform-load), cleaning (memory consumption) before storage.
  • ELT extract-load-transform, put into storage (temporary table) before cleaning.

Complementary relationship between data lake and data warehouse

The data lake itself supports a variety of computing engines and the separation of storage and computing, which ensures that the data is complete during storage. It has nothing to do with storage during calculation. Data can be loaded according to computing requirements. Data warehousing remains subject-oriented, integration-oriented, stable, and dynamic.

Technical tool selection strategy

The following factors need to be considered in the integrated design of lake warehouse;

  • Whether to meet business requirements: technology selection is not only required to be large and complete, but according to business requirements to match, select the most appropriate function coverage;

  • Pay attention to maturity/popularity: based on the open source community activity, you can check the number of Github stars.

  • Technology stack landing cost: combined with the complexity of the architecture and the existing development experience, control the use cost;

  • Consistency of technology stack: combine the consistency and correlation of the company’s technology stack, that is, the maintainability of code;

  • Industry use case: Reuse previous manufacturer’s experience in pit filling

Below is a list of maturity/popularity data of each technical tool:

Comparison of data lake tool selection characteristics

Hudi (Hadoop Upserts anD Incrementals) is based on Spark2.x to manage large analysis datasets stored in HDFS. It supports operations such as updates, inserts, anD deletes on Hadoop. Read Optimized Tables and Near-real-time tables.

Iceberg is an open Table Format for mass data analysis. Table Format, as defined in the definition, is a way of organizing metadata and data files in the computing framework (Flink, Spark…). Below, above the data file.

Delta Lake is a storage layer that provides ACID transaction capabilities for Apache Spark and big data Workloads with Optimistic Concurrency Control between write and snapshot isolation, Provide consistent reads during Data writing to bring reliability to Data Lakes built on HDFS and cloud storage.

Lake warehouse integrated blueprint scheme

The Hadoop Distributed File System (HDFS) based on Hadoop stores both structured and unstructured data. Hive supports full storage of historical and real-time data and data tracing.

Prestudy reason of lake warehouse integrated scheme

  • Hudi integration capabilities: Hudi has good Upsert capabilities and a framework that supports docking with Flink incremental processing.

  • Precipitation of Flink technology: Cloud Intelligence maintains part of Flink engine development by itself, which has supported real-time calculation of data products.

  • Responding to unexpected requirements: For unexpected business requirements, queries can be satisfied using AD hoc query Presto.

  • Near-term goals: Support real-time data retrieval to Hive at minute intervals, and support Upsert.

To learn more

Cloud Intelligence Is an open-source, lightweight, converged, and intelligent Operation and maintenance (O&M) integrated Operation Management Platform (OMP). It provides functions such as Management, deployment, monitoring, inspection, self-healing, backup, and recovery, providing users with convenient O&M capabilities and service Management. In addition to improving the efficiency of o&M personnel, it greatly improves business continuity and security.

Click on the link below to like OMP and send it to Star to learn more

Making address:Github.com/CloudWise-O…

Gitee address:Gitee.com/CloudWise/O…