Introduction: With the development of Internet technology, the expansion of data scale and the emergence of complex demand scenarios, traditional big data architecture cannot bear.

The author | log feather source | ali technology to the public

A preface

The traditional big data technology originated from Google’s three wagons, GFS, MapReduce, and Bigtable, as well as its derivative open-source distributed file system HDFS, distributed computing engine MapReduce, and distributed database HBase. The initial big data technologies and requirements tend to focus on large-scale data storage, data processing, online query, etc. At this stage, many companies will choose to build their own computer rooms and deploy Hadoop. Big data technologies and requirements focus on offline computing and large-scale storage, which are commonly reflected in T+1 reports and large-scale online data query.

With the development of Internet technology, the expansion of data scale and the emergence of complex demand scenarios, traditional big data architecture cannot bear. The evolution of big data architecture in recent years is mainly reflected in the following aspects:

  1. Scale: Scale here is mainly reflected in the scale of the use of big data technology and the growth of data scale. The growth of the use of big data technology represents more and more complex requirements, and the growth of the data scale determines that the traditional quasi-big data technology (such as MySQL) cannot solve all the problems. As a result, storage components, for example, are often divided into different data layers, with optimization preferences for different dimensions, such as size, cost, query and analysis performance, to meet diverse needs.
  2. Real-time: The traditional T+1 offline big data technology cannot meet the near-real-time requirements of recommendation and monitoring, and the whole big data ecology and technical architecture have been greatly upgraded in the past decade. In terms of storage, traditional HDFS file storage and Hive data warehouse cannot meet the needs of low-cost, updatable iteration, so Hudi and other data schemes have emerged. Computationally, traditional MapReduce batch processing is not capable of processing data at the second level. Storm’s primitive real-time processing and Spark Streaming’s microbatch processing have emerged. At present, Flink’s real-time computing framework based on Dataflow model occupies an absolutely dominant position in the field of real-time computing.
  3. Cloud protogenics: Traditional companies often choose to build their own computer rooms or purchase machine deployment instances on the cloud, such as cloud hosting. However, this architecture has various problems, such as low utilization rate in the off-peak period, poor storage and computing elasticity caused by non-separation of storage and computing, and low flexibility of upgrade. The cloud native big data architecture is the so-called data lake. Its essence is to make full use of the elastic resources on the cloud to realize a big data architecture with unified management, unified storage, and elastic computing, which changes the traditional big data architecture based on physical clusters and local disks. Its main technical features are storage and computing separation and Serverless. In the cloud native big data architecture, each layer of architecture is evolving towards servitization, such as storage servitization, computing servitization, and metadata management servitization. Each component is required to be broken down into separate units with the ability to scale independently, to be more open, flexible and resilient.

Based on the scenario of cloud native big data architecture, this article will discuss in detail the architecture selection of dimension table and result table in real-time computing.

Real-time computing in two big data architectures

1 Real-time computing scenario

The rapid development of big data has been more than 10 years, and big data is evolving from computing scale to a more real-time trend. The most common real-time computing scenarios are as follows:

  1. Real-time data warehouse: Real-time data warehouse is mainly used in PV/UV statistics, transaction data statistics, commodity sales statistics and other transactional data scenarios on websites. In this scenario, the real-time computing task subscribs to the real-time business data source, analyzes the real-time information at the second level, and finally presents the information on the big business screen for decision makers to use, which is convenient for judging the operation status of the enterprise and the situation of activities and promotions.
  2. Real-time recommendation: Real-time recommendation is mainly based on AI technology and personalized according to user preferences. It is commonly seen in short video scenes, content information scenes, e-commerce shopping and other scenes. In this scenario, users’ preferences can be judged in real time based on their historical clicks, so as to make targeted recommendations to increase user stickiness.
  3. Data ETL: Real-time ETL scenarios are common in data synchronization tasks. For example, the synchronization and transformation of different tables in the database, or the synchronization of different databases, or the pre-processing of data aggregation and other operations, and finally the results are written into the data warehouse or data lake for archiving precipitation. This scenario is used to prepare for the subsequent in-depth service analysis.
  4. Real-time diagnostics: This is common in financial or transactional business scenarios. In these scenarios, in view of the uniqueness of the industry, anti-cheating supervision is needed to determine whether users are cheating users according to their behaviors within a short time in real time, so as to timely stop losses. This scenario has high requirements on timeliness. Abnormal data can be detected through real-time computing tasks, anomalies can be found in real time and losses can be stopped in time.

2 Flink SQL real-time calculation

Real-time computing requires an extremely powerful big data computing capability in the background, so Apache Flink emerged as an open source real-time computing technology of big data. Traditional computing engines, such as Hadoop and Spark, are batch computing engines in nature, and cannot ensure the processing timeliness by processing limited data sets. Apache Flink was designed to be a streaming computing engine that can subscribe to streaming data in real time, analyze and produce results in real time, and make data valuable in the first place.

Flink chooses SQL, a declarative language, as the top-level API, which is convenient for users to use and also conforms to the trend of cloud native big data architecture:

  1. Big data benefits, scale production: Flink SQL can automatically optimize according to the query statement to generate the optimal physical execution plan, shielding the complexity of big data calculation, greatly reducing the user threshold, to achieve the effect of big data benefits.
  2. Integration of Stream and batch: Flink SQL provides the same semantics and unified development experience for both stream and batch tasks, facilitating the transfer of offline tasks to real-time tasks.
  3. Shielding differences in underlying storage: Flink shields differences in underlying data stores by providing SQL unified query language, facilitating flexible service switching among diversified big data stores and making more open and flexible adjustments to the big data architecture on the cloud.

The figure above shows some basic operations of Flink SQL. As you can see, the SYNTAX of SQL is very similar to standard SQL, including basic SELECT and FILTER operations, and you can use built-in functions (such as formatting dates), as well as custom functions after registering functions.

Flink SQL divides real-time computing INTO source tables, result tables and dimension tables. DDL statements of these three tables (such as CREATE TABLE) register various input and output data sources, and represent the topological relationship of real-time computing tasks through DML of SQL (such as INSERT INTO). In order to achieve the effect of real-time computing task development through SQL.

  1. Source table: primarily represents input to messaging system classes, such as Kafka, MQ (Message Queue), or CDC (Change Data Capture, such as converting MySQL binlog to live stream) input.
  2. Result table: mainly represents the target storage where Flink writes each data processed in real time, such as MySQL, HBase and other databases.
  3. Dimension table: Primarily represents a data source that stores dimensional information about data. In real-time computing, because the data collected by the data collection end is often limited, it is necessary to complete the required dimension information before data analysis, and the dimension table is the data source for storing data dimension information. Common user dimension tables include MySQL, Redis and so on.

The following figure shows a complete example of a real-time calculation, in which the Flink SQL task is designed to calculate Gross Merchandise Volume (GMV) per minute for different categories of goods. In this task, Flink real-time consumption user order data Kafka source table, through the Redis dimension table will be associated with the commodity ID to obtain the commodity classification, according to the 1-minute interval of the rolling window according to the commodity classification of the total transaction amount calculated. Write the final result to the RDS (Relational Database Service, such as MySQL) result table.

# source table - user order data, CREATE TEMPORARY TABLE user_action_source (' timestamp 'BIGINT, `user_id` BIGINT, `item_id` BIGINT, `price` DOUBLE,SQs ) WITH ( 'connector' = 'kafka', 'topic' = '<your_topic>', 'properties.bootstrap.servers' = 'your_kafka_server:9092', 'properties.group.id' = '<your_consumer_group>' 'format' = 'json', 'scan.startup.mode' = 'latest-offset' ); CREATE TEMPORARY TABLE item_detail_DIM (id STRING, catagory STRING, PRIMARY KEY (id) NOT ENFORCED ) WITH ( 'connector' = 'redis', 'host' = '<your_redis_host>', 'port' = '<your_redis_port>', 'password' = '<your_redis_password>', 'dbNum' = '<your_db_num>' ); CREATE TEMPORARY TABLE gmv_output (time_minute STRING, catagory STRING, GMV DOUBLE, PRIMARY KEY (time_minute, catagory) ) WITH ( type='rds', url='<your_jdbc_mysql_url_with_database>', tableName='<your_table>', userName='<your_mysql_database_username>', password='<your_mysql_database_password>' ); INSERT INTO gmv_output SELECT TUMBLE_START(s.stamp, INTERVAL '1' MINUTES) as time_minute, d.c. catagory, SUM(d.price) as gmv FROM user_action_source s JOIN item_detail_dim FOR SYSTEM_TIME AS OF PROCTIME() as d ON s.item_id = d.id GROUP BY TUMBLE(s.timestamp, INTERVAL '1' MINUTES), d.catagory;Copy the code

This is a very common real-time computing processing link. In the following chapters, we will analyze the key capabilities of dimension table and result table of real-time computing, and discuss the architecture selection respectively.

Three real-time calculation dimension table

1 Key requirements

In the construction of data warehouse, usually around the star model and snowflake model to design the table relationship or structure. Real-time computing is no exception, and a common requirement is to complete fields for data streams. Because the data collected by the data collection end is usually limited, the required dimension information should be completed before data analysis. For example, only commodity ID is recorded in the collected transaction log, but the aggregation needs to be carried out according to the store dimension or industry latitude during business. Therefore, transaction log and commodity dimension table need to be associated first to complete the required dimension information. The dimension table in question is similar to the concept in data warehouse, which is a collection of dimension attributes, such as goods dimension, user dimension, place dimension, and so on.

As a data store for storing user dimension information, it needs to cope with massive and low-latency access in real-time computing scenarios. Based on such positioning, we summarize several key requirements for structured big data storage:

1. High throughput and low latency reading ability

First of all, apart from the optimization of the dimension table itself by the open source engine Flink, the dimension table must be able to handle massive data access (tens of thousands of QPS) in real-time computing scenarios and return query data with extremely low latency (millisecond level).

2. High integration ability with computing engine

In addition to the dimension table itself, computing engines often have some capacity to unload traffic for the sake of performance, stability, and cost. In some cases, there is no need to access the downstream dimension table with every request. For example, Flink supports optimization features such as Async IO and cache policies in dimension table scenarios. A good dimension table needs to have a high degree of interconnection with the open source computing engine. On the one hand, it can improve the performance of the computing layer, on the other hand, it can effectively unload part of the traffic, protect the dimension table from being penetrated by too many accesses, and reduce the calculation cost of the dimension table.

3. Flexibility of computing capability under light storage

A dimension table is usually a shared table that stores metadata information such as dimension attributes. The access scale is usually large, but the storage scale is not particularly large. The scale of access to dimension tables depends heavily on the amount of data in the real-time data stream. For example, if the data scale of real-time stream expands tens of times, The Times of accessing dimension table will be greatly increased. For example, if multiple real-time computing tasks are added to access the dimension table, the query pressure of the dimension table will surge. In these scenarios, storage sizes tend not to increase significantly.

Therefore, it is best to calculate on demand and be elastic. Adding or offline real-time computing tasks or increasing access traffic does not affect access performance. At the same time, computing and storage should be separated and storage costs should not be increased simply because of a surge in computing access.

2 Architecture Selection

MySQL

At the beginning of big data and real-time computing technology, LAMP (Linux + Apache + MySQL + PHP) architecture was popular in the early days of the Internet for rapid development of sites. Therefore, as the business history data already exists in MySQL, MySQL is widely used as the dimension table in the initial selection of real-time computing dimension table.

With the update of big data architecture, MySQL cloud architecture is also improving, but there are still the following problems in the application scenarios of dimension tables:

  1. Poor flexibility and high cost of expansion on the storage side: Data replication and migration are required for MySQL expansion on the storage side, resulting in a long expansion period and poor flexibility. At the same time, each expansion of MySQL sub-database sub-table requires double resources, and the expansion cost is high.
  2. High storage cost: Relational databases have the highest unit cost for storing structured data. Therefore, for big data scenarios, the storage cost of relational databases is high.

These limitations cause performance bottlenecks and high costs for MySQL in the scenario of big data dimension tables. But overall, MySQL is a very good database product, and it is definitely a good choice for scenarios where the data size is not very large.

Redis

In the cloud application architecture, because MySQL is difficult to bear the increasing business load, Redis is often used as the MySQL query result set cache to help MySQL resist most query traffic.

In this architecture, MySQL serves as the primary storage server and Redis serves as the secondary storage. The synchronization between MySQL and Redis can be realized through binlog real-time synchronization or MySQL UDF + trigger. In this architecture, Redis can be used for caching to improve query performance while reducing the risk of MySQL being broken down.

Redis is also often used as a dimension table for real-time calculations because of the weak consistency of user data cached in Redis. Compared with MySQL as a dimension table, Redis has unique advantages:

  1. High query performance: Data is cached in the memory and result data can be queried in the form of high-speed key-value, which meets the requirements for high-performance query of dimension tables.
  2. High flexibility of storage layer expansion: Redis can be very convenient to expand the fragmented cluster, horizontal expansion, support the persistence of multiple copies of data.

Redis has its outstanding advantages, but also has a defect which cannot be ignored: although Redis has good expansion plan, but because of the cache data exist in memory, the cost is higher, if there are any business data of large dimension properties (such as user dimension, product dimension), use Redis as a dimension table storage cost is extremely high.

Tablestore

Tablestore is a structured big data storage product developed by Ali Cloud. For specific product introduction, please refer to the official website and authoritative guide. In the scenario of big data dimension tables, Tablestore has unique advantages:

  1. High throughput access: Tablestore adopts the storage and computing separation architecture, which can flexibly expand computing resources and support data query under high throughput.
  2. Low-latency query: The Tablestore is implemented based on the LSM storage engine and supports Block Cache accelerated query. Users can optimize service query by configuring rich indexes.
  3. Low-cost storage and elastic computing cost: in storage cost, Tablestore belongs to the structured NoSQL storage type, data storage cost is much lower than the relational database or cache; In terms of computing cost, Tablestore adopts storage computing architecture, which can flexibly expand computing resources on demand.
  4. High docking with Flink dimension table optimization: Tablestore supports all Flink dimension table optimization strategies, including Async IO and different cache strategies.

Scheme comparison

The above is the comparison of several dimension table schemes mentioned above in each dimension. Next, several specific scenarios will be cited for detailed comparison of costs:

1. High storage and high computation: the dimension table needs to store data of 10 billion order dimensions, and the total storage needs 1T. Although the business has configured a cache strategy at the task end of Flink, there are still high KV queries that sink into the dimension table, and the QPS peak value of the dimension table is 100,000, with an average value of 25,000. Configuration requirements and purchase costs for different dimension tables are as follows:

2. Low storage and low computation: the dimension table needs to store 1 million pieces of regional dimension data, with a total storage capacity of 10M. LRU caching strategy is configured for the dimension table in the Flink task of the business end to resist most of the traffic, and the QPS peak value of the dimension table is 1000 with an average value of 250. Configuration requirements and purchase costs for different dimension tables are as follows:

3. High storage and low calculation: the dimension table needs to store data of 10 billion order dimensions, and the total storage needs 1T. The business end has configured LRU cache strategy for the dimension table in Flink task to resist most of the traffic, and the QPS peak value of the dimension table is 1000 with an average value of 250. Configuration requirements and purchase costs for different dimension tables are as follows:

4. Low storage and high computing: As an in-memory database, Redis has the ability of ultra-high frequency KV data query. The Redis cluster with only 4 cores and 8G memory can support 160,000 QPS concurrent access, and the cost is estimated at 1600 yuan/month, which has a distinct cost advantage in low storage and high computing scenarios.

As can be seen from the cost comparison report above:

1) MySQL, due to its lack of elasticity of storage and computation and inherent shortcomings of relational databases, has high costs at different levels of storage and computation scale.

2) As an in-memory database, Redis has a distinct cost advantage in low storage (less than 128G) and high computing scenarios. However, due to the high cost of memory storage and lack of elasticity, the cost increases exponentially with the increase of data scale.

3) Based on the cloud native architecture, Tablestore can be flexible for storage and computing according to quantity, and the cost is low when the data storage and access scale is not large.

4) As a NoSQL database, Tablestore has a low storage cost and a distinct cost advantage in the scenario of high storage (above 128G).

Table of real-time calculation results

1 Demand Analysis

As the storage system for data import after real-time calculation, the result table can be divided into relational database, search engine, offline storage of structured big data and online storage of structured big data. The specific differences are summarized in the following table.

As for these data products, they have their own advantages in their own scenarios, and their origins are different. For the sake of exploration, we narrow the problem field to consider only what role a better result table store would play in a real-time computing scenario.

In the main scenarios of real-time computing mentioned above, the selection of result table should be considered in the three scenarios of real-time data warehouse, real-time recommendation and real-time monitoring. Let’s analyze them.

  1. Real-time data warehouse: Real-time data warehouse is mainly used in real-time PV/UV statistics, transaction data statistics and other real-time analysis scenarios on websites. Real-time analysis (OLAP) scenarios are divided into three OLAP models: preaggregation, search engines, and Massively Parallel Processing (MPP). For the pre-aggregation model, data can be aggregated and written into the result table through Flink computing layer, or written into the result table in full. Data can be stored by the pre-aggregation ability of the result table itself, which greatly depends on the support of the data query and analysis ability of the result table in this form. For the search engine model, the data is written into the result table in full quantity, and the data is analyzed by the inverted index and column storage characteristics of the search engine. In this form, the result table is required to have high throughput data writing capability and large-scale data storage capability. The MPP model is a computing engine that can better leverage the analytical query feature if it accesses columnar storage. Real-time OLAP storage and computing engines are numerous, requiring multiple storage components to coexist in a complete data system architecture. And depending on the different requirements for query and analysis capabilities, data derivation capabilities are required to be extended to other types of storage when necessary. In addition, with the expansion of the business scale in real-time data warehouse, the storage capacity will greatly increase, compared with data query and other computing scale changes are generally not particularly obvious, so the result table needs to achieve storage and computing cost separation, greatly control the resource cost.
  2. Real-time recommendations: Real-time recommendation is mainly based on user preferences for personalized recommendation, in the common user merchandise personalized recommendation scenario, a common practice is to write the characteristics of the user into structured data storage (such as HBase), and the store will be as a dimension table another user clicks for correlation data about consumer behavior, to extract the user input is characteristics associated with behavior, As input to the recommendation algorithm. The storage here needs to be used as a result table to provide high-throughput data writing capability, and as a dimension table to provide high-throughput and low-latency online data query capability.
  3. Real-time monitoring: The application of real-time monitoring is common in financial or transaction business scenarios, which have high requirements for timeliness. Abnormal data detection can detect abnormal situations in real time and make a stop-loss behavior. In this scenario, the real-time and low-delay data aggregation query capability is needed, whether through threshold judgment or anomaly detection algorithm.

2 Key Competencies

Through the above demand analysis, we can summarize several key capabilities of real-time big data result table:

1. Large-scale data storage

The positioning of result table storage is centralized mass storage, which must be able to support pB-scale data storage as a summary of online databases, or as input and output of real-time computing (or offline).

2. Rich data query and aggregation analysis ability

Result tables need to have rich data query and aggregation analysis capabilities, and need to be optimized to support efficient online queries. Common query optimizations include caching, random queries with high concurrency and low latency, complex queries with arbitrary combination of field conditions, and data retrieval. The technical means of query optimization are cache and index, in which index support is diversified, providing different types of indexes for different query scenarios. For example, the secondary index based on B+tree is oriented to fixed combination query, the spatial index based on R-tree or BKD-tree is oriented to geographical location query, or the inverted index is oriented to multi-condition combination query and full-text retrieval.

3. High throughput write ability

The data table of real-time computation needs to be able to withstand the massive result data set export of big data computing engine. It must be able to support high-throughput data writes, and a storage engine is usually used that is optimized for writing.

4. Data derivation ability

A complete data system architecture requires the coexistence of multiple storage components. And according to the different requirements of query and analysis ability, the storage needs to be dynamically extended under the data derivation system. Therefore, for big data storage, it is also necessary to have derivative capabilities that can expand storage to expand data processing capabilities. The best way to judge whether a storage component has the ability to derive data is to have a mature CDC technology.

5. Cloud native architecture: separation of storage and computing costs

In the cloud native big data architecture, each layer of architecture is evolving towards servitization, such as storage servitization, computing servitization, and metadata management servitization. Each component is required to be broken down into different units, and as a result table is no exception, it needs to be independently scalable, more open, flexible, and resilient.

In terms of the results table alone, only components that conform to the cloud native architecture, namely products implemented based on the storage and computing separation architecture, can achieve the separation of storage and computing costs and independent scaling. The advantages of separating storage and computing will be more obvious in big data systems. To take a simple example, the storage capacity of structured big data storage will increase with the accumulation of data, but the amount of data written is relatively stable. So storage needs to grow, but the computing resources needed to support data writing or AD hoc data analysis are relatively fixed and on-demand.

3 Architecture Selection

MySQL

Like dimension table, at the beginning of big data and real-time computing technology, MySQL is a universal storage. Almost all requirements can be completed through MySQL, so the application scale is very wide, and the result table is no exception. With the continuous expansion of data scale and the increasing complexity of demand scenarios, MySQL is a little difficult to bear. The main problems in the result table scenario are as follows:

  1. High storage cost of big data: As mentioned earlier in the dimension table discussion, the unit storage cost of relational databases is very high.
  2. The query capability provided by a single storage system is limited: With the expansion of data scale, the insufficiency of MySQL read and write performance gradually becomes apparent. In addition, with the emergence of analysis AP requirements, MySQL queries that are more suitable for TP scenarios are limited.
  3. Poor high-throughput data writes: Relational databases, as TP classes, are not particularly good at high-throughput data writes.
  4. Poor scalability and high cost of expansion: This has been mentioned in the previous discussion of dimension tables. MySQL storage expansion requires data replication and migration, and requires double resources, so the flexibility of expansion is poor and the cost is high.

These limitations cause performance bottlenecks and high costs for MySQL in the scenario of big data result tables. However, as a relational database, MySQL is not particularly suitable for the use of big data result tables.

HBase

Because of the natural bottleneck of relational database, the distributed NoSQL structured database based on BigTable came into being. At present, the well-known structured big data storage in the open source field is Cassandra and HBase. Cassandra is a top-1 product under the NoSQL category of WideColumn model and is widely used in foreign countries. In this article, we focus on deploying more HBase in China. HBase is a WideColumn model database based on THE HDFS storage and computing separation architecture. It has good scalability and can support large-scale data storage. It has the following advantages:

  1. Large data storage and high throughput writes: The LSM-based storage engine supports large-scale data storage and is optimized for writes to provide high throughput data writes.
  2. Storage and computing separated architecture: Based on HDFS, the separated architecture can flexibly expand storage and computing as required.
  3. Mature developer ecosystem, well integrated with other open source ecosystems: As an open source product developed for many years, there are many applications in China, and the developer community is mature and well integrated with other open source ecosystems such as Hadoop and Spark.

HBase has significant advantages, but it also has several significant disadvantages:

  1. The query capability is weak and hardly supports data analysis: efficient single-row random query and range Scan are provided. For complex combination condition query, Scan + Filter must be used. If you do not pay attention to it, it means full table Scan, which is extremely inefficient. Phoenix of HBase provides secondary indexes to optimize queries. However, like secondary indexes of MySQL, indexes can be optimized only when the left-most matching query conditions are met. The query conditions that can be optimized are limited.
  2. Weak data derivation capability: The CDC technology is the core technology supporting the data derivation system. HBase does not have the CDC technology.
  3. Non-cloud native Serverless service mode, high cost: The cost of HBase depends on the number of CPU cores required for computing and the storage cost of disks. In the deployment mode based on a fixed ratio of physical resources, there is a minimum ratio between the CPU and storage. In other words, the cost of CPU cores increases with the increase of storage space, instead of calculating the cost based on actual computing resources. Therefore, only the cloud-native Serverless service model can achieve complete separation of storage and computing costs.
  4. Complex o&M: HBase is a standard Hadoop component, and its core relies on Zookeeper and HDFS. Without a professional O&M team, it is almost impossible to o&M.

Domestic advanced players mostly do secondary development based on HBase, and basically make various schemes to compensate for the weak HBase query ability. They develop their own index schemes according to their own business query characteristics. For example, self-developed secondary index scheme, Solr full-text index or bitmap index scheme for data sets with small differentiation, etc. In general, HBase is an excellent open source product with many good design ideas.

HBase + Elasticsearch

To solve the problem of HBase query capability, many Domestic companies use Elasticsearch to speed up data retrieval and implement their architecture based on HBase + Elasticsearch. HBase is used to store large data and query historical cold data. Elasticsearch is used to search data. Because HBase does not have the CDC technology, the service side needs to write HBase and Elasticsearch at the application layer. Or start a data synchronization task to synchronize HBase to Elasticsearch.

Elasticsearch can greatly compensate for the poor HBase query capability. However, HBase and Elasticsearch have the following problems:

  1. The development cost is high and the operation and maintenance (O&M) is complicated. The customer needs to maintain at least two clusters and complete data synchronization from HBase to Elasticsearch. To ensure consistency between HBase and Elasticsearch, you need to write data at the application layer as described earlier. This architecture is not decoupled, which is complicated to expand. In addition, the overall architecture is relatively complex, involving many modules and technologies, and the operation and maintenance costs are also high.
  2. High cost: The customer needs to purchase two sets of clusters and maintain data synchronization between HBase and Elasticsearch, resulting in high resource cost.
  3. No data derivation capability: In this architecture, data is written to HBase and Elasticsearch respectively. HBase and Elasticsearch do not have CDC technology, so data cannot be derived from other systems flexibly.

Tablestore

Tablestore is a structured big data storage product developed by Ali Cloud. For specific product introduction, please refer to the official website and authoritative guide. The design concept of Tablestore largely takes into account the needs of structured big data storage in the data system, and specially designs and realizes some features based on the design concept of derived data system. Briefly summarize the technical concept of Tablestore:

  1. Large-scale data storage, supporting high throughput write: LSM and B+ Tree are two mainstream storage engine implementations, among which Tablestore is based on LSM, supporting large-scale data storage, specially optimized for high throughput data write.
  2. Diversified indexes provide rich query capabilities: LSM engine features determine the shortcomings of query capabilities, which require indexes to optimize queries. Different query scenarios require different types of indexes, so Tablestore provides diversified indexes to meet data query requirements in different types of scenarios.
  3. Supports CDC technology and provides data derivation capability: The CDC technology of Tablestore, named Tunnel Service, supports full and incremental real-time data subscription, and can seamlessly connect with Flink stream computing engine to realize real-time streaming calculation of data in tables.
  4. Storage and computing separation architecture: The storage and computing separation architecture is adopted, and the bottom layer is based on the Feitian Pangu distributed file system, which is the basis of storage and computing cost separation.
  5. Cloud native architecture, Serverless product form, o&M free: The most critical factors of cloud native architecture are storage and computing separation and Serverless service. Only storage and computing separation and Serverless service can achieve a cloud native architecture of unified management, unified storage, and elastic computing. As it is a Serverless product, the business side does not need to deploy and maintain the Tablestore, which greatly reduces the operation and maintenance cost of users.

Scheme comparison

To take a specific scenario, the result table needs to store hundreds of billions of e-commerce order transaction data, with a total storage capacity of 1T. Users need to query and flexibly analyze such data. The daily order query and data retrieval frequency is 1000 times/second, and data analysis is about 10 times/minute.

Here are the configurations required for different architectures to meet the requirements, and the purchase costs on Aliyun:

Five summarizes

This paper discusses the architecture design and selection of real-time computing dimension table and result table scenarios under the cloud native big data architecture. Among them, Ali Cloud Tablestore has some features in these scenarios, I hope to have a deeper understanding of us through this article. We’ll follow up with a series of articles on building Flink on Tablestore from scratch and best practices for dimension and result table scenarios.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.