Status quo and problems of big data analysis

The 21st century is the century of information explosion. With the rapid development of IT technology, more and more applications continuously produce hundreds of millions of data. In the past century, scientists and engineers have invented various data management systems to store and manage various kinds of data: relational databases, NoSQL databases, document databases, key-value databases, object storage systems and so on. Diversified data management systems bring convenience to enterprise organizations in managing data, but at the same time, it is difficult to manage and make full use of the data stored in these data systems. Whether it is PostgreSQL or MySQL in relational databases, or Hive or HBase in Hadoop, these common data management systems have their own SQL dialects. Data analysts who want to analyze data from a particular data management system need to be familiar with a certain SQL dialect; In order to conduct joint query for different data sources, different clients must be used to connect different data sources in the application logic. The whole analysis process has complex architecture, multiple programming entrances and difficult system integration, which is very painful for data analysts involved in massive data.

Data warehouse is widely used in the industry to solve the problem of federated query of data islands formed by multiple data sources. In the past few years, data warehouse has developed rapidly. Through extracting, transforming and loading data from various data sources, it centrally stores processed data in thematic data warehouse for use by data analysts or users through the whole process of ETL. However, with the further growth of data scale, it has to be pointed out that the industry has gradually realized that the process of moving data to data warehouse is expensive. In addition to the cost of hardware or software of data warehouse, the labor cost of maintaining and updating the entire ETL logic system has gradually become one of the important costs of data warehouse. The DATA warehouse ETL process is also cumbersome and time-consuming. In order to obtain the desired data, data analysts or users have to compromise the data analysis mode of data warehouse T+1. It has always been a problem for data analysts to conduct business analysis exploration quickly.

In order to solve the problem of data islands in various data management systems, thematic data warehouses have been invented for different business applications. However, with the increase of business applications, more and more thematic data warehouses have become data islands. So the heroic “dragon slayer” inevitably turns into a “dragon” over time? Is there a solution with simple system architecture, unified programming entry and good system integration? Perhaps today, it’s time to go back to square one and look at another paradigm of big data data analytics from the beginning.

Data virtualization engine openLooKeng: We don’t move data, we are “connectors” of data

So when we look back to see the data warehouse of all kinds of problems, clever you can easily found that the data warehouse “warrior” is gradually became the “dragon” because it constantly moving data, carrying data is the data warehouse establishment and analysis process of heavy, time consuming and expensive “culprit”. Since moving data is causing these problems, let’s go back to the starting point of big data analytics and consider the “alternative path in the woods” that openLooKeng is taking to transform moving data into connecting data.

To put it simply, openLooKeng data virtualization engine analyzes data by connecting to various data source systems through a variety of data source connectors. When users initiate a query, the data is accessed in real time through each Connector and high-performance calculations are performed to obtain analysis results in seconds or minutes. This is quite different from the way that the data is processed by T+1 ETL data handling process in the past data warehouse and then used by users.

Instead of having to learn a variety of SQL dialects, data analysts now only need to be familiar with ANSI SQL2003 syntax. The differences in SQL standards between various data management systems are shielded by openLooKeng as the middle layer. Users no longer need to learn a variety of SQL dialects, and these complicated SQL dialect conversion work will be completed by openLooKeng. By “freeing” users from the various SQL dialects, they can focus on building the high-value business application query analysis logic that forms the intangible assets that are often at the heart of enterprise business intelligence, OpenLooKeng built its entire technology architecture with the goal of helping users quickly build high-value business analysis logic. Without the need to move data, the user’s analysis and query inspiration can be quickly verified using openLooKeng, thus achieving faster analysis results than the previous T+1 data warehouse analysis process.

To take things a step further, since openLooKeng can connect to relational databases, NoSQL databases, and other data management systems through Connector, what if openLooKeng could be a Connector itself? The answer is yes. When we provide openLooKeng itself as a data source to another openLooKeng cluster, we get the following benefits: Previously, due to the limitation of network bandwidth or delay across regions or DC, real-time federated query of data between multiple data centers was basically unavailable. Now, openLooKeng cluster 1 computs the local data and transmits the results to openLooKeng cluster 2 for further analysis. It avoids the transmission of large amounts of original data, thus avoiding the network problems of cross-domain and cross-DC query.

Unified SQL entrance of openLooKeng, rich southward data source ecology, to a certain extent, solves the problems of complex cross-source query architecture, too many programming entrances and poor system integration, and realizes the mode conversion of data from “handling” to “connecting”, which facilitates users to quickly realize the value realization of massive data.

Key features of openLooKeng

After reading the above, you may be eager to see what scenarios openLooKeng can be used in to address the pain points of current business applications. But before moving on to the business scenarios where openLooKeng works, let’s take a look at some of the key features of openLooKeng to give you a deeper understanding of why openLooKeng works for them, You can even explore more business scenarios based on these capabilities of openLooKeng.

An in-memory computing framework designed for mass data

OpenLooKeng has been designed for TB or even PB level massive data query and analysis tasks since its inception. It has a natural affinity for Hadoop file system. Its DISTRIBUTED processing architecture of SQL on Hadoop adopts the design concept of storage and computing separation. Horizontal expansion of computing or storage nodes can be easily realized. At the same time, openLooKeng kernel uses a memory-based computing framework, and all data processing is completed in memory in parallel pipeline-type operations, which can provide second-level to minute-level query delay response.

ANSI SQL2003 syntax support

OpenLooKeng supports ANSI SQL2003 syntax. When users use openLooKeng syntax to query data, no matter the underlying data source is RDBMS, NoSQL, or other data management systems, the Connector framework of openLooKeng can be used to query data. Data can still be stored in the original data source, so as to achieve data “0 relocation” query.

Through the unified SQL entry of openLooKeng, the SQL dialect of the underlying data source can be shielded. Users can obtain the data of the underlying data source without caring about the SQL dialect of the underlying data source, which facilitates the consumption of data by users.

A variety of data source connectors

As diverse as data management systems are, openLooKeng has developed a variety of data source connectors for these data management systems, including RDBMS (Oracle Connector, HANA Connector, etc.), NoSQL (Hive Connector, HBase Connector), full text search database (ElasticSearch Connector) OpenLooKeng can easily access data sources through these diverse connectors for further memory-based high performance syndication.

Cross-domain DataCenter Connector across DCS

OpenLooKeng not only provides the ability to syndicate queries across multiple data sources, but also extends the capabilities of cross-source queries by developing a DataCenter Connector for cross-domain and cross-DC queries. With this new Connector, you can connect to additional openLooKeng clusters on the remote end, providing collaborative computing capabilities across different data centers. The key technologies are as follows:

Parallel data access: Workers can concurrently access data sources to improve access efficiency, and clients can concurrently obtain data from servers to speed up data acquisition.

Data compression: Before serialization during data transmission, data is compressed using the GZIP compression algorithm to reduce the amount of data transmitted over the network.

Cross-dc dynamic filtering: Filters data to reduce the amount of data extracted from remote devices, ensuring network stability and improving query efficiency.

High-performance query optimization techniques

OpenLooKeng builds on an in-memory computing framework and utilizes a number of query optimization techniques to meet the needs of high-performance interactive queries.

Index –

OpenLooKeng provides indexes based on Bitmap Index, Bloom Filter, and Min-max Index. By creating indexes on existing data and storing the index results outside of the data source, the query plan is programmed to use index information to filter out mismatched files and reduce the size of data to be read, thus speeding up the query process.

– Cache

OpenLooKeng provides a variety of caches, including metadata Cache, execution plan Cache, and ORC row data Cache. These multiple caches speed up the delayed response of users to the same SQL query or the same TYPE of SQL query for multiple times.

– Dynamic filtering

The so-called dynamic filtering refers to the optimization method that applies the results of the filtering information of one side of the join table to the filter of the other side at run time. OpenLooKeng not only provides dynamic filtering optimization features of various data sources, This optimization feature has also been applied to DataCenter Connectors to speed up the performance of associated queries for different scenarios.

– Operator push down

When openLooKeng is connected to a data source such as an RDBMS through the Connector framework, due to the computing power of the RDBMS, better performance can generally be achieved by pushing the operator down to the data source for calculation. OpenLooKeng currently supports operator push-down for a variety of data sources, including Oracle, HANA, etc. In particular, it also implements operator push-down for DC Connector for faster query delay response.

High availability

HA AA double life

OpenLooKeng introduces the highly available AA feature and supports the Coordinator AA hypermetro mechanism to maintain load balancing between coordinators and ensure the availability of openLooKeng in high-concurrency scenarios.

Auto-scaling

The elastic scaling feature of openLooKeng allows you to smoothly unload a service node that is performing a task or pull an inactive node to accept a new task. OpenLooKeng provides isolated and In Isolation interfaces for external resource managers (such as Yarn and Kubernetes) to flexibly expand and shrink coordinator and worker nodes.

Common application scenarios of openLooKeng

With the introduction of openLooKeng’s key features above, you’ve probably already had a few openLooKeng application scenarios in mind. Let’s take a look at how openLooKeng can be used in real business.

High-performance interactive query scenarios

OpenLooKeng is a memory-based computing framework that makes full use of memory parallel processing, indexing, Cache, distributed pipeline and other technologies to quickly query and analyze data, and can process TB or even PB level of massive data. Interactive analysis applications that use Hive, Spark, or Impala to build query tasks can be upgraded with openLooKeng query engine for faster query performance.

Cross-source heterogeneous query scenarios

As mentioned above, RDBMS, NoSQL and other data management systems are widely used in various customer application systems; More and more specialized data warehouses, such as Hive or MPPDB, have been built to handle this data. These databases or data warehouses are often isolated from each other and form independent data islands. Data analysts often suffer from:

  • Querying various data sources requires different connection modes or clients and running different SQL dialects, which leads to additional learning costs and complex application development logic. # # # # # #
  • You cannot federated queries on data from different systems without bringing the data from various sources together again.

OpenLooKeng can be used to implement the joint query of RDBMS, NoSQL, Hive, MPPDB and other data warehouses. With the cross-source heterogeneous query capability of openLooKeng, data analysts can realize minute-level or even second-level query analysis of massive data.

Cross-domain and cross-DC query scenarios

For province, city, the headquarters – division that two stage or multistage data center scenario, users often need from provincial (headquarters) data center data query the municipal (division) data center, is the main bottleneck of this query cross-domain network problem between multiple data centers (bandwidth, delay, packet loss, etc.), which can lead to query ShiYanChang, performance is not stable, etc.

OpenLooKeng has designed DataCenter Connector, a cross-domain, cross-DC solution, for cross-domain queries. The DataCenter Connector avoids the network transmission of large amounts of raw data and avoids the network problems caused by insufficient bandwidth and packet loss by transferring results between openLooKeng clusters. To some extent, it solves the problem of cross-domain and cross-DC query and has higher practical value in cross-domain and cross-DC query scenarios.

Computing and storage separation scenario

OpenLooKeng itself does not have a storage engine, and its data sources are mainly from various heterogeneous data management systems. Therefore, it is a typical storage and computing separated system, which can facilitate the independent horizontal expansion of computing and storage resources. OpenLooKeng technology architecture for storage and computing separation enables dynamic expansion of cluster nodes and elastic resource scaling of continuous services, which is suitable for service scenarios requiring computing and storage separation.

Scenarios for rapid data exploration

As mentioned above, in order to query data from multiple data sources, customers usually set up a special data warehouse through the ETL process, but this brings expensive labor costs, ETL time costs and other problems. For customers who need rapid data exploration rather than building a dedicated data warehouse, copying and loading data into a data warehouse can be time consuming and laborious, and may not yield the desired analysis results.

OpenLooKeng defines a virtual data mart with standard syntax, and the ability to query across heterogeneous sources to connect to various data sources to define the various analysis tasks that users need to explore at this virtual data mart semantic layer. With openLooKeng’s data virtualization capabilities, customers can quickly build exploration and analysis services based on a variety of data sources without having to build complex, specialized data warehouses, thereby saving labor and time costs. OpenLooKeng is one of the best choices for scenarios that want to quickly explore data to develop new businesses.

Looking to the future

OpenLooKeng, a data virtualization engine, has made some progress in exploring interactive query scenarios across domains and DC. Looking ahead, there are still a number of issues that need to be verified and solved, such as how to solve openLooKeng’s streaming and batch processing issues in addition to interactive analysis scenarios. What other data source connectors do users need? We sincerely expect users and developers to join the openLooKeng open source community, work together to develop openLooKeng project, solve more user problems, and make big data easier.

• • •

OpenLooKeng is an open source, high-performance data virtualization engine that provides a unified SQL interface and cross-data source/data center analysis capability to provide a minimalist data analysis experience for big data users. Join the openLooKeng community and do something fun to make big data easier!

OpenLooKeng community official website: openLooKeng. IO /zh-cn/

OpenLooKeng code store address: gitee.com/openlookeng