Create reuse mechanism of openLooKeng Connector based on Hive Connector

OpenLooKeng, Make Big Data Simplified

OpenLooKeng is an open source and efficient data virtualization analysis engine. In this issue, our partner from Everbright Bank will share a blog for us, [Analysis of openLooKeng Connector reuse mechanism based on Hive Connector]. Thank you very much for your contribution.

Create reuse mechanism of openLooKeng Connector based on Hive Connector

The author/Cheng Yijian, Information Technology Department, Head Office of China Everbright Bank

preface

OpenLooKeng, as a federated computing engine across multiple data sources, naturally supports the development of third-party data source connectors. How to develop a simple Connector is not detailed here, you can refer to the own example. Recently, due to business needs, Hive Connector needs to be modified to support our scenario. To avoid invading the community’s native Hive Connector, I created a new Mpp Connector by reusing and customizing it. Take this as an example to talk about the mechanism.

1. Execute logic

The figure above shows the overall execution logic of a Hive Connector. A query statement is pared by openLooKeng through a Coordinator, and then accessed by a Thrift interface to Metastore of Hive to obtain metadata information, including which databases are available. What tables are in each database and the data locations and fields of the tables. After receiving the information, the Coordinator uses scheduling tasks to pull data from a specific location in the HDFS. Finally, the data is returned to the Coordinator for aggregation. The most important part of this whole process is to acquire metadata and pull data. How to create a new Connector and reuse Hive Connector functionality?

2. Plugin loading mechanism

The image above is my hand drawing of a general loading mechanism. It may not be particularly accurate but it doesn’t get in the way. After compiling the openLooKeng engine, you can see the structure shown below

All connectors are in the Plugin directory. When Presto starts, each Connector is loaded via PluginManager, as shown in the loading diagram. When a specific Connector is loaded, the Plugin implementation of the Connector is accessed, where the Connector name is defined, such as Hive-Hadoop2, mysql, Oracle, Clickhouse, etc. The Connector instance is then generated by calling the ConnectorFactory here.

In openLooKeng, the specific implementation of a Connector is specified. Some interfaces must be implemented.

(For details, please refer to this: source code learning (2) — Presto Connector mechanism)

For example, ConnectorMetadata, SplitManager, and other specific implementations need to be included in the Connector. So the ConnectorFactory will need to include all of these when creating an instance of the Connector. So we can see that in the loading diagram, there are different colored Modules, and these different colored modules are implementations of interfaces divided by topic.

Module is an interface to Guice. In this interface, you can bind the interface to the implementation class, such as ConnectorMetadata. Your implementation class is HiveConnectorMetadata. Here you can find them through binder. The bind (ConnectorMetadata. Class) to (HiveConnectorMetadata. Calss.) in (Scopes. The SINGLETON) is described.

3. Customize Connector

Actually by the above description, we probably have seen several key interface (or class) : the Plugin/ConnectorFactory/Module, then we see how to create a new Connector and reuse.

(1) Create a new module

First, we’ll create a new module named presto-mpp and specify the parent class as presto-root. Then, in the hetu – server/SRC/main/provisio hetu. Add the following content in XML, used to make packaging path information.

Then, make sure that the module is added to pom.xml in the parent directory.

With this in mind, a new Connector has been created, but compilation will tell you that none of the interfaces are yet implemented, so we need to implement or reuse the necessary interfaces.

(2) Implement or reuse interfaces

Hive Connector configuration file: Hive Connector configuration file: Hive Connector configuration file: Hive Connector configuration

This Connector is provided by the presto-hive-hadoop2 Connector shown below.

Super (” Hive-hadoop2 “) is the only valid line in his code. That’s right, he does one thing: define the name of the connector. Who does all the other work? As you can see, he inherits the HivePlugin, so everything else is done by the Presto-Hive module.

I can also create a new Hive Connector class. For example, I can easily create an MPP Connector that can read Hive tables.

If it ends there, he’s just a Hive connector. What if you want to customize some of the logic inside? What if I want to redefine the sharding logic in there? So we have to rewrite this logic.

Let’s take rewriting the data sharding logic as an example of how to add custom functionality.

(3) Custom sharding logic

Where is the sharding logic implemented in Hive? The getSplits method from the HivesplitManager. Java class defines the methods to access the data files from the Hive table, partitions, and splits. If we want to customize sharding logic, what should we do?

We can create mppSplitManager.java. One way to do this is to implement the ConnectorSplitManager interface. However, we can reuse existing hive functions and add custom logic. That will inherit the HiveSplitManager class and override the getSplits method.

I’m simply adding a custom output here, and if we have any other logic we can add it here. For example, you can get schemaName and tableName by tableName. If you detect that tableName is whitelisted tables, you can do something specific to those tables. And so on and so on.

Is that it? If this is done, it still won’t work because the engine doesn’t instantiate the class you wrote and bind it to MppConnector’s ConnectorSplitManager implementation. This is how you bind your class implementation via MppModule. The engine will then use your implementation when the plugin is loaded, as shown below.

4. To summarize

This completes the ability to create a new Connector and reuse the existing Connector. Of course, if you want to customize the Hive Connector’s Metadata, you can re-implement HiveMetadata in the same way. JDBC Connector series Mysql, Oracle, etc. are implemented in a similar way. The implementation case of this piece I put on Gitee I fork the MPPdev branch of openLooKeng’s Hetu-core code.

Gitee.com/doubledue/h…

More on the internal mechanism, we’ll have a chance to talk about it later.

– END –

Welcome to openLooKeng website

openlookeng.io

OpenLooKeng code warehouse

gitee.com/openlookeng

Create reuse mechanism of openLooKeng Connector based on Hive Connector

Create reuse mechanism of openLooKeng Connector based on Hive Connector

preface

1. Execute logic

2. Plugin loading mechanism

3. Customize Connector

(1) Create a new module

(2) Implement or reuse interfaces

(3) Custom sharding logic

4. To summarize

Related Posts

Ding dong! Bonus time, a collection of machine learning must-read articles

Yield, sleep, wait, notify, notifyAll, suspend, stop, resume, etc

What is a microservice architecture?