The introduction

With the rapid development of computer and network technology and the infiltration to the industries, data source of produce mode and producing now than ever before have a lot of rich, such as: data from the sensors, the website user activity data, data from the mobile terminal and the smart devices, real-time transaction data in financial markets, various monitoring program of data and so on. Much of this data is in the form of continuous streams of data that are continuously generated from multiple external sources, and in most cases we have no control over the order in which these streams arrive or the rate at which they are generated.

For a long time, EMQ has been seeking for an optimal flow processing system and architecture in order to better meet the various requirements of real-time streaming data storage and processing in the actual business of various industries.

In When Databases Meet Streaming Computing: The Birth of streaming Databases! In this paper, EMQ proposes a new database category called “streaming database”. Compared to current unsystematic streaming data solutions, it is reasonable to believe that streaming databases will be the best choice in the era of real-time data processing and will become the core infrastructure of future enterprise software systems.

In today’s article, we will introduce you to HStreamDB, a stream database product that EMQ is developing.

Overview of HStreamDB project

HStreamDB is a streaming database designed for the full life cycle management of large-scale real-time data streams, including access, storage, processing, and distribution. It uses standard SQL (and its streaming extension) as the main interface language and takes real-time as the main feature, aiming to simplify the operation and maintenance management of data flow and the development of real-time applications.

The overall architecture of HStreamDB is shown in the figure below. A single HStreamDB node is mainly composed of two core components, HStream Server (HSQL) and HStream Storage (HStore). An HStream cluster consists of several peer HStreamDB nodes. Clients can connect to any one of the HStreamDB nodes in the cluster and perform various flow processing and analysis tasks, ranging from simple to complex, using the familiar SQL language.

As the core computing component of HStreamDB, HStream Server (HSQL) is itself designed to be stateless. It is mainly responsible for client connection management, security authentication, SQL parsing, SQL optimization, as well as the creation, scheduling, execution and management of stream computing tasks.

HStream Server (HSQL) can be divided into the following layers from top to bottom:

  1. Access layer. Mainly responsible for client request protocol processing, connection management, as well as security authentication and access control.
  2. The SQL layer. The client interacts with HStreamDB primarily through SQL statements to perform most of the streaming and real-time analysis tasks. This layer is responsible for compiling SQL statements submitted by users into logical data flow diagrams. As with a classic database system, there are two core sub-components: the SQL parser and the SQL optimizer. SQL parser is responsible for completing lexical analysis, syntax analysis, compiling SQL statements into the corresponding relational algebraic expressions; The SQL optimizer is responsible for optimizing the generated execution plan based on various rules and Context information.
  3. The Stream layer. This layer contains implementations of various common flow handlers, as well as data structures and DSLS that express data flow diagrams, and support for user-defined functions as handlers. Mainly responsible for selecting the corresponding operator implementation and optimization for the logical data flow graph passed down from the SQL layer, and generating executable data flow graph.
  4. The Runtime layer. This layer is responsible for actually performing the computation tasks of the data flow graph and returning the results. It mainly consists of task scheduler, state manager and execution optimizer. The scheduler is responsible for the scheduling of computing tasks among available computing resources, which may be between multiple threads of a single processing, or between multiple processors in a single machine, or between multiple machines or containers in a distributed cluster. The state manager is responsible for coordinating state maintenance and fault tolerance of the flow handler. The execution optimizer can speed up the execution of data flow diagrams by automating parallelism.

As the core Storage component of HStreamDB, HStream Storage (HStore) is a low-latency Storage component specially designed for streaming data. It can not only store large-scale real-time data in distributed persistence, but also use auto-tiering mechanism. Seamlessly interconnects with large-capacity secondary storage such as S3 to store historical data and real-time data in a unified manner.

The core Storage model of HStream Storage (HStore) is a log model that fits streaming data. The data stream itself can be regarded as an infinitely growing log. The typical operations supported by HStream include appending and interval read.

HStream Storage (HStore) can be divided into the following layers:

  1. Streaming Data API layer. This layer provides core data flow management and read and write operations, including the creation, deletion, writing and consumption of data flows. In HStore, there is no limit on the number of data streams to be created, and concurrent writing of large numbers of data streams is supported. The HStore storage design does not store data streams, so creating data streams is a very lightweight operation. In view of the characteristics of data stream, HStore provides append operation to support fast data writing. Meanwhile, in terms of reading stream data, HStore provides read operation based on subscription semantics. The newly written data in data stream will be pushed to data consumers in real time.
  2. Duplicate the layer. This layer is mainly based on the optimized Flexible Paxos consensus engine to achieve strong and consistent replication of stream data to ensure fault tolerance and high availability of data. At the same time, the availability of cluster data is maximized by non-deterministic data distribution strategy. In addition, replication groups can be reconfigured online to achieve seamless cluster data balancing and horizontal expansion.
  3. Local storage layer. This layer is mainly responsible for the local persistent storage of data, and the RocksDB storage engine based on optimization on the implementation encapsulates the access interface of streaming data, which can support the write and read of large amounts of data with low latency.
  4. Secondary storage layer. This layer provides unified interface encapsulation for multiple long-term storage systems, such as HDFS and AWS S3. Historical Data can be automatically unloaded to these secondary storage systems and accessed through unified Streaming Data interface.

HStreamDB features

Note: The following features are all planned up to HStreamDB 1.0, some features are under continuous development, the current version is not implemented, stay tuned.

SQL – based data flow processing

HStreamDB design the complete stateful processing scheme based on event time, not only support the basic filtering, conversion operations, also support the press key to do aggregation computation, based on the calculation of multiple time Windows, and the data flow between the ability to join, but also support the out-of-order and late news of the special processing, guarantee the accuracy of the calculation results. The user can do all of the above processing functions through SQL statements without learning any third-party APIS. At the same time, HStream stream processing has rich expansion capabilities, users can expand according to their own business.

Materialized queries of data streams

HStreamDB provides materialized view capabilities to support complex query and analysis operations over continuously updated data streams. The incremental computing engine inside HStreamDB updates the materialized view in real time as the data flow changes, and users can query the materialized view through SQL statements to gain real-time data insight.

Data Flow management

HStreamDB supports the creation and management of large numbers of data streams. Data stream creation is a very lightweight operation in HStreamDB, while maintaining stable read and write latency despite large numbers of concurrent reads and writes based on an optimized storage design.

Persistent storage of data flows

HStreamDB provides reliable data stream storage with low latency, ensuring that written data messages are not lost and can be reused. HStreamDB copies written data messages to multiple storage nodes, providing high availability and fault tolerance. It also supports the dumping of cold data to lower-cost storage services, such as object storage and distributed file storage. The storage capacity can be unlimited and data can be permanently stored.

Schema management of data flows

HStreamDB emphasizes flexible Schema support. Data streams can be schema-free, Schema can be developed using Json, Avro, Protobuf and other formats, and Schema evolution is also supported. Automatically manages compatibility between multiple Schema versions.

Access and distribution of data streams

HStreamDB data access and distribution is done by Connector, which connects to a variety of data systems including MQTT Broker, MySQL, ElasticSearch, Redis, etc., to facilitate integration between users and external data systems.

Security mechanism

HStreamDB security will be guaranteed by TLS encrypted transport, OAuth and JWT based authentication and authorization mechanisms, with security plug-in interfaces reserved for users to extend the default security mechanism as needed.

Monitoring and o&M tools

HStreamDB features a Web-based console with numerous system dashboards and visualizations, enabling detailed monitoring of cluster machine status, system key indicators, and more to facilitate cluster management by o&M personnel.

HStreamDB application scenarios

Real-time data analysis

Traditional data analysis is usually based on batch processing technology. Batch processing is generally run on a limited set of pre-collected data, so the analysis results often do not contain the latest data and have a high delay. HStreamDB’s ability to analyze real-time data streams and update the results as the data flows change makes it possible to better support applications such as real-time prediction of web user activity and real-time analysis of iot sensor data. Not only does it provide more real-time data insight than batch processing, but it also avoids the error-prone and complexity of scheduling batch tasks periodically.

Event-driven application

Event-driven applications usually trigger corresponding actions or behaviors in real time according to incoming events, which can be stateless or stateless, such as real-time fraud detection in financial transactions, business process monitoring and warning Internet of Things rule engine, etc. Based on HStreamDB, implementing these complex event-driven applications may require only a few SQL statements, greatly reducing the cost of developing and maintaining these applications.

Real-time data pipeline

Enterprise internal often require data synchronization between multiple data systems and migration, such as online transaction data in the database offline copy to the data warehouse were analyzed, and the process is usually performed by a set of ETL system, this kind of ETL system development and maintenance costs are relatively high, often the data synchronization and it is not real-time, Scalability is also poor. HStreamDB integrates a variety of external system connectors, making it easy to build real-time data pipelines, real-time build indexes, real-time build caches and other data synchronization tasks.

Online machine learning

Nowadays, machine learning system plays an increasingly important role in business system, including search, recommendation, risk control and other events are widely dependent on machine learning system. However, with the explosive development of online business and related application scenarios, conventional offline systems and offline machine learning platforms can no longer meet the requirements of business development. HStreamDB’s real-time computing engine enables real-time machine learning systems, online feature extraction, real-time recommendation and other applications.

HStreamDB is quick to get started

Let’s get started using HStreamDB quickly based on Docker.

Pull the Docker image

docker pull hstreamdb/logdevice
docker pull hstreamdb/hstream
Copy the code

Start a local HStream Server in docker

Create a directory for storing data

mkdir ./dbdata
Copy the code

Start the HStream Storage

docker run -td --rm --name some-hstream-store -v dbdata:/data/store --network host hstreamdb/logdevice ld-dev-cluster --root /data/store --use-tcp
Copy the code

Start the HStreamDB Server

docker run -it --rm --name some-hstream-server -v dbdata:/data/store --network host hstreamdb/hstream hstream-server --port 6570 -l /data/store/logdevice.conf
Copy the code

Start the HStreamDB CLI

docker run -it --rm --name some-hstream-cli -v dbdata:/data/store --network host hstreamdb/hstream hstream-client --port  6570Copy the code

If all is well, when you enter the CLI, you should see something like the following:

Start HStream-Cli!
Command
  :h                        help command
  :q                        quit cli
  show queries              list all queries
  terminate query <taskid>  terminate query by id
  terminate query all       terminate all queries
  <sql>                     run sql

>
Copy the code

Creating a Data flow

Now we will CREATE a new data STREAM using the CREATE STREAM statement,

CREATE STREAM demo WITH (FORMAT = "JSON");
Copy the code

After executing the preceding statement on the CLI, you will see information similar to the following, indicating that the execution succeeded.

Right ( CreateTopic { taskid = 0 , tasksql = "CREATE STREAM demo WITH (FORMAT = "JSON");" , taskStream = "demo", taskState = Finished, createTime = 2021-02-04 09:07:25.639197201 UTC})Copy the code

Perform a persistent query

We use SELECT statements for real-time processing and analysis of the data stream. On the CLI, execute the following statement,

SELECT * FROM demo WHERE humidity > 70 EMIT CHANGES;
Copy the code

When the execution is complete, it does not produce any results, which is normal because there is no data in the data stream yet, so we will write some data to the data stream and observe the results. Also, note that this SELECT statement is different from a normal database SELECT statement that returns after a single execution. Instead, it continues until you explicitly terminate it.

Start a new CLI session

docker exec -it some-hstream-cli hstream-client --port 6570
Copy the code

Insert data into the data stream

Execute the following INSERT statement to write data to the data stream,

INSERT INTO demo (temperature, humidity) VALUES (22, 80);
INSERT INTO demo (temperature, humidity) VALUES (15, 20);
INSERT INTO demo (temperature, humidity) VALUES (31, 76);
INSERT INTO demo (temperature, humidity) VALUES ( 5, 45);
INSERT INTO demo (temperature, humidity) VALUES (27, 82);
INSERT INTO demo (temperature, humidity) VALUES (28, 86);
Copy the code

If all goes well, you should see the following real-time output in the PREVIOUS CLI window:

{"temperature":22,"humidity":80}
{"temperature":31,"humidity":76}
{"temperature":27,"humidity":82}
{"temperature":28,"humidity":86}
Copy the code

HStreamDB open source community

As an open source infrastructure provider, EMQ has always believed in the value and power of open source, so HStreamDB has been developed entirely on GitHub in an open source way since the beginning of the project.

The HStreamDB project is moving forward as a team effort, and this is the perfect time for all of you in the open source community to get involved.

We invite you to join us in building an open source community for HStreamDB by visiting the HStreamDB website (hstream.io/) or the GitHub project address (github.com/hstreamdb/h…). Learn more about the project and join us on Slack Channel (slack-invitation.hstream.io /) for a discussion. We will also hold Open Day activities regularly to share the progress of the project and exchange technical knowledge.

In future plans, HStreamDB will continue to support and improve distributed processing support, Schema management, SQL optimization, monitoring and operation.

We believe that with the support of every partner who loves open source, we will take HStreamDB as the benchmark to create and witness the future of streaming database together!

Company introduction

EMQ is an open source Internet of Things infrastructure software provider, serving the Internet of Things, edge computing and cloud computing markets of the 5G industry cycle, delivering the world’s leading open source MQTT messaging server and flow processing database, providing a one-stop solution for real-time Internet of Things data mobile distribution, flow processing and analysis.

EMQ was founded in 2017 and has open source project teams around the world. Headquartered in Hangzhou, the company has branches in Beijing, Shanghai, Shenzhen, Nanjing, Kunming and Chongqing. The overseas R&D headquarters is located in Stockholm, with branches or service teams in Sweden, Germany, North America and Japan.

We believe In all-in-one commercial open source software and pursue the corporate mission of “serving the future industry and society of mankind through world-class open source software products”.

In the future, EMQ will continue to focus on HStreamDB, a streaming database product that integrates streaming data storage, real-time streaming processing and low-latency streaming data analysis. It will combine with the existing EMQ X Broker to form the next generation cloud-edge Model for Streaming and serve as a competitive open source base software stack to reshape the global database and Streaming market for the next decade.

Copyright: EMQ

Original link: www.emqx.cn/blog/hstrea…