Rocketmq-streams focuses on “large data volume -> high filtering -> light window computing” scenarios, with a core of light resources, high performance advantages, great advantages in resource-sensitive scenarios, with a minimum of 1Core, 1 gb can be deployed. Through a lot of filtering optimization, the performance is 2-5 times better than other big data. Widely used in security, risk control, edge calculation, message queue flow calculation. Rocketmq-streams is compatible with Flink SQL, UDF/UDTF/UDAF. In the future, we will be deeply integrated with Flink Ecology, which can be run independently or published as Flink tasks running in Flink clusters. For scenarios with Flink clusters, That is, it enjoys the advantage of light resources and can achieve unified deployment, operation and maintenance.
Rocketmq-streams features and application scenarios
Rocketmq-streams application scenario
• Computing scenarios: Suitable for large data volume, high filtering, and light window computing scenarios. Unlike mainstream computing engines, you need to deploy clusters, write tasks, publish, tune, and run complex processes. Rocketmq-streams is a lib package that can be run directly after writing a stream task based on the SDK. Support big data development needs computing features: Exactly-ONCE, flexible window (scrolling, sliding, session), dual-stream Join, high throughput, low latency, high performance. Minimum 1Core, 1G can run.
• SQL engine: RocketMQ-Streams can be thought of as an SQL engine, compatible with Flink SQL syntax and supporting extensions to Flink UDF/UDTF/UDAF. Support SQL hot upgrade, after writing SQL, submit SQL through SDK, you can complete SQL hot release.
• ETL engine: RocketMQ-Streams can also be regarded as an ETL engine. In many big data scenarios, data needs to be aggregated from a source through ETL to a unified storage. It has built-in functions such as GROk and regular parsing, which can be combined with SQL to complete ETL.
• Development SDK, which is also a data development SDK package, in which most components can be used separately, such as Source/sink, it shields the data Source, data storage details, provides a unified programming interface, a set of code, switch input and output, without changing the code.
Rocketmq-streams design idea
Design goals • Less dependencies, simple deployment, 1Core, 1G single instance deployable, scalable at will. • Implement the required big data features: Exactly-ONCE, flexible window (scroll, slide, session), dual-stream Join, high throughput, low latency, high performance. • Controllable cost, low resources, high performance. • Compatible with Flink SQL, UDF/UDTF, making it easier for non-technical people to use.
• A shared-nothing distributed architecture is adopted, which relies on message queues as load balancing and fault tolerance mechanism. Single instances can be started, and instances can be expanded to achieve capacity expansion. Concurrency depends on the number of shards. • Fragments in message queues are shuffled to implement fault tolerance through load balancing in message queues. • Use storage for state backup to realize the exact-once semantics. Fast startup with structured remote storage without waiting for local storage to recover.
Rocketmq-streams features and innovations
RocketMQ – Streams SDK explanation
Hello WorldAs usual, we’ll start with an example of RocketMq-Streams
• Namespace: Tasks with the same namespace can run in a process and share configurations • pipelineName: job name • DataStreamSource: Create source nodes • Map: User functions that can be extended by implementing MapFunction • toPrint: the result is printed • start: the task is started • Running the above code will start an instance. If you want multi-instance concurrency, you can start multiple instances, each consuming a portion of RocketMQ’s data. • Run result: Concatenate the original message with “-” and print it out
RocketMQ-Streams SDK
• StreamBuilder as a starting point, create a DataStreamSource by setting namespace and jobName. • DataStreamSource Use the from method to set source and create DataStream objects. • DataStream provides multiple operations that result in different flows: • The to operation generates DataStreamAction • The Window operation generates WindowStream configuration window parameters • The join operation generates JoinStream configuration join conditions • The Split operation generates SplitStream configures split conditions • Other operations generate DataStream • DataStreamAction starts the entire task and can also configure various policy parameters for the task. Supports asynchronous and synchronous startup.
RocketMQ – Streams operator
RocketMQ – Streams operator
SQL can be deployed in two modes. The first mode is to directly run the client to start SQL, as shown in the first red box. 2 set up a server cluster and use the CLIENT to submit SQL for hot deployment. See the second red box.
Rocketmq-streams SQL extension supports a variety of extensions: • Extend SQL capabilities via FlinkUDF,UDTF,UDAF, and create function in SQL with the restriction that UDF does not use Flink FunctionContext when open. • Extend SQL functions with built-in functions. The syntax is the same as Flink syntax. Function names are the names of built-in functions, and class names are fixed. In the figure below, we introduce a now function that outputs the current time. More than 200 functions are built into the system and can be introduced on demand.
• By extending Function implementation, it is easy to implement a Function by marking Function on the class, FunctionMethod on the method that needs to be published as a Function, and setting the name of the Function to be published. If system information is required, The first two functions can be IMessage and Abstract. If you do not need them, you can write parameters directly. There is no format requirement for parameters. As shown in the figure below, we create a function called now, which can be written either way. This can be called currentTime=now(), adding a key=currentTime, value= currentTime variable to Message.
• Publish the existing Java code as a function. Configure the Java code’s class name, method name, and expected function name through policy configuration. Copy the Java JAR package to the JAR package directory. Below are examples of several extensions in use.
Rocketmq-streams architecture and principle implementation
The overall architecture
Source implementation
• Source requires the semantics of at least one consumption, which is realized through checkpoint system message. Before submitting offset, the system sends checkpoint message to inform all operators to refresh the memory. • Source supports automatic load balancing and fault tolerance for sharding. • When a fragment is removed, the data source sends a message to remove the system so that the operator can clean the fragment. • When there is a new shard, the new shard message is sent so that the operator completes the initialization of the shard. • The data source uses the start method to start consuemr to get messages. • The original Message is encoded and additional header information wrapped as Message is delivered to subsequent operators.
Sink to achieve
• Sink is a combination of real-time and throughput. • To implement a Sink, simply inherit the AbstractSink class and implement the batchInsert method. BatchInsert means that a batch of data is written to the storage, which requires subclasses to call the storage interface to implement. Try to apply the batch processing interface of the storage to improve throughput. • Write Message->cache-> Flush -> Store. The system ensures that no more than batchsize data is written in each batch. If more than batchsize data is written, the system splits the data into multiple batches.
• Sink has a cache. Data is written to cache by default and stored in batches to improve throughput. (One fragment, one cache). • Automatic refresh can be enabled. Each fragment has a thread that periodically refreshes cache data to storage, improving real-time performance. Implementation class: DataSourceAutoFlushTask. • You can also flush the cache to the store by calling flush. • The Sink cache is protected by memory. When the number of messages in the cache is greater than batchSize, the Sink cache is forcibly refreshed to release memory.
RocketMQ-Streams Exactly-ONCE
• Source Ensures that a checkpoint message is sent at commit offset, and the component that receives the message finishes saving. Messages are consumed at least once. • Each message has a header that encapsulates QueueId and offset. • When storing data, the component stores the maximum QueueId and processed offset. If there are duplicate messages, the message is discarded based on the maxoffset. • Memory protection: Flush may occur multiple times in a checkpoint cycle to ensure controllable memory usage.
RocketMQ-Streams Window
• Supports scrolling, sliding and session Windows. Event time and natural time (the time when the message enters the operator) are supported. • Supports high performance mode and high reliability mode. The high performance mode does not rely on remote storage, but the window data will be lost during fragment switching. • Fast startup, no need to wait for local storage recovery, in case of an error or fragment switchover, asynchronous data recovery from the remote storage, and direct access to the remote storage computing. • Use message Queue load balancing to achieve capacity expansion and reduction. Each Queue is a group, and a group is consumed by only one machine at the same time. • Normal computing relies on local storage and has similar computing performance to Flink.
Support three trigger modes to balance watermark delay and real-time requirements
Rocketmq-streams for cloud security
In the context of security applications
• When the public cloud switches to the private cloud, there is a resource problem in the intrusion detection computing. The big data cluster does not output by default, and the output is at least 6 high-configuration machines, which is difficult for users to accept because they buy cloud Shield to add a set of big data cluster. • The upgrade of private cloud users is difficult to operate and maintain, and it is impossible to quickly upgrade capabilities and fix bugs.
Stream computing in security applications
• Build lightweight computing engine based on security features (big data -> high filtering -> Light window computing) : after analysis, all rules will be pre-filtered, and then heavy statistics, window and join operations will be carried out, and the filtering rate is relatively high. Based on this feature, statistics and join operations can be implemented with lighter schemes.
• 100% proprietary cloud rule coverage (re, Join, statistics) via RocketMQ-Streams. • Light resources: the memory is 1/70 of that of the public cloud engine, and the CPU is 1/6 of that of the public cloud engine. Optimized by fingerprint filtering, the performance is improved by more than 5 times. Resources do not increase linearly with rules, and new rules have no resource pressure. Reusing the previous re engine resources can support more than 95% of sites without adding additional physical resources. • Support tens of millions of messages through highly compressed dimension tables. 1000 W data requires only 330 M of memory. • In C/S deployment mode, SQL and engines can be hot-released, especially in network protection scenarios, and rules can be quickly published.
Rocketmq-streams future planning
Download the new version at github.com/apache/rock… ———————————————— Copyright notice: This article is the original article of CSDN blogger “Alibaba Cloud Native”, in accordance with CC 4.0 BY-SA copyright agreement, please attach the original source link and this statement. The original link: blog.csdn.net/alisystemso… Release the latest information of cloud native technology, collect the most complete content of cloud native technology, hold cloud native activities and live broadcast regularly, and release ali products and user best practices. Explore the cloud native technology with you and share the cloud native content you need.
Pay attention to [Alibaba Cloud native] public account, get more cloud native real-time information!