This article is from netease Cloud community
Author: Wang Jianwei
-
preface
Recently, I participated in the design of sentry streaming monitoring function and investigated two frameworks for streaming computing: Storm and Spark Streaming. I was in charge of storm’s research work. I have spent a week on and off reading the DOC on the official website and some information on the Internet. I summarized what I learned into a document and sent it out as a guide for colleagues interested in Storm.
-
Storm background
With the further development of the Internet, from Portal information browsing type to Search information searching type to SNS relationship interaction transmission type, as well as e-commerce, Internet travel and life products, the circulation links in life are online. The demand for efficiency further increases the demand for real-time, and information interaction and communication is developing from point-to-point to information chain or even information network, which inevitably brings cross-correlation of data in various dimensions, and data explosion is inevitable. Therefore, stream processing and NoSQL products emerge as The Times require to solve the problems of real-time framework and large-scale data storage computing respectively.
Twitter made Storm open source in 2011. Before the Internet developers in a real-time application, in addition to the application of logic computing itself, but also for the real-time flow of data, interaction, distribution of great trouble. Now, developers can quickly build a set of real-time stream processing framework, robust and easy to use SQL products or no products match or graphs computing platform, can be low cost to make a lot of it is hard to imagine the real time before products: such as adding a data of quantum constant multiple product brand is built on the real-time stream processing platform.
-
Strom language
– Storm’s main development language is Clojure, which does the core logic, as well as Python and Java
-
The characteristics of the strom
1. Simple programming model
Storm, like Hadoop, provides some simple and elegant primitives for real-time computing of big data, which greatly reduces the complexity of developing parallel real-time processing tasks and helps you develop applications quickly and efficiently.
2. Can be extended
There are three main entities that actually run topology in a Storm cluster: worker process, thread, and task. Each worker process can run on each machine in the Storm cluster, and each worker process can create multiple threads. Each thread can execute multiple tasks. Tasks are real entities for data processing. As a result, computing tasks are performed in parallel across multiple threads, processes, and servers, enabling flexible horizontal scaling.
3. High reliability
Storm guarantees that every message sent by Spout is “fully processed”. The message sent by SPout may trigger thousands of messages, which can be visualized as a message tree. The message sent by SPout is the root of the tree. Storm will track the processing of the message tree. – Storm will consider the message sent by Spout to have been “fully processed” If any message processing in the tree fails, or if the entire tree is not “fully processed” within a limited time, the message sent by spout is resent. Considering that as far as possible to reduce the consumption of memory, the Storm will not track each message in the message tree, but use some special strategy, it to track the message tree as a whole, the unique id of all messages in the message tree or calculated, whether through zero to determine whether the message from the spout is fully processed, This saves a lot of memory and simplifies decision logic, as described in more detail below.
In this mode, an ACK/FAIL is sent synchronously for each message sent, which consumes a certain amount of network bandwidth. If the reliability requirement is not high, you can disable this mode by using different EMIT interfaces.
As mentioned above, Storm ensures that each message is processed at least once, but for some computing scenarios, it is strictly required that each message be processed only once. Storm 0.7.0 addresses this issue by introducing a transactional topology.
4. High fault tolerance
If something goes wrong during message processing, Storm will reschedule the offending processing unit. – Storm guarantees that a handler will always run (unless you explicitly kill the handler). Of course, if the intermediate state is stored in the processing unit, then when the processing unit is restarted by Storm, the application will need to handle the intermediate state recovery itself.
5. Quick
Fast here mainly refers to the delay of. Storm’s network direct transmission and in-memory calculation have much lower latency than Hadoop’s transmission through HDFS. When the calculation model is more suitable for streaming, Storm’s streaming processing saves the time of collecting data in batch processing. Because Storm is a service job, it also saves the delay of job scheduling. So storm is faster than Hadoop in terms of latency.
In a typical scenario, several thousand log producers produce log files that need to perform some ETL operations to store in a database.
If hadoop is used, it needs to save HDFS first, and calculate by the granularity of one file per minute (this granularity is extremely fine, if it is smaller, HDFS will have a pile of small files). When Hadoop starts to calculate, 1 minute has passed, and then it takes another minute to start scheduling the task, and then the job runs. If there are so many machines, a few minutes is enough, and then writing the database takes so little time that at least two more minutes have passed between the time the data is generated and the time it is finally usable.
And flow calculation is the data produced, has a program to monitor log, have been produced via a transmission system to a row of the current computing systems, and then deal directly with the flow calculation system, after processing directly into the database, each data from produce to write the database, when sufficient resources can be done in millisecond level.
6. Support multiple programming languages
In addition to implementing Spout and Bolt in Java, you can use any programming language you’re familiar with to do this, thanks to what Storm calls a multilingual protocol. The multilingual protocol is a special protocol within Storm that allows spout or Bolt to use standard input and standard output to pass messages as single lines of text or multiple lines of JSON-encoded text.
7. Local mode is supported
Storm has a “local mode”, which simulates all the functions of a Storm cluster in the process. Running topology in local mode is similar to running topology on a cluster, which is very useful for development and testing.
-
The composition of the storm
There are two types of nodes in Storm’s cluster: master nodes and worker nodes. The control node runs a daemon called Nimbus, which acts like JobTracker in Hadoop. Nimbus is responsible for distributing code across clusters, assigning computing tasks to machines, and monitoring status.
Each work node runs a node called Supervisor. The Supervisor listens for work on the machine assigned to it, starting/shutting down the work process as needed. Each worker process executes a subset of the topology; A running topology consists of many worker processes running on many machines.
All coordination between Nimbus and Supervisor is done through the Zookeeper cluster. In addition, both Nimbus and Supervisor processes are fail-fast and stateless. All states are either in ZooKeeper or on local disk. This means that you can kill Nimbus and Supervisor processes with kill-9 and then restart them as if nothing had happened. This design makes the Storm extremely stable.
Let’s look at these concepts in more detail.
Nimbus: Allocates resources and schedules tasks.
Supervisor: Receives tasks assigned by NIMbus, starts and stops worker processes managed by the Supervisor.
Worker: Runs a process that deals specifically with component logic.
Task: Each spout/bolt thread in the worker is called a Task. After Storm0.8, tasks no longer correspond to physical threads, and tasks with the same SPout/Bolt may share a physical thread called executor.
The following diagram depicts the relationships between the roles described above.
-
Topology Fundamentals
Storm clusters and Hadoop clusters look superficially similar. But Hadoop running MapReduce Jobs and Storm running Topology are very different. One key difference is that a MapReduce job ends eventually, whereas a topology always runs (unless you kill it manually).
1 Topologies
One topology is a graph of spouts and bolts. The spouts and bolts of the graph are connected through stream groupings, as shown below:
A topology runs until you manually kill it, Storm automatically reassigns failed tasks, and Storm guarantees that you will not lose data (if high reliability is turned on). If some machine shuts down unexpectedly all the tasks on it are transferred to other machines.
2 Streams
Streams are the core abstraction of Storm. A data flow is an unbounded sequence of tuples created and processed in parallel in a distributed environment. A data flow can be defined by a schema that expresses the fields of a tuple in the data flow. By default, tuples contain primitives such as Integer, Long, Short, Byte, Double, Float, Booleans, and Byte arrays. Of course, you can also implement custom tuple types by defining serializable objects.
3 Spouts
A data source (Spout) is the source of the data flow in the topology. Typically Spout reads tuples from an external data source and sends them to the topology. Spout can be defined as either a reliable or unreliable data source, depending on the requirements. A reliable Spout can resend tuples if the tuple it sent fails to process, ensuring that all tuples are processed correctly; Conversely, unreliable Spout does not do anything else to the tuple after it has been sent.
An Spout can send multiple data streams. In order to achieve this function, we can declare different data streams through OutputFieldsDeclarer’s declareStream method. The emit method of the SpoutOutputCollector then uses the data stream ID as a parameter to send data.
The key method in Spout is nextTuple. As the name implies, nextTuple either sends a new tuple to the topology, or returns directly when there are no tuples to send. It is important to note that since Storm calls all Spout methods in the same thread, nextTuple cannot be blocked by any of Spout’s other functional methods, which would directly interrupt the data flow.
The other two key methods in Spout are ACK and FAIL, which are used for further processing after Storm detects that a sent tuple has been processed successfully or failed, respectively. Note that the ACK and FAIL methods only work for the “reliable” Spout described above.
4 Data flow processing Bolts
All the data processing in the topology is done by Bolt. Bolt can fulfill almost any data processing requirement through data filtering, functions, aggregations, joins, database interactions, and more
A Bolt can perform simple data-flow transformations, while more complex data-flow transformations typically require multiple Bolts and multiple steps. For example, converting a microblog data stream into a trend image data stream involves at least two steps: One Bolt is used to scroll count the retweets of each image, and the other or more bolts output the data stream as a “most retweeted image” result (you can make this conversion much more scalable with 3 bolts than with 2).
Like Spout, Bolt can output multiple data streams. In order to achieve this function, we can declare different data streams through OutputFieldsDeclarer’s declareStream method. The emit method of OutputCollector then uses the data stream ID as a parameter to send data.
When defining Bolt’s input data stream, you need to subscribe to the specified data stream from other Storm components. If you need to subscribe to streams from all the other components, you must register each component separately when defining Bolt. InputDeclarer supports syntactic sugar for subscribing to data streams declared as default ids. Declarer. ShuffleGrouping (“1”) is equivalent to declarer. ShuffleGrouping (“1”, DEFAULT_STREAM_ID) if you want to subscribe to data streams from component “1”.
The key method of Bolt is the Execute method. The execute method is responsible for receiving a tuple as input and sending the new tuple using the OutputCollector object. If there is a message reliability requirement, Bolt must call the OutputCollector’s ACK method for each tuple it processes so that Storm can know if the tuple is processed (and ultimately whether it can respond to the original Spout output tuple tree). In general, for each input tuple, after processing, you can choose not to send or send as many new tuples as you want, and then respond (ACK) to the input tuple. The IBasicBolt interface can realize the automatic reply of tuples.
5 Stream Groupings
Determining the input data flow for each Bolt in the topology is an important step in defining a topology. Data flow grouping defines how data flows are divided among Bolt’s different tasks.
There are eight built-in streams grouping methods in Storm, and you can customize the stream grouping model using the CustomStreamGrouping interface. The eight grouping points are as follows:
1. Shuffle grouping: In this way, tuples are assigned as randomly as possible to Bolt’s different tasks so that the number of tuples handled by each task is basically the same to ensure load balancing.
2. Field grouping: Data flows are grouped according to a defined field. For example, if a data flow is grouped based on a domain named “user-ID,” all tuples containing the same “user-ID” are assigned to the same task, thus ensuring consistency in message processing.
Partial Key grouping: This approach is similar to domain grouping. Data streams are grouped based on defined fields. However, Bolt provides better performance when input data sources are imbalanced. Interested readers can refer to the paper, which explains in detail how this grouping works and its advantages.
All grouping: In which data streams are sent to All of Bolt’s tasks at the same time (that is, the same tuple is replicated and processed by All tasks), use this grouping with care.
Global grouping: All data streams are sent to Bolt’s task with the smallest ID.
6. None grouping: Using this means you don’t care how data streams are grouped. Currently this is exactly equivalent to random grouping, but in the future the Storm community may consider non-grouping to allow Bolt to execute on the same thread as the Spout or Bolt it subscribs to.
7. Direct grouping: This is a special grouping. Using this approach means that the sender of a tuple can specify which downstream task can receive the tuple. Direct grouping can only be used if the data flow is declared as a direct data flow. Sending tuples using direct data flow requires using one of the emitDirect methods of OutputCollector. Bolt can get the task ID of its downstream consumer by using TopologyContext, or by tracing the emit method of OutputCollector, which returns the ID of the target task for the tuple it sent.
Local or shuffle grouping: If the target Bolt has one or more task threads in the worker process of the source component, tuples are randomly assigned to tasks in the same process. In other words, it had a similar effect to random grouping.
6 Tasks (Tasks)
Each Spout and Bolt in the Storm cluster is performed by several tasks. Each task corresponds to a thread of execution. Data flow grouping determines how tuples are sent from one set of tasks to another. You can set the parallelism of Spout/Bolt in the TopologyBuilder setSpout method and setBolt method.
7 Work Process (Workers)
Topologies run in one or more worker processes. Each worker process is an actual JVM process and executes a subset of the topology. For example, if the parallelism of the topology is defined as 300 and the number of worker processes is defined as 50, each worker process will perform six tasks (threads within the process). Storm will split tasks among all workers to achieve load balancing in the cluster.
Storm: Stream Processing Framework
Netease Cloud Free experience pavilion, 0 cost experience 20+ cloud products!
For more information about NETEASE’s r&d, product and operation experience, please visit netease Cloud Community.
Kudu PK Parquet tPC-h Query2 compared with Dubbo and HadoopRPC