Here are 5 must-know big data processing framework technologies.
Big data is an umbrella term for the unconventional strategies and techniques needed to collect, organize, and process large data sets to gain insights from them. While the amount of computing power or storage required to process data has long exceeded the limits of a single computer, the ubiquity, scale, and value of this type of computing has only experienced a massive expansion in recent years.
This article will introduce one of the most basic components of big data systems: the processing framework. The processing framework is responsible for computing data in the system, such as data read from non-volatile storage or data that has just been ingested into the system. Data computing is the process of extracting information and insights from a large number of single data points.
These frameworks are described below:
· Batch only framework:
Apache Hadoop
· Flow-only processing framework:
Apache Storm
Apache Samza
· Mixed frame:
Apache Spark
Apache Flink
What is the big Data processing framework?
The processing framework and the processing engine are responsible for computing the data in the data system. While there is no definitive definition of the difference between “engine” and “framework,” most of the time the former can be defined as the component that is actually responsible for handling data operations, while the latter can be defined as a series of components that perform a similar role.
For example, Apache Hadoop can be seen as a processing framework with MapReduce as the default processing engine. Engines and frameworks can often be used interchangeably or together. For example, another framework, Apache Spark, can incorporate Hadoop and replace MapReduce. This interoperability between components is one of the reasons big data systems are so flexible.
While the systems responsible for processing data at this stage of the life cycle are often complex, at a broad level their goals are very much the same: to improve understanding by performing operations on the data, to uncover patterns underlying the data, and to gain insights into complex interactions.
To simplify the discussion of these components, we will classify the different processing frameworks by the state of the data being processed by their design intent. Some systems can process data in batch mode, and some systems can process data streaming into the system continuously. There are also systems that can process both types of data.
Before diving into the metrics and conclusions of different implementations, a brief introduction to the concept of different processing types is needed.
Batch system
Batch processing has a long history in the big data world. Batch processing operates on large static data sets and returns results after the computation process is complete.
Data sets used in batch mode generally conform to the following characteristics…
· Bounded: A batch data set represents a finite set of data
· Persistence: Data is usually always stored in some type of persistent storage location
· Bulk: Batch operations are often the only way to process extremely large data sets
Batch processing is ideal for computations that require access to a full set of records. For example, when calculating totals and averages, the data set must be treated as a whole rather than as a collection of multiple records. These operations require that the data remain in its own state as the computation proceeds.
Tasks that require processing large amounts of data are usually best handled with batch operations. Whether the data set is processed directly from the persistent storage device, or the data set is loaded into memory first, the amount of data is fully considered in the design process of batch processing system, which can provide sufficient processing resources. Batch processing is often used to analyze historical data because of its excellent performance in handling large amounts of persistent data.
The processing of large amounts of data takes a lot of time, so batch processing is not suitable for high processing time requirements.
Apache Hadoop
Apache Hadoop is a processing framework dedicated to batch processing. Hadoop was the first big data framework to gain significant traction in the open source community. Hadoop reimplements algorithms and component stacks based on Google’s published papers and experience in handling massive data, making large-scale batch processing easier to use.
The new version of Hadoop consists of multiple components, or layers, that work together to process batch data
HDFS: HDFS is a distributed file system layer that coordinates storage and replication between cluster nodes. HDFS ensures that data is still available after unavoidable node failures. It can be used as a data source, store intermediate processing results, and store the final results of calculations.
YARN: YARN is an abbreviation of Yet Another Resource Negotiator(Another Resource manager) and serves as the cluster coordination component of the Hadoop stack. This component coordinates and manages the underlying resources and schedules the running of jobs. By acting as an interface to cluster resources, YARN enables users to run more types of workloads in a Hadoop cluster than in previous iterations.
· MapReduce: MapReduce is the native batch processing engine of Hadoop.
Batch mode
Hadoop’s processing functions come from the MapReduce engine. The MapReduce processing technology meets the requirements of the Map, Shuffle, and Reduce algorithms that use key-value pairs. The basic processing process includes:
· Read data sets from the HDFS file system
· Split the dataset into small pieces and distribute them to all available nodes
· Calculate the subset of data on each node (the intermediate result of calculation will be written to HDFS)
· Redistribute intermediate results and group them by key
· “Reducing” the values of each key by summarizing and combining the results computed for each node
· Write the final result of calculation into HDFS
Strengths and Limitations
Because this approach relies heavily on persistent storage and requires multiple reads and writes per task, it is relatively slow. On the other hand, since disk space is often the most abundant resource on a server, this means that MapReduce can handle very large data sets. It also means that Hadoop’s MapReduce can often run on cheaper hardware than other similar technologies because it doesn’t require everything to be stored in memory. MapReduce has very high scaling potential and has been used in production environments with tens of thousands of nodes.
ApReduce has a steep learning curve, and while other technologies surrounding the Hadoop ecosystem can significantly reduce the impact of this problem, it still needs to be addressed when implementing certain applications quickly through Hadoop clusters.
A vast ecosystem has developed around Hadoop, and Hadoop clusters themselves are often used as components of other software. Many other processing frameworks and engines can also use HDFS and YARN Resource manager through integration with Hadoop.
conclusion
Apache Hadoop and its MapReduce processing engine provide a proven batch processing model that is best suited for processing very large data sets that do not require much time. Fully functional Hadoop clusters can be built from very low-cost components, making this inexpensive and efficient processing technology flexible for many cases. Compatibility and integration with other frameworks and engines make Hadoop the underlying foundation for a variety of workload processing platforms using different technologies.
Stream processing system
The stream processing system computes the data that comes into the system at any time. This is a very different way of processing than the batch mode. Instead of operating on the entire data set, the stream approach operates on each item of data transferred through the system.
· Data sets in stream processing are “borderless”, which has several important effects:
· The complete data set can only represent the total amount of data that has been entered into the system so far.
· Working data sets may be more relevant and represent only a single data item at a given time.
Processing is event-based and has no “end” unless it explicitly stops. The results are available immediately and will continue to be updated as new data arrives.
Stream Processing systems can process almost unlimited amounts of data, but only one (true stream Processing) or a very small amount (micro-batch Processing) of data can be processed at a time, and only a minimal amount of state is maintained between records. While most systems provide methods for maintaining some state, flow processing is optimized for more Functional processing with fewer side effects.
Functional operations focus primarily on discrete steps with limited states or side effects. Performing the same operation on the same data will produce the same result or some other factor, and this type of processing is well suited for stream processing because the states of different items are often a combination of some difficulty, limitation, and in some cases unwanted results. So while some type of state management is usually possible, these frameworks are often simpler and more efficient without a state management mechanism.
This type of processing is well suited for certain types of workloads. Tasks with near-real-time processing requirements are good candidates for the stream processing mode. Analytics, server or application error logs, and other time-based measures are the most appropriate types because responding to changes in data in these areas is critical to business functions. Flow processing is well suited for data that must respond to changes or spikes and focus on trends over time.