This article discusses the big data processing ecosystem and related architectural stacks, including an investigation of various framework features for different tasks. In addition, the article also delves into the framework from multiple levels, such as storage, resource management, data processing, query and machine learning.

The lowering of barriers to use was what drove the initial growth of data on the Internet. The trend has been reinforced by a slew of new devices, including smartphones and tablets. On top of the first generation of data growth, social media platforms are driving exponential growth in the volume of data, which is known as the second wave of growth unleashed by social media. The collaborative nature of information sharing platforms has contributed to the viral growth of data sharing. The third growth will come largely from the rapid growth of smart connected devices, which will bring unprecedented data expansion. In addition, advances in science and cheaper computing power have led to new applications in many fields, including medical science, physics, astronomy and genetics. In these fields, collected data is used to test hypotheses and drive new discoveries and creations.

The rapid expansion of data acquisition and storage has brought us to a new stage of processing data into meaningful interpretation. The practical need to process large amounts of data has led to the need for scalable and parallel systems. These systems are capable of processing data at higher speeds and on a larger scale. Open source technology is the perfect choice for high-performance computing for large-scale data processing. This article provides an overview of open source frameworks and components available at different levels of the big data processing stack.

As more and more big data (characterized by volume, speed, and variety) is generated and collected, various systems are being developed to exploit the vast and diverse potential of this data. While many systems exhibit the same characteristics, they often have different design philosophies, which leads to greater choice. One of the strategic guiding principles for an enterprise in determining its data strategy is the adoption of a common data storage layer, which facilitates the use of data by different frameworks and allows data to be shared across frameworks. Figure 1 shows a typical data processing architecture stack.

This architecture stack can also be viewed as a multi-stage data processing pipeline, as shown in Figure 2. Unstructured data often comes in a variety of formats, such as text data, images, video, and audio. This data can also come from multiple sources, including transaction records, web logs, public websites, databases, multiple devices and tools, and other associated data. After cleaning and error checking, the data goes to the data storage tier. The next task is to iterate and interactively process the data using the framework described in the next section. The process itself may have several sub-stages, which may be accompanied by interactions with the storage tier. It can further use statistical algorithm exploration and modeling to derive and test hypotheses. The algorithm is trained with data and then used for predictive modeling. These algorithms can be trained periodically as new data sets enter the system. The data set is further used for exploratory analysis to uncover latent knowledge and insights. During processing and exploration, visualization tools are used to visualize the processed data set to facilitate understanding of the data and to communicate it to stakeholders.

Data in the storage tier can be reused by different stakeholders within the organization. Big data is often uncertain, and most processing frameworks have adapted to this. In fact, this feature is a key factor in the success of a framework. Next, let’s discuss these different levels of frameworks and libraries.

Storage and Data Layer Let’s start with the storage and data layer. This is the most important component and foundation of the big data stack. As the name suggests, big data is often characterized by its large amount of data, which requires huge and theoretically unlimited storage capacity. Technological advances that enable cheaper storage and computing resources have led to the emergence of clustered storage and computing platforms. These platforms are free of storage limitations and essentially enable unlimited data storage. These platforms are not constrained by traditional data modeling and pattern design paradigms. They are typically schemaless, allowing all forms of data (structured, semi-structured, and unstructured) to be stored. This makes it possible to create more dynamic systems and allow analysts to explore the data without the constraints of existing models.

HDFS (https://hadoop.apache.org/) : this is the Hadoop ecosystem of scalable, fault-tolerant distributed file system. Add a commercial server to the cluster to expand HDFS. The largest known cluster contains about 4500 nodes and up to 128 petabytes of data. HDFS supports data reading and writing in parallel. The bandwidth in the HDFS varies linearly with the number of nodes. Built-in redundancy is achieved through multiple copies of data stored in the system. These files are broken up into Blocks and stored in a cluster like files. To achieve reliability, these files are copied multiple times. HDFS has a master/slave architecture and a component called NameNode in the cluster that acts as the master server. NameNode manages file system namespaces (files, directories, and blocks and their relationships). Namespaces are stored in memory and changes are periodically saved to disk. In addition, there are dependent components called Datanodes, one for each node in the cluster. These processes manage the storage on which a particular compute node depends.

NoSQL Database (http://nosql-database.org/) : As the Web continues to evolve and become more accessible, existing relational database technologies cannot meet the massive data and concurrency requirements of Web 2.0. This has become an obvious fact. To meet this need, “Not only SQL” databases have emerged as a new storage and management system of choice. Compared with HDFS and Hadoop as real-time or batch processing engines in the field of data analysis, NoSQL database is essentially a data storage layer for Web applications based on front-end and back-end systems, especially those that require a large amount of concurrent processing capability.

The characteristics of these databases include that they are often structurally unstructured (or require minimal architectural design), horizontally extensible, and rely on event consistency models rather than just-in-time consistency models.

There are now four basic NoSQL architectures, which are

  • Key-value storage based on hash data structures or associative array data models. This kind of database is built on amazon’s Dynamo paper. (www.allthingsdistributed.com/files/amazo)… .
  • Column type database (DB) based on Google’s BigTable paper (research.google.com/archive/big…). . This data model allows each row to have its own schema, for example, HBase and Cassandra.
  • Document databases store data as documents, each of which is a collection of key-value pairs. Typically, these documents exist as JSON (for example, MongoDB and CouchDB).
  • Graph databases represent data models as nodes or as relationships between nodes. Nodes and relationships are represented as key-value pairs (such as Neo4J).

Tachyon (http://tachyon-project.org/) : This is a platform that provides reliable memory data sharing across cluster frameworks and jobs. Tachyon essentially sits on top of a storage platform such as HDFS and therefore provides memory-centric data processing capabilities across cluster frameworks and jobs. While some existing clustered computing frameworks, such as Spark, have implemented in-memory data processing, this approach has three key flaws that contributed to Tachyon’s development:

  • Although jobs process data in memory, data sharing between jobs and frameworks is not yet implemented because data is only available within the JVM context of jobs.
  • Because the execution engine and storage are in the same JVM context, any execution engine crash results in data loss and recalculation.
  • In some cases, data in memory is replicated between jobs, leading to larger data footprint and triggering more severe garbage collection.

Tachyon was developed to address these issues. It is implemented by establishing lineage relationship with storage layer. It can store only one copy of data in memory, which can be used in all frameworks (e.g. Spark, MapReduce, etc.). In addition, it achieves fault tolerance by relying on recalculation of pedigree relationships.

After the data processing framework has saved the data to the storage tier, the next step is to process the data and form insights from it. We will compare several frameworks here.

Apache Hadoop stack (https://hadoop.apache.org/) is the granddaddy of the big data processing framework, and in fact has become these collection technology platform. The cost effectiveness and scalability of the platform fully meet the needs of the industry for large-scale data processing. In addition, the platform’s reliability, community support, and ecosystem that has grown around it have led to wider adoption.

The Hadoop ecosystem has three main goals:

Scalability – You can scale to meet larger requirements simply by adding nodes to the cluster. This feature is further enhanced by the framework’s adoption of a local computing model to benefit from a simplified scalability model.

Flexibility – Provides flexibility for storing structured data in different formats. This is achieved through a “Schema on Read” approach, which enables the system to store anything and parse the data only when it is Read, which is when it needs to be understood.

Efficiency – Ensure optimal use of cluster resources to improve efficiency.

Hadoop graphs (https://hadoop.apache.org/) is the graphs the realization of the programming paradigm (by Google paper promotion). This programming paradigm is intended to process very large data sets in parallel through large clusters while ensuring reliability and fault tolerance. The MapReduce() paradigm itself is a concept built on a distributed file system that ensures reliability and scalability. A MapReduce() program consists of two programs, Map() and Reduce(). The Map() procedure processes the input dataset in parallel and produces the processed output. Since the Map() stage takes place on a very large distributed data set, spread across a huge cluster of nodes, followed by the Reduce() stage, which is aggregated from the sorted data set of multiple Map nodes, the framework and the underlying HDFS system can handle petabytes of Very large data sets distributed over thousands of nodes.

Apache Flink (https://flink.apache.org/) is a data processing system, combines the Hadoop HDFS extensibility and powerful functions of the layer and declarative properties of as cornerstones of relational database and performance optimization. Flink provides a runtime system that is an alternative to the Hadoop MapReduce framework.

Apache Tez (https://tez.apache.org/) is a distributed data processing engine built on top of Yarn (Hadoop 2.0 ecosystem). Tez models the data processing workflow as Distributed Acyclic Graphs (DAGs). With this unique capability, Tez allows developers to intuitively model complex data-processing jobs as a series of tasks, while leveraging the basic resource management capabilities of the Hadoop 2.0 ecosystem.

Apache Spark (https://spark.apache.org/) is a large distributed execution engine data processing, can provide efficient abstract processing large data sets in the memory. Although MapReduce based on Yarn provides an abstract method to use cluster computing resources, it is inefficient in iterative algorithms and interactive data mining algorithms that require reuse of data. Spark implements in-memory fault-tolerant data abstraction in the form of ELASTIC distributed Data sets (RDD). These forms of parallel data structures are stored in memory. RDD implements fault tolerance by tracking the conversion process (lineage, Lineage) rather than actual data. If a part needs to be restored after it has been lost, the transformation only needs to be performed on that data set. This is much more efficient than copying data sets across nodes to improve fault tolerance. This is supposedly 100 times faster than Hadoop MR.

Spark also provides a unified framework for batch processing, streaming data processing, interactive data mining, and apis in Java, Scala, and Python. It provides an interactive command line tool (shell). This tool provides access to fast query capabilities, libraries for machine learning (MLLib and GraphX), apis for graphical data processing, SparkSQL (a declarative query language), and SparkStreaming (a streaming API for streaming data processing). SparkStreaming is a system for processing event streams in real time. SparkStreaming treats stream processing as a microbatch data set. The input stream is divided into batches of preset duration. These batches are entered into the underlying Spark system and are processed in the same way as the Spark batch programming paradigm. This enables the very low latency and real-time integrated batch features required for real-time processing.

Apache Storm (https://storm.apache.org/) is a system used for the real-time processing of continuous data streams. It is highly scalable, fault tolerant, and implements the concept of reliable processing so that no events are lost. Hadoop provides a framework for batch processing of data, while Storm implements the same functionality for streaming event data. It uses directed acyclograph (DAG) and uses the concepts of spouts (input data sources) and bolts to define a data processing pipeline or topology. Streams are the tuples that flow through these processing pipelines. The Storm cluster consists of three parts:

  • Nimbus, which runs on the master node, is responsible for assigning work in the worker process.
  • The Supervisor daemon runs on the work node, listens for assigned tasks, and manages the work processes to start/stop them as needed to complete the work.
  • Zookeeper handles the coordination between Nimbus and Supervisors and maintains fault tolerance.

High-level languages for analysis and query have evolved with cluster programming frameworks as the primary means of solving big data processing problems, and another problem has emerged with larger practical attempts. Programming with these computing frameworks is becoming increasingly complex and difficult to maintain. Skill scalability becomes another concern, as many people are familiar with domain expertise in skills such as SQL and scripting. As a result, higher-level programming abstractions of clustered computing frameworks began to emerge, abstracting lower-level programming apis. This section discusses some of these frameworks.

Hive (https://hive.apache.org/) and Pig (https://pig.apache.org/) are high-level language implementations of MapReduce. Language interface The MapReduce program is generated from query commands written in high-level languages, thus abstracting the basic content of MapReduce and HDFS. Pig implements PigLatin, a procedural-like language interface, while Hive provides Hive Query Language (HQL), a declarative language interface similar to SQL.

Pig is suitable for writing data processing flows for iterative processing scenarios. Hive uses the declarative SQL language for temporary data query, exploratory analysis, and business intelligence (BI).

BlinkDB (http://blinkdb.org/) is a newcomer to the big data processing ecosystem. It provides an interactive query processing platform that supports approximate query of big data. As the volume of data grows exponentially, more and more long-term research is emerging in this area and focusing on creating computational models with low latency. Apache Spark is moving in this direction by focusing on in-memory data structures to reduce latency.

Blink DB further compresses the delay benchmark by introducing the concept of approximate queries. In some industry cases, a small number of errors are acceptable if the speed can be improved. BlinkDB runs queries on samples of the original dataset rather than the entire dataset. The framework can define acceptable error boundaries for queries, or specify time limits. The system processes queries based on these constraints and returns results within a given range. BlinkDB uses the concept of statistical sampling error, which does not vary with the population size but depends on the sample size. Thus, even as the amount of data increases, the same sample size always reflects the nature of the population at an appropriate level. This idea led to incredible performance improvements. Since most of the query time is consumed in the I/O process, if the sample size is set to 10% of the original data, the processing speed can be increased by 10 times, while the error is less than 0.02%. BlinkDB is built on the Hive Query engine and supports Hadoop MR and Apache Shark execution engines. BlinkDB abstracts this complex approximation and provides a kind of SQL command structure. The language supports standard aggregation, filtering, grouping, federation, and nested queries in addition to user-defined functions using raw commands.

Figure 1: Stack of big data processing components

Cluster resource management framework Cluster resource management is one of the key components in big data processing stack. Existing resource management frameworks have been able to combine the generality of supporting multiple upper-layer frameworks with some of the important features required. These features include radically different processing requirements, robust data control, and seamless recovery. A common framework will avoid replicating large amounts of data between different frameworks within the cluster. It is also important to provide an easy-to-use management interface. I’ll look at several resource management frameworks that do this.

Apache Hadoop Yarn _ (https://hadoop.apache.org/) : _Hadoop 1.0 based entirely on graphs paradigm and build engine. As Hadoop is widely accepted as a platform for distributed big data batch processing systems, the need for other computing patterns such as messaging interfaces, graphics processing, real-time stream processing, AD hoc and iterative processing is growing. MapReduce, as a programming paradigm, does not support these requirements. As a result, new (and other existing) frameworks began to evolve. In addition, HDFS is widely accepted as a big data storage system, and it doesn’t make sense to design a storage structure for other frameworks. So the Hadoop community is working to improve the existing platform so it’s not just MapReduce. The result of this effort is Hadoop 2.0, which separates resource management from application management. The resource management system is named Yarn. Yarn also works in a master-slave architecture. The resource manager serves as the primary service and manages resource allocation for different applications in the cluster. The slave component, called NodeManager, runs on each node in the cluster and is responsible for the computing containers needed to start the application. ApplicationMaster is a framework-specific entity. It coordinates resources in ResourceManager and works with the node manager to submit and monitor application tasks. ApplicationMaster is a framework-specific entity. It coordinates resources in ResourceManager and works with the node manager to submit and monitor application tasks.

This decoupling helps improve cluster utilization by allowing other frameworks to work with MapReduce to access and share data on the same cluster.

Apache Mesos (http://mesos.apache.org/) is a generic cluster resource management framework that manages all resources in a data center. Mesos schedules work in a different way from Yarn. Mesos implements a two-tier scheduling mechanism, with the master server providing resources to the framework scheduler and the framework deciding whether to accept or reject them. This model makes Mesos extensible and versatile, and allows the framework to meet specific goals, such as data locality. Mesos is a master/slave architecture in which the Mesos master runs on one node and is paired with multiple standby master servers to take over in the event of a failure. The master server manages slave processes on the cluster nodes and the framework for running tasks on the nodes. A framework running on Mesos has two components: a framework scheduler registered on the master server, and a framework executor started on the slave server in Mesos. In Mesos, the slave server reports the available resources provided to the master server. The Mesos master server looks up the allocation policy and provides resources to the framework based on the policy. The framework can fully accept, partially or even reject assignments based on its goals and the tasks it needs to run. If there is one, it sends back an accepted response and the task to run. The Mesos master server forwards these tasks to the corresponding slave servers, which allocate the provided resources to the executor, which then starts the task.

Figure 2: Big data processing pipeline

There is little point in putting effort into big data if machine learning libraries do not ultimately provide business value. Machine learning enables systems to learn or process large amounts of data and can be applied to predict results on unknown input datasets. In display life, there have been many examples of machine learning systems, such as targeted advertising, recommendation engines, “next best offer/behavior”, self-learning autonomous systems, etc. Here are some frameworks for this area.

Apache Mahout (http://mahout.apache.org/) aims to provide a scalable machine learning platform. It implements a variety of “out of the box” algorithms and provides a framework for implementing custom algorithms. Although Apache Mahout is one of the earliest machine learning libraries, it was originally written as a programming paradigm for MapReduce. However, MapReduce is not well suited to the iterative nature of machine learning algorithms and has not been a great success. By the time Spark started to take off, Mahout had been ported to Apache Spark, renamed Spark MLLib, and no longer used Hadoop MapReduce.

Spark MLLib (https://spark.apache.org/mllib/) is a scalable machine learning platform. It is built on top of Spark and can be regarded as an extension of the Spark Core execution engine. Spark MLLib is already a native extension implementation of Spark Core, so it has many advantages. Spark MLLib has several algorithms written for ML problems, such as classification, regression, collaborative filtering, clustering, decomposition, etc.

PredictionIO (http://prediction.io/) is an extensible machine learning server. It provides a framework to accelerate the prototyping and production of machine learning applications. It is built on Apache Spark and utilizes the implementation of a variety of machine learning algorithms provided by Spark MLLib. It provides an interface to abstract the trained prediction model into a service, which is provided through an event server-based architecture. It also provides a way to continuously train training models in a distributed environment. The generated events are collected in real time and can be used to retrain the model as a batch job. The client application can query the service through the REST API and the JSON response returns the predicted results.