Let’s take a look at the composition of the Hadoop ecosystem. We will focus on what subprojects the Hadoop ecosystem consists of, what features each project has, and what types of problems each project can solve. Focus on understanding what the Hadoop ecosystem is, what it is, what it will be.
HDFS:
HDFS (HadoopDistributedFileSystem, Hadoop distributed file system) is the basis of the data in Hadoop system storage management. It is a highly fault-tolerant system that detects and responds to hardware failures and is designed to run on low-cost general-purpose hardware. HDFS simplifies the file consistency model and provides high-throughput application data access through streaming data access. It is suitable for applications with large data sets.
Graphs:
MapReduce is a computing model used for computing large amounts of data. The MapReduce implementation of Hadoop, together with Common and HDFS, constitutes the three components in the early development of Hadoop. MapReduce divides applications into Map and Reduce steps. Map performs specific operations on individual elements in a data set and generates an intermediate result in the form of key-value pairs. Reduce prescribes all the “values” of the same “key” in the intermediate result to get the final result. Functional partitioning like MapReduce is very suitable for data processing in a distributed parallel environment consisting of a large number of computers.
HBase
HBase is a scalable, highly reliable, high-performance, distributed, and columnoriented dynamic schema database for structured data. Different from traditional relational databases, HBase uses the data model of BigTable: an enhanced sparse sorting mapping table (Key/Value), in which keys are composed of row keys, column keys, and timestamps. HBase provides random and real-time read/write access to large-scale data. In addition, Data stored in HBase can be processed using MapReduce, which perfectly combines data storage and parallel computing.
I sorted out the latest and most complete big data video tutorial, click can get learning, supporting information code have,
Please visit: 2021 Big Data Learning Tutorial
Hive
Hive is an important subproject of Hadoop. It was first designed by Facebook and is a data warehouse architecture based on Hadoop. Hive provides many functions for data warehouse management, including data ETL (extract, transform, and load) tools, data storage management, and query and analysis capabilities of large data sets. Hive provides a structured data mechanism and defines an SQL-like language (HiveQL) similar to traditional relational databases. With this query language, data analysts can easily run data analysis services (convert SQL into MapReduce tasks and execute them on Hadoop).
Pig
Pig runs on Hadoop and is a platform for analysis and evaluation of large datasets. It simplifies the requirements for data analysis using Hadoop and provides a high-level, domain-oriented abstraction language: PigLatin. With PigLatin, data engineers can encode complex and interrelated data analysis tasks as data flow scripts on Pig operations, which can be converted into MapReduce task chains for execution on Hadoop. Like Hive, Pig lowers the threshold for analysis and evaluation of large data sets.
Zookeeper
How to agree on a value (resolution) in a distributed system is an important fundamental problem. As a distributed service framework, ZooKeeper solves the problem of consistency in distributed computing.
On this basis, ZooKeeper can be used to solve common data management problems in distributed applications, such as unified naming service, status synchronization service, cluster management, and configuration item management of distributed applications.
ZooKeeper is playing an increasingly important role as a major component of other Hadoop-related projects.
Mahout
Mahout, which originated in 2008 as a subproject of ApacheLucent, has come a long way in a very short period of time and is now a top-level project for Apache. Mahout’s main goal is to create scalable implementations of classical algorithms in machine learning to help developers create smart applications more easily and quickly. Mahout now includes widely used data mining methods such as clustering, classification, recommendation engines (collaborative filtering), and frequent set mining. In addition to algorithms, Mahout includes input/output tools for data, data mining support architectures such as integration with other storage systems such as databases, MongoDB, or Cassandra.
Flume
Flume is a distributed, reliable, and highly available log collection system developed and maintained by Cloudera. It abstracts the process of data from the path of generation, transmission, processing, and finally writing to the target into data flow. In the specific data flow, the data source supports the customization of data sender in Flume, thus supporting the collection of data of various protocols. At the same time, Flume data flow provides simple processing capabilities for log data, such as filtering and format conversion. In addition, Flume has the ability to write logs to a variety of (customizable) data targets.
In general, Flume is an extensible, suitable for complex environment of massive log collection system.
Sqoop
Sqoop, short for SQL-to-Hadoop, is a peripheral tool of Hadoop. Its main function is to exchange data between structured data storage and Hadoop. Sqoop can import data from a relational database (such as MySQL, Oracle, and PostgreSQL) to HDFS or Hive, or import data from HDFS or Hive to a relational database. Sqoop makes full use of the advantages of Hadoop. The whole data import and export process is parallelized by MapReduce, and most of the steps in the process are automatically executed, which is very convenient.
Accumulo
Accumulo is a reliable, scalable, and high-performance sort distributed key-value storage solution based on cell access control and customizable server-side processing. GoogleBigTable is designed based on ApacheHadoop, Zookeeper, and Thrift.
Spark
Spark is a fast and versatile computing engine designed for large-scale data processing. Spark is a general parallel framework of HadoopMapReduce class developed by UCberkeley ELEyamplab (AMP Lab of University of California, Berkeley). Spark has the advantages of HadoopMapReduce. Unlike MapReduce, however, the output of a Job can be stored in memory, eliminating the need to read and write HDFS. Therefore, Spark is more suitable for MapReduce algorithms that require iteration, such as data mining and machine learning.
Spark is an open source clustered computing environment similar to Hadoop, but there are some useful differences that make Spark better for certain workloads. In other words, Spark enables memory-distributed datasets, and in addition to providing interactive queries, It can also optimize iterative workloads.
Spark is implemented in the Scala language, which uses Scala as its application framework. Unlike Hadoop, Spark and Scala are tightly integrated, where Scala can manipulate distributed data sets as easily as local collection objects.
Although Spark was created to support iterative jobs on distributed datasets, it is actually a complement to Hadoop and can run in parallel on Hadoop file systems. This behavior can be supported through a third-party clustering framework called Mesos. Spark by AMP lab at the university of California, Berkeley (Algorithms, those andPeopleLab) development, can be used to build a large, low latency data analysis applications.
Avro
Avro is a data serialization system designed for applications that support large volume data exchange. Its main features are: support binary serialization, can be convenient, fast processing of a large number of data; Dynamic languages are friendly, and Avro provides mechanisms that make it easy for dynamic languages to process Avro data. There are many similar serialization systems on the market today, such as Google’s Protocol Buffers and Facebook’s Thrift. These systems respond well enough to meet the needs of common applications. In response to this confusion, Doug Cutting wrote that Hadoop’s existing RPC systems run into some problems
Problems such as performance bottlenecks (currently using the IPC system, which uses Java’s built-in DataOutputStream and DataInputStream); The server and client must run the same version of Hadoop. Can only use Java development etc. But these existing serialization systems have their own problems. Take ProtocolBuffers as an example, which requires the user to define data structures, generate code from this data structure, and then assemble the data. If you need to manipulate data sets from multiple data sources, you need to define multiple data structures and repeat the above process multiple times, so that no uniform processing can be performed on any arbitrary data set. Second, using code generation doesn’t make sense for scripting systems like Hive and Pig in Hadoop. And Protocol Buffers take into account when serializing data definitions that may not exactly match the data, adding annotations to the data, which can make the data bulky and slow down processing. Other serialization systems have a similar problem with Protocol Buffers. So for the future of Hadoop, Doug Cutting led the development of a new serialization system called Avro, which was added to the Hadoop project family in 2009.
I sorted out the first and most complete big data video tutorial this year, supporting data codes are available, to learn the big data interview employment and technology promotion is very helpful, please visit: 2021 big data learning tutorial
Crunch
Apache Crunch is implemented based on FlumeJava, which is a mapReduce-based data pipeline library. Apache Crunch is a Java class library that simplifies the writing and execution of MapReduce jobs and can be used to simplify apis for connection and data aggregation tasks. Like Pig and Hive, Crunch is designed to reduce the entry cost of MapReduce. The difference is:
Pig is a pipe-based framework, while Crunch is a Java library that offers a higher level of flexibility than Pig.
Hue
HUE=Hadoop User Experience
Hue is an open-source Apache Hadoop UI system developed from Cloudera Desktop. Cloudera contributes Hue to the Apache Foundation Hadoop community. It is implemented based on the Python Web framework Django.
Using Hue, you can interact with the Hadoop cluster on the Web console of the browser to analyze and process data, for example, operating HDFS data, running MapReduce jobs, executing Hive SQL statements, and browsing HBase databases. (is to support the provision of various Web graphical interface).
Impala
Impala is Cloudera’s new query system that provides SQL semantics to query petabytes of big data stored in HDFS and HBase. Existing Hive systems also provide SQL semantics. However, Hive uses the MapReduce engine to perform batch processing, which is difficult to meet query interactivity. By contrast, the Impala’s biggest feature and selling point is its speed.
Kafka
Kafka is an open source stream processing platform developed by the Apache Software Foundation by Scala and Java
To write. Kafka is a high-throughput distributed publish-subscribe messaging system that can handle consumer data in
All action flow data in the site. This action (web browsing, searching and other user actions) is a key factor in many social functions on the modern web. This data is usually addressed by processing logs and log aggregation due to throughput requirements.
This is a viable solution for logging data and offline analysis systems like Hadoop that require real-time processing limitations. Kafka is designed to unify online and offline message processing through Hadoop’s parallel loading mechanism, and to provide real-time messaging across clusters.
Kudu
Kudu is cloudera’s open source columnar storage system running on hadoop platform. It has common technical features of Hadoop ecosystem applications. It runs on common commercial hardware, supports horizontal scaling and high availability.
Oozie
Oozie is an open-source workflow engine that is contributed by Cloudera to Apache and is used to manage Hadoop jobs. Oozie is a Web application. Oozie Server consists of Oozie Client and Oozie Server. Oozie Server is a Web application running in the Java Servlet container (Tomcat).
I sorted out the latest and most complete big data video tutorial, click can get learning, supporting information code have,
Please visit: 2021 Big Data Learning Tutorial
Sentry
Sentry is an open source real-time error tracking system that helps developers monitor and fix exceptions in real time. It focuses on continuous integration, improving efficiency, and improving the user experience. Sentry is divided into server and client SDK. The former can directly use the online services provided by its home, or can be built locally. The latter provides support for many major languages and frameworks, including React, Angular, Node, Django, RoR, PHP, Laravel, Android,.NET, JAVA, and more. It also offers solutions to integrate with other popular services, such as GitHub, GitLab, Bitbuck, Heroku, Slack, Trello, and more.
Note: Apache Parquet is a columnar storage format that efficiently stores nested data.