Hadoop service role of big Data framework (DKHadoop) : DKHadoop distribution download, installation, operating environment deployment, etc. Although some places may not be very detailed, personal understanding level is limited, please forgive me! I remember that when I wrote the DKHadoop runtime environment deployment, I left out the content of Hadoop service role. I will make up this part of the content, otherwise I always feel uncomfortable.
- Zookeeper role: The ZooKeeper service is a cluster service framework that contains one or more nodes and is used for cluster management. For a cluster, the Zookeeper service provides configuration maintenance, naming, and distributed synchronization for HyperBase. You are advised to deploy at least three nodes in a Zookeeper cluster.
- JDK role: JDK is the Java language software development kit, JDK is the core of the entire Java development, it contains the Java runtime environment, Java tools and Java base class library.
- Apache-flume role: Cloudera provides a highly available, reliable, and distributed system for collecting, aggregating, and transferring massive logs. Flume supports customized data sending parties in the log system for data collection. Flume also provides the ability to easily process data and write to a variety of (customizable) data recipients.
- Apache-hive Role: Hive is a data warehouse tool based on Hadoop. Hive maps structured data files to a database table, provides simple SQL query functions, and converts SQL statements to MapReduce jobs.
- Apache-storm Role: Storm is memory-level computing, and data is directly imported into the memory over the network. Reading and writing memory is n orders of magnitude faster than reading and writing disks. Storm’s streaming process saves the time needed to collect data in batch processing when computing models are more suitable for streaming.
- Elasticsearch role: Elasticsearch was developed in Java and released as open source under the Apache license, and is currently a popular enterprise-level search engine. Designed for cloud computing, can achieve real-time search, stability, reliability, fast, easy to install and use.
- NameNode Role: Nodes in the HDFS Maintain the directory structure of all files in the file system and track the data nodes on which file data is stored. When a client needs to obtain a file from the HDFS, it communicates with NameNode to know which data node on the client has the file required by the client. Only one NameNode can exist in a Hadoop cluster. NameNode cannot be assigned to other roles.
- DataNode role: In HDFS, DataNode is a node that stores data blocks.
- Secondary NameNode role: node that creates periodic checkpoints for data on the NameNode. The node will periodically download the current NameNode image and log file, merge the log and image file into a new image file and upload it to the NameNode. Machines assigned NameNode roles should not be assigned Secondary NameNode roles.
- Standby Namenode Role: NameNode metadata in Standby mode (Namespcae information and blocks are synchronized with metadata in Active NameNode. Once the NameNode is switched to Active mode, NameNode services are immediately available.
- JournalNode Role: Standby NameName and Active NameNode communicate through JournalNode to keep information synchronized.
- HBase role: HBase is a distributed, column-oriented, open source database. HBase provides capabilities similar to BigTable on top of Hadoop. HBase is a subproject of the Apache Hadoop project. Unlike common relational databases, HBase is a database suitable for unstructured data storage. Another difference is that HBase is column-based rather than row-based.
- Kafka role: Kafka is a high-throughput distributed publish-subscribe messaging system that processes all action flow data in consumer-scale websites. This action (web browsing, searching and other user actions) is a key factor in many social functions on the modern web. This data is usually addressed by processing logs and log aggregation due to throughput requirements. This is a viable solution for logging data and offline analysis systems like Hadoop, but with limitations that require real-time processing. Kafka is designed to unify online and offline message processing through Hadoop’s parallel loading mechanism, and to provide real-time consumption through clustering.
- Redis role: Redis is an open source C language, network support, memory based and persistent logging, key-value database, and provides a variety of language API.
- Scala’s role: Scala is a multiparadigm programming language, a Java-like programming language designed to implement a scalable language and integrate the features of object-oriented and functional programming.
- Sqoop role: Sqoop is a tool for moving data between Hadoop and a relational database. It allows a relational database (e.g. Data from MySQL,Oracle,Postgres, etc.) is imported into HDFS of Hadoop, and data from HDFS can also be imported into relational databases.
- Impala role: Impala is a new query system developed by Cloudera that provides SQL semantics to query pB-level big data stored in HDFS and HBase of Hadoop. Existing Hive systems also provide SQL semantics. However, Hive uses the MapReduce engine to perform batch processing, which is difficult to meet query interactivity. By contrast, the Impala’s biggest feature and selling point is its speed.
- Crawler role: Crawler is a large and fast DKHadoop proprietary component, Crawler system, crawling dynamic and static data.
- The Spark role: Spark is an open source clustered computing environment similar to Hadoop, but there are some useful differences that make Spark better for certain workloads. In other words, Spark enables memory-distributed datasets, and in addition to providing interactive queries, It can also optimize iterative workloads. Spark is implemented in the Scala language, which uses Scala as its application framework. Unlike Hadoop, Spark and Scala are tightly integrated, where Scala can manipulate distributed data sets as easily as local collection objects.
- HUE Role: HUE is a group of network applications that can interact with Hadoop JiQun. HUE allows you to browse HDFS and work, manage Hive MetaStore, run Hive, browse HBase Sqoop export data, submit MapReduce programs, build a customized search engine, and schedule repetitive workflows with Solr.