In a nutshell, Hadoop is a tool for storing and analyzing massive data. It is an open source framework for storing massive data and running distributed analysis applications on distributed server clusters. It processes massive data in a reliable, efficient, and scalable manner.

  • HDFS is a distributed file system. It introduces the Namenode server that stores file metadata information and the Datanode server that stores data to store and read data in a distributed manner.
  • MapReduce is a computing framework: the core idea of MapReduce is to assign computing tasks to servers in a cluster for execution. Computing tasks are divided (Map computing /Reduce computing) and distributed computing is performed according to JobTracker.

Roles and functions of Hadoop

  • Hadoop uses distributed storage to improve read and write speed and expand storage capacity
  • MapReduce is used to integrate data on distributed file systems, ensuring efficient data analysis and processing
  • Hadoop also ensures data reliability by storing redundant data (without losing data)
  • The high fault-tolerant nature of HDFS in Hadoop, and the fact that it is developed based on the Java language, make It possible to deploy Hadoop on inexpensive clusters of computers
  • The data management capability of HDFS in Hadoop, the high efficiency of MapReduce in processing tasks, and its open source feature make HDFS stand out among similar distributed systems, and it is widely adopted in many industries

The advantages of the Hadoop

  • Hadoop is reliable: because it assumes that computing elements and storage will fail, it maintains multiple copies of the working data, ensuring that processing can be redistributed against failed nodes
  • Hadoop is efficient: because it works in parallel, it speeds up processing by parallelizing processing. Hadoop is also scalable and can handle petabytes of data
  • Hadoop is low cost: It relies on cheap servers: Therefore it is low cost and can be used by anyone
  • Running on Linux: Hadoop comes with a framework written in the Java language, so it is ideal to run on a Linux production platform
  • Multiple programming languages are supported: applications on Hadoop can also be written in other languages, such as C++

Illustration of Hadoop architecture

  • At the bottom is the Hadoop Distributed File System (HDFS), which stores files on all storage nodes in the Hadoop cluster. The MapReduce engine is on top of HDFS.
  • HBase is a distributed column storage database (an abstraction based on HDFS) at the structured storage layer.
  • Zookeeper is a distributed, highly available coordination service that provides basic services such as distributed locking.
  • Hive is a data warehouse based on Hadoop. It is used to manage structured or semi-structured data stored in HDFS or Hbase

HDFS and MapReduce form the core of the Hadoop distributed system architecture. They interact with each other to complete the main tasks of the Hadoop distributed cluster

It’s just a brief mention for now, and we’ll talk about it later!

== The next two sections will provide an example of the following introduction ==

Diagram HDFS architecture

The HDFS is the master-slave structure of the Master node and Slave node. It consists of name-node, Secondary NameNode, and DataNode.

  • NameNode: On the main control node, manages the HDFS namespace and data block mapping letter where metadata and files are stored and mapped to data blocks.

Namespace: for example, which folders are in the file system and which files are in the folders

  • Secondary NameNode: indicates a backup of NameNode.
  • DataNode: A secondary node that stores data, reads and writes data blocks, and reports storage information to the NameNode.
  • HDFS allows users to store data in the form of files. Files are divided into several data blocks, and these data blocks are stored on a group of Datanodes
  • Namenode coordinates the data, and Datanode directly returns the result to the client

The illustration graphs

The MapReduce framework consists of a single JobTracker running on the master node and a TaskTracker running on each cluster slave node

The primary node periodically polls the secondary node for availability

The master node is responsible for scheduling all the tasks that make up a job, which are distributed across different slave nodes; The master node monitors their execution and reexecutes previously failed tasks. When a Job is submitted, JobTracker receives the submitted Job and configuration information and sends the configuration information equally to the secondary node. At the same time, JobTracker schedules the Job and monitors the execution of the TaskTracker

The TaskTracker and Datanode are one-to-one, while the JobTracker and Namenode may not be together

Then what graphs is, this is divided into Map and reduce, the Map is assigned to a task, is a large tasks into smaller tasks, this is called a Map, and then reduce is the outcome of each slave get together, the results.