This is the 7th day of my participation in Gwen Challenge

background

I collected some scattered materials when I learned Hadoop before. I’m going to sort out this series and record it for my future review.

Hadoop Document Summary

The two core designs

The Hadoop framework has two core designs:

  • HDFS: Stores massive data
  • MapReduce: Provides computing for massive amounts of data

Fault tolerance of HDFS

  1. Hadoop Distributed File System (HDFS) is a Hadoop Distributed File System implemented by Hadoop.
  2. HDFS is highly fault tolerant and is designed to be deployed on inexpensive hardware, such as Google, which is deployed on a large number of Linux PCS, not servers
  3. Hadoop is reliable because it assumes that computing elements and storage will fail, so it maintains multiple copies of working data, ensuring that processing can be redistributed for nodes that fail
  4. Hadoop has the ability to automatically save multiple copies of data and automatically reassign failed tasks

HDFS File system

Hadoop Distributed File System (HDFS) stores files on all storage nodes in the Hadoop cluster.

  1. To external clients, HDFS is just like a traditional tiered file system. You can create, delete, move, or rename files, and so on. But the HDFS architecture is based on a specific set of nodes, which is determined by its own characteristics. These nodes include NameNode (only one), which provides metadata services within HDFS; DataNode, which provides storage blocks for HDFS
  2. Disadvantages: Since there is only one NameNode, this is a disadvantage of HDFS (single point of failure).
  3. Files stored in HDFS are divided into blocks, and these blocks are copied to multiple computers (Datanodes). The size of the block (typically 64MB) and the number of blocks to copy are determined by the client when the file is created. NameNode can control all file operations

NameNode

  1. NameNode is a piece of software that typically runs on a separate machine in an HDFS instance. It is responsible for managing file system namespaces and controlling access to external clients. The NameNode determines whether the file is mapped to the replicated block on the DataNode. The most common is three replica blocks
  2. The actual I/O transactions do not go through NameNode, only the metadata representing the file mappings between Datanodes and blocks. When an external client sends a request to create a file, the NameNode responds with a block id and the DataNode IP address of the first copy of the block. The NameNode also notifies other Datanodes that will receive copies of the block.
  3. NameNode stores file namespace information and transaction log files. These information is stored on NameNode’s local file system. You need to prevent these files from being corrupted or the NameNode system from being lost
  4. NameNode inevitably has the risk Of Single Point Of Failure (SPOF). This problem cannot be solved in active/standby mode. Hadoop non-stop NameNode can only achieve 100% uptime availability

DataNode

  1. The DataNode runs on a separate machine in the HDFS instance
  2. The Hadoop cluster contains a NameNode and a large number of Datanodes
  3. DataNode responds to read/write requests from HDFS clients. They also respond to commands from NameNode to create, delete, and copy blocks.
  4. NameNode relies on regular heartbeat messages from each DataNode. Each message contains a block report against which NameNode can validate block mappings and other file system metadata. If the DataNode cannot send heartbeat messages, the NameNode takes repair measures and replicates the lost blocks on the node

File operation process

  1. If the client wants to write a file to HDFS, it first needs to cache the file to local temporary storage. If the cached data is larger than the required HDFS block size, the request to create the file is sent to NameNode. The NameNode will respond to the client with the DataNode identity and target block.
  2. It also notifies the DataNode that will save a copy of the file block. When the client starts sending temporary files to the first DataNode, the block content is immediately piped to the replica DataNode. The client is also responsible for creating checksum files that are stored in the same HDFS namespace.
  3. After the last file block is sent, NameNode commits the file creation to its persistent metadata store (in EditLog and FsImage files)