preface
This paper mainly introduces the HDFS architecture and its execution process, and gives a programming example of read and write operation, hoping to have a preliminary understanding of HDFS.
Introduction to the
HDFS (Hadoop Distributed File System) is a Distributed File System running on commercial PCS. Its design idea is derived from The Google File System, a paper published by Google in 2003. HDFS is designed to solve large-scale data storage and management problems.
The architecture
The figure above shows that HDFS is a standard master/slave architecture and consists of three parts:
- NameNode (Master node)
- Manages MetaData, which consists of file path names, data block ids, and storage locations
- Manage HDFS namespace.
- SecondaryNameNode
- Periodically merge NameNode Edit logs (sequence of changes to the file system) to Fsimage (snapshot of the entire file system) and copy the modified fsimage to NameNode.
- Provides a checkpoint of the NameNode (do not think of it as a backup of the NameNode) that can be used to recover the NameNode.
- DataNode (Slave node)
- Provides file storage and data block operations.
- Periodically reports block information to the NameNode.
Here are some of the concepts that appear in the figure:
-
Replication
To ensure high data availability, the HDFS stores three copies of written data in redundancy mode by default.
-
Blocks
A Block is a basic storage and operation unit (128 MB by default). A Block refers to a file system Block, not a physical Block, and its size is usually an integer multiple of the physical Block.
Execute the process
Read the file
The process of reading a file can be summarized as:
- The Client sends a request to the NameNode to obtain the location of the file data block
- The Client connects the data blocks according to the distance to the Client and reads the data
Write files
The process of writing a file can be summarized as:
- The Client sends a file write request to the NameNode to obtain information such as the DataNode list that can be written
- The Client blocks files based on the block size set by the HDFS
- The datanodes assigned by Client and NameNode constitute the pipeline and write data
- After the write is complete, the NameNode receives the message from the DataNode to update the metadata
Common commands
File operations
-
Lists the files
hdfs dfs -ls <path> Copy the code
-
Create a directory
hdfs dfs -mkdir <path> Copy the code
-
Upload a file
hdfs dfs -put <localsrc> <dst> Copy the code
-
Output file contents
hdfs dfs -cat <src> Copy the code
-
The file is copied to a local directory
hdfs dfs -get <src> <localdst> Copy the code
-
Delete files and directories
hdfs dfs -rm <src> hdfs dfs -rmdir <dir> Copy the code
management
-
Viewing Statistics
hdfs dfsadmin -report Copy the code
-
Entering and exiting safe mode (this mode does not allow any file system changes)
hdfs dfsadmin -safemode enter hdfs dfsadmin -safemode leave Copy the code
Programming instance
-
IDEA Creates a Maven project
After checking the relevant options, click Next to fill in the relevant information of the project
-
Add dependencies to pom.xml
<dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.9.2</version>// Select according to Hadoop version</dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.9.2</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.9.2</version> </dependency> </dependencies> Copy the code
-
Read and write files
Create the Sample class to write the corresponding read and write functions
-
Sample class
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.fs.FSDataOutputStream; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import java.io.*; / * * *@author ikroal */ public class Sample { // The default HDFS address private static final String DEFAULT_FS = "hdfs://localhost:9000"; private static final String PATH = DEFAULT_FS + "/tmp/demo.txt"; private static final String DEFAULT_FILE = "demo.txt"; public static void main(String[] args) { Configuration conf = new Configuration(); FileSystem fs = null; conf.set("fs.defaultFS", DEFAULT_FS); // Set the HDFS address try { fs = FileSystem.get(conf); write(fs, DEFAULT_FILE, PATH); read(fs, PATH); } catch (IOException e) { e.printStackTrace(); } finally { try { if(fs ! =null) { fs.close(); }}catch(IOException e) { e.printStackTrace(); }}}}Copy the code
-
Write a function
/** * write file *@paramInputPath File path *@paramOutPath HDFS write path */ public static void write(FileSystem fileSystem, String inputPath, String outPath) { FSDataOutputStream outputStream = null; FileInputStream inputStream = null; try { outputStream = fileSystem.create(new Path(outPath)); // Get the HDFS write stream inputStream = new FileInputStream(inputPath); // Read local files int data; while((data = inputStream.read()) ! = -1) { // Write operationoutputStream.write(data); }}catch (IOException e) { e.printStackTrace(); } finally { try { if(outputStream ! =null) { outputStream.close(); } if(inputStream ! =null) { inputStream.close(); }}catch(IOException e) { e.printStackTrace(); }}}Copy the code
-
Read function
/** * read file *@paramPath Path of the file to be read in the HDFS */ public static void read(FileSystem fileSystem, String path) { FSDataInputStream inputStream = null; BufferedReader reader = null; try { inputStream = fileSystem.open(new Path(path)); // Get the HDFS read flow reader = new BufferedReader(new InputStreamReader(inputStream)); String content; while((content = reader.readLine()) ! =null) { // Read and output to the consoleSystem.out.println(content); }}catch (IOException e) { e.printStackTrace(); } finally { try { if(inputStream ! =null) { inputStream.close(); } if(reader ! =null) { reader.close(); }}catch(IOException e) { e.printStackTrace(); }}}Copy the code
-
-
Create the file you plan to upload under the root of your project folder (in this case demo.txt) and fill in Hello World!
-
Start Hadoop and run the program to see the results
Write the results through http://localhost:50070/explorer.html#/ to view
The console outputs the contents of the uploaded file
Thanks
- Have a basic knowledge of HDFS architecture and principles
- In-depth understanding of HDFS: Hadoop distributed file system
- HDFS read and write process (most refined and detailed ever)
- Hadoop Learning Path (11) HDFS read and write details