Hadoop basic HDFS introduction

preface

This paper mainly introduces the HDFS architecture and its execution process, and gives a programming example of read and write operation, hoping to have a preliminary understanding of HDFS.

Introduction to the

HDFS (Hadoop Distributed File System) is a Distributed File System running on commercial PCS. Its design idea is derived from The Google File System, a paper published by Google in 2003. HDFS is designed to solve large-scale data storage and management problems.

The architecture

The figure above shows that HDFS is a standard master/slave architecture and consists of three parts:

NameNode (Master node)
- Manages MetaData, which consists of file path names, data block ids, and storage locations
- Manage HDFS namespace.
SecondaryNameNode
- Periodically merge NameNode Edit logs (sequence of changes to the file system) to Fsimage (snapshot of the entire file system) and copy the modified fsimage to NameNode.
- Provides a checkpoint of the NameNode (do not think of it as a backup of the NameNode) that can be used to recover the NameNode.
DataNode (Slave node)
- Provides file storage and data block operations.
- Periodically reports block information to the NameNode.

Here are some of the concepts that appear in the figure:

Replication

To ensure high data availability, the HDFS stores three copies of written data in redundancy mode by default.
Blocks

A Block is a basic storage and operation unit (128 MB by default). A Block refers to a file system Block, not a physical Block, and its size is usually an integer multiple of the physical Block.

Execute the process

Read the file

The process of reading a file can be summarized as:

The Client sends a request to the NameNode to obtain the location of the file data block
The Client connects the data blocks according to the distance to the Client and reads the data

Write files

The process of writing a file can be summarized as:

The Client sends a file write request to the NameNode to obtain information such as the DataNode list that can be written
The Client blocks files based on the block size set by the HDFS
The datanodes assigned by Client and NameNode constitute the pipeline and write data
After the write is complete, the NameNode receives the message from the DataNode to update the metadata

Common commands

File operations

Lists the files
```
hdfs dfs -ls <path>
Copy the code
```
Create a directory
```
hdfs dfs -mkdir <path>
Copy the code
```

Upload a file

hdfs dfs -put <localsrc> <dst>
Copy the code

Output file contents
```
hdfs dfs -cat <src>
Copy the code
```

The file is copied to a local directory

hdfs dfs -get <src> <localdst>
Copy the code

Delete files and directories

hdfs dfs -rm <src>
hdfs dfs -rmdir <dir>
Copy the code

management

Viewing Statistics
```
hdfs dfsadmin -report
Copy the code
```
Entering and exiting safe mode (this mode does not allow any file system changes)
```
hdfs dfsadmin -safemode enter
hdfs dfsadmin -safemode leave
Copy the code
```

Programming instance

IDEA Creates a Maven project

After checking the relevant options, click Next to fill in the relevant information of the project

Add dependencies to pom.xml

<dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>2.9.2</version>// Select according to Hadoop version</dependency>

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-hdfs</artifactId>
        <version>2.9.2</version>
    </dependency>

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>2.9.2</version>
    </dependency>
</dependencies>

Copy the code

Read and write files

Create the Sample class to write the corresponding read and write functions

Sample class

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import java.io.*;

/ * * *@author ikroal
 */
public class Sample {
    // The default HDFS address
    private static final String DEFAULT_FS = "hdfs://localhost:9000";
    private static final String PATH = DEFAULT_FS + "/tmp/demo.txt";
    private static final String DEFAULT_FILE = "demo.txt";

    public static void main(String[] args) {
        Configuration conf = new Configuration();
        FileSystem fs = null;
        conf.set("fs.defaultFS", DEFAULT_FS); // Set the HDFS address

        try {
            fs = FileSystem.get(conf);
            write(fs, DEFAULT_FILE, PATH);
            read(fs, PATH);
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {
                if(fs ! =null) { fs.close(); }}catch(IOException e) { e.printStackTrace(); }}}}Copy the code

Write a function

/** * write file *@paramInputPath File path *@paramOutPath HDFS write path */
public static void write(FileSystem fileSystem, String inputPath, String outPath) {
    FSDataOutputStream outputStream = null;
    FileInputStream inputStream = null;
    try {
        outputStream = fileSystem.create(new Path(outPath)); // Get the HDFS write stream
        inputStream = new FileInputStream(inputPath); // Read local files
        int data;
        while((data = inputStream.read()) ! = -1) { // Write operationoutputStream.write(data); }}catch (IOException e) {
        e.printStackTrace();
    } finally {
        try {
            if(outputStream ! =null) {
                outputStream.close();
            }
            if(inputStream ! =null) { inputStream.close(); }}catch(IOException e) { e.printStackTrace(); }}}Copy the code

Read function

/** * read file *@paramPath Path of the file to be read in the HDFS */
public static void read(FileSystem fileSystem, String path) {
    FSDataInputStream inputStream = null;
    BufferedReader reader = null;
    try {
        inputStream = fileSystem.open(new Path(path)); // Get the HDFS read flow
        reader = new BufferedReader(new InputStreamReader(inputStream));
        String content;
        while((content = reader.readLine()) ! =null) { // Read and output to the consoleSystem.out.println(content); }}catch (IOException e) {
        e.printStackTrace();
    } finally {
        try {
            if(inputStream ! =null) {
                inputStream.close();
            }
            if(reader ! =null) { reader.close(); }}catch(IOException e) { e.printStackTrace(); }}}Copy the code

Create the file you plan to upload under the root of your project folder (in this case demo.txt) and fill in Hello World!
Start Hadoop and run the program to see the results

Write the results through http://localhost:50070/explorer.html#/ to view

The console outputs the contents of the uploaded file

Thanks

Have a basic knowledge of HDFS architecture and principles
In-depth understanding of HDFS: Hadoop distributed file system
HDFS read and write process (most refined and detailed ever)
Hadoop Learning Path (11) HDFS read and write details