This is the 9th day of my participation in Gwen Challenge
File system
- File system definition: A file system is a method of storing and organizing computer data so that it can be accessed and found
It gets easier.
- File name: In a file system, the file name is used to locate the storage bit.
- Metadata: Data that stores file attributes, such as the file name, file length, and user group to which the file belongs.
File storage bit etc.
- Data Block: the smallest unit that stores a file. The storage medium is divided into fixed areas, according to this when used
Some areas are allocated for use
2. HDFS features
Not applicable scenarios:
-
Low time delay data access applications, such as tens of milliseconds range.
- Reason: HDFS is optimized for high data throughput applications at the expense of high latency.
-
Lots of small files
- Cause: When NameNode starts, the metadata of the file system is loaded into memory, so the total number of files that can be stored by the file system is limited by NameNode memory capacity. As a rule of thumb, each file, directory, and data block contains about 150 bytes of information. If you have a million files and each file contains one data block, you need at least 300MB of memory space, but if you store a billion files, you need a lot of memory space.
-
Multi-user write, modify files arbitrarily.
- Cause: Currently, the HDFS file has only one writer, and write operations are always written at the end of the file.
-
Streaming data access
- After the data set is generated, various analyses are performed on the data set for a long time. Each analysis will involve most or all of the data in that data set, so the time delay for reading the entire data set is more important than the time delay for reading the first record. Random data access corresponds to stream data access, which requires less delay in locating, querying or modifying data. It is more suitable for multiple reads and writes after data creation. Traditional relational database is very suitable for this point.
Third, hadoop storage model
- Use bytes
- Linearly cut files into blocks: offset (byte)
- Blocks are distributed among cluster nodes
- The Block size of a single file is consistent, and files can be inconsistent
- Block allows you to set the number of copies, which are scattered in disorder among different nodes
- The number of copies must not exceed the number of nodes
- File upload can be set Block size and number of copies (resources are not enough to open the process)
- The number of Block copies of uploaded files can be adjusted without changing the size
- Only one write and multiple reads are supported, with only one writer at a time
- You can append data
Iv. Architecture model of Hadoop
- File MetaData
- metadata
- The data itself
- (primary) NameNode node saves file metadata: single-node POSIX
- Save files Block data from DataNode: Multiple nodes
- DataNode maintains heartbeat with NameNode and submits Block list
- HdfsClient interacts with NameNode metadata information
- HdfsClient interactive file with DataNode Block Data (CS)
- DataNode uses the local file system of the server to store data blocks
Five, component introduction
- namenode
- The main function
- Receives read and write services from clients
- Collect Block list information reported by DataNode
- NameNode stores metadata information including
- Files owership and Permissions
- File size, time
- (Block list: Block offset), location information (persistent not stored)
- Block location per copy (reported by DataNode)
- Memory-based storage: does not swap with disk (bidirectional)
- It’s only in memory
- Persistence (one-way)
- The namenode persistence
- The metadata information of the NameNode is loaded into memory after startup
- Store metadata to disk file named “fsimage” (point-in-time backup)
- Block location information is not saved to fsimage
- Edits Logs metadata operations… >Redis
- The main function
- SecondaryNameNode (only exists in Hadoop1.0)
- It is not a backup of NN (but it can be a backup), and its main job is to help NN merge edits logs and reduce NN startup time.
- SNN Indicates the timing of the merge
- Period The default interval fs.checkpoint.period is 3600 seconds
- Set the edits log size based on the configuration file fs.checkpoint.size the default maximum edits log size is 64MB
6. HDFS read and write process
- Writing process
- A service application invokes the API provided by the HDFS Client to request file writing.
- The HDFS Client contacts NameNode, which creates the file node in the metadata.
- Business applications call the WRITE API to write files.
- After receiving service data, the HDFS Client obtains the data block number and bit information from NameNode, contacts Datanodes, and sets up a pipeline for datanodes to write data. After completion, the client writes data to DataNode1 through its own protocol, which copies data from DataNode1 to DataNode2 and DataNode3.
- After data is written, confirmation information is returned to the HDFS Client.
- After confirming all data, the service invokes the HDFS Client to close the file.
- Service call close, flush, HDFS Client contact NameNode to confirm that data writing is complete, and NameNode persists metadata.
- Reading process
- A service application invokes the API provided by the HDFS Client to open a file.
- The HDFS Client contacts NameNode and obtains file information, including data blocks and datanodes.
- The business application calls the READ API to read the file.
- Based on the information obtained from NameNode, the HDFS Client contacts DataNode to obtain the corresponding data block. (The Client reads data using the proximity principle.)
- The HDFS Client communicates with multiple Datanodes to obtain data blocks.
- After the data is read, the service calls close to close the connection.