1. Upload the internal execution diagram of the file
1.1. Let’s talk about uploading HDFS files
- 1: First there is a file, called 1.txt, let’s say this file is 300MB.
- 2: If the client wants to upload the 1.txt file, it sends a request to NameNode, the boss of HDFS, to upload the file.
- 3: NameNode checks whether the request permission is valid, whether the parent directory exists, and whether there is a file with the same name. If the check fails, the request fails. Check by looking at the following steps.
- 4: NameNode, the boss of HDFS, sends a call to the Client and requests permission to upload files.
- 5: The Client starts to upload the first block after splitting the file
- 6: NameNode obtains three suitable Datanodes according to the replica placement policy
- 7: NameNode returns the host list of DataNode to the Client
- 8: Pipeline is established between the Client and DataNode1, DataNode2, and DataNode3.
- 9: The Client sends data in the unit of packet. Each PACKE is 64K.
- 10: DataNode (1,2,3) does not receive a packet, cache it, and then continue to pass the packet.
- 11: When the DataNode receives a packet, it sends an ACK response and stores it in the reply queue
- 12: After sending a Block, the DataNode stores data to the hard drive
- 13: The same is true for other blocks
2. Read the internal execution diagram of the file
2.1. Now listen to me to talk about HDFS read file
- 1: The client sends a read request to NameNode.
- 2: NameNode checks the permission of the request to check whether the request has operation permission and whether the file exists.
- 3: NameNode Returns the Block list of the downloaded file.
- 4: Returns the block list to the client and sorts the hosts where each copy resides.
- 5: Select the optimal host for reading each block according to the actual situation.
- 6: The Client establishes pipeline pipelines with each DataNode host that stores blocks.
- 7: Read data from multiple Datanodes simultaneously (parallel reading)
- 8: Combine multiple blocks into a complete file.