This is the sixth day of my participation in the August More text Challenge. For details, see:August is more challenging
This article does not involve the specific details of HDFS, but just talk about the design and implementation of HDFS to help better understand the overall architecture of HDFS.
Design idea
The background
With the rapid development of the Internet, the amount of data that needs to be processed by computing has ballooned. In slightly larger Internet companies, the amount of data that needs to be calculated and processed is often measured in petabytes. The network bandwidth (usually hundreds of MEgabytes), memory capacity (usually tens of gigabytes), disk size (usually terabytes), and CPU speed that a computer can schedule are impossible to meet the requirements of this calculation.
This creates a requirements context:
We need a file system that can support the storage and computation of massive data and provide a unified reading and writing API.Copy the code
Problems faced
To support the storage and computation of massive amounts of data, we need to solve several core problems:
1. The data storage capacity is faulty.
Since big data is to solve the problem of data calculation of PB, and the average server disk capacity is usually 1 ~ 2TB, then how to store such a large amount of data?
2. The data read and write speed is abnormal.
General disk continuous read and write speed for dozens of MB, at such a speed, dozens of PB data is afraid to read and write to heaven and earth.
3. Data reliability.
Disk is about the most vulnerable hardware in computer equipment, usually a disk life is about a year, if the disk is damaged, what will happen to the data?
The fire of the mind
Vertical scaling vs. horizontal scaling
In computing, there are two ways of thinking about more computing power and more data storage,
-
Scaling up, which makes a computer more powerful by upgrading CPUS, memory, disks and so on;
-
Scaling out, adding more computers to the system to achieve more computing power.
Obviously, vertical scaling means higher costs, and there is a limit to vertical scaling, not horizontal scaling (or much higher limit than vertical scaling). So, if we’re going to solve our problem, we’re going to have to start with horizontal retracement.
Horizontal scaling can not meet our needs
At present, we can only take the road of horizontal expansion, so horizontal expansion can meet the demand?
User request aspect
Site often real-time processing for a single user request, while the large number of concurrent requests high, large sites but between each user request is independent, as long as the site of a distributed system to different users of different business requests assigned to different servers, we start a process on each server to process these requests, At the same time, as long as the coupling between these distributed servers is small enough, the scalability of the system can be guaranteed by adding more servers to handle more user requests and the resulting user data.
Mobile computing is more cost-effective than moving data
Faced with petabytes of data, the program itself might be kilobytes or Megabytes. Since the amount of data is much larger than the program itself, it is not cost-effective to send data to the program. Another way to think about it is to send the program over the network to where the data is, which is actually consistent with the idea of horizontal scaling.
RAID
If I have seen further than others, it is by standing on the shoulders of giants. Newton –
Almost everyone has heard this before, and so has technology. The birth of every popular new technology means not only a technical problem to be solved, but also a continuous summary and innovation of previous technologies.
The REDUNDANT Array of Independent Disks (RAID) technology combines multiple common disks into a disk array to provide external services. It improves the storage capacity, read/write speed, availability, and fault tolerance of disks.Copy the code
This will not discuss the specific implementation details of RAID, only describe how RAID encounter we said above the three problems to solve?
After all, implementation details may vary, but design thinking is common and eternal, which is also the purpose of this article, to help us better understand the various details of HDFS from the perspective of design thinking.
1. Data storage capacity: RAID uses N disks to form a storage array. If RAID 5 is used, data can be stored on n-1 disks, increasing the storage space by n-1 times.
2. The data read and write speed is abnormal. RAID divides data to be written into multiple disks based on the number of available disks and concurrently writes data to multiple disks. This improves the data writing speed. Similarly, the reading speed can also be significantly improved.
3. Data reliability. RAID 10, RAID 5, or RAID 6 provides redundant data storage or parity data. Therefore, when a disk is damaged, data on the lost disk can be restored using the data on other disks and parity data.
The above paragraph is relatively long, in fact, it can be summed up in three key words:
N Disks form a storage array
Concurrent writes
Redundancy backup
perfect
After the above discussion, we actually have a little more perfect thinking, which can help us to solve the three problems just mentioned at the beginning.
1. Data storage capacity: Multiple servers present a unified file system, which integrates CPU, disk, memory, and network bandwidth, and provides unified read/write apis.
2. Data read and write speed problem: we divide the data into multiple blocks and write to the system concurrently.
3. Data reliability problem: We carry out redundant backup for each block. One data block corresponds to multiple copies.
All of that has been distilled into the following mind map.
Implementation approach
In learning any new technology, I recommend starting with the problem and working through it step by step.
Back to the first three questions:
1. Data storage capacity (distributed)
1. How do you know which servers store which data in a distributed cluster?
It’s not hard to think of the most common master-slave architecture in a distributed cluster.
The client communicates with the primary server to obtain the specific storage location, and then the client requests the corresponding data from the secondary server.
Further, the master server is responsible for interacting with the client, and the slave server is responsible for storing the specific data.
2. How does the master server know what data is stored on the slave server?
After the secondary server is started, it obtains the address of the primary server from the local configuration file, communicates with the primary server, and reports the local storage information. The primary server stores metadata information, including the file system tree, all the files and directories in the tree, the block list of each file, and the node where the block resides.
Data is kept in memory and on disk for usability and fault tolerance.
3. What if the server is down? What if a disk is damaged? How does the master know?
Here is also a more classical thinking – the heartbeat mechanism, the secondary server and the primary server timed to communicate, once the timeout connection, it is considered that the secondary server down. Redundant backup cannot be avoided here. Once the primary server determines that the secondary server is down, it retrieves the storage information owned by the secondary server from local memory and creates a new backup. If the secondary server determines that the disk is damaged, the secondary server reports the disk storage information to the primary server. The primary server finds out which other servers the same information is stored on and only needs to create a backup. The corresponding request is also transferred to the backup secondary server.
This actually reflects a common safeguard system availability thinking – failure transfer.
4. What should I do if the primary server is down?
In hot standby mode, the primary and secondary servers share metadata. When the primary server fails, the backup server becomes active immediately, ensuring system availability.
5. How to quickly recover from a fault?
1. First of all, we should have thought that we can directly image the entire file system, after the system restart directly load the image is good;
2. But this will have a problem, playing a mirror for a long time, this period of time the system is not available?
Here’s a classic solution:
We can record every write to a log file, so we have a separate thread dedicated to merging logs into the mirror.
3. In this case, there is another problem. If the primary server is suddenly down, the data written to the log file is not merged into the image file, and the system is still unavailable for a long time after the system restarts?
Encounter this problem, there is a common solution, is to set up checkpoints, at regular intervals to the inside of the log file write to merge into the image file, in order to strengthen the fault tolerance system, checkpoint to put another one on the server, the server is designed to handle the log file merging into the image file. Periodically send the newly generated image file to the master server and overwrite the old image of the master server.
Of course, another common solution is to create a snapshot every once in a while, as we do every time we install the system. Note that the snapshot is different from the image above.
6. Faced with massive data, how to deal with insufficient memory for a single master server?
The memory of one is not enough, then I will add one; Two is not enough. I’ll add more! This is actually the idea of horizontal scaling.
7. What if the metadata disk is faulty?
Again the old problem, believe you can blurt out: redundant backup! Yes, I can keep several copies of the same data on one server, but there is a better solution: use shared storage across multiple primary servers or use distributed editing logs.
2. The problem of data read and write speed (data block)
1. How is data partitioned?
Data is shred to a fixed size. What if the size is insufficient? Insufficient size also counts as a block, so it is not convenient to manage
2. Is the data block size appropriate?
Data blocks should not be too small – first of all, the metadata information of the data block is the same, the data block is too small for the memory of the primary server, and second of all, the data block is too small, for mechanical disk, each data block is a continuous address, switching data blocks, the addressing time is too long.
Blocks can’t be too big — how can they be written concurrently? Remember that the purpose of partitioning is for concurrent writing
3. What if you encounter massive amounts of small data?
This is a common idea, merge and compress.
3. Data reliability (redundant backup)
1. Where will the data go?
Nodes should not be too close together — one hang might all hang
Nodes can’t be too far apart — the farther away, the longer the transmission time, according to Shannon’s theorem.
2. Which copy should be selected when reading data?
Considering Shannon’s theorem, the closer the better, because, as we said, the data is downloaded by the client from the server, so the distance between the client and the server.
3. How to ensure data integrity?
There is also a classic thinking here, since the larger the data, the less reliable the network transmission, we try to find a way to slice the data into smaller pieces.
We use checksums to ensure data integrity, and since the size of the checksum is much smaller than the size of the data, the checksum is much more reliable than the data.
Read data – The client will know if the data was read successfully after reading the checksum of the data compared to the checksum in the primary server’s memory.
Write data — For data blocks, data is written in small chunks. Each chunk contains data and checksums. For each piece of data, we write to each backup node one by one, and check the data directly after each piece is written.
Specific to the HDFS
In fact, HDFS is designed according to the above ideas, of course, the above ideas can only give you a rough understanding of the design of HDFS, I will write more detailed content in a future article.
Each of the above questions corresponds to one or more knowledge points in HDFS. For details, see the mind map below.