preface

Intercepting the knowledge planet to share out, is also a supplement to the previous HDFS, by the way let you review

The first two HDFS articles are here:

Take you into the pit of big data (a) – HDFS basic concepts

Big Data (2) – HDFS read and write process and some important policies

Content

Decentralized storage, redundant storage

Firstly, data in HDFS can be divided into real data and metadata. Of course, metadata is stored in Namenode, while real data is stored in Datanode.

For example, if we want to store a large file, we will split the file into data blocks and store them separately in a Datanode. So the question is, how do you know which file is made up of which blocks? Which Datanodes do these data blocks reside on? This is where metadata comes in.

In fact, metadata storage is stored in memory and disk, the main consideration of memory is to improve response time, while the disk is to ensure the security of metadata. This leads to the question: How do you ensure data consistency between memory and disk? This can learn Redis practice, both to ensure efficiency and security.

And the reason why HDFS is not suitable for storing small files, and even say that we will have many small file merge mechanism, that is because metadata is not a file, but each data block should have a related metadata description, and the metadata size of each data block is nearly 150 bytes. For example, if we store a 100M video in HDFS, there will be a 150byte metadata corresponding to it. If we store 100 1M pictures, there will be 100 150byte metadata corresponding to it. Therefore, this resource waste is terrible.

Redundant storage means that our blocks will have replicas, and the replicas will be stored in rack storage policies

Rack Storage Policy

In the actual equipment room, there are racks, and each rack has several servers. Generally we will store three copies of a block as follows:

  1. The first replica is stored on rack A and the second replica is stored on A server in A different rack from the block (such as rack B)

  2. When storing the second replica, the replica is stored on a different rack first. This is to prevent a power failure in one rack. If the replica is also stored on a different server on the same rack, the data may be lost.

  3. The third replica is stored on another server in Rack B (note that both replica 2 and 3 are stored in Rack B)

Why would be so choice, because if we get a copy of the 3 on another rack C, communications between duplicates of 2 and 3 will need to copy 2 through its total to contact switch, and then switches to contact rack C switches, need route is very long, and bandwidth resources of the computer room is very valuable, if in the condition of the high concurrency, It is easy to use up the bandwidth of the machine room, and the response speed of the whole cluster will drop sharply, and the service will have problems.

Of course, the number of copies can also be increased manually through commands. When the client visits more, we can appropriately allocate the pressure

$ hadoop fs -setrep -R 4 path + FileName
Copy the code

Set replication = set replication = set replication = set replication = set replication = set replication = set replication = set replication = set replication = set replication = set replication

All client interactions are with the Namenode and, like Kafka, always with the leader, not the follower. However, you should know that the normal implementation of data upload and download function is actually going to Datanode.

HDFS architecture

HDFS architecture: Primary and secondary architecture, with three roles

  1. As the leader of the cluster, Namenode is in charge of metadata of the HDFS file system, processing read and write requests from clients, data security in the HDFS cluster, and load balancing
  2. Datanodes store all data blocks in the entire cluster and process real data reads and writes
  3. SecondaryNamenode technically does not belong to the namenode backup node, and it plays a main role for the namenode is actually share the pressure, reduce the load (metadata editing logs to merge, that is, the edits the log)

heartbeat

The heartbeat mechanism solves the communication problem between HDFS clusters, and is also the way for NameNode command DataNode to perform operations

  1. After the master Namenode starts, an IPC server will be started
  2. The DataNode starts up, connects to the NameNode, and sends a heartbeat message to the NameNode every 3 seconds with status information
  3. The NameNode sends task instructions to the DataNode using the return value of this heartbeat

The role of the heartbeat mechanism:

1.NameNode has full control over data block replication. It periodically receives heartbeat signals and block status reports from each DataNode in the cluster

2. The DataNode registers with the NameNode when it starts, and periodically reports block reports to the NameNode. The DataNode sends a heartbeat message to the NameNode every 3 seconds, and the NameNode returns instructions to the DataNode, such as copying data blocks to another machine. If a DataNode does not send heartbeat messages to the NameNode within 10 minutes, the NameNode determines that the DataNode is unavailable. In this case, the read and write operations of the client are no longer transmitted to the DataNode

Safe mode

3. When the Hadoop cluster starts in safe mode (99.99%), the heartbeat mechanism is used. In fact, when the cluster starts, each DataNode sends block reports to NameNode, and NameNode counts the total number of blocks reported. When block/total is less than 99.99%, the safe mode is triggered. In safe mode, the client cannot write data to the HDFS, but can only read data.

In addition, the calculation formula for Namenode’s perception of Datanode’s offline death is:

timeout = 2 * heartbeat.recheck.interval + 10 * dfs.heartbeat.interval
Copy the code

HDFS timeout time is 630 seconds, by default for the default heartbeat. Recheck. Interval for 5 minutes, and DFS. The heartbeat. The default is 3 seconds interval, and to understand these two parameters are also very simple, one is to check the time interval, And the other is to send a heartbeat every n seconds, wait 10 times, no pull down.

A supplement to the safe mode

The safe mode is not only entered when the cluster is started and all Datanodes report such a situation, but also automatically entered when the HDFS data block loss reaches a certain proportion. Of course, you can manually enter the safe mode. By default, this ratio is 0.1%, and losing 1 block in 1000 is a serious event.

It is possible to use the start-balancer.sh command to make HDFS perform load balancing. However, this command has some problems, which are the same as system.gc () in Java garbage collection mechanism. I’m telling you, it’s time for garbage collection, but the JVM just won’t listen to us, why? It only does garbage collection when it’s free, and start-balancer.sh does the same thing, using the remaining bandwidth to do it.

And there are some criteria for this operation, which is defined by a number n. Each node calculates the usage of a disk. Maximum usage – Minimum usage = this standard value n. The default is 10%, so that makes sense. In addition, the load balancing operation must not affect the read and write services of the client. Therefore, the HDFS does not allow the balance operation to occupy too much bandwidth by default. But we can adjust it manually

hdfs dfs admin -setBalancerBandwidth newbandwidth
Copy the code

The default unit of newBandwidth is bytes, so this will be adjusted as needed. The default is 1M per second.

HDFS defects and evolution

Hadoop1 was created to solve two problems: how to store massive data, and how to calculate massive data

What Namenode does:

  1. Managing cluster metadata (file directory tree) : A file is divided into several blocks, and where the blocks are stored.
  2. Namenode loads metadata into memory in order to quickly respond to user requests for operations

Datanode work:

  1. To store data, the uploaded data is divided into fixed size file blocks, hadoop1 default 64, after 128M

  2. To ensure data security, there are three copies of each file block by default. The three copies here actually mean a total of three, not a total of four copies of the original three

HDFS1 has architectural flaws

  1. Single point of failure: If the Namenode fails, the cluster will fail
  2. Memory limitation: The entire cluster crashes when Namenode metadata overflows the memory

The QJM solution solves single points of failure

There is also a solution that allows multiple Namenodes to share a shared directory, but this may cause problems with the shared directory, so this does not meet our requirements

QJM scheme is as follows: We set up a separate Cluster of JournalNodes and let them solve the single point of failure. The data is consistent between journalNodes, and when our active Namenode makes changes to the metadata, The data is written to the JournalNode and then synchronized by the Standby Namenode to make the data consistent with that of the primary Namenode.

However, there is still a problem, that is, we need to manually switch standby to active, so we introduced Zookeeper to solve this problem. In fact, some enterprises may use Zookeeper instead of JournalNode. Since JournalNode’s job is to keep the data consistent across these nodes, this feature is fully possible using Zookeeper’s atomic broadcast protocol.

Create a lock directory in Zookeeper, and the NameNode preempts the lock when it starts. The first NameNode to grab the lock is in the active state. In addition, each NameNode has a ZKFC service that continuously listens for the health status of the NameNode. If the active NameNode has a problem, ZKFC reports it to Zookeeper, who then assigns the lock to the standby NameNode. The system automatically switches to the active state. This is the architecture of HDFS2.

JournalNode recommended

Because JournalNode is not a heavy task, you don’t need a large cluster

Less than 200 nodes: > 3 nodes

200 to 2000 nodes: > 5 nodes

Federation addresses memory constraints

As shown in the figure above, this is equivalent to a scale-out. Since one Namenode will burst, multiple Namenodes will be responsible for storing different metadata information. It’s just like our C,D,E,F drives. C is for storing things in the system,D is for storing software, and E is for playing movies. And the federated mechanism, HDFS, automatically routes. The user does not have to care which namenode is storing it

For a brief summary, active+ Standby Namenodes are grouped to ensure high availability and prevent single points of failure. Multiple active+ Standby groups form a federation to solve the problem of insufficient memory for a single group.

How does HDFS manage metadata

Remember when we set up the cluster and we had to do the “format Namenode” operation before starting the cluster? The purpose of this step is to generate fsimage, our metadata directory information, on disk. This metadata is then loaded into memory.

As shown below, if the client wants to upload a file, it interacts with the fsimage in memory to add a piece of metadata, which is then written to an Edit log, as the name implies. The fsimage on disk is the same as the original one, unlike the fsimage in memory, which changes all the time. At this time

Memory fsimage = disk fsimage + Edit logCopy the code

If I stop the service and the fsimage in memory is destroyed, I just need to play back the records in the Edit Log and write a new fsimage. At this time, the cluster is started again, loaded into memory and restored to the state before we stopped the service.

SecondaryNamenode SecondaryNamenode SecondaryNamenode SecondaryNamenode SecondaryNamenode SecondaryNamenode SecondaryNamenode

  1. The SecondaryNameNode pulls the edits log and fsimage information over HTTP get
  2. Merge the edits log and fsimage in the SecondaryNameNode to create a new file called

Fsimage.ckpt 3. When SecondaryNameNode is merged, it is passed back to NameNode 4. In this case, there is a high probability that some clients are still performing read/write operations on the NameNode and new logs are generated. In this case, you can continue the routine without introducing SNN. In the HA scenario this behavior can be handed over to the standby.

Double buffering mechanism

Namenode metadata is stored in one of two states:

The first state is stored in memory, the directory tree mentioned earlier, which is a list, and updating metadata in memory is very fast. But if you only store metadata in memory, the data is less secure.

So we store a copy of metadata on disk, but the problem is that we need to write the data to disk, which is not very good performance. However, as the leader of the cluster, NameNode performs hive, HBASE, Spark, flink and other calculations on Hadoop. These data will constantly pressure NameNode to write metadata. It is possible to write 100 million pieces of metadata a day, so NameNode must be designed to support super high concurrency. But writing to disk is very, very slow, tens or at most hundreds per second have been capped, so now what?

First, the data generated by the client (i.e., Hive, hbase, or Spark) goes through two processes. The first process writes data to the memory, which is very fast and understandable

At this point, we certainly can’t write directly to memory, after all, we know that this thing is very slow, really wait for it to write data to disk, then we can leave the mouse and keyboard with both hands to go off work. If NameNode does not work, the whole cluster will not work, so how do we solve this problem?

Double buffered means that we will open up two copies of the same memory, one is bufCurrent, and the data will be written directly to the bufCurrent, and the other is called bufReady. After the data is written to the bufCurrent (there may be more than one data in this case, but that will be explained in a moment), Two pieces of memory will exchange. The previous bufCurrent is then responsible for writing data to the disk, and the previous bufReady continues to receive data written by the client. In effect, you leave the task of writing data to disk to the background. And this is what we do in JUC as well

And on top of that, Hadoop assigns a transaction ID to each modification of metadata information to ensure that the operations are ordered. This is also due to data security considerations. The overall system memory requirement is quite large, so this is a Hadoop optimization issue that will be covered later.

Finally

Recently, I am running my own knowledge planet, which aims to help friends who want to know about big data to get started with big data, and advance with developers who have been engaged in big data. It is free but does not mean that you will not gain. If you are interested in it, you can add it, and the basic will update more frequently