Namenode metadata management and dual buffering

preface

These two questions about NameNode are actually very classic, not only there are a lot of details to ask, but also a frequent interview question, so I specially write a separate article. Metadata management will be combined with the source code, and double buffering although not to read the source code, but we can use a simple implementation to explain. Some modifications will be made to the source code to make it more efficient. So let’s get started without further ado

Since reading the source code may not be acceptable, let’s talk about double buffering.

Namenode double buffering

1.1 Problem Scenario

Metadata in Namenode is stored in two states:

The first state is stored in memory, which is the directory tree mentioned earlier, which is a list, and it is very fast to update metadata in memory. But if you only store metadata in memory, the data is less secure.

So we also store a copy of metadata on disk, but then the problem arises, we need to write data to disk, which is definitely not very good performance. However, NameNode, as the leader of the whole cluster, performs hive, HBASE, Spark, Flink and other calculations on Hadoop. All these data constantly pressure NameNode to write metadata, and 100 million metadata is possible in a day. Therefore, NameNode must be designed to support ultra-high concurrency. But write disk this operation is very very slow, a second dozens or at most a few hundred have been capped, then how to do now?

1.2 Double buffering mechanism for metadata

The double buffering mechanism means that we will create two identical memory Spaces, one for bufCurrent, and the data generated will be written directly to this bufCurrent, and the other for bufReady. After the bufCurrent data is written (in fact, there may be more than one data, but we will explain later), Two pieces of memory will exchange. The previous bufCurrent is then responsible for writing data to the disk, and the previous bufReady continues to receive data written by the client. In effect, the task of writing data to disk is handed over to the background. This also works in JUC

On top of that, Hadoop assigns a transaction ID number to each metadata modification to ensure that the operation is orderly. This is also for data security reasons. This requires a lot of memory for the entire system, so it’s a Hadoop optimization issue that will be covered later.

1.3 code implementation of double buffer mechanism

This is actually very helpful to understand, and I hope you can follow the train of thought

1.3.1 Simulate a metadata information class

Let’s start by designing a metadata message

Public class EditLog{// transaction ID public long taxId; public String log; public EditLog(long taxid, String log) { this.taxid = taxid; this.log = log; } @Override public String toString() { return "EditLog [taxid=" + taxid + ", log=" + log + "]"; }}Copy the code

1.3.2 Double buffer

The code is actually not difficult, divided into five pieces

① Two buffers currentBuffer and syncBuffer are defined

② A write method is responsible for writing metadata

③ A flush method writes metadata to disk. Here I use only a print statement to simulate the write to disk. When the write is complete, the syncBuffer is emptied

④ An exchange method to exchange currentBuffer and syncBuffer

⑤ There is also a getMaxTaxid method to get the maximum ID of the in-memory transaction ID of the data being synchronized. The purpose of this method is explained later

I think you all know what these five pieces are for, except for getting the ID at the end. Okay, we’ll find out later

LinkedList<EditLog> currentBuffer = new LinkedList<EditLog>(); <EditLog> syncBuffer = new LinkedList<EditLog>(); Public void write(editLog){currentBuffer. Add (editLog); Public void flush() {for(EditLog EditLog :syncBuffer) {system.out.println (EditLog); } syncBuffer.clear(); Public void exchange() {LinkedList<EditLog> TMP =currentBuffer; currentBuffer=syncBuffer; syncBuffer=tmp; Public long getMaxTaxid() {return syncbuffer.getLast ().taxId; }}Copy the code

1.3.3 Core methods for Writing metadata Logs

I now need to ensure the order of the writes (here the client writes to bufCurrent), so we’ll use synchronized here, and then use taxid++ for the sequence. Then new a metadata object and write the object to disk

long taxid=0L; // DoubleBuffer doubleBuffer=new DoubleBuffer(); ThreadLocal<Long> ThreadLocal =new ThreadLocal<Long>(); private void logEdit(String log) { synchronized (this) { taxid++; Threadlocal.set (taxId); threadlocal.set (taxID); EditLog editLog=new EditLog(taxid,log); // Write something to memory doublebuffer. write(editLog); } // The lock is released // the method to persist data to disk logFlush(); }Copy the code

That has a small partner will have a doubt, are added to the lock this run performance can be good? Write (editLog), which is writing to memory. So this is fine and perfectly supports high concurrency

For example, if we have three threads, thread 1 comes in to logEdit, executes a write, the lock is released immediately, and thread 2 immediately follows the write and goes to thread 3.

Since memory writes are extremely fast, we may have already completed three writes to bufCurrent before the **logFlush() method (persisting data to hard disk) is executed.

Warm prompt: the side of the thread one will enter the logFlush, but at this point bufCurrent may have carried the thread data from 1, 2, 3, and now I’ll do a hypothesis, thread 1, 2, 3 write metadata is 1, 2, 3 respectively, this sentence is very important!!!!!! This sentence is very important!! This sentence is very important!! Very important things say three times, and then read logFlush’s explanation

1.3.4 logFlush – Method of persisting data to hard disk

Public Boolean isSyncRunning =false; public Boolean isSyncRunning =false; // Synchronizing the largest ID in the memory block of the disk. long maxtaxid=0L; boolean isWait=false; Private void logFlush() {synchronized(this) {if(isSyncRunning) {long localTaxid=threadLocal.get(); if(localTaxid <= maxtaxid) { return; } if(isWait) { return; } isWait=true; While (isSyncRunning) {try {// Wait this. Wait (1000); } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } } isWait=false; //syncBuffer(1,2,3) doublebuffer.exchange (); //maxtaxid = 3 if(doubleBuffer.syncBuffer.size() > 0) { maxtaxid=doubleBuffer.getMaxTaxid(); } isSyncRunning=true; } // release the lock // persist the data to disk. doubleBuffer.flush(); Synchronized (this) {// Synchronize isSyncRunning isSyncRunning=false; // Wake up wait; this.notifyAll(); }Copy the code

It’s a very clear idea, so let’s sort it out now.

So the logFlush method is, I’m going to start with a Boolean isSyncRunning that tells me that the background is synchronizing data to disk, which must be false at first until the client doesn’t have enough data to write in, But if I have enough data to write into bufCurrent, THEN I will swap bufCurrent with syncBuffer and change isSyncRunning to true. Note that maxTaxID is the largest ID in the memory block of the disk being synchronized (needed later). Then let the original bufCurrent write data to disk. After the write is complete, isSyncRunning is changed back to false

If the second thread enters logFlush with data 2, and the write to disk operation is not complete, it will get the transaction ID of the current thread –localTaxid. If the current localTaxid is less than the maximum transaction ID that I am currently synchronizing (2<3), That means that the current thread 2 is carrying data that I’m already operating on in the previous thread. I’ll just ignore it. (If you don’t understand why, go to the next paragraph)

I’m going to use something that I said was important three times. When the last thread 1 came in, bufCurrent had ensnared data 1,2, and 3, and at this point my MaxTaxId =3, the 2 ensnared by thread 2, was already in process

But if localTaxid is greater than the maximum transaction ID I’m currently synchronizing, and there are threads synchronizing metadata, THEN I tell it to wait, and when I wait, the client can continue writing metadata to bufCurrent. The logic of this code is to wait 1s and then check again to see if any threads are synchronizing metadata. If the thread is still waiting, it is not waiting. If the thread is still waiting, it is not waiting. If the thread is still waiting, it is waiting.

1.3.5 Write a main method to run

public static void main(String[] args) { FSEdit fs=new FSEdit(); for (int i = 0; i < 1000; i++) { new Thread(new Runnable() { @Override public void run() { for (int j = 0; j < 100; J++) {fs.logedit (" log "); } } }).start(); }}Copy the code

If we do a random 10W run, it looks like it’s done in three or four seconds, and the editlogs are all in order

In fact, this is a complete imitation of hadoop source code to write a general, later we will also modify the source code of this place. But it’s pretty close.

Here I can also explain where there are shortcomings, for example, if we operate like this and swap memory frequently, that will definitely have some impact on performance, so we will set a reasonable size in this area before swapping.

How does NameNode manage metadata

NameNode creates a directory, and then see if the HDFS metadata changes accordingly

hadoop fs -mkdir /user/soft
Copy the code

With that in mind, let’s open hadoop-src

2.1 Use Java code to simulate directory creation

Now create the directory with a piece of Java code

public static void main(String[] args) throws IOException { Configuration configuration=new Configuration(); FileSystem fileSystem=FileSystem.newInstance(configuration); FileSystem. Mkdirs (new Path("")); fileSystem. /** * TODO does some important initialization */ FSDataOutputStream fsous=fileSystem. Create (new Path("/user.txt")); //TODO completes the file upload process fsous.write("showMeYourDream".getbytes ()); }Copy the code

2.2 Mkdirs Operations

Now that we know that the mkdirs method is implemented by the DistributedFileSystem class, let’s go there and find it

2.2.1 the file.mkdirs DistributedFileSystem

2.2.2 the file.mkdirs DFSClient

So now we can go straight to NameNode and take a look at its mkdirs method

The file.mkdirs 2.2.3 the NameNode

Go ahead and click fsDIrmkdirop.mkdirs

2.3 Relationship between FSDirectory and FSNameSystem

The first keyword is FSDirectory, which is the directory tree (fsimage) that manages the metadata, which is in the memory of the NameNode

Also note that our metadata is in two copies, fsimage + Edit log on disk, managed by FSNameSystem, and fsimage in memory, managed by FSDirectory

How did I know that? Hadoop developers told me to click “FSDirectory”

Our metadata record information is persisted to disk

And this FSDirectory object is how to manage a directory, we want to through the FSDirectory code description

2.4 Structure of FSDirectory and directory tree generation process

An INode is an INodeDirectory that contains a folder, and an INodeFile that contains a file

Go back to mkdirs in FSDirMkdirOp

The lastINode is actually warehouse (aka lastINode)

So far, the directory tree fsimage has been updated.

The process to summarize

DistributedFileSystem implements the mkdirs method and jumps to DFSClient. DFSClient sees that mkdirs actually calls Namenode mkdirs.

Fsimage is a directory tree, and our metadata is in two copies, one is fsimage + Edit log on disk, managed by FSNameSystem, and one is fsimage in memory, managed by FSDirectory. INodeFIle represents a file and INodeDirectory represents a folder. To create a directory, you need to get the last INode of the existing directory in the cluster, which is called lastNode. A for loop is then used to concatenate the directories to the end of the lastNode.

finally

The number of words is more, but in fact in general are not difficult to understand. Important positions have been bold, but also hope to help you. Later asked these two questions, in the understanding of the premise of this routine, according to our summary process briefly said. I hope you’re all good at it.