I'm a CRUD guy. You're telling me algorithms work? [Architectural Notes of Huperia]

** Welcome to follow our wechat official account: ** Architecture Notes of Huoshani (ID: Shishan100) **

My new course ** “C2C e-commerce System Micro-service Architecture 120-day Practical Training Camp” is online in the public account ruxihu Technology Nest **, interested students, you can click the link below for details:

“C2C E-commerce System Micro-service Architecture 120-day Actual Training Camp” **

Personal public account: Architecture Notes of Huishania (ID: Shishan100)

Many Java development students often have a question, Java development also need to understand the algorithm? This article let’s talk about this problem.

In fact, if you develop a very complex and challenging large system, it is inevitable to use algorithms in the system. Similarly, if you can optimize the algorithm reasonably, you can also improve the system performance by dozens of times!

Empty words without evidence, the following with a real case to illustrate. Let’s take a look at the performance optimization of the file contract monitoring algorithm when Hadoop is deployed in a large-scale cluster scenario with a large number of clients concurrently writing data.

Hadoop is the most complex java-based distributed system in the world, so we chose it as an example. From its algorithm optimization to improve the system performance, we can see the importance of the algorithm for Java programmers to develop the system.

To understand this article, you need to have some basic knowledge of Hadoop and understand its architecture principles. So if you are not familiar with this article, you need to read the previous article: Brother, Let me tell you the Hadoop architecture principles in plain English.

To introduce a small background, if multiple clients simultaneously want to write a file on HDFS, do you think it can be done?

Obviously not acceptable, guys. HDFS files are not allowed to write concurrently, such as append some data concurrently.

So, HDFS has a mechanism called file contract mechanism

In other words, only one client can obtain the contract of a file on the NameNode at a time, and then write data. When other clients try to obtain the file contract, they cannot obtain it and have to wait. This ensures that only one client is writing a file at a time.

After obtaining the file contract, during the writing process, the client needs to start a thread to request NameNode to renew the file, telling NameNode: “I’m still writing the file, can you keep the contract for me?”

NameNode has a special background thread that monitors the contract renewal time. If a contract has not been renewed for a long time, the contract will expire automatically and another client will write the contract.

Take a look at the picture below:

Well, the problem is that if we have a large-scale deployment to a Hadoop cluster with tens of thousands of concurrent clients, the list of file contracts that NameNode maintains internally can be very, very large.

However, the background thread monitoring contracts needs to check whether all contracts are expired at regular intervals. For example, a large number of contracts are traversed every few seconds, which is bound to cause poor performance. Obviously, such contract monitoring mechanism is not suitable for large-scale deployment of Hadoop cluster.

So how does Hadoop optimize the file contract monitoring algorithm? Let’s take a step-by-step look at the implementation logic, starting with the following figure:

Mystery is very simple, after each time a client sends a request to renew, will set the contract renewal time recently, and then based on a TreeSet data structure according to the latest contract time to sort of contract, the contract time every time in the head, the old contract after this sort of contract data structure is very important.

TreeSet is a sortable data structure that is based on TreeMap, and TreeMap is based on red-black trees. It ensures that there are no duplicates of elements, and it also allows you to customize the sort that you insert every time you insert an element.

So our sorting rule here is to sort the contract according to the most recent renewal time.

In fact, this optimization is as simple as maintaining such a sorted data structure. Then we can look at the source code implementation of contract monitoring in Hadoop:

How’s that？ Do you have to admire the technical level of those who write excellent open source projects such as Hadoop and Spring Cloud? Reading a lot of the source code of various complex and excellent open source projects can really quickly improve one’s architectural ability, technical ability and technical vision, which is also what I spend a lot of time doing.

Instead of going through thousands of contracts each time you check to see if the contract is expired, which is inefficient, you can simply get the oldest contract from TreeSet

If say to renew the oldest time even recently that contract has not expired, so need not continue to check ah! Because contracts that expire more recently never expire!

For example, the contract with the oldest renewal date was last renewed 10 minutes ago, but we judge the contract to expire after 15 minutes.

At this time even 10 minutes before the contract renewal has not expired, so those eight minutes ago, five minutes ago contract renewal, certainly will not expire, that is what it means!

This mechanism can be quite helpful for improving performance, because normally the number of expired contracts is always in the minority, so you don’t need to go through all the contracts every time to check whether they are expired, just the ones with the oldest renewals.

If a contract has expired, delete it and check the second-oldest contract. And so on.

Through this mechanism of TreeSet sorting + priority checking of the oldest contract, the performance of contract monitoring mechanism under large-scale cluster can be effectively improved by at least 10 times. This idea is very worthy of our learning and reference when designing our own system.

Eureka, as a registry, also has a renewal check mechanism in Spring Cloud microservices architecture, which is similar to Hadoop. )

However, Eureka does not implement a similar renewal optimization mechanism, instead each round of violence iterates through the renewal times of all service instances.

What if you are a microservice system deployed on a large scale? For a large-scale system with hundreds of thousands of machines deployed, would you want to run through hundreds of thousands of service instances whose renewal information resides in Eureka’s memory every few seconds?

Believe that see here, this article at the beginning of Java engineers also need to know some algorithms, you have the answer!

END

Welcome to long press the picture below to pay attention to the public number: Huoia architecture notes!

The official number backstage replies the information, obtains the author exclusive secret system study material

Architecture notes, BAT architecture experience taught each other

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

I’m a CRUD guy. You’re telling me algorithms work? [Architectural Notes of Huperia]

I’m a CRUD guy. You’re telling me algorithms work? [Architectural Notes of Huperia]

Related Posts

JDK Installation and Configuration Summary (multi-platform)

Summary of Linux performance tuning

Decrypt Java Lambda expressions