Ramble on distributed storage planning and design
After reading some books, I could not sleep in the middle of the night, all kinds of exciting views, all kinds of novel ideas and thinking that I did not think of… After this kind of look, can insomnia…
Some books after reading, want to die heart have, incompatible… Why did you choose this book…
Read all kinds of books, some talk about ideas, some talk about specific operations, some even brought up official documents to make up the numbers… This kind of… What’s going on in your head…
Look at a variety of books, many involve a variety of specific operating steps, the first step how to do, the second step how to do, done, over…
Look at distributed storage, there is no design to how to plan and design distributed storage… According to?
Docker, container, volume, network, storage… There is no mention of how to plan… According to?
When you look at KVM, it’s all about services, creating VMS, using images… There is no mention of how to plan… According to?
Since everyone doesn’t talk about planning, I’m going to talk about planning!
Distributed storage planning and design
Scene: Assume that the distributed storage with a central node, such as HDFS, is called the master node, and chunkServer is responsible for storing data. Relevant processes need to be run on the master node to perform scheduling services, load balancing, and data copy. The chunkServer needs to run a process that stores data, communicates with the master, and reports the status of the ChunkServer.
Now there are two roles, one is master and the other is ChunkServer. In the cloud environment, there are many kinds of services running, which can be run on physical machines, VM or Docker. Then, how to plan and design the service at the beginning of planning?
Consider the following: Master is for scheduling, it’s for storing metadata, it’s CPU intensive, that is, as long as there’s enough CPU, it can run on a VM; Chunkserver, on the other hand, is mainly used to store data. It is an IO-intensive application that requires a lot of IO and is used for persistent storage, so it must run on a physical machine.
So in this distributed cluster, there are three VMS and N physical machines. Why three VMS? Because the master needs to select the master, so you need an odd number of machines, you should have five better, can tolerate two VM hangs… No… I’ll take five… I need five copies of three data centers… The two regions, one in Beijing and the other in Nanjing, will establish two data centers locally in Nanjing. In these two data centers, they will be divided into two masters respectively, thus. It doesn’t matter which one…
A bit too far, so go back to a single data center, deploy a private cloud in it, private cloud distributed storage…
Assuming chunkServer uses 6 physical machines, do the master VMS also run on the 6 physical machines? No!!!!!!
In order to ensure reliability and availability, all eggs must not be put in one basket, so we must find 3 physical machines, create a VM on each physical machine, and then run the master.
Think of a scenario where the data is secure, fragmented, distributed, but!! What if the Master or ChunkServer process hangs?
There are also many tools available in the market. The most common tool is Supervisor, which automatically pulls up the process when it fails. Oh? You are dead? Rise again, my champion!!
So there are two VMS, you ask me why I need two VMS, these two VMS are mainly used to do load balancing, or master/slave, mainly to ensure that the process of checking live itself also want to live, after all, everything is not reliable, get two is the most reliable,… Duplicates are the only guarantee of distributed reliability!!
Is that enough? Not enough… I also need a VM for tools. Why do I need a VM for Tools? In distributed storage, sometimes a variety of batch operations, such as the detection of chunkserver logs, need special tools or SDK. For example, if you need to count the remaining capacity of a cluster, you need specialized tools. You need to detect the state of each process, and you need a tool. So the tools aren’t necessary, but… It would be nice to have one, and to make the VM of the Tools in a secret-free form, because then you have the right to batch operation.
Is that enough? All processes are available, all monitoring is available, all logs are saved, all alarms are available. Where’s my alarm monitoring…
Alarm monitoring, this must run on every server, so no matter the physical machine, or VM, need to run this monitoring program, so that alarm notification management… But you say I am distributed, why should I alarm?
In fact, it can be considered that the alarm self-healing is not better. In that period of time, the alarm is automatically recovered in a short time, which is good.
Consider a scenario where my production system needs distributed storage, my test environment needs distributed storage, and my development environment needs distributed storage… Deploying this service every day is tiring… So what do we do?
Templates… Template, automatic installation, must be automatic, planning, give you a few IP, give you a few physical machines, you automatically to install, you write your own configuration file, your template is different, the generated distributed storage will be different, then… It depends on how you design your template.
So everyone has 24 hours a day, no more, no less. How do you plan? How do you design? Your day is a template… What personalization can you do? How much can you accumulate every day… How much data can you persist every day… In the era of big data, your daily experience is your UGC content, so how much value can you create ????
JUST Thinking…
When the wind comes…
Microservices, non-existent… Look at this micro service, feel good, surprise… Stimulating or not stimulating… Stolen (photo)
Using K8S will improve… Will not…
Container application, using K8s can improve… The disorderly flower gradually wants charming eye… Guess which app I’m the container for… Which pod do you think I am? Which service do you think I am? !!!!!!!!!
Don’t guess, you say !!!!