Introduce: this thing is concerned with Galileo.

The development of the code platform

I believe that many students do back-end services to see stand-alone, read and write separation, sharding these words will not feel strange. Yes, code services face roughly the same problems as other Web services at the beginning of their development, so the solutions used are roughly the same.

Single service

As you know, Git is a distributed version control software. Everyone has a complete version of the code locally. However, in order to solve the problem of collaborative development and process control (review, test card points, etc.), a centralized remote central warehouse is needed to accomplish these functions, which is where stand-alone services come from.

Reading and writing separation

With the increase of the number of collaborators and the number of submissions, the single liter configuration cannot solve the problem of excessive dosage. According to statistics, we found that the read/write ratio of Git service was about 20:1. In order to ensure the normal submission of the main link, we separated the read/write to spread the pressure. Data is synchronized through primary/secondary synchronization, and the read service is extended using the one write, many reads architecture.

shard

But no matter how the read/write separation is done, the specifications of all machines are always the same. As the number of warehouses increases and the number of users on the platform increases, both storage and computing will reach the limit of what a single machine can carry. At this time, we adopt the sharding method, which divides different warehouses into different slices, and each slice is a complete read-write separation architecture. When a request arrives, the service determines which shard the repository is in based on the query database characteristics, and then forwards the request to a specific machine based on the read/write characteristics of the interface. A little similar to the database in the sub-table. Thus, with sharding + read-write separation architecture, we can theoretically support unlimited scaling of levels.

Problems and Thoughts

So is sharding + read-write separation the silver bullet to solve the problem of large, high concurrency of code services? In our view, the answer is no. As the volume of code services grows, the core issues we address have been two:

A centralized Git storage service that is both I/O intensive and computation-intensive (Git’s compression algorithm);
The number of files is large, and the number of files in a single warehouse may be 100,000 or even millions, which poses great challenges to the guarantee of data consistency and the reliability of operation and maintenance.

In fact, the evolution of the code platform architecture is a balancing act between these two issues to ensure the stability of the entire platform at a certain scale. But there has been no fundamental solution to these two problems. However, as the scale gradually increases, the disadvantages caused by the two core problems mentioned above gradually become apparent.

Code service master/Standby architecture: Problems with stateful services

For those who are familiar with the high availability system, the name should tell you something. The problem of stateful services is a natural introduction of the read/write separation solution of the master/standby architecture.

The request flow of the whole system is mainly forwarded in two parts:

Through the unified agent layer, users can forward different client requests to corresponding systems, such as SSH and HTTP of Git command line clients, page access, and API interface requests.
Then, the interconnecting module converts the user’s requests of different protocols into internal RPC calls, and forwards the requests to the corresponding services according to the fragmentation, read and write through the unified RPC Proxy module RPC Proxy and Shard Config.

How to handle read and write operations

If it is a write operation on file (such as: push, from the page and branch for add, delete, modify etc operation), the request will fall to the warehouse corresponding settlement shard RW/server, settlement in RW/service after the completion of the write, again through the Git protocol synchronously, synchronization to other machines with fragmentation, such a write operation is completed. For the read operation, a RO machine is randomly found in the corresponding fragment of the warehouse for forwarding.

Problem analysis

When the primary node of a shard is abnormal (such as service crash or server breakdown), the state of the machine in the shard changes. The original RW/WO status will become unavailable and the Backup machine will replace the original RW/WO service to accept the write request.

From the perspective of the system, the following four problems exist in the active/standby architecture:

1. Availability:

Since the read and write operations are separated, the write function of the service is unavailable during the failover of the write server.
For a single chip, write operations are single-point, and if one service fluctuates, the entire shard fluctuates.

2. Performance:

There is an additional time overhead in synchronization between the primary and secondary machines. For a loose, compressed Git repository, this can take longer than a single file copy.

3. Safety:

Instantaneous operations on the user side in a short period of time may be concurrent for node synchronization, and the order of transactions in synchronization cannot be guaranteed.

4. Cost:

The specification requirements of the primary and standby machines are identical with that of piece-writing. However, due to the different requests received, there are serious uneven resource consumption;
Due to the synchronization of small files, sensitive to delay, asynchronous synchronization across the machine room, machine specifications one to one replication.

The problems caused by these four system defects are tolerable under certain usage scale and service stability requirements. But with the deepening of commercialization and the growth of user scale, the solution of these problems becomes urgent. I’m going to share with you how our team has thought about and addressed these architectural issues over the past year.

Code service multiple copy architecture: Eliminate stateful storage services

In the previous section, it became clear that the four architectural problems are mainly caused by stateful services. So in the design of the new architecture, we had only one goal — to eliminate stateful services. Once you have a goal, how do you achieve it? We first did a deep understanding and learning of several popular distributed systems in the industry, such as ETCD, Paxos protocol learning. At the same time, we also learned a few articles about Github open source, the big brother of code services, but Github sees distributed architecture as their core competency, so there are few articles to look at, but we are still inspired by these articles. First of all, for any architecture upgrade, to be able to “fly the plane for the engine”, let the architecture soft landing is the minimum requirement of the architecture upgrade. So there are no changes above the proxy layer of GRPC. This is where our new architecture is implemented.

Different from the previous section, in the new underlying design:

We hope to design a GPRC D-proxy module, which can copy the gRPC request on write, so as to achieve the first step of multiple write.
Second, in the Proxy Config module to store warehouse, copy, machine and other metadata;
Again through a distributed Lock (D-lock) to complete the warehouse level Lock control;
Finally, we hope to have a fast algorithm to calculate the checksum of the warehouse and quickly identify whether the copies of the warehouse are consistent or not.

Through the combination of these modules to complete the concurrent writing and random reading of multiple copies of the warehouse, we can remove the state of the underlying storage nodes, so that the warehouse copies can be dispersed to different machine rooms. In addition, thanks to the decoupling of the machine from the replica, each server can be configured independently, laying the foundation for subsequent heterogeneous storage.

Implementation of multiple copy architecture for code services

With the basic design in place, we validated the architecture with the IMPLEMENTATION of MVP. And through more than a year of development and pressure testing, we finally successfully brought our multi-copy architecture online and gradually began to provide services. In the process of implementation, we gave the system a meaningful name — Galileo, because in our multi-copy architecture, the smallest copy is three, and Galileo saw exactly three moons on Mars before he invented the telescope. We want to live up to this spirit of constant exploration, hence the meaningful name. In the specific design, we overwrite the gRPC Proxy module to make a write operation from the user complete write replication, and support segmentary submission by overwriting Git. Based on git data features, we have written checksum module, which can perform fast full and incremental consistency calculation for Git repository.

At the system architecture level, Galileo architecture solves the problems caused by the previous master/standby architecture:

1. Improved availability: Multiple write and random read enable the underlying Git storage service to avoid single write point and failover problems;

2. Improved write performance: concurrent write of multiple copies eliminates the time cost of primary/secondary replication between underlying copies, and the performance is comparable to that of a single disk.

3, security: write operation segment submission and lock control, in the distributed system write security on the basis of control user write operation transaction;

4. Significant cost reduction:

Each copy undertakes read and write operations, and the water level is average
The copy is decoupled from the machine, freeing the specification limits of the machine, and different models and storage media can be used according to the access heat of the warehouse

Having said that, what happens in Galileo when the user performs a push? Let’s use an animation to briefly demonstrate:

User 3 and user 4 are two students who are in a hurry to get off work. Their local master branches are Submit 3 and Submit 4 respectively.
The gray circle at the bottom represents the loading of three copies of a repository on the server. You can see that two of the master branches point to commit 2, and one of them may have its master pointing to 1 due to network or other reasons.
When user 3 and user 4 submit the code before and after, user 3 will be the first to start the write process because user 3 will reach the service faster.
At the beginning of a write process, the current repository copy is checked for consistency, and the system can easily find that the backward replica 1 is inconsistent with the other two copies, so it is marked as unhealthy. For the copy of Unhealthy, it will not participate in any operation of Galileo’s writing or reading;
When user 4’s request is also accepted, the consistency check is also triggered, but because replica 1 is already marked unhealthy, the replica is “invisible” to user 4’s process.
After the process of user 3 checks that most copies are consistent, the process considers that the data can be written and transfers non-referenced data to the copies. The process of user 4 is the same, and because the non-referenced data does not change the branch information, it does not need to be locked and can be operated at the same time.
When the user 3 process receives the non-referenced data transfer successfully, it starts to update the reference, and the first step is to preempt the lock of the repository. Fortunately, the lock is idle. User 3’s process successfully locks and begins to rewrite the reference to the copy.
When user 4’s process completes the transfer of non-reference data, it also starts to update the reference. Similarly, before the update operation, the user needs to grab the lock, but the lock is already occupied, and the process of user 4 enters the waiting phase (if the wait times out, the user directly tells user 4 that the submission fails).
When user 3’s process successfully overwrites the reference, the vault lock is released. The master branch of the repository on the server has been changed to point to 3. User 3 can happily leave work;
When the lock is released, user 4’s process quickly preempts and tries to change the reference point, but finds that the target reference master is different (previously 2, now 3 changed by user 3). User 4’s process fails and releases the lock. A message similar to “The remote branch has been updated, please pull the latest submission and push it” is displayed on user 4’s local device.

That’s what happens inside Galileo when multiple users submit at the same time. Students with good vision may ask, what about the duplicate which is judged in the system to be unhealthy? This is the transaction compensation that the operation and maintenance system will do as mentioned in the architecture implementation just now. The operation and maintenance system will trigger the repair of unhealthy copy after receiving unhealthy report and periodic scan. The repaired copy will be set as Healthy and continue to provide services after being confirmed consistent. Of course, the process of determining whether the fix is consistent is also a locking operation.

Code managed operation and maintenance management platform

In addition to the copy fixes mentioned in the previous section, there are a lot of o&M operations required to keep the system running. In the past, these were all operated by the operations students. Git service involves data and synchronization, which requires higher operation and maintenance complexity and higher operation and maintenance risk than a simple business system. In the past year, we combined the previous operation and maintenance experience, as well as the needs of the new architecture, with the goal of instrumentalization, automation and visualization, to build the corresponding operation and maintenance management platform for Galileo system:

Through this platform, we have greatly improved the quality of operation and maintenance, and fully released the energy of operation and maintenance personnel.

The original link

This article is ali Cloud original content, shall not be reproduced without permission.

The evolution of Alibaba code platform architecture