In this paper, the original link: ruanyifeng.com/blog/2016/07/google-monolithic-source-repository.html, infringement, if any, can be deleted.

00 preface

ACM Communications has a paper called “Why Does Google Put Billions of lines of code in a Library?” The writer is an engineer in Google’s infrastructure group. The author elaborates on why Google’s code is all in one repository.

01 overview

Google first used CVS for code management and changed it to Perforce in 1999. It was a Perforce host with various caching machines.

At the time, the entire company’s code was housed in a warehouse, a practice that has continued ever since. As Perforce grew in size, Google began using Piper, its own version management system.

Piper is built on top of Google’s own distributed database system, formerly known as Bigtable but now known as Spanner, in 10 data centers around the world, giving Googlers worldwide access to fast data.

Currently, the repository contains 1 billion files, 35 million commit records, 86 terabytes in size, and tens of thousands of users. There are 500,000 requests per second on weekdays and 800,000 at peak times, mostly from automated build and test systems.

More than 90% of Google’s code is in Piper. For projects that are open source and require external collaboration, the code is in Git, mainly Android projects and Chrome projects. Git’s nature is that all history is copied to the user’s local machine, so it is not suitable for large projects and must be split into smaller libraries. In the case of Android, the project consists of more than 800 separate repositories.

02 Piper’s design

2.1 structure

The whole warehouse adopts a tree structure. Each team has its own catalog. The directory path is the code’s namespace. Each directory has an owner, who is responsible for approving changes to files in that directory.

2.2 Permission Control

Piper supports file-level permission control. 99% of the code is visible to all users, with limited access to only a few important configuration files and confidential critical business.

If classified information is accidentally placed on Piper, the files can be quickly erased. In addition, all reads and writes are logged, and administrators can find out who has read the file.

2.3 the workflow

Piper’s workflow is shown below.

The developer starts by creating a local copy of the file, called a workspace. After development is complete, a snapshot of the workspace is shared with other developers for code review. Only after passing the review can the code be merged into the central repository.

2.4 the client

Most developers access Piper through a client called CitC. The developer uses CitC to browse and synchronize files on Piper, but edits and changes are made in his workspace, where only the changed files are saved (typically no more than 10 files per workspace). CitC has a cloud storage mechanism, and each workspace is a directory on the cloud. These files were merged from Citc into Piper after a code review.

2.5 Trunk Development

Google adopts “trunk-based development”. Code is generally submitted to the head of the trunk. This ensures that all users see the latest version of the same code.

“Trunk development” avoids the hassle of merging branches. Google doesn’t usually branch out. Branches are just for publishing. Most of the time, the release branch is a snapshot of the trunk at one point in time. Subsequent debuggers and enhancements are committed to the trunk and, if necessary, cherry-pick to the release branch. Long parallel development branches are rare at Google.

Instead of “branching out,” Google introduced new features that typically use switch controls in code. This avoids another branch and makes it easy to switch functionality by configuring it so that if the new functionality fails, it is easy to switch back to the old functionality. Wait until the new features are stable, then remove the old code completely. Google has A routing algorithm similar to A/B testing, which evaluates code performance and is easy to implement because of configuration switches.

2.6 Code Review

All code must be reviewed before being merged into the repository. Most reviews are open to everyone, and any Google employee can comment on the code or submit changes.

Code reviews are based on the Google Code Style Guide. Google has a tool called Critique that looks at the history of each line of code.

2.7 Automatic Test

After the review is complete, the test will be run automatically. Once tested, the code fits into the Piper warehouse without human intervention.

Advantages of a single code repository

(1) Unified version

Code across the company has a uniform version and path, and there is no problem finding the latest version of a file.

(2) Extensive code sharing and reuse

Anyone can browse and use company-wide code, which greatly facilitates code sharing and reuse.

(3) Simplified dependency management

If you’re the author of a library file or API, because everyone’s code is in one repository, it’s easy to find all the downstream code that depends on you.

Every time the code changes, all the code that depends on you is automatically built. If a large number of build failures occur, the commit is automatically revoked. This also ensures that all code relies on the latest version and avoids conflicts caused by relying on different versions.

In addition, because the boundaries of the code are clear, circular dependencies do not occur. Also, it’s easy for an API author to find out how others are using his API.

(4) Atomic variation

Since every code change has an impact in a repository, it is atomic. Therefore, it is easy to undo it, or to pre-test its impact.

To prevent false commits, Google introduced “pre-commits” (analyzing whether code that relies on it will fail to build before committing).

(5) Large-scale code destruction

A single code repository provides great convenience for finding and analyzing code.

Tricorder, Google’s static analysis engine, runs periodically to analyze the code. For example, after the C++ 11 standard was released, it was easy to find all the variable declarations that needed to be improved for performance optimization. The engine also provides “one-click correction” for many errors, while producing a wealth of statistics.

In addition, the compiler team also analyzes all code in different languages to find unreasonable code and outdated apis.

Disadvantages of a single code repository

The main disadvantage of a single codebase is that all the tools have to be written themselves, because there is no software on the market that can manage a codebase of this size.

05 summary

A single code repository is suitable for large software companies that advocate transparency and openness, not for small companies and companies with a lot of private code