Apart from Linux, Git is one of the most popular open source projects in the world

My poem is finished. Neither the wrath of the great God, nor the collapse of the earth, can make it invisible! Ovid “metamorphosis” before, I wrote a few GitHub as a network disk, such as the use of the SAO gas operation, purely out of personal willing to toss around, and then wrote a big factory why like to ask the source of the article

It shouldn’t be a problem in itself, but then, all of a sudden, an amazing reader contacted me today. Watch the conversation, right

How did this big brother connect these two things, which will become something big

But again, what is your answer to my last question? If your answer is the same as my original one, git= SVN + distributed, then I suggest you take a look at it

Git is just a repository of code. I don’t want to blame you for your lack of interviews. Here are some of the most common commands I’ve compiled for git interviews and use

Back to basics, let’s take a closer look at Git, the code commit that changed the world

As the largest and most successful open source project, Linux has attracted the contributions of programmers from all over the world. So far, more than 20,000 developers have submitted code to the Linux Kernel.

Surprisingly, For the first ten years of the project (1991-2002), Linus, as the project manager, did not use any SCM tools, but manually merged the submitted code through Patch. It’s not that Linus likes to do things by hand, it’s that he’s picky about software configuration management tools (SCM), neither commercial ClearCase nor open source CVS, SVN, etc.

In his opinion, a version control system suitable for Linux kernel project development needs to meet several requirements: 1) fast 2) support multi-branch scenarios (thousands of branches in parallel development scenarios) 3) distributed 4) support large projects. It wasn’t until 2002 that Linus finally found a tool that basically met his requirements. BitKeeper was a commercial tool that was willing to give it to the Linux community for free, as long as it complied with provisions such as no decompilation. When it became clear that the default interface provided by BitKeeper did not meet all the needs of the community, a community developer decompiled BitKeeper and used the undisclosed interface, causing the BitKeeper company to withdraw its License for free use. As a last resort, Linus spent ten days on vacation implementing a DVCS, Git, and pushed it out to community developers.

Git design has become a standard for software developers all over the world. There is no need to say more about Git and how to use it. Today I want to talk about the internal implementation of Git. But before I look at this article, LET me ask you a question: If you were designing Git (or redesigning It), how would you do it? What features will be implemented in the first release? After reading this article, compare your own ideas. Welcome to comment.

The best way to learn about Git’s internal implementation is to look at Linus’s initial commit. Checkout the first commit node of a Git project (see blog: reading open source tips) and see that there are only a few files in the code base: A README, a build script Makefile, and a few C source files. Initial Revision of “git”, the information Manager from hell.

commit e83c5163316f89bfbde7d9ab23ca2e25604af290 Author: Linus Torvalds [email protected] Date: Thu Apr 7 15:13:13 2005 -0700

Initial revision of "git", the information manager from hell
Copy the code

In README, Linus describes Git’s design in detail. In Linus’s design, there are only two kinds of object abstractions: 1) object database; 2) Current directory cache (current directory cache).

Git is essentially a collection of file objects: code files are objects, file directory trees are objects, and commits are objects. The names of these file objects are the SHA1 values of the content, which are 40 bits in the SHA1 hash algorithm. Linus uses the first two digits as folders and the last 38 digits as file names. Git is all about the objects in the.git directory, where you can see directories with two letter/number names and 38-bit hash files.

Linus defines the object’s data structure as < label ASCII representation >(blob/tree/commit) + < space > + < length ASCII representation > + <\0> + < binary data content >. You can use the XXD command to view the objects file in the objects directory (zlib decompressed), such as a tree object file containing the following contents:

00000000: 7472 6565 2033 3700 3130 3036 3434 2068 tree 37.100644H 00000010: 7472 6565 2033 3700 3130 3036 3434 2068 tree 37.100644H 00000010: 656c 6c6f 2e74 7874 0027 0c61 1ee7 2c56 ello.txt.’.a.. ,V 00000020: 7bc1 b2ab ec4c bc34 5bab 9f15 ba {…. L.4[…. There are three types of objects: BLOB, TREE, and CHANGESET.

Blobs: Binary objects. These are the files that Git stores. Unlike some VCS (like SVN), Git does not store change delta information. If you commit hello. C to Git, a BLOB file will be generated to record the contents of Hello. After making changes to hello.c and committing a commit, a new BLOB file is generated to record the entire contents of hello.c. At Linus design time, bloBs only record the contents of files, not metadata such as file names and file attributes, which are recorded in a second type of object, TREE.

TREE: indicates a directory TREE object. In Linus’s design, a TREE object is an abstraction of directory TREE information in a time slice, containing the filename, file properties, and SHA1 value of the BLOB object, but no history information. The advantage of this design is that two TREE objects with history can be quickly compared without reading the contents, while consistent and different files are displayed based on SHA1 values.

In addition, because the file name and attribute information are recorded on the TREE, BLOB objects can be reused to save storage resources when file attributes or file names are modified or directories are moved without modifying file contents. In the evolution of Git development, the design of TREE has been optimized to become an abstraction of folder information at a certain point in time. TREE contains the object information (SHA1) of the TREE under its subdirectory. In this way, Git libraries with complex directory structures or deep hierarchies can save storage resources. History information is recorded in a third type of object, CHANGESET.

Image from Pro Git 1

CHANGESET: Commit object. A CHANGESET object records the TREE object information (SHA1), committer, commit Message, and other information about the commit. Unlike other SCM (software configuration management) tools, Git CHANGESET does not record file renaming, property modification operations, or Delta information about file changes. CHANGESET records the SHA1 value of the parent CHANGESET. The difference is obtained by comparing the TREE information of the local node with that of the parent node.

Linus designed the CHANGESET parent to allow a node to have up to 16 parents, and while merging more than two parents is a strange thing to do, Git actually supports multiple merges of more than two branches.

Linus focuses on TRUST after the design explanation of the three objects: Although Git is not designed to be trusted, it can be trusted as a configuration management tool. The reason is that all objects are encoded in SHA1 (Google’s implementation of SHA1 collision attack is a late thing, and the Git community is also planning to use SHA256 encoding to replace it), and the process of checking in objects can be guaranteed by signature tools, such as GPG tools.

Now that you understand Git’s three basic objects, it’s easy to understand the two layers of abstraction Linus originally designed for Git: “object database” and “current directory cache.” With the original working directory, Git has three levels of abstraction, as shown below: A Working Directory is where you view/write your code, and a Git Repository is a Repository that contains the contents of your.git folder. Linus was first designed as.dircache, and there is a Staging Area between the two storage abstractions.

Linus explained the design of the “current directory cache”, which is a binary file with a content structure similar to that of a TREE, except that indexes no longer contain nested index objects, meaning that the contents of the current modified directory TREE are all in one index file. This design has two advantages: 1. It can quickly restore the full contents of the cache, even if the files in the current workspace were accidentally deleted, you can also recover all files from the cache; 2. 2. Quickly find out files whose contents are inconsistent between the cache and the current workspace.

Picture from Things About Git and Github You Need to Know as Developer 2

The implementation of Linus completes the most basic Git functionality in Git’s first code commit and is ready to compile and use. The code is extremely concise, totaling 848 lines with the Makefile. Checkout Git is one of the earliest commit methods available on Linux.

Minor changes to the original Makefile script are required because of the dependency on the library version. The first version of Git relies on two libraries, OpenSSL and Zlib, which need to be manually installed. Sudo apt install libssl-dev libz-dev; sudo apt install libssl-dev libz-dev; Then change the makefile from -lssl to -lcrypto in LIBS= -lssl and add -lz; At last, run make and ignore the compilation alarm. 7 executable program files are generated: init-db, update-cache, write-tree, commit-tree, cat-file, show-diff, and read-tree.

The following is a brief introduction to the implementation of these executables:

(1) init-db: initializes a git repository. This is the git init command used to initialize git repositories. It’s just that Linus originally created the repository and cache folder named.dircache, not the.git folder we know today.

(2) update-cache: Enter the file path to add the file (or multiple files) to the buffer. The concrete implementation is: verify the path validity, then calculate SHA1 value of the file, add bloB header information for Zlib compression and write to the object database (.dircache/objects); Finally, update the file path, file attributes, and blob sha1 value to the.dircache/index cache file.

(3) write-tree: The cached directory tree information is generated into tree objects and written into the object database. The data structure of the TREE object is: ‘TREE’ + length + \0 + list of file trees. The file tree is stored according to the file attribute + file name + \0 + SHA1 value structure. After the object is successfully written, the SHA1 value of the TREE object is returned.

(4) commit-tree: Generates the tree object information into the COMMIT node object and submits it to the version history. Enter the SHA1 value of the TREE object to be submitted and select the parent commit node (up to 16). Commit object information contains the name, email, and date of the TREE, parent, committer, and author. Finally, write the new COMMIT node object file and return the SHA1 value of the COMMIT node.

(5) cat-file: All object files have been compressed by Zlib, so if you want to view the file content, you need to use this tool to decompress and generate temporary files to view the content of the object file.

(6) show-diff: quickly compare the difference between the current cache and the current workspace, because the file attribute information (including modification time, length, etc.) is also stored in the data structure of the cache, so you can quickly compare whether the file has been modified and show the difference part.

(7) Read-tree: print the contents of the tree according to the input SHA1 value.

These are all seven subroutines of the first available version of Git, and those of you who have used Git might say: How is this different from my usual Git commands? Git add, Git commit Yes, there were no Git commands in the original Git design that we normally use.

In Git’s design, there are two kinds of commands: low-level commands and Porcelain commands. From the beginning, Linus designed these Unix KISS commands for the open-source community of hackers, called plumbing commands because hackers are hands-on and roll up their sleeves to fix broken pipes. Junio Hamano, who took over Git, decided that these commands were not very user-friendly, so he built on top of them higher-level commands that are easier to use and have better interfaces, like Git add and Git commit, which we use every day.

Git add encapsulates update-cache and Git commit encapsulates write-tree and commit-tree. For a more detailed introduction to the underlying commands, see the Git Internals section in Pro Git.

The code implementation is not detailed here, Linus code style is extremely concise, can be done in one line never write two lines. In addition, no one can use Linux API better than others. What impressed me most is that mMAP is used in many places to establish file and memory mapping, eliminating memory application, file reading and writing operations, and improving tool performance. As one colleague put it: Linus’s code doesn’t really seem to have any problems except that it doesn’t meet programming specifications. By the way, Linus’s indentation style is Tab (see “Tab or Space, that’s the question” for reference).

After Linus commits his first Git commit, he releases the Git tool to the community. One of the community’s developers, Junio Hamano, found the tool interesting and downloaded the code. He was surprised and intrigued to find only 1,244 lines of code. Junio communicated with Linus on the mailing list and helped add functions like Merge, then continued to polish Git until Junio took over git maintenance completely and Linus went back to work on the Linux Kernel project.

If I were to pick the greatest Git commit of all time, it would be the first Git tool project itself. The code submission was groundbreaking. If the Linux project led to the success of open source software and changed the landscape of the software industry, Git has changed the way developers around the world work and write. Two years after Git was born, three young programmers sat down in a San Francisco bistro and decided they wanted to do something with It. A few months later, GitHub went live.

Going back to the question mentioned at the beginning of the article, if I were to design Git, I would probably extend my design from my existing tool experience (such as SVN use). Even when I first got into Touch with Git, I thought that Git was SVN + distributed superficially. It is only after understanding the internal principle of Git and even reading the initial code of Git that I marvel at its exquisite design. The initial design and implementation of Git may give the following inspiration to (open source) software products:

1. Addressing pain points: The origin of Git is the desire of Linus and the Linux community, and these desire are the common desire of collaborative project development (especially cross-regional project). Linus solved his own pain point problem and pulled off a great achievement.

2. Minimalist design: Linus is not bound by traditional SCM tools when designing Git tools, considering file differences, version comparisons, etc. Instead, it abstractions several basic objects to clarify the design idea of Git.

3.MVP (Minimum Viable Product) : The concept is widely accepted, but it is not easy to implement. What features does an MVP CONFIGURATION management tool need? You generally think of code submission, history tracing, version comparison, branch merging, and so on. But Linus took it apart and quickly implemented the basic functionality that was so simple that only hackers in the open source community could use it. But that was enough for hackers to see its value and continue to add to it.

4. Rapid release and rapid iteration: this is also derived from the Linux Kernel development experience; After Linus implemented the Git MVP, he posted it on the Linux community mailing list, solicited comments, and iterated.

5. Find a Suitable successor: A similar point is made in The Cathedral and The Fair, which states, “If you lose interest in a project, your last duty is to hand it over to a worthy successor.” However, Linus didn’t give Git to Junio because he lost interest, but because he realized Junio was better at implementing richer, more user-friendly features than he was once the Git infrastructure was in place. It takes both courage and wisdom to find a more suitable successor to open source projects.

Don’t know if, after watching this article have the brand-new cognition to the Git, in my world, never learn any technology, but to play, make technology as one of interest, like like to watch anime, comic books, unconsciously, open happy heart of the technology, it is really a very happy thing, and in this way, In everyone’s eyes is not doing proper work every day, but the technology is a lot of (I won’t say to install X), by people envy

Well, basically here, for the origin of Git and the internal implementation of the basic finished, although there is not much, but also have to envy this group of god’s imagination and technology, too cruel, the later benefit of how many programmers ah, especially now for the multi-person team cooperation development mode

Think today learned something, attention + forward to more people to see it, need mind mapping, attention to the public number: Java Architect Union, you can view the way to obtain

Apart from Linux, Git is one of the most popular open source projects in the world

Related Posts

Link jump to the public account attention page, attention link.

Redis — Redis five data types — String and Hash (and application scenarios)

Function parameters