TL; DR

This article introduces Git internals, including how Git stores our code and change history, how it changes internally when you change a file, and the benefits of implementing Git this way. Explain the above GIF with an example to give you an idea of how Git works. If you already know how to read this picture, the following may seem basic to you.


This text was shared on FCC Front-end Sharing session (freeCodeConf 2019 shenzhen station) held on 2019/24 at Multi-function Hall, 2F, Tencent Tower, Shenzhen.

Video: www.bilibili.com/video/av772…

PPT:www.lzane.com/slide/git-u…

preface

In recent years, the rapid development of technology has made some students develop a learning knowledge to stay on the surface, only to call some instructions. We often have the illusion of “I’ll use this technology, this framework” until we get to a problem and realize it’s not that simple.

Git is also one of those things where most people know how to use it and what the commands are, but only a few people know how it works. Knowing some of the basics will help you better understand what you’re really doing and not get lost in Git’s vast array of instructions and parameters.

How does Git store information

Here is a simple example to give you a sense of how Git stores information.

Let’s start by creating two files

$ git init
$ echo '111' > a.txt
$ echo '222' > b.txt
$ git add *.txt
Copy the code

Git will store the entire database in the.git/ directory. If you go to the.git/objects directory, you will find two objects in your repository.

$ tree .git/objectsThe git/objects ├ ─ ─ 58 │ └ ─ ─ c9bdf9d017fcd178dc8c073cbfcbb7ff240d6c ├ ─ ─ c2 │ └ ─ ─ 00906 efd24ec5e783bee7f23b5d7c941b0c12c ├ ─ ─ The info └ ─ ─ packCopy the code

Just out of curiosity, what’s in there

$ cat .git/objects/58/c9bdf9d017fcd178dc8c073cbfcbb7ff240d6c
xKOR0a044K%
Copy the code

Why is it a string of gibberish? This is because Git compresses information into binary files. Git cat-file [-t] [-p] Git cat-file [-t] [-p] Git cat-file [-t]

$ git cat-file -t 58c9
blob
$ git cat-file -p 58c9
111
Copy the code

This object is a bloB whose content is 111. This object contains the contents of the a.txt file.

Here we encounter the first Git object, the blob type, which stores only the contents of a file, excluding other information such as the file name. Then the information SHA1 hash algorithm to obtain the corresponding 58 c9bdf9d017fcd178dc8c073cbfcbb7ff240d6c hash value, as the object in the Git repository only id card.

In other words, our Git repository looks like this:

As we continue to explore, we create a COMMIT.

$ git commit -am '[+] init'
$ tree .git/objectsThe git/objects ├ ─ ─ 0 c │ └ ─ ─ 96 bfc59d0f02317d002ebbf8318f46c7e47ab2 ├ ─ ─ 4 c │ └ ─ ─ aaa1a9ae0b274fba9e3675f9ef071616e5b209...Copy the code

We will find two more objects in Git repository after we commit. Also using the cat-file command, let’s see what types they are and what their contents are.

$ git cat-file -t 4caaa1
tree
$ git cat-file -p 4caaa1
100644 blob 58c9bdf9d017fcd178dc8c0... 	a.txt
100644 blob c200906efd24ec5e783bee7...	b.txt
Copy the code

Here we come across the second Git object type, Tree, which takes a snapshot of the current directory structure. From what it stores, you can see that it stores a directory structure (similar to a folder), as well as the permissions, type, corresponding ID (SHA1 value), and filename of each file (or subfolder).

Git repository looks like this:

$git cat - file - t 0 c96bf commit $git cat - file - p 0 c96bf tree 4 caaa1a9ae0b274fba9e3675f9ef071616e5b209 author lzane Li Zefan 1573302343 +0800 committer lzane 1573302343 +0800 [+] initCopy the code

Git object: commit (); commit (); commit (); commit (); Multiple parents may also appear in a merge commit), the author of the commit and when it was committed, and finally information about the commit.

Git repository:

Git is the repository of all the branch information that Git is used to store.

$ cat .git/HEAD
ref: refs/heads/master

$ cat .git/refs/heads/master
0c96bfc59d0f02317d002ebbf8318f46c7e47ab2
Copy the code

In Git repositories, HEAD, branch, and plain Tag can simply be understood as a pointer to the SHA1 value of a commit.

Git tag-a: Git tag -a: Git tag -a: Git tag -a: Git tag -a: Git tag -a: Git tag -a: Git tag -a

Now you know how Git stores a file’s contents, directory structure, commit information, and branching. It is essentially a directed acyclic graph (DAG) formed by a key-value database and Merkle tree. Here’s a little bit of blockchain heat, which also uses Merkle trees for its data structures.

Git’s three partitions

Let’s take a look at the three Git partitions (working directory, Index area, Git repository) and how Git change logs are formed. Understanding the inner workings of these three partitions and Git chains will give you a “visual” understanding of Git’s many instructions that won’t get mixed up too often.

Following the example above, the current state of the warehouse is as follows:

There are three areas where information is stored:

  • Working directory: the file on the operating system where all code development and editing is done.
  • Index or staging area: a staging area where code is committed to a Git repository at the next commit.
  • Git Repository: A Git Object records a snapshot of each commit and a chained history of commit changes.

Let’s take a look at what happens when you update the contents of a file.

Run echo “333” > a.txt to change a.txt from 111 to 333.

Run git add a.txt to add a.txt to the index area. As shown in the image above, Git creates a new blob object in the repository to store the new file contents. And the index is updated to point A.txt to the new Blob Object.

Run git commit -m ‘update’ to commit this change. As shown above

  1. Git first produces a tree Object based on the current index that acts as a newly committed snapshot.
  2. Create a new Commit object, store the information about this commit, and parent points to the previous commit to form a chain to record the change history.
  3. Move the pointer to the master branch to the new COMMIT node.

Now we know what the three Git partitions are, what they do, and how the history chain is set up. ** Basically, most of Git’s instructions operate on these three partitions and the chain. It’s important to think about git’s various commands and see if you can visualize them in the image above.

If you can’t visualize the commands you use every day, Git is recommended


Pay attention to [IVWEB community] public number to check the latest technology weekly, today you are better than yesterday!


Some interesting questions

Those of you who are interested can continue reading, this part is not the main content of the article

Question 1: Why store permissions and file names in a Tree object instead of a blob object?

Imagine changing the name of a file.

If the file name is stored in a BLOb, Git can only copy one more copy of the original content to form a new Blob object. Git’s implementation method only needs to create a new tree object and change the corresponding file name to a new one. The original BLOb object can be reused, saving space.

Question 2: Does Git store a new file snapshot or a changed part of the file every commit?

As you can see from the example above, Git stores a fresh snapshot of a file, not a change record of a file. That is, even if you just add a line to a file, Git will create a new blob object. Is that a waste of space?

This is a spatial-temporal trade-off with Git. Think about checking out a commit, or comparing the difference between two commits. If Git stores changes to the questionnaire, Git will have to start with the first commit and count the changes until the target commit, which can take a long time. In contrast, Git’s method of storing fresh snapshots of files makes it much faster to just grab the contents from the snapshot.

Of course, when network transport is involved or Git repositories are really large, Git has a garbage collection mechanism called GC, which not only removes useless objects, but also packs up existing similar objects.

Q3: How does Git ensure that history cannot be tampered with?

It is guaranteed by SHA1 hash algorithm and hash tree. If you secretly modify the contents of a file in the history change record, the SHA1 hash of the blob object of the questionnaire changes, the SHA1 of the associated tree object changes, and the SHA1 of the COMMIT changes. All commit SHA1 values after this commit are also changed. And because Git is a distributed system, meaning that everyone has a Git repository with a full history, it’s easy for anyone to find problems.


In the next article, I will write some Git tips that I find useful in my daily work, the questions I am often asked, and how to deal with some accidents.

reference

  • Scott Chacon, Ben Straub – Pro Git-APress (2014
  • Jon Loeliger, Matthew McCullough – Version Control with Git, 2nd Edition – O’Reilly Media (2012) as a supplement to the above book