background

With the development iterations, the entire code repository is getting bigger and bigger, git operations are getting slower and slower, which greatly affects the overall development pace. I want to solve this problem. Before we can solve this problem, we need answers to the following questions.

  • How do Git operations work?
  • How is Git stored?
  • Why does it get slower and slower with development iterations? Where’s the slowness?

How does Git work?

Git is fundamentally a content-addressing system. This means that the core of Git is a simple key-value data store. You can insert any type of content into a Git repository, and it returns a unique key that can be retrieved again at any time. It is divided into the bottom command and the top command, the top command is the familiar Git command, the bottom command refers to the bottom command can be executed in various systems, through the call of the top command, connected to the bottom command, can be really executed, such as Unix system, mainly through a series of scripts. Git add: git commit: git commit: git commit: git commit: git commit: git commit:

These commands are initialized after git init.

$ ls -F1
config
description
HEAD
hooks/
info/
objects/
refs/
Copy the code

Description: gitWeb: config: git config: info: global exclusion files, such as write. Gitigonore: hooks file for server and client. Refs: Pointers to the submitted objects (branches, labels, and repositories) in which the directory stores data. Objects: Stores data

objects

A Blob object

Blob objects: images, source files, binary large objects

find .git/objects -type f
.git/objects/d6/70460b4b4aece5915caf5c68d12f560a9fe3e4
Copy the code

The Tree object

Solve the file name saving problem, similar to the Unix system directory structure. Store Pointers to blob objects and Pointers to trees

git cat-file -p master^{tree}
100644 blob a906cb2a4a904a152e80877d4088654daad0c859      README
100644 blob 8f94139338f9404f26296befa88755fc2598c289      Rakefile
040000 tree 99f1a6d12cb4b6f19c8655fca46c3ecf317074e0      lib
Copy the code

Commit object

The generation of a commit object requires a reference to the Tree object and a parent commit object.

$ echo 'second commit' | git commit-tree 0155eb -p fdf4fc3
cac0cab538b970a37ea1e769cbbde608743bc96d
$ echo 'third commit'  | git commit-tree 3c4e9c -p cac0cab
1a410efbd13591db07496601ebc7a059dd55cfe9
Copy the code

How to store Git objects:

  • Read the contents of the file and add a special tag to the header to get the new content, denoted as content
  • Calculate the SHA-1 value for this content
  • Compress content through Zlib
  • The first two characters of the SHA-1 value are the directory and the last 38 characters are the file name

All Git objects are stored this way, except for the type identification — the header information for the other two object types starts with the string “commit” or “tree” instead of “blob.” Also, while the content of a data object can be almost anything, the content of a submission object and a tree object have their own fixed formats.

Combining the three types of objects, we know that the commit refers to the tree object, and the tree object refers to the Tree and blob object, so that all changes can be recorded.

refs

Refs keeps a Git reference, which is basically what a Git branch is: a pointer or reference to the beginning of a series of commits.

head

How does Git know the latest committed SHA-1 value when you perform Git Branch? The answer is the HEAD file. The head file usually holds a symbolic reference to the current branch. A symbolic reference means that it is a pointer to another reference. In some rare cases, a HEAD file may contain the SHA-1 value of a Git object. This happens when you put your repository into a “detached HEAD” state by checking out a label, commit, or remote branch.

cat .git/HEAD
ref: refs/heads/master
Copy the code

tags

A tag object is very similar to a submission object — it contains a tag creator, a date, a comment, and a pointer. The main difference is that the label object usually points to a commit object rather than a tree object. It’s like a branch reference that never moves — always pointing to the same commit with a friendlier name.

Transfer protocol

Intelligent Transport protocol: It can read local data, understand what the client has and needs, and generate the appropriate package file for it. There are two sets of processes that transfer data, one for uploading and one for downloading.

Upload data

To upload data to the remote end, Git uses the send-pack and receive-pack processes. The send-pack process running on the client connects to the receive-pack process running on the remote end. After negotiating data transfer, initiate a request to upload data.

Download the data

The fetch-pack and upload-pack processes come into play when you are downloading data. The client starts the fetch-pack process and connects to the remote Upload-pack process to negotiate subsequent transfers of data. After negotiating data transfer, initiate a download request.

Git package

Without optimization, if we commit a 10M file, we will add a BLOb object inside the Object, which will be zlib-compressed. Once we modify the file again and add it, we will once again generate a bloB object with a different hash value, that is, the current object size is approximately 20 MB.

git gc

1. Collect all loose objects and place them in a package file. 2. Merge multiple package files into one large package file. 3. Remove stale objects that are not relevant to any commit. 4. Package references into a separate file

If you update a reference at this point, Git does not modify the file, but creates a new file into refs/heads. To get the correct SHA-1 value for the specified reference, Git first looks for the specified reference in the refs directory and then in the Packed -refs file. So if you can’t find a reference in the refs directory, it’s probably in the Packed -refs file.

The format Git originally used to store objects on disk is known as the “loose” object format. From time to time, however, Git packs multiple of these objects into a binary called a packfile to save space and improve efficiency. Git does this when there are too many loose objects in the repository, or when you run git GC manually, or when you push to a remote server. To see the packaging process, you can manually run git gc to get Git to package the objects. When you look in the Objects directory, you will find that most of the objects are missing and a new pair of files have been created

$ find .git/objects -type f
.git/objects/bd/9dbf5aae1a3862dd1526723246b20206e5fc37
.git/objects/d6/70460b4b4aece5915caf5c68d12f560a9fe3e4
.git/objects/info/packs
.git/objects/pack/pack-978e03944f5c581011e6998cd0e9e30000905586.idx
.git/objects/pack/pack-978e03944f5c581011e6998cd0e9e30000905586.pack
Copy the code

When Git packages objects, it looks for files with similar names and sizes, and only saves differences between versions of files. Typically, the latest data is saved as full data, while older versions are saved in a differential manner. Because most of the time, we just use the latest data.

To solve

Question why

Too many branches

In the development iteration, each feature may be a separate branch, resulting in an increasing number of branches with the iteration cycle. The main influence points are:

  • During upload and download, branch updates need to be computed, involving traversal of the heads directory
  • If there are branches that do not merge with the main branch, and the branch is obsolete and no longer maintained, the files, commit information, and tree information added to the branch will remain in the remote repository, causing the data to be re-evaluated with each push and pull.
Objects data size increases

After several development iterations, there were no GC operations to optimize, resulting in more duplicate data and a larger and larger catalog of objects. Calculations are becoming more and more time-consuming.

To solve the process

  1. Delete useless branches

Rule: Delete branches where the latest commit information was updated more than 60 days ago. Step: 1. Obtain all the current remote branches. 2. If it rains 60 days after the current interval, tag it and delete it

  1. Perform gc operations for the Git repository

Execute: Log in to the remote Git repository, execute git GC, and delete useless commit and unreferenced objects

Detailed script code

#! /bin/bash git fetch origin git fetch git fetch -p rm -rf parseBranchdir mkdir parseBranchdir rm -rf deleteBranchFile touch deleteBranchFile rm -rf whiteListBranch touch whiteListBranch echo ${WHITE_LIST} > whiteListBranch lastBranch="" whiteList="false" function isInWhiteList() { while read line do if [ "$line" = "$1" ]; then whiteList="true" echo "$1 in white list ...... . return" return 1 fi done < whiteListBranch whiteList="false" } function tryArchiveBranch { if [ "$1" = "$lastBranch" ]; then echo "same branch, just return" return fi lastBranch=$1 isInWhiteList $1 if [ "$whiteList" = "true" ]; then echo "in white list, just return" return fi if [ "${ENABLE_DELETE}" = "true" ]; then echo "the branch $1 should be delete.... . try delete" git reset --hard remotes/origin/$1 git checkout . git tag archive/$1 git push --delete origin $1 git push origin archive/$1 else echo "the branch $1 should be delete...." echo "the branch $1 should be delete...." >> deleteBranchFile fi } function mapMonthToInt() { case $1 in "Jan") return 1 ;; "Feb") return 2 ;; "Mar") return 3 ;; "Apr") return 4 ;; "May") return 5 ;; "Jun") return 6 ;; "Jul") return 7 ;; "Aug") return 8 ;; "Sep") return 9 ;; "Oct") "Nov") "Dec") return 12 ;; esac } function calculateTime() { #! /bin/bash git fetch origin git fetch git fetch -p rm -rf parseBranchdir mkdir parseBranchdir rm -rf deleteBranchFile touch deleteBranchFile rm -rf whiteListBranch touch whiteListBranch echo ${WHITE_LIST} > whiteListBranch lastBranch="" whiteList="false" function isInWhiteList() { while read line do if [ "$line" = "$1" ]; then whiteList="true" echo "$1 in white list ...... . return" return 1 fi done < whiteListBranch whiteList="false" } function tryArchiveBranch { if [ "$1" = "$lastBranch" ]; then echo "same branch, just return" return fi lastBranch=$1 isInWhiteList $1 if [ "$whiteList" = "true" ]; then echo "in white list, just return" return fi if [ "$1" = "develop" ]; then echo "return for develop" return fi if [ "$1" = "master" ]; then echo "return for master" return fi if [ "$1" = "release_temp" ]; then echo "return for release_temp" return fi if [ "${ENABLE_DELETE}" = "true" ]; then echo "the branch $1 should be delete.... . try delete" git reset --hard remotes/origin/$1 git checkout . git tag archive/$1 git push --delete origin $1 git push origin archive/$1 else echo "the branch $1 should be delete...." echo "the branch $1 should be delete...." >> deleteBranchFile fi } function mapMonthToInt() { case $1 in "Jan") return 1 ;; "Feb") return 2 ;; "Mar") return 3 ;; "Apr") return 4 ;; "May") return 5 ;; "Jun") return 6 ;; "Jul") return 7 ;; "Aug") return 8 ;; "Sep") return 9 ;; "Oct") "Nov") "Dec") return 12 ;; esac } function calculateTime() { #echo "calculateTime $1, $2" month=$(echo $1 |awk -F' *' '{print $3}') mapMonthToInt $month month=$? day=$(echo $1 |awk -F' *' '{print $4}') year=$(echo $1 |awk -F' *' '{print $6}') currentTime=$(date '+%Y-%m-%d') currentYear=$(echo $currentTime |awk -F'-' '{print $1}') currentMonth=$(echo $currentTime |awk -F'-' '{print $2}') currentDay=$(echo $currentTime |awk -F'-' '{print $3}') echo " current time is-> $currentYear:$currentMonth:$currentDay"  echo "$2 merge time is -> -> $year:$month:$day" mergeTimeToDays=$(((year-2016) * 365 + (${month#0} * 31) + (${day#0} - 0))) dividerDays=$((currentTimeToDays - mergeTimeToDays)) echo "dividerDays is $dividerDays" if [[ $dividerDays -gt 60 ]]. then tryArchiveBranch $2 fi } function parseTime() { #echo "parseTime $1, $2" dataFilter="Date:" result=$(echo $1 | grep "$dataFilter") if [ "$result" != "" ]; then calculateTime "$1" "$2" fi } function parseBranch() { #echo "parse branch $1" currentBranch=$(echo $1 |awk -F/ '{print $3}') #echo "parseBranch short -> $currentBranch" #git log develop | grep -C 5 "$currentBranch" | grep -C 5 "Merge branch" | grep -C 5 "into develop" > parseBranchdir/parseBranch$currentMsg.txt #git reset --hard $1 git log -1 $1  > parseBranchdir/parseBranch$currentMsg.txt while read line do parseTime "$line" "$currentBranch" done < parseBranchdir/parseBranch$currentMsg.txt } #git branch -a | grep remotes/origin/feature_ > featureBranchName.txt git branch -a | grep remotes/origin > featureBranchName.txtCopy the code