Pygit: Just enough git to create a repo, commit, and push itself to GitHub

A wild goose shudders at the cold

Summary: PyGit is about 500 lines of Python code that implements git functions, including creating libraries, adding files to an index, committing, and pushing itself to GitHub. This article describes some of the code writing process and describes the code in detail.

Git is known for its very simple object model. When I learned about Git, I discovered that the local object database was just a bunch of ordinary files in a.git directory. With the exception of index (.git/index) and package files (optional), the rules and format of these files are fairly simple.

Inspired by Mary Rose Cook’s program, I also wanted to see if I could write a Git client that creates a repository, performs a commit, and pushes to a server like GitHub.

Mary’s Gitlet app has a lot to learn, while my app needs to push itself to GitHub, so it has more innovative features. In some areas, it implements more Git functionality (including basic merges), but in others it implements less. For example, she uses a simple text-based index format instead of the binary format used by Git. In addition, while her Gitlet supports push, it only pushes to a local repository that already exists, not to a remote server.

For the exercise covered in this article, I intend to write a version that performs all the steps, including pushing to a real Git server. I also use the same binary index format as Git, so I can use git commands at each step to check the functionality of the program.

My program is called PyGit, written in Python (3.5+), and uses only standard library modules. It has only 500 lines of code, including blank lines and comments. At a minimum, I need to implement init, add, Commit, and push, but PyGit also implements status, diff, cat-file, ls-Files, and Hash-Object. The following commands are useful in their own right and will also help when debugging PyGit.

Now, let’s take a look at the code! You can view all the code for Pygit.py on GitHub, or follow me through the pieces below.

Initialize the warehouse

To initialize your local Git repository, create a. Git directory and a few files and subdirectories under that directory. With the read_file and write_file helper functions defined, init() is ready:

def init(repo): """ Create the warehouse directory, Git directory "" os.mkdir(repo) os.mkdir(os.path.join(repo, '.git')) for name in ['objects', 'refs', 'refs/heads']: os.mkdir(os.path.join(repo, '.git', name)) write_file(os.path.join(repo, '.git', 'HEAD'), b'ref: refs/heads/master') print('initialized empty repository: {}'.format(repo))Copy the code

You may notice that there is no elegant error handling in this code. After all, the entire code is only 500 lines. If the repository directory already exists, the program terminates and throws a Traceback.

Takes the hash value of the object

The hash_object function is used to get the hash value of a single file object and write it to the database in the.git/objects directory. There are three types of objects in the Git model: blobs, commits, and trees.

Each object has a file header, including the file type and the file size, which is about a few bytes long. The NUL character is followed by the data content of the file. All this is compressed using zlib and written to a file. Git /objects/ab/ CD… Where ab is the first two characters of the 40-character SHA-1 hash, and CD… That’s what’s left over.

Note that the Python standard libraries (OS and Hashlib) are used.

Def hash_object(data, obj_type, write=True): """ Computes the hash value of the object based on its type and saves it to a file if write is True. Returns sha-1 hash """ header = '{} {}' as a hexadecimal string. Format (obj_type, len(data)).encode() full_data = header + b'\x00' + data sha1 = hashlib.sha1(full_data).hexdigest() if write: path = os.path.join('.git', 'objects', sha1[:2], sha1[2:]) if not os.path.exists(path): os.makedirs(os.path.dirname(path), exist_ok=True) write_file(path, zlib.compress(full_data)) return sha1Copy the code

There is also a find_object() function that finds a file object by hash (or hash prefix) and reads the object and its type with the read_object() function. This is actually the reverse of hash_object(). Finally, cat_file is a PyGit function that does the same thing as git cat-file: It formats the content (or size and type) of an object and prints it to standard output.

Git index

The next thing we need to do is add the file to the index or staging area. An index is a list of files sorted by pathname, with each path containing the pathname, modification time, SHA-1 hash, and so on. Note that the index lists all files in the current tree, not just those waiting to be committed in the staging area.

The index is stored in a custom binary format in a.git/index file. This file is not very complicated, but it does cover the use of structs to get the next index entry after a variable length path name field with regular byte offset.

The first 12 bytes of the file are the header, the last 20 bytes are the SHA-1 hash of the index, and the middle bytes are the index entries, each of which is 62 bytes plus the length of the path plus the length of the fill. Here are the IndexEntry and read_index functions of type namedtuple:

# git index (.git/index) IndexEntry = collections.namedtuple('IndexEntry', ['ctime_s', 'ctime_n', 'mtime_s', 'mtime_n', 'dev', 'ino', 'mode', 'uid', 'gid', 'size', 'sha1', 'flags', 'path', ]) def read_index(): """ Try: data = read_file(os.path.join('.git', 'index')) except FileNotFoundError: return [] digest = hashlib.sha1(data[:-20]).digest() assert digest == data[-20:], 'invalid index checksum' signature, version, num_entries = struct.unpack('! 4sLL', data[:12]) assert signature == b'DIRC', \ 'invalid index signature {}'.format(signature) assert version == 2, 'unknown index version {}'.format(version) entry_data = data[12:-20] entries = [] i = 0 while i + 62 < len(entry_data): fields_end = i + 62 fields = struct.unpack('! LLLLLLLLLL20sH', entry_data[i:fields_end]) path_end = entry_data.index(b'\x00', fields_end) path = entry_data[fields_end:path_end] entry = IndexEntry(*(fields + (path.decode(),))) entries.append(entry) entry_len = ((62 + len(path) + 8) // 8) * 8 i += entry_len assert len(entries) == num_entries return entriesCopy the code

This function is followed by the ls_files, status, and diff functions, which are several different ways to print index status:

  • The ls_files function simply prints all files in the index (if -s is specified, prints their schemas and hashes together)
  • The status function uses get_status() to compare files in the index to those in the current directory tree, and to print which files have been modified, added, or deleted
  • The diff function prints the changes in each modified file, showing the differences between the contents in the index and those in the current working copy (use Python’s Difflib module to do this)

Git’s manipulation of indexes and the execution of these commands are much more efficient than my program. I use the os.walk() function to list the full path of all the files in the directory, do some setup, and then compare their hash values. For example, here’s the code I used to get a list of modified paths:

changed = {p for p in (paths & entry_paths) if hash_object(read_file(p), 'blob', write=False) ! = entries_by_path[p].sha1.hex()}Copy the code

Finally, there is a write_index function for writing back indexes. It calls the add() function to add one or more paths to the index. The add() function first reads the entire index, adds a path to it, and then resorts and writes back to the index.

Now that we have added the file to the index, we are ready to commit.

submit

Two objects need to be written to perform the commit operation:

The first is the tree object, which is a snapshot of the current directory (or index) at the time of submission. This tree recursively lists the hashes of files and subdirectories in a directory.

So each commit is a snapshot of the entire directory tree. The advantage of using hash values to store things is that if any file in the tree changes, the hash of the entire tree changes as well. Conversely, if a file or subdirectory does not change, the hash does not change. So you can store changes in the directory tree efficiently.

Here is an example of a tree object printed with the cat-file pretty 2226 command (each line prints: file mode, object type, hash, and file name) :

100644 blob 4aab5f560862b45d7a9f1370b1c163b74484a24d    LICENSE.txt
100644 blob 43ab992ed09fa756c56ff162d5fe303003b5ae0f    README.md
100644 blob c10cb8bc2c114aba5a1cb20dea4c1597e5a3c193    pygit.pyCopy the code

The write_tree function is used to write tree objects. The Git file format is a strange mix of binary and text. For example, each “line” in a tree object is first text: “mode, space, path”, then NUL bytes, then binary SHA-1 hash. Here’s our write_tree() function:

Def write_tree(): """ tree_entries = [] for entry in read_index(): assert '/' not in entry.path, \ 'currently only supports a single, top-level directory' mode_path = '{:o} {}'.format(entry.mode, entry.path).encode() tree_entry = mode_path + b'\x00' + entry.sha1 tree_entries.append(tree_entry) return hash_object(b''.join(tree_entries), 'tree')Copy the code

Next is the submission object. It records the tree’s hash value, parent commit, author, timestamp, and commit information. Merge functionality is one of the great things about Git, but PyGit only supports a single linear branch, so there’s only one parent commit (or no parent commit if it’s a first commit).

Here is an example of submitting an object, printed again using the cat-file pretty aa8d command:

tree 22264ec0ce9da29d0c420e46627fa0cf057e709a
parent 03f882ade69ad898aba73664740641d909883cdc
author Ben Hoyt <[email protected]> 1493170892 -0500
committer Ben Hoyt <[email protected]> 1493170892 -0500

Fix cat-file size/type/pretty handlingCopy the code

Here’s our commit function, thanks again to Git’s object model, which is pretty simple:

Def commit(message, author): """ commits the current state of the index to the master. Tree = write_tree() parent = get_local_master_hash() timestamp = int(time.mktime(time.localtime())) utc_offset = -time.timezone author_time = '{} {}{:02}{:02}'.format( timestamp, '+' if utc_offset > 0 else '-', abs(utc_offset) // 3600, (abs(utc_offset) // 60) % 60) lines = ['tree ' + tree] if parent: lines.append('parent ' + parent) lines.append('author {} {}'.format(author, author_time)) lines.append('committer {} {}'.format(author, author_time)) lines.append('') lines.append(message) lines.append('') data = '\n'.join(lines).encode() sha1 = hash_object(data, 'commit') master_path = os.path.join('.git', 'refs', 'heads', 'master') write_file(master_path, (sha1 + '\n').encode()) print('committed to master: {:7}'.format(sha1)) return sha1Copy the code

Interacting with the server

Now comes the slightly harder part, as we’ll have to get PyGit to communicate with a real Git server (I’ll push PyGit itself to GitHub, but it will also work with Bitbucket and other servers).

The basic idea is to first query the main branch on the server that is about to commit, then determine the set of local objects waiting to commit, and finally, update the remote commit hash and send a “package file” containing all the missing objects.

This is called “smart protocol”. It wasn’t until 2011 that GitHub stopped supporting the “stupid” transport protocol, which transfers files directly from a.git directory, so it was easier to implement. Here, we have to use “smart protocols” to package objects into a file.

In the final phase of my work, I used Python’s http.server module to implement a small HTTP server that I could run other Git clients to interact with to see the actual requests and data.

PKT – line format

One of the key parts of the transport protocol is the “PKT-line” format, which is the data message format used to send metadata such as submit hashes. The beginning of a message is a length value. Each “line” begins with a length value of four hexadecimal characters (including the length value field), so the packet length must be less than the value of these four characters. There is a LF character at the end of each line. 0000 at the end of the data is the end-of-segment marker.

For example, this is GitHub’s response to a git-receive-pack GET request. Note that additional line breaks and indentation are not part of the message.

001f# service=git-receive-pack\n
0000
00b20000000000000000000000000000000000000000 capabilities^{}\x00
    report-status delete-refs side-band-64k quiet atomic ofs-delta
    agent=git/2.9.3~peff-merge-upstream-2-9-1788-gef730f7\n
0000Copy the code

Obviously, we need two conversion functions: one to convert pkT-line data to row by row, and the other to convert row by row to PKT-line format:

Def extract_lines(data): """ lines = [] I = 0 for _ in range(1000): line_length = int(data[i:i + 4], 16) line = data[i + 4:i + line_length] lines.append(line) if line_length == 0: i += 4 else: i += line_length if i >= len(data): break return lines def build_lines_data(lines): """ Result = [] for line in lines: result.append('{:04x}'.format(len(line) + 5).encode()) result.append(line) result.append(b'\n') result.append(b'0000') return b''.join(result)Copy the code

Implementing HTTPS requests

Since I only want to use the standard library, the following code implements authenticated HTTPS requests without using the Requests library:

def http_request(url, username, password, data=None): """ Send HTTP authentication request (default GET, if data is not empty, With POST "" password_manager = urllib. Request. HTTPPasswordMgrWithDefaultRealm () password_manager. Add_password (None, url, username, password) auth_handler = urllib.request.HTTPBasicAuthHandler(password_manager) opener = urllib.request.build_opener(auth_handler) f = opener.open(url, data=data) return f.read()Copy the code

This code demonstrates that the Requests library makes sense. You can use the urllib.request module of the standard library to do this, but it can sometimes be painful. Most of the Python standard libraries are good, and some are not, though not many. If you use Request, you don’t even need a help function:

def http_request(url, username, password):
    response = requests.get(url, auth=(username, password))
    response.raise_for_status()
    return response.contentCopy the code

We can use the above function to ask the server which version its main branch is in, as follows (this function is still fragile, but can easily be modified to be more general) :

def get_remote_master_hash(git_url, username, password): """ Gets the commit hash of the remote master branch, returns SHA-1 hexadecimal string, or empty """ URL = git_URL + '/info/refs? service=git-receive-pack' response = http_request(url, username, password) lines = extract_lines(response) assert lines[0] == b'# service=git-receive-pack\n' assert lines[1] == b'' if lines[2][:40] == b'0' * 40: return None master_sha1, master_ref = lines[2].split(b'\x00')[0].split() assert master_ref == b'refs/heads/master' assert len(master_sha1) == 40 return master_sha1.decode()Copy the code

Identify the missing object

Next, we need to make sure that the server needs objects that do not exist on the server. Pygit assumes that everything is local (it doesn’t support “pulling”), so I write the read_tree function (as opposed to write_tree) and recursively find the collection of object hashes in the specified tree and the specified commit using the following two functions:

def find_tree_objects(tree_sha1): """ Returns the sha-1 hash collection of all objects in the tree_sha1 tree directory, "" objects = {tree_sha1} for mode, path, sha1 in read_tree(sha1=tree_sha1): if stat.s_isdir (mode): objects.update(find_tree_objects(sha1)) else: objects.add(sha1) return objects def find_commit_objects(commit_sha1): """ Return sha-1 hash for all objects on commit_SHA1 "" objects = {commit_sha1} obj_type, commit = read_object(commit_sha1) assert obj_type == 'commit' lines = commit.decode().splitlines() tree = next(l[5:45] for l in lines if l.startswith('tree ')) objects.update(find_tree_objects(tree)) parents = (l[7:47] for l in lines if l.startswith('parent ')) for parent in parents: objects.update(find_commit_objects(parent)) return objectsCopy the code

Then, all we need to do is get the set of objects referenced in the local commit and subtract the set of objects referenced in the remote commit from this set. The difference between the two is the remote missing object. While there are certainly more efficient ways to generate this collection of objects, this logic is sufficient for PyGit:

def find_missing_objects(local_sha1, remote_sha1): """ Returns sha-1 hash marriage on the remote server for all objects missing from local submission "" local_OBJECTS = find_COMMIT_objects (local_sha1) if remote_sha1 is None: return local_objects remote_objects = find_commit_objects(remote_sha1) return local_objects - remote_objectsCopy the code

Push themselves

Before pushing, we need to send a PKT-line request saying “update the primary branch to submit a hash for this” and then send a package containing all of the above missing objects.

The package file has a 12-byte header (starting with PACK), followed by individual objects, each containing the length and the object data compressed using the Zlib algorithm, and finally the hash value of the entire package file, which is 20 bytes long. Although the algorithm based on object difference can make data packets smaller, it is overdesigned for us:

def encode_pack_object(obj): Encoding a single object into a packaged file (including variable-length headers and compressed data) """ obj_type, data = read_object(obj) type_num = ObjectType[obj_type].value size = len(data) byte = (type_num << 4) | (size & 0x0f) size >>= 4 header = [] while size: header.append(byte | 0x80) byte = size & 0x7f size >>= 7 header.append(byte) return bytes(header) + zlib.compress(data) def create_pack(objects): """Create pack file containing all objects in given given set of SHA-1 hashes, return data bytes of full pack file. """ header = struct.pack('! 4sLL', b'PACK', 2, len(objects)) body = b''.join(encode_pack_object(o) for o in sorted(objects)) contents = header + body sha1 = hashlib.sha1(contents).digest() data = contents + sha1 return dataCopy the code

Then, the final step, push() itself, I’ve removed a bit of code for brevity:

def push(git_url, username, password): Remote_sha1 = get_remote_master_hash(git_URL, username, password) local_sha1 = get_local_master_hash() missing = find_missing_objects(local_sha1, remote_sha1) lines = ['{} {} refs/heads/master\x00 report-status'.format( remote_sha1 or ('0' * 40), local_sha1).encode()] data = build_lines_data(lines) + create_pack(missing) url = git_url + '/git-receive-pack' response  = http_request(url, username, password, data=data) lines = extract_lines(response) assert lines[0] == b'unpack ok\n', \ "expected line 1 b'unpack ok', got: {}".format(lines[0])Copy the code

Command line parsing

Pygit, including subcommands (PyGit init, PyGit Commit, etc.), is an example of using the standard library Argparse module. I didn’t copy the code here, but you can check out the relevant sections of Argparse in the source code.

Pygit usage

In most places, I try to keep the PyGit command line syntax the same or nearly the same as git syntax. Here is the command to submit PyGit to GitHub:

$ python3 misc/pygit.py init pygit initialized empty repository: pygit $ cd pygit # ... write and test pygit.py using a test repo ... $ python3 pygit.py status new files: pygit.py $ python3 pygit.py add pygit.py $ python3 pygit.py commit -m "First working version of pygit" committed to master: 00d56c2a774147c35eeb7b205c0595cf436bf2fe $ python3 pygit.py cat-file commit 00d5 tree 7758205fe7dfc6638bd5b098f6b653b2edd0657b author Ben Hoyt <[email protected]> 1493169321 -0500 committer Ben Hoyt <[email protected]> 1493169321 -0500 First working version of pygit # ... make some changes ... $ python3 pygit.py status changed files: Py $python3 pygit.py diff -- -- pygit.py (index) +++ pygit.py (working copy) @@ -100,8 +100,9 @@ """ obj_type, data = read_object(sha1_prefix) if mode in ['commit', 'tree', 'blob']: - assert obj_type == mode, 'expected object type {}, got {}'.format( - mode, obj_type) + if obj_type ! = mode: + raise ValueError('expected object type {}, got {}'.format( + mode, obj_type)) sys.stdout.buffer.write(data) elif mode == '-s': print(len(data)) $ python3 pygit.py add pygit.py $ python3 pygit.py commit -m "Graceful error exit for cat-file with bad  object type" committed to master: 4117234220d4e9927e1a626b85e33041989252b5 $ python3 pygit.py push https://github.com/benhoyt/pygit.git updating remote master from no commits to 4117234220d4e9927e1a626b85e33041989252b5 (6 objects)Copy the code

Conclusion This is all the code logic! If you read from scratch to this point, you’ve just skimmed through 500 lines of Python code, which isn’t worth anything. Oh wait, except for the value of education and craftsman spirit. Hopefully, you learned something about Git’s internal logic.