The preface

As a tool you use every day, you must be familiar with the basic uses of Git, such as:

  • withgit-blameFind out which colleague introduced a particular line of bugs, and let him take the blame;
  • withgit-mergeMerging someone else’s code into your flawless branch and then discovering that unit tests don’t work;
  • withgit-push -fOverwrite all submissions from the rest of the team.

Git is also a key-value database with versioning capabilities:

  • All submissions are stored in the directory.git/objects/Under;
  • The contents of files are storedblobObject, store file metadatatreeObject, and store the commit recordcommitObjects, etc.;
  • Git provides key-value style read and write commandsgit-cat-fileandgit-hash-object.

As those of you who read my previous post on git Merge know, if a merge is not fast-forward, a new commit object with two parent commit objects is created. With well-known Go language Web framework gin warehouse as an example, the hash value for e38955615a14e567811e390c87afe705df957f3a submission is a merger, there are two lines of the parent in the content of the submitted

➜ gin git: (master) git cat - file - p 'e38955615a14e567811e390c87afe705df957f3a' tree 93e5046e502847a6355ed26223a902b4de2de7c7 parent ad087650e9881c93a19fd8db75a86968aa998cac parent ce26751a5a3ed13e9a6aa010d9a7fa767de91b8c author Javier Provecho Fernandez <[email protected]> 1499534953 +0200 committer Javier Provecho Fernandez <[email protected]> 1499535020 +0200 Merge pull request #520 from 178inaba/travis-import_pathCopy the code

All submitted objects form a directed acyclic graph with a submitted parent attribute. But you’re smart enough to realize that the git-log output is linear, so Git uses some sort of graph traversal.

Look at man Git-log, which can be seen in the Commit Ordering section

By default, the commits are shown in reverse chronological order.

You are smart enough to already know how to implement the traversal algorithm for this graph.

Write one yourselfgit-log

parsingcommitobject

To print the commit object’s information in the correct order, you need to parse it first. Instead of opening the file, reading the byte stream, and unpacking the file itself from scratch, we simply call git-cat-file as we did above. Git-cat-file prints some of the contents that need to be extracted for later use:

  • In order toparentThe opening line. The hash of this row is used to locate a node in a directed acyclic graph;
  • In order tocommitterThe opening line. This line of UNIX timestamps will be used to determine who is “next.”

You can write a Python class to parse a COMMIT object

class CommitObject:
    The result of parsing an object of type commit in Git. ""
    def __init__(self, *, commit_id: str) - >None:
        self.commit_id = commit_id

        file_content = self._cat_file(commit_id)
        self.parents = self._parse_parents(file_content)
        self.timestamp = self._parse_commit_timestamp(file_content)

    def _cat_file(self, commit_id: str) - >str:
        cmd = ['git'.'cat-file'.'-p', commit_id]
        return subprocess.check_output(cmd).decode('utf-8')

    def _parse_commit_timestamp(self, file_content: str) - >int:
        """ Resolve the submitted UNIX timestamp." ""
        lines = file_content.split('\n')
        for line in lines:
            if line.startswith('committer '):
                m = re.search('committer .+ <[^ ]+> ([0-9]+)', line.strip())
                return int(m.group(1))

    def _parse_parents(self, file_content: str) - >List[str] :
        lines = file_content.split('\n')
        parents: List[str] = []
        for line in lines:
            if line.startswith('parent '):
                m = re.search('parent (.*)', line.strip())
                parent_id = m.group(1)
                parents.append(parent_id)
        return parents
Copy the code

traversecommitDirected acyclic graph — large root heap

Congratulations, that data structure you’ve learned can come in handy.

Assumption in the above class CommitObject parsing the hash values are submitted e38955615a14e567811e390c87afe705df957f3a gin, then its parents attributes, there are two strings:

  • ad087650e9881c93a19fd8db75a86968aa998cac;
  • ce26751a5a3ed13e9a6aa010d9a7fa767de91b8c.

Among them:

  • The hash value isad087650e9881c93a19fd8db75a86968aa998cacThe submission time ofSat Jul 8 12:31:44;
  • The hash value isce26751a5a3ed13e9a6aa010d9a7fa767de91b8cThe submission time ofJan 28 02:32:44.

Obviously, according to the inversion of time sequence (reverse chronological) print log, is the next print node should be ad087650e9881c93a19fd8db75a86968aa998cac – use git – log command can confirm this.

After printing ad087650e9881c93a19fd8db75a86968aa998cac, and from its parent submission and ce26751a5a3ed13e9a6aa010d9a7fa767de91b8c, pick out the next to print submission object. Obviously, this is a recurring process:

  1. From the one to printcommitObject, find the one with the largest commit timestamp;
  2. Prints its messages;
  3. willcommitAdd all parent commits to the object pool to be printed, go back to step 1;

This process continues until there are no More Commit objects to print, and all commit objects to print form a priority queue — which can be implemented with a large root heap.

However, I’m not going to actually implement a heap data structure in this short demo — I’m going to use insert sort instead.

class MyGitLogPrinter() :
    def __init__(self, *, commit_id: str, n: int) - >None:
        self.commits: List[CommitObject] = []
        self.times = n

        commit = CommitObject(commit_id=commit_id)
        self._enqueue(commit)

    def run(self) :
        i = 0
        while len(self.commits) > 0 and i < self.times:
            commit = self.commits.pop(0)

            for parent_id in commit.parents:
                parent = CommitObject(commit_id=parent_id)
                self._enqueue(parent)

            print('{} {}'.format(commit.commit_id, commit.timestamp))
            i += 1

    def _enqueue(self, commit: CommitObject) :
        for comm in self.commits:
            if commit.commit_id == comm.commit_id:
                return
        Insert a new node at the index I by moving it from I to the last element.
        i = 0
        while i < len(self.commits):
            if commit.timestamp > self.commits[i].timestamp:
                break
            i += 1
        self.commits = self.commits[0:i] + [commit] + self.commits[i:]
Copy the code

Finally provide a start function to experience

@click.command()
@click.option('--commit-id', required=True)
@click.option('-n', default=20)
def cli(commit_id: str, n: int) :
    MyGitLogPrinter(commit_id=commit_id, n=n).run()


if __name__ == '__main__':
    cli()
Copy the code

Real Monkey King vs. fake Monkey King

To see if the above code prints the commit objects in the correct order, I first redirect its output to a file

➜  gin git:(master) python3 ~/SourceCode/python/my_git_log/my_git_log.py --commit-id 'e38955615a14e567811e390c87afe705df957f3a' -n 20 > /tmp/my_git_log.txt
Copy the code

Print it out in the same format with git-log

➜ gin git (master) the git log - pretty = 'format: % H % ct' 'e38955615a14e567811e390c87afe705df957f3a' - n > 20 / TMP/git_log. TXTCopy the code

Finally, let the diff command tell us if the two files are different

➜ gin git diff: (master)/TMP/git_log. TXT/TMP/my_git_log. TXT 20 c20 < 2521 d8246d9813d65700650b29e278a08823e3ae 1499266911  \ No newline at end of file ---> 2521d8246d9813d65700650b29e278a08823e3ae 1499266911
Copy the code

It’s almost the same.

Read the original