How are Git's objects linked together?
Table of Contents
- Git commits connect to root trees
- Git root trees connect to blobs and subtrees
- Git subtree nesting
- Git stores a whole new blob for every version of every file
- Git trees point to existing blobs for files that haven't changed
- Visualizing Git objects as a hierarchy under the commit chain
- Next Steps
Most Git users are familiar with the way each commit refers to one or more parent commits, aside from the initial commit which has no parents.
This results in commit chains that we call branches. It also results in a structure known as a Directed Acyclic Graph or DAG, which is a graph structure representing a series of nodes that refer to other nodes via directed edges, with no circular structures or cycles being possible.
However, most Git users don't realize that Git's other core objects - trees and blobs - are connected to each other and to commits together via similar types of references.
In this article, we'll discuss how Git commits connect to Git trees, and furthermore how those trees connect to blobs forming a connected web of objects. This is what what allows Git to perform all of the amazing tasks that we developers use it for every day.
Git commits connect to root trees
As a casual Git user, you might not know that every commit you make references a root tree. We have a whole article about trees if you need a refresher.
A root tree is just the top-level tree object that a commit points to. Every commit points to one and only one root tree, which specifies the collection of objects that the commit is a "snapshot" of. The root tree is created directly from the file changes in the staging area at the time of the commit.
Like all Git objects, each root tree is stored in Git's object database and is identified by a unique SHA-1 hash value of its content. This SHA-1 value is also known as the object ID. Each commit references its root tree by literally recording the object ID of the tree along with the commit's content.
A commit object has the format seen below, which includes the object ID of the root tree, along with the rest of the commit data:
commit <size-of-commit-data-in-bytes>'\0' <root-tree-SHA1-hash> <parent-1-commit-id> <parent-2-commit-id> ... <parent-N-commit-id> author ID email date committer ID email date user comment
By storing the root tree's object ID as a part of the commit, Git can understand which file content is associated with that commit, as we'll see next.
Git root trees connect to blobs and subtrees
Since root trees in Git are just regular trees, they reference the set of blobs and subtrees that are included in a commit. Now would be a good time for a refresher on blobs, if you need one.
Here is the format of a tree as stored by Git:
tree <size-of-tree-in-bytes>\0 <file-1-permission> <file-1-name>\0<file-1-blob-object-id> <file-2-permission> <file-2-name>\0<file-2-blob-object-id> ... <file-n-permission> <file-n-path>\0<file-n-blob-hash>
As you can see, a tree is just a list of file permissions, file names, and their corresponding blob object ID's. The tree helps Git understand which blobs are included to each commit.
Git subtree nesting
Note that trees can also references other trees (known as subtrees), which in turn reference other blobs and even deeper subtrees. You can think of this as a sort of hierarchy or family tree in which all files and folders in a commit are strung together.
Git stores a whole new blob for every version of every file
Many Git users think that Git stores a series of diffs or file changes, which are cumulatively applied or subtracted in order to reconstruct any version of a file. However, this is not true at all. Although many legacy version control systems like SCCS and RCS did things this way, it turns out to be very slow to apply all those diffs when the history gets very large.
For this reason, Git stores a totally new blob for every version of every file that you commit. This makes reconstructing file content as simple and fast as accessing the corresponding blob's content.
This may sound like a waste of storage, especially if some files only have very small changes. But, Git uses multiple layers of compression algorithms to make up for this. The zlib library is used to compress each individual Git object. Furthermore, individual objects are compressed into pack files. This enables similar objects to (like blobs derived from files with only minor changes) to be efficiently stored and transferred across networks.
Git trees point to existing blobs for files that haven't changed
However, for files that are sitting in your working directory with no changes when you make a commit, Git can just re-use the existing blob that's already sitting in the object database. This is done by referencing the existing blob object ID in the root tree when a commit is made. All changed files will have a new blob generated and included in the tree, but unchanged files will simply use their existing blob.
Visualizing Git objects as a hierarchy under the commit chain
Putting it all together, I'd like you to try and visualize what Git's root tree, subtrees, and blobs, look like.
- Start by picturing a simple chain of 3 commits (i.e. a small branch). Each commit is a circle has an arrow pointing back to its parent.
- Next imagine a root tree as a triangle sitting underneath each commit circle. An arrow points from each commit to it's root tree.
- Lastly, vizualize a set of 2-3 blobs as boxes sitting underneath each root tree, with arrows pointing from the tree to each blob.
I find that mentally visualizing this "marionette" structure helps me get a feel for how Git's objects all fit together under the hood.
In this article, we discussed how Git's core objects - blobs, trees, and commits - all link together to for a connected web.
We saw how each commit references a root tree, and each root tree references a set of blobs and possibly subtrees. A new blob is created for each and every changed version of a file, and trees will make use of existing blobs where possible for storage efficiency.
This is all best visualized as a web of objects flowing out from under a chain of commits, with arrows representing the references between Git's various objects.
If you're interested in learning more about how Git works under the hood, check out our Baby Git Guidebook for Developers, which dives into Git's code in an accessible way. We wrote it for curious developers to learn how Git works at the code level. To do this, we documented the first version of Git's code and discuss it in detail.
We hope you enjoyed this post! Feel free to shoot me an email at firstname.lastname@example.org with any questions or comments.
- Git SCM Packfiles - https://git-scm.com/book/en/v2/Git-Internals-Packfiles
Recommended product: Decoding Git Guidebook for Developers