Image of What is a tree in Git?

ADVERTISEMENT

Table of Contents

Introduction

This is Part 2 of our mini-series on Git's core objects - blobs, trees, commits, and annotated tags. Part 1 answered the question what is a blob in Git.

In this article, we'll discuss everything you need to know about trees in Git.

What is a Git tree object?

A tree in Git associates blobs with their actual file path/name and permissions.

Without trees, Git would have no way of identifying which tracked files correspond to the content stored in their blobs. Remember that a blob is purely a file's binary content (which is compressed using Zlib before storage in the object database). Blobs contain no information that directly links them to a file path and name on your filesystem - this is the job of a Git tree.

A tree can also be thought of as representing a directory, not because they directly track a specific committed folder on your filesystem, but because they link together a list of file blobs with their corresponding paths/names, and permissions as they existing at a particular time.

Where are Git trees stored?

Trees are stored in Git's repository, also known as the object store. This is located at the path .git/objects/ in your project root directory.

What is the Git tree format?

A tree in Git is built up by Git's code in a memory buffer in the following format:

tree <size-of-tree-in-bytes>\0
<file-1-mode> <file-1-path>\0<file-1-blob-hash>
<file-2-mode> <file-2-path>\0<file-2-blob-hash>
...
<file-n-mode> <file-n-path>\0<file-n-blob-hash>

The tree format starts with the object's type, which is just the string 'tree', followed by the size of the tree object in bytes.

Next comes a series of "cache entries" for the tree, each made up of the following 3 pieces of information:

  1. A file's mode (permissions)
  2. A file's path/name
  3. SHA-1 hash of the file's blob

This allows Git to link up specific blobs in the database with filenames on your system. A tree can hold as many of these blob mappings as you need, depending on the size of your working directory.

When does Git create trees?

When you use the git add command to tell Git to track a new file or add file modifications to the staging area, Git first creates a blob for the changed file and puts it in the local repository.

Next, Git will create a "cache entry" for that blob in the staging area, which in reality is just a plain old nothing-special file called .git/index. The cache entry stores the filename/path, permissions, and hash of the blob.

When you run git commit, a proper tree (in the format above) is built up in memory from the cache entries in the index file, and referenced as the "root tree" in the new commit object. Every commit points to a "root tree" object that represents a snapshot of the working directory contents at that time.

How are Git trees stored?

Once the proper tree is built up in memory, Git calculates the SHA-1 hash of the tree using the OpenSSL SHA library.

Then Git compresses the tree using Zlib, and stores the tree in Git's object store as a loose object. The tree is named using its SHA1 hash, which is also used to organize it into the correct folder in Git's content addressable database.

Does Git delete trees?

No. Git doesn't delete trees, unless the commit tied to that tree becomes orphaned. If this does happen, Git's garbage collector could delete the tree once a certain period of time elapses.

Can Git reuse trees across commits?

Yes! Git will never create the same object twice in your object database! So if multiple commits represent snapshots of the exact same set of files - with unchanged content - Git will just reuse the existing tree that represents that set.

Is working tree the same as working directory?

Yes, in Git the working tree is synonymous with the working directory, but this isn't directly tied to the concept of "git trees" that we're discussing in this post.

The working directory (or working tree) in Git is just the active checked-out folder that you see in your project root folder. Developer often use the term tree synonymously with directory or folder, since it identifies a set of files and their names/paths/permissions.

However, don't confuse a tree object in Git with the working tree. A tree object is full Git object that is stored in the repository. Git's working tree is just the set of files and folders you currently have checked out.

What is the difference between a tree and commit?

Trees and commits are different types of Git objects. While a tree links together a set of blobs with their names and permissions, a commit connects a "root tree" with an author, committer, datetime information, commit message, and commit parents - which effectively turns it into a snapshot of your project a certain point in time.

What are the three trees in Git?

This terminology of Git's three trees is a reference to the following three concepts in Git:

  1. Git HEAD
  2. Git's index (staging area)
  3. Working tree (working directory)

Git tree example | How do you use trees in Git?

As a Git user, you don't really need to explicitly worry about trees most of the time. Git will automatically create trees as you go through the normally process of adding file modifications to the staging area and committing them to your repository.

However, there are some commands you can use to directly interact with trees.

git rev-parse HEAD^

You can easily find the root tree pointed to by a commit by using the command git rev-parse:

$ git rev-parse HEAD^{tree}
012bbd88e5d2cc8cdd405df2443de403bc0eb4e5

This outputs the object ID of the HEAD commit's root tree. Instead of HEAD, you can supply a specific commit ID, branch name, or tag.

git ls-tree

You can use the git ls-tree command to display the contents of a tree-ish argument. Tree-ish just means a tree, or an object/syntax that leads to one if you follow the linked objects.

Using git ls-tree with no arguments will simply list the raw content of the tree pointed to by the HEAD commit:

$ git ls-tree
100644 blob 5790758c324eda7a91057bcff1e0ba0a8c138057	.gitignore
100644 blob d5429c2d93c8bb2850b9964e98dd2510ee7de261	package-lock.json
100644 blob 319e6e5295741d6cd5b366337bbb75a3d35a9629	package.json
100644 blob 388adfa5c0a996e6f6585c98c697838507ce54b4	pom.xml
040000 tree 50cd20e424aac109d74fd2adc99162bdde74d64f	src

You can supply specific commit ID's or tree ID's as arguments, or you can supply refs such as branch names or tags since those just resolve to an individual commit.

git show

You can also use the git show command to display the formatted contents of a specific tree if you know its object ID:

$ git show 012bbd88
tree 012bbd88e5d2cc8cdd405df2443de403bc0eb4e5

.gitignore
package-lock.json
package.json
pom.xml
src/

Summary

In this article, we provided an overview of trees in Git. We identified what a tree is, how a tree is formatted, when and how they're created, and where they're stored in Git's repository.

Furthermore, we discussed how Git doesn't usually deletes trees, and saw that Git is able to reuse trees for multiple commits where possible.

We also mentioned some related information regarding the working tree, Git's three trees, and explained the difference between trees and commits.

Lastly, we saw a few Git tree examples which showed how you can use Git's commands to work with trees in your project.

Next Steps

If you're interested in learning more about how Git works under the hood, check out our Baby Git Guidebook for Developers, which dives into Git's code in an accessible way. We wrote it for curious developers to learn how Git works at the code level. To do this, we documented the first version of Git's code and discuss it in detail.

We hope you enjoyed this post! Feel free to shoot me an email at jacob@initialcommit.io with any questions or comments.

References

  1. Memory buffers - https://en.wikipedia.org/wiki/Data_buffer
  2. OpenSSL SHA Library - https://www.openssl.org/docs/man3.0/man3/SHA1.html
  3. Zlib library - https://zlib.net/
  4. Git's three trees - https://git-scm.com/book/en/v2/Git-Tools-Reset-Demystified

Final Notes