Image of What is a blob in Git?

ADVERTISEMENT

Table of Contents

Introduction

This is Part 1 of our series on Git objects - blobs, trees, commits, and annotated tags.

As Git users, we often only worry about the objects that are presented to us in daily Git usage. For the most part, this is commits and tags. In addition, we also see and interact with Git refs, which are essentially just labels for commits and branch names.

However, behind the scenes Git uses two lower-level object types called blobs and trees.

In this article, we'll discuss everything you need to know about blobs in Git.

What are the Git object types?

Git has 3 main types of objects - blobs, trees, and commits. Technically [annotated tags] are also objects since they are stored in Git's object database.

What is a Git blob object?

A blob in Git is just a file's binary data, stored along with the size of that data and a label indicating the object's type, which in this case is a 'blob'.

Because a blob is just a stream of binary data, it is also referred to as an octet stream or a byte stream. An octet is just another word for a byte of data, which is a series of 8 bits (0's or 1's).

Git uses object chaining to link together various different objects. Commits are linked to trees which in turn link to blobs.

Where are Git blobs stored?

Blobs are stored in Git's object database, which is located in the path .git/objects/ in the root directory of your project.

What is the Git blob format?

Git blobs are built up in a memory buffer (a location in your computer's memory that Git's code can access) by Git's code in the following format:

blob <size-of-blob-in-bytes>\0<file-binary-data>

This format starts with the Git object's type, which in this case is blob, followed by a single space which acts as a delimiter (or separator) between values.

Second comes the size of the blob's data in bytes, which I labeled as <size-of-blob-in-bytes>. This size is calculated to include the object's type, size, delimiters, and binary content.

Next comes a NUL byte, represented by the \0 character. This is just an empty byte 00000000 and also acts as a delimiter.

Lastly comes the file's binary content.

When does Git create blobs?

Git creates new blobs in the following circumstances:

  1. Adding an untracked file to the staging area with git add
  2. Adding a modified, tracked file to the staging area with git add
  3. Git merges that result in changing an existing file's content (not all merges do this)

It's possible that some other Git commands such as git rebase might create new blobs as well, but only if the result of that command changes the content of an existing tracked file.

How are Git blobs stored?

When you tell Git to track a file using the git add command, it may seem like it's working some magic behind the scenes to add your file into Git's staging area. However, once you know how the magic trick works, it doesn't seem that magical at all!

The first thing Git does is turn your new untracked file into a blob. Git's code does this by reading the file's content into a memory location that the code can access. It then determine's the size of the data, and create's the format shown above.

The code then uses the OpenSSL SHA library to calculate the SHA-1 hash of the blob. This is called the blob hash or more generally, the object ID.

Next, the code uses the Zlib library to compress (or deflate) the blob.

Finally, the compressed blob is written to a new file in Git's object store and named using the calculated SHA-1 hash.

When Git objects are stored individually like this, they are called loose objects. Loose objects are grouped into pack files to reduce their size for transferring between Git repositories when using git push or git pull.

Does Git delete blobs?

No. Git won't ever delete a blob, unless it becomes unreachable by all other objects in your repository.

This could happen if the commit that points to the tree that points to the blob becomes orphaned. In this case, Git's garbage collection might delete the blob after a certain period of time.

Can Git reuse blobs across commits?

Yes! Since Git uses a content-addressable database, it is able to detect whether a file's content has changed based on the calculated SHA-1 of the file's blob. If a file is not changed between commits, or is changed but to a known previous state, the existing blob for that file will be reused in future commits.

This saves a lot of hard-drive space since Git doesn't need to store the same object multiple times across commits.

Summary

In this article, we discussed various aspects of Git blobs. We started by explaining what a blob is in Git and how they are formatted in Git's code.

Next we reviewed when Git's commands actually create blobs and how it happens.

Lastly, we covered blob deletion and reusability.

Next Steps

Check out Part 2 of this series which discusses git trees.

If you're interested in learning more about how Git works under the hood, check out our Baby Git Guidebook for Developers, which dives into Git's code in an accessible way. We wrote it for curious developers to learn how Git works at the code level. To do this, we documented the first version of Git's code and discuss it in detail.

We hope you enjoyed this post! Feel free to shoot me an email at jacob@initialcommit.io with any questions or comments.

References

  1. OpenSSL SHA Library - https://www.openssl.org/docs/man3.0/man3/SHA1.html
  2. Zlib library - https://zlib.net/

Final Notes