Git's Original Object Database
Table of Contents
Baby Git Object Database
In the original version of Git, Linus Torvalds designed an extremely simple mechanism for Git's repository system. The repository system is also called the object database. It is nothing more than a storage area for Git's internal objects. Note that the term database is a bit of a misnomer.
We usually think of a database as a bunch of tables with rows and columns. However, in Git's case, the object database is just a set of directories, conveniently named to stored Git's internal objects in an organized way. This set of directories is stored in a hidden folder created when the repository is initialized by the
init-db command. The hidden folder is called
.dircache. This folder contains another folder called
objects which is the root of the object database.
The object database lives in the
objects folder. It is made up of 256 subfolders each with a two-character name. Here is the full set of 256 folders:
00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f 30 31 32 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f 40 41 42 43 44 45 46 47 48 49 4a 4b 4c 4d 4e 4f 50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 5e 5f 60 61 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 71 72 73 74 75 76 77 78 79 7a 7b 7c 7d 7e 7f 80 81 82 83 84 85 86 87 88 89 8a 8b 8c 8d 8e 8f 90 91 92 93 94 95 96 97 98 99 9a 9b 9c 9d 9e 9f a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 aa ab ac ad ae af b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 ca cb cc cd ce cf d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 da db dc dd de df e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed ee ef f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 fa fb fc fd fe ff
The naming convention of these folders is set up so that each folder name corresponds to a two letter combination of the hexadecimal characters 0-f. In fact, the full set of folder names above makes up all of the possible two-character combinations of hexadecimal characters.
The reason it is set up this way is that all internal object that Git stores (blobs, trees, etc) are referenced by their SHA-1 hash values, which are made up of hexadecimal characters. When Git creates an object (a blob, tree, etc) it stores it in the subfolder that matches the first two characters of that object's SHA-1 hash. In this way, all of Git's internal objects can be indexed neatly into corresponding folders.
For example, let's assume a user starts tracking a new file in their Git repository. That file's content is compressed and then hashed to yield the following blob hash value:
This blob object will be stored in the following path in the root directory of the project:
As more files are added, modified, deleted, and generally tracked through Git, the object database gets populated with more and more objects. These can be efficiently searched by Git's commands and perused by curious developers.
Since each blob, tree, or commit object is indexed based a unique identifier (a hash) corresponding to its own content, a database set up in this way is called a content addressable database.
Note that this is the original structure of Git's object database. But during it's 10+ years of evolution, Git's object database has updated and optimized in various ways.
If you're interested in learning more about how Git works under the hood, check out our Baby Git Guidebook for Developers, which dives into Git's code in an accessible way. We wrote it for curious developers to learn how Git works at the code level. To do this we documented the first version of Git's code and discuss it in detail.
Recommended product: Git Guidebook for Developers