Merkle trees in Git and Bitcoin
Table of Contents
- Hashing & content addressability
- Hash functions explained
- Benefits of content addressability
- How Git uses content addressability
- How Bitcoin uses content addressability
This article is Part 2 in our series comparing Git and Bitcoin. Check out Part 1 to start at the beginning. In this article, we'll discuss how Merkle trees are utilized by both Git and Bitcoin.
Hashing & content addressability
A major element that Git and Bitcoin have in common is content addressability. Content addressability means that each piece of data is uniquely identified by the contents of that data, as opposed to a sequential ID number or arbitrary label specified by a user.
For Git, you can prove this to yourself: The commit ID of each commit is, (as you read everywhere in the documentation), the hash of that commit object. This hash is calculated from all of the data that makes up that commit, including (but not limited to):
- File changes included in the commit
- File names of the committed files
- Committer name
- Commit date and time
- Commit message
- Parent commit ID
This means that the commit object itself is identified by its own content as opposed to some arbitrary number (like the nth commit created in a repository), or by the commit message you entered.
These SHA1 hash identifiers stay constant for the most part in Git. For example, commit ID's in Git stay constant unless history is edited using a command such as git rebase.
Hash functions explained
Before we go any further, let's talk about hash functions and hashing.
A hash is the resulting output from a hash function. Hash functions can map arbitrary blocks of data to fixed-size output chunks, for example, SHA-256 always outputs a 256-bit result. A good hash always has uniform outputs: if you feed it randomly generated inputs, the outputs don't tend to "bunch up" in one part of all combinations of the output possibilities.
For our purpose of cryptographic hash functions, they have a few more requirements:
- The input should not be easy to obtain from the output.
- The output should change drastically in response to even a single change in the input.
- No two inputs should ever map to the same output.
As an example of (1), take the Git commit ID,
Nowhere is there any tool or library that can take that commit ID, and tell you that this is the data that was used to generate it:
author Teknikal_Domain <firstname.lastname@example.org> 1596851132 -0400
committer Teknikal_Domain <email@example.com> 1596851132 -0400
gpgsig -----BEGIN PGP SIGNATURE-----
-----END PGP SIGNATURE-----
Site JS change: Persistent tagline storage
* Allows for calling setTagline() more than once, after initial XHR
And (2) becomes useful when you want some authentication, say, checking passwords. Even one bit flipped in a sea of billions will make a completely different hash. This is why downloadable binary files have an associated SHA value to identify that exact version as official. By comparing a relatively short string of characters, you can guarantee what you grabbed is an exact to-the-bit copy.
Benefits of content addressability
Content-addressable systems are inherently resistant to tampering. To change one piece of data means updating every piece that references it, which in turn means every piece that references that, and so on. When using certain Git commands that re-write history, such as
git rebase -i or
git filter-branch, even one change requires Git to recalculate every commit after the one that was changed, since each commit's identity depends on its parent.
In most of these systems, they create what's called a Merkle tree. A Merkle tree is a tree structure where each node of the tree has an identifier that's the hash of its contents. Git's object database is a Merkle tree... or rather, a bunch of Merkle trees that point to other Merkle trees. Because of their referential nature, they're hardened in the way I described above.
How Git uses content addressability
As described, Git's object database is content addressable. Every blob, tree, and commit is a hashed object. If you want to refer to a specific object, you need to know its hash. Git uses this to guarantee unique identifiers since two hashes should never collide. In addition, the structure is hard-to-modify and can be created without a lot of work. A malicious actor can't forge commits easily once they're made since that would require a lot of calculations and updates, which would then disagree with everyone else's copy of the same data. That would be quite the explanation for why such a merge needs to be completed.
How Bitcoin uses content addressability
The internal structure of the Bitcoin blockchain and Git object database are very similar. However, Bitcoin's blockchain is a bit more linear.
Transactions in Bitcoin are grouped into blocks that get linked into a Merkle tree. Every validated block contains a proof-of-work (PoW) that is solved by the miners. Each PoW is a number that, when added to the block's contents, causes the block's hash to satisfy a few criteria. Currently, the main criteria is that the hash, in numerical representation, is smaller than some other number, i.e., there are a certain number of leading zeros in the hash value.
The reason for doing this is that to change network transactions, you would need to not only modify every block after the one with the transaction you're modifying, but you also need to recalculate the PoWs for each of them, a process that is designed to take roughly 10 minutes of calculations to solve (they have variable difficulty levels, meaning as the computing power of miners goes up, so does the proof difficulty, requiring more processing cycles.
To this end, both Git and Bitcoin have settled on slightly different implementations of the same overarching concept. Any changes to existing data are computationally too big of a task. The only way to modify the state of everything is to add new data, and this is always going to come from a known end state, which can be verified easily, by every device that has a copy.
Both Git and Bitcoin strive for a similar goal: To be able to store data in a way that's publicly accessible, hard to forge, hard to modify, but also easy to read the final output from.
Both have settled on similar design patterns. The use of a distributed system where every participant has a complete copy of the data, and the use of a content-addressable system - the Merkle tree - to provide fast validation and computationally difficult modification of existing data.
This data structure is (theoretically) infinitely scaleable (at least until you run out of hard drive space to put the thing), meaning that you don't have to worry about a potential hard limit looming overhead, waiting to be crashed into.
For a comprehensive overview of Bitcoin, I highly recommend the book Mastering Bitcoin, by O'Reilly Media.
If you're interested in learning how Bitcoin's code works, check out our Baby Bitcoin Guidebook for Developers.
Recommended product: Decoding Bitcoin Guidebook for Developers