Version Control Systems | A Technical Guide to VCS Internals
ADVERTISEMENT
Table of Contents
- Introduction
- What is a Version Control System?
- 13 Historically Significant Version Control Systems
- Third Generation - Distributed Version Control Systems
- VCS Release History Timeline
- SCCS - Source Code Control System - First Generation
- Background
- Sample SCCS History File
- RCS - Revision Control System - First Generation
- Architecture
- CVS - Concurrent Versions System - Second Generation
- SVN - Subversion - Second Generation
- Git - Third Generation
- Mercurial - Third Generation
- Summary
- References
Introduction
In this article, we'll start by briefly describing the basics and purpose of version control systems. Then we'll provide a technical comparison of some of the most historically significant Version Control Systems (VCS).
The original version of this article covered the six VCS: SCCS, RCS, CVS, SVN, Git, and Mercurial. Due to popular demand, we added six more - Perforce Helix Core, BitKeeper, Darcs, Monotone, Bazaar, and Fossil. The first six are covered in this article and the new additions are covered in Part 2.
The goal of this post is to help you learn version control from a technical and historical perspective, so that you can understand and apply it in your own software development teams and personal journey as a developer.
What is a Version Control System?
A Version Control System (VCS) is a tool that helps software developers keep track of how their software development projects - desktop applications, websites, mobile apps, etc - change over time.
Each snapshot or state of the files and folders in a codebase at a given time can be called a "version." Version control systems were created to allow developers a convenient way to create, manage, and share those versions. It allows them to have control over managing the versions of their code as it evolves over time.
Version control systems also enable collaboration within a team of software developers, without losing or overwriting anyone's work. After a developer makes a set of code changes to one or more files, they tell the version control system to save a representation of those changes.
In this way, a history of code changes - or versions - is built up as the code evolves. This history is preserved so that all developers, even if they are working in totally different locations, have access to a consistent and up-to-date history of the project.
A version control system can also be referred to as a source control system, source code management, version control software, version control tools, or other combinations of these terms.
13 Historically Significant Version Control Systems
Below is a list of 13 important VCS that were each very important to the progression of the industry. These can be categorized by generation, which represents a combination of chronological order and VCS design principles:
-
First Generation
-
Second Generation
-
Third Generation
First Generation - Local Version Control Software
The first generation VCS were intended to track changes for individual files and checked-out files could only be edited locally by one user at a time. They were built on the assumption that all users would log into the same shared Unix host with their own accounts.
As you can imagine, the ability to revisit the state of code files at various times in the project's history made these early systems a valuable resource for small, local development teams.
Second Generation - Centralized Version Control Tools
As version control technology continued to evolve, the second generation introduced networking which led to centralized repositories that contained the 'official' versions of their projects. This was good progress, since centralized version control allowed multiple users to checkout and work with the code at the same time, but they would all be committing back to the same central repository. Furthermore, network access was required to make commits.
Third Generation - Distributed Version Control Systems
The third generation comprises the distributed VCS, or DVCS.
In a distributed VCS, all copies of the repository are created equal - there is no central copy of the repository. This design principle encourages commits, branches, and merges to be created locally without network access and pushed to other repositories as needed.
Distributed version control truly opened the door to an open source version control solution, since software developers around the globe could easily collaborate and share code without a central authority to enable their interactions.
VCS Release History Timeline
For context, here is a timeline of the creation of these VCS tools:
Figure 1: Timeline of the Creation of Version Control Systems
SCCS - Source Code Control System - First Generation
Background
SCCS is considered to be one of the first successful VCS tools created. It was developed by Marc Rochkind at Bell Labs in 1972. It is written in C and was created to solve the problems of source file revision tracking. Furthermore, it made it significantly easier to track down the source of bugs introduced into a program. SCCS is worth understanding at a basic level because it is the seed of the set of modern VCS tools that are so important to developers today.
Architecture
Like most modern day VCS, SCCS has a set of commands that allow developers to work with versioning of their files. These commands are used to:
- Check in files to have their history tracked using SCCS
- Check out specific file revisions for review or compilation
- Check out specific file revisions for editing
- Check in new file revisions along with a comment explaining the changes
- Revert changes made in a checked out file
- Basic branching and merging of changes
- Provide a log of a file's revision history
A special type of file called an s-file
or a history file
is created when a file is added for tracking with SCCS. This file is named using the original file name prefixed with a s.
and is stored in a subdirectory called SCCS
. So a file called test.txt
would get a history file created in the ./SCCS/
directory with a name of s.test.txt
. On creation, the history file contains the initial content of the original file as well as some metadata to assist with version tracking. Checksums are stored in the history files to verify that the content has not been tampered with. The history file content is not compressed or encoded (as we will see is the case with the later generation VCS).
Since the content of the original file is now stored in the history file, it can be retrieved into the working directory for review, compilation, or editing. Further changes made to the file such as line additions, modifications, and removals can be checked back into the history file, which increments its revision number.
Subsequent SCCS checkins only store only the deltas
or changes to a file as opposed to the entire file content each time. This decreases the size of the history file. Each time a checkin is made, the delta is stored in a structure known as a delta table
inside the history file. As previously mentioned, the actual file content is more or less copied verbatim, with special control sequences for marking the start and end of sections of added and removed content. Since SCCS history files don't use compression, they are typically larger in size that the actual file being tracked. SCCS uses a delta method known as interleaved deltas
. This is beneficial since it allows constant-time checkouts regardless of how old the checked out revision is - i.e. older revisions don't take longer to checkout than newer revisions.
One important thing to note is that all files are tracked and checked in separately in SCCS. There is no way to checkin changes to multiple files as a part of one atomic unit - like a commit in Git. Each tracked file has a corresponding history file which stores its revision history. In general, this means that the version numbers of different files in a project will not usually match each other. However, matching revision numbers can be achieved by editing every file in the project at once (even if not all of the files have real changes) and checking them all at one time. This will increment the revision number for all the files to keep them consistent, but note that this is NOT the same as including multiple files in a single commit like in Git. In SCCS, this makes an individual checkin in each history file, as opposed to one big commit including all the changes at once.
When a file is checked out for editing in SCCS, a lock is placed on the file so it cannot be edited by anyone else. This prevents changes from being overwritten by other users, but also limits development since only one user can work with a given file at a time.
SCCS has support for branches that can store sequences of changes within a specific file. Branches can be merged back in with the original versions or merged with other branched versions of the same parent.
Basic Commands
Below is a list of the most common SCCS commands.
sccs create <filename.ext>
: Check in a new file to SCCS and create a new history file for it (in the ./SCCS/
directory by default).
sccs get <filename.ext>
: Check out a file from from its corresponding history file and place it in the working directory in readonly mode.
sccs edit <filename.ext>
: Check out a file from the corresponding history file for editing. Locks the history file so no other users can modify it.
sccs delta <filename.ext>
: Check in the modifications to the specified file. Will prompt for a comment, store the changes in the history file, and remove the lock.
sccs prt <filename.ext>
: Display the revision log for a tracked file.
sccs diffs <filename.ext>
: Display the differences between the current working copy of a file and the state of the file when it was checked out.
For more information on SCCS internals, see Eric Allman's guide and this Oracle guide on programming utilities.
Sample SCCS History File
^Ah20562
^As 00001/00001/00002
^Ad D 1.3 19/11/26 14:37:08 jack 3 2
^Ac Here is a comment.
^Ae
^As 00002/00000/00001
^Ad D 1.2 19/11/26 14:36:00 jack 2 1
^Ac No.
^Ae
^As 00001/00000/00000
^Ad D 1.1 19/11/26 14:35:27 jack 1 0
^Ac date and time created 19/11/26 14:35:27 by jack
^Ae
^Au
^AU
^Af e 0
^At
^AT
^AI 1
Hi there
^AE 1
^AI 2
^AD 3
This is a test of SCCS
^AE 2
^AE 3
^AI 3
A test of SCCS
^AE 3
RCS - Revision Control System - First Generation
Background
The Revision Control System (RCS) was written in C by Walter Tichy in 1982 as an alternative to SCCS, which wasn't open source at the time.
RCS manages revisions of text documents, in particular source programs, documentation, and test data. It automates the storing, retrieval, logging, and identification of revisions.
— Walter Tichy, (1)
Architecture
RCS shares many traits with its predecessor, including:
- Handling revisions on a file-by-file basis
- Changes across multiple files can't be grouped together into an atomic commit
- Tracked files are intended to be modified by one user at a time
- No network functionality
- Revisions for each tracked file are stored in a corresponding history file
- Basic branching and merging of revisions within individual files.
When a file is set checked into RCS for the first time, a corresponding history file is created for it in the local ./RCS/
directory. This file is postfixed with a ,v
so a file named test.txt
would be tracked by a file called test.txt,v
.
RCS uses a reverse-delta
scheme for storing file changes. When a file is checked in, a full snapshot of the file's content is stored in the history file. When the file is modified and checked in again, a delta is calculated based off of the existing history file content. The old snapshot is discarded and the new one is saved, along with the delta to get back to the older state. This is called reverse-delta
since to check out an older revision, RCS starts with the newest version of the file and applies consecutive deltas until the older revision is reached. This method allows for very quick checkouts of current revisions since the full snapshot of the current revision is always available. However, the older the checkout revision, the longer the checkout takes since an increasing number of deltas need to be calculated against the current snapshot.
This is not the case with SCCS which takes the same amount of time to fetch any revision. In addition, no checksum is stored in RCS history files so file integrity cannot be ensured.
Basic Commands
Below is a list of the most common RCS commands:
ci <filename.ext>
: Check in a new file to RCS and create a new history file for it (in the ./RCS/
directory by default).
co <filename.ext>
: Check out a file from from its corresponding history file and place it in the working directory in readonly mode.
co -l <filename.ext>
: Check out a file from the corresponding history file for editing. Locks the history file so no other users can modify it.
ci <filename.ext>
: Check in file changes and create a new revision for it in its corresponding history file.
merge <file-to-merge-into.ext> <parent.ext> <file-to-merge-from.ext>
: Merge changes from two modified children of the same parent file.
rcsdiff <filename.ext>
: Display the differences between the current working copy of a file and the state of the file when it was checked out.
rcsclean
: Removes working files that don't have locks.
For more information on RCS internals, see the GNU RCS manual.
Sample RCS History File
head 1.2;
access;
symbols;
locks; strict;
comment @# @;
1.2
date 2019.11.25.05.51.55; author jstopak; state Exp;
branches;
next 1.1;
1.1
date 2019.11.25.05.49.02; author jstopak; state Exp;
branches;
next ;
desc
@This is a test.
@
1.2
log
@Edited the file.
@
text
@hi there, you are my bud.
You are so cool!
The end.
@
1.1
log
@Initial revision
@
text
@d1 5
a5 1
hi there
@
CVS - Concurrent Versions System - Second Generation
Background
CVS was created by Dick Grune in 1986 with the goal of adding a networking element to version control. It is also written in C. CVS kicked off the second generation of VCS tools which allowed geographically dispersed development teams to work on projects together.
Architecture
CVS is a frontend for RCS - it provides a set of commands for interacting with files in a project, but uses the RCS history file format and commands behind the scenes.
For the first time in VCS history, CVS allowed multiple developers to check out and work on the same files simultaneously. It did this by using a centralized repository model.
The first step is to set up a centralized repository on a remote server using CVS. Projects can then be imported into the repository. When a project is imported into CVS, each file is converted into a ,v
history file and stored in a central directory known as a module
. The repository generally lives on a remote server which is accessible over a local network or the Internet.
A developer checks out a copy the module which is copied to a working directory on their local machine. No files are locked in this process so there is no limit to the number of developers that can check out the module at one time. Developers can modify their checked out files and commit their changes as needed. If a developer commits a change, other developers will need to update their working copies via a (usually) automated merge process before committing their changes. Occasionally merge conflicts will need to be manually resolved before the commit can be made. CVS also provides the ability to create and merge branches.
Basic Commands
export CVSROOT=<path/to/repository>
: Sets the CVS repository root directory so it doesn't need to be specified in each command.
cvs import -m 'Import module' <module-name> <vendor-tag> <release-tag>
: Import a directory of files into a CVS module. Before running this browse into the root directory of the project you want to import.
cvs checkout <module-name>
: Copy a module to the working directory.
cvs commit <filename.ext>
: Commit a changed file back to the module in the central repository.
cvs add <filename.txt>
: Add a new file to track revisions for.
cvs update
: Update the working copy by merging in committed changes that exist in the central repository but not the working copy.
cvs status
: Show general information about the checked out working copy of a module.
cvs tag <tag-name> <files>
: Add an identifying tag to a single file or set of files.
cvs tag -b <new-branch-name>
: Create a new branch in the repository (must be checked out before working on it locally).
cvs checkout -r <branch-name>
: Checkout an existing branch to the working directory.
cvs update -j <branch-to-merge>
: Merge an existing branch into the local working copy.
For more information on CVS internals, see the GNU CVS manual and Dick Grune's article.
Sample CVS History File
head 1.1;
branch 1.1.1;
access ;
symbols start:1.1.1.1 jack:1.1.1;
locks ; strict;
comment @# @;
1.1
date 2019.11.26.18.45.07; author jstopak; state Exp;
branches 1.1.1.1;
next ;
commitid zsEBhVyPc4lonoMB;
1.1.1.1
date 2019.11.26.18.45.07; author jstopak; state Exp;
branches ;
next ;
commitid zsEBhVyPc4lonoMB;
desc
@@
1.1
log
@Initial revision
@
text
@hi there
@
1.1.1.1
log
@Imported sources
@
text
@@
SVN - Subversion - Second Generation
Background
Subversion was created in 2000 by Collabnet Inc and is now maintained by the Apache Software Foundation. It is written in C and was designed to be a more robust centralized solution than CVS.
Architecture
Like CVS, Subversion uses a centralized repository model. Remote users must have a working network connection to commit their changes to the central repository.
Subversion introduced the functionality of atomic commits which ensured that a commit would either fully succeed, or be completely abandoned if an issue occurred.
In CVS, if a commit operation failed midway, for example due to a network outage, the repository could be left in a corrupted and inconsistent state. Furthermore, a commit or revision in Subversion can include multiple files and directories. This is important since it allows users to track sets of related changes together as a grouped unit, instead of the past storage models that track changes separately for each file.
The current storage model that Subversion uses for tracked files is called FSFS
or File System atop the File System
. This name was chosen since it creates its database structure using a file and directory structure that match the operating system filesystem it is running on. The unique feature of the Subversion filesystem is that it is designed to track not only the files and the directories it contains, but the different versions of these files and directories and they change over time. It is a filesystem with an added time dimension. In addition, folders are first class citizens in Subversion. Empty folders can be committed in Subversion, whereas in the rest (even Git) empty folders are unnoticed.
When a Subversion repository is created, a (nearly) empty database of files and folders is created as a part of it. A directory called db/revs
is created in which all revision tracking information for the checked-in (committed) files is stored. Each commit (which can include changes to multiple files) is stored in a new file in the revs
directory and is named with a sequential numeric identifier starting with 1. When a file is committed for the first time, its full content is stored. Future commits of the same file will store only the changes - also called the diffs
or deltas - in order to conserve space. In addition, the deltas are compressed using lz4
or zlib
compression algorithms to further reduce their size.
By default, this is actually only true to a point. Although storing file deltas instead of the whole file each time does save on storage space, it adds time to checkout and commit operations since all the deltas need to be strung together in order to recreate the current state of the file. For this reason, by default Subversion stores up to 1023 deltas per file before storing a new full copy of the file. This achieves a nice balance of both storage and speed.
SVN does not use a conventional branching and tagging system. A normal Subversion repository layout is to have three folders in the root:
trunk/
branches/
tags/
The trunk/
folder is used for the production version of the application. The branches/
folder is used to store subfolders that correspond to individual branches. The tags/
folder is used to store tags which represent specific (usually significant) project revisions.
Basic Commands
svn create <path-to-repository>
: Create a new, empty repository shell in the specified directory.
svn import <path-to-project> <svn-url>
: Import a directory of files into the specified Subversion repository path.
svn checkout <svn-path> <path-to-checkout>
: Copy a stored repository path to the desired working directory.
svn commit -m 'Commit message'
: Commit a set of changed files and folders along with a descriptive commit message. These can be used as notes for future developers to understand what changes were made. The message of the initial commit is typically set to 'Initial commit'.
svn add <filename.txt>
: Add a new file to track revisions for.
svn update
: Update the working copy by merging in committed changes that exist in the central repository but not the working copy.
svn status
: Show a list of tracked files that have been changed in the working directory (if any).
svn info
: Show a list of general details about the checked-out copy.
svn copy <branch-to-copy> <new-branch-path-and-name>
: Create a new branch by copying an existing one.
svn switch <existing-branch>
: Switch the working directory to an existing branch. This will checkout the specified branch.
svn merge <existing-branch>
: Merge the specified branch into the current branch checked out in the working directory. Note this needs to be committed afterwards.
svn log
: Show the commit history and associated descriptive messages for the active branch (useful for devs to find details of previous changes).
For more information on SVN internals, see the Version Control with Subversion book.
Sample SVN Revision File
DELTA
SVN^B^@^@ ^B
^A<89> hi there
ENDREP
id: 2-1.0.r1/4
type: file
count: 0
text: 1 3 21 9 12f6bb1941df66b8f138a446d4e8670c 279d9035886d4c0427549863c4c2101e4a63e041 0-0/_4
cpath: /trunk/hi.txt
copyroot: 0 /
DELTA
SVN^B^@^@$^B%^$K 6A
hi.txt
V 15
file 2-1.0.r1/4
END
ENDREP
id: 0-1.0.r1/6
type: dir
count: 0
text: 1 5 48 36 d84cb1c29105ee7739f3e834178e6345 - -
cpath: /trunk
copyroot: 0 /
DELTA
SVN^B^@^@'^B#^'K 5A
trunk
V 14
dir 0-1.0.r1/6
END
ENDREP
id: 0.0.r1/2
type: dir
pred: 0.0.r0/2
count: 1
text: 1 7 46 34 1d30e888ec9e633100992b752c2ff4c2 - -
cpath: /
copyroot: 0 /
_0.0.t0-0 add-dir false false false /trunk
_2.0.t0-0 add-file true false false /trunk/hi.txt
L2P-INDEX
^A<80>@^A^A^A^M^H^@^H^A^FD^Bz^AP2L-INDEX
^A<91>^E<80><80>@^A?^@'2^@<8D<90>^^^N=A^X^@C>
^@<8d>^Ft^V^@<92><9a><89>^E;
^@<8Aw|I^@<88><83>><93>^L`^<92>M^E^@?^[^@^@657 6aad60ec758d121d5181ea4b81a9f5f4 688 75f59082c8b5ab687ae87708432ca406I
Git - Third Generation
Background
Git was created in 2005 by Linus Torvalds (also the creator of Linux) and is written primarily in C combined with some shell scripts. It is widely considered the best VCS tool due to its features, flexibility, and speed. Linus Torvalds originally wrote it for the Linux codebase and it has grown to become the most popular VCS in use today.
You can do a lot of things with Git, and many of the rules of what you should do are not so much technical limitations but are about what works well when working together with other people. So Git is a very powerful set of tools, and that can not only be overwhelming at first, it also means that you can often do the same (or similar) things different ways, and they all "work."
— Linus Torvalds, 10 Years of Git: An Interview with Git Creator Linus Torvalds
Git repositories are commonly hosted on local servers as well as cloud services.
Git forms the backbone of a broad set of DevOps tools available from popular service providers including GitHub, BitBucket, GitLab, and many others.
Architecture
Git is a distributed VCS. This means that no copy of the repository needs to be designated as the centralized copy - all copies are created equal. This is in stark contrast to the second generation VCS which rely on a centralized copy for users to checkin and checkout from. What this means is that developers and coding partners can share changes with each other directly before merging their changes into an official branch. This allows team collaboration to take on a flexible distributed workflow, if desired.
Furthermore, developers can commit their changes to their local copy of the repository without any other repositories knowing about it. This means that commits can be made without any network or Internet connection. Developers can work locally offline until they are ready to share their work with others. At that point, the changes can be pushed to other repositories for review, testing, or deployment.
When a file is added for tracking with Git, it is compressed using the zlib
compression algorithm. The result is hashed using a SHA-1 hash function. This yields a unique hash value that corresponds specifically to the content in that file. Git stores this in an object database
which is located in the hidden .git/objects
folder.
The name of the file is the generated hash value, and the file contains the compressed content. These files are called Git blobs and are created each time a new file (or changed version of an existing file) are added to the repository.
The object database is literally just a content-addressable collection of objects. All objects are named by their content, which is approximated by the SHA1 hash of the object itself. Objects may refer to other objects (by referencing their SHA1 hash), and so you can build up a hierarchy of objects.
— Linus Torvalds, Original Git readme.md
Git implements a staging index
which acts as an intermediate area for changes that are getting ready to be committed. As new changes are staged for commit, their compressed contents are referenced in a special index file - which takes the form of a tree
object. A tree
is a Git object that connects blob objects to their real file names, file permissions and links to other trees, and in this way represents the state of a particular set of files and directories. Once all related changes are staged for commit, the index tree can be committed to the repository, which creates a commit
object in the Git object database. A commit references the head tree for a particular revision as well as the commit author, email address, date, and a descriptive commit message. Each commit also stores a reference to its parent commit(s) and so over time a history of project development is established.
As mentioned, all Git objects - blobs, trees, and commits - are compressed, hashed, and stored in the object database based on their hash value. These are called loose objects
. At this point no diffs have been utilized to save space which makes Git very fast since the full content of each file revision is accessible as a loose object. However, certain operations such as pushing commits to a remote repository, storing too many objects, or manually running Git's garbage collection command can cause Git to repackage the objects into pack files
. In the packing process, reverse diffs are taken and compressed to eliminate redundant content and reduce size. This process results in .pack
files containing the object content, each with a corresponding .idx
(or index) file containing a reference of the packed objects and their locations in the pack file.
These pack files are transferred over the network when branches are pushed to or pulled from remote repositories. When pulling or fetching branches, the pack files are unpacked to create the loose objects in the object repository.
Basic Commands
git init
: Initialize a Git repository in the current directory (creates the hidden .git
folder and its contents).
git clone <git-url>
: Download a copy of the Git repository at the specified URL.
git add <filename.ext>
: Add an untracked file or changed file to the staging area (creates corresponding entries in the object database).
git commit -m 'Commit message'
: Commit a set of changed files and folders along with a descriptive commit message.
git status
: Show information related to the state of the working directory, current branch, untracked files, modified files, etc.
git branch <new-branch>
: Create a new branch based on the current checked-out branch.
git checkout <branch>
: Checkout the specified branch into the working directory.
git merge <branch>
: Merge the specified branch into the current branch checked out in the working directory.
git pull
: Update the working copy by merging in committed changes that exist in the remote repository but not the working copy.
git push
: Pack loose objects for local active branch commits into pack files and transfer to remote repository.
git log
: Show the commit history and associated descriptive messages for the active branch.
git stash
: Save all uncommitted changes in the working directory to a cache so that they can be retrieved later.
If you're interested in a great beginner book on using Git, I highly recommend Version Control with Git, by O'reilly Media. I read this book a few years ago and it clarified a lot of Git concepts and commands that I now use almost every day!
If you're interested in learning how Git's code works, check out our Baby Git Guidebook for Developers. For more information on Git internals, see the Pro Git book chapter on Git's internals.
Sample Git Blob, Tree, Commit Files
A blob file with hash value 37d4e6c5c48ba0d245164c4e10d5f41140cab980
:
hi there
A tree object with hash value b769f35b07fbe0076dcfc36fd80c121d747ccc04
:
100644 blob 37d4e6c5c48ba0d245164c4e10d5f41140cab980hi.txt
A commit object with hash value dc512627287a61f6111705151f4e53f204fbda9b
:
tree b769f35b07fbe0076dcfc36fd80c121d747ccc04
author Jacob Stopak 1574915303 -0800
committer Jacob Stopak 1574915303 -0800
Initial commit
Mercurial - Third Generation
Background
Mercurial was created in 2005 by Matt Mackall and it is written in Python. It was also started with the goal of hosting the codebase for Linux, but Git was chosen instead. It is the second most popular distributed VCS after Git, but is used far less often.
Architecture
Like Git, Mercurial is a distributed version control system that allows any number of developers to work with their own copy of a project independently from others. Mercurial leverages many of the same technologies as Git, such as compression and SHA-1 hashing, but does so in different ways.
When a new file is committed for tracking in Mercurial, a corresponding revlog
file is created for it in the hidden directory .hg/store/data/
. You can think of a revlog
(or revision log) file as a modernized version of the history files
used by the older VCS like CVS, RCS, and SCCS.
Unlike Git, which creates a new blob for every version of every staged file, Mercurial simply creates a new entry in the revlog for that file. To conserve space, each new entry only contains the delta (changes) from the previous version. Once a threshold number of deltas is reached, a full snapshot of the file is stored again.
This reduces the lookup time when applying many deltas to reconstruct a particular file revision.
These file revlogs are named to match the files that they track, but are postfixed with .i
and .d
extensions. The .d
files contained the compressed delta content. The .i
files are used as indexes to quickly track down different revisions inside the .d
files. For small files with low numbers of revisions, both the indexes and content are stored in .i
files. Revlog file entries are compressed for performance and hashed for identification. The hash values are referred to as nodeids
.
Whenever a new commit is made, Mercurial tracks the all file revisions in that commit in something called the manifest
. The manifest is also a revlog file - it stores entries that correspond to particular states of the repository. However, instead of storing individual file content like the file revlogs, the manifest stores a list of filenames and nodeids that specify which file revision entries exist in each revision of the project. These manifest entries are also compressed and hashed. The hash values are again referred to as nodeids
.
Lastly, Mercurial uses one more type of revlog called a changelog
. The changelog contains a list of entries that associate each commit with the following information:
- Manifest nodeid: Identifies the full set of file revisions that exist at a particular time.
- Parent commit nodeid(s): This allows Mercurial to establish a timeline or branch of project history. One or two parent ID's are stored depending on the type of commit (normal vs merge).
- Commit author
- Commit date
- Commit message
Each changelog entry also generates a hash known as its nodeid
.
Basic Commands
hg init
: Initialize the current directory as a Mercurial repository (creates the hidden .hg
folder and its contents).
hg clone <hg-url>
: Download a copy of the Mercurial repository at the specified URL.
hg add <filename.ext>
: Add a new file for revision tracking.
hg commit -m 'Commit message'
: Commit a set of changed files and folders along with a descriptive commit message.
hg status
: Show information related to the state of the working directory, untracked files, modified files, etc.
hg update <revision>
: Checkout the specified branch into the working directory.
hg merge <branch>
: Merge the specified branch into the current branch checked out in the working directory.
hg pull
: Download new revisions from remote repository but don't merge them into the working directory.
hg push
: Transfer new revisions to remote repository.
hg log
: Show the commit history and associated descriptive messages for the active branch.
Sample Mercurial Files
Manifest revlog entry:
hey.txt208b6e0998e8099b16ad0e43f036ec745d58ec04
hi.txt74568dc1a5b9047c8041edd99dd6f566e78d3a42
Changelog revlog entry:
b8ee947ce6f25b84c22fbefecab99ea918fc0969
Jacob Stopak
1575082451 28800
hey.txt
Add hey.txt
For more information on Mercurial internals, check out the following links:
- Bryan O'Sullivan's Hg Book
- Mercurial Wiki (Revlog)
- Mercurial Wiki (ChangeSet)
- Mercurial Wiki (Manifest)
- Mercurial Wiki (Revision)
- Mercurial Wiki (Nodeid)
Summary
In this article, we provided a technical comparison of some historically relevant version control systems. To continue learning about the internals of other important VCS, check out Part 2 of this article.
If you're interested in more resources about how Git's code works, check out our Baby Git Guidebook for Developers.
If you're interested in learning the basics of coding and software development, check out our Coding Essentials Guidebook for Developers.
A special thanks to Reddit user u/Teknikal_Domain, who provided expert details and insight that greatly contributed to the writing of this article.
If you have any questions or comments, feel free to reach out to jacob@initialcommit.io
References
- Walter Tichy, RCS - A System for Version Control https://www.academia.edu/26671521/Rcs_a_system_for_version_control
Final Notes
Recommended product: Decoding Git Guidebook for Developers