The Evolution of Version Control System (VCS) Internals
In this article, we'll provide a technical comparison of some of the most historically significant Version Control Systems (or VCS). The original version of this article covered the six VCS: SCCS, RCS, CVS, SVN, Git, and Mercurial. Due to popular demand, we added six more - Perforce Helix Core, BitKeeper, Darcs, Monotone, Bazaar, and Fossil. The first six are covered in this article and the new additions are covered in Part 2.
Below is a list of these VCS, categorized by generation:
The first generation VCS were intended to track changes for individual files and checked-out files could only be edited locally by one user at a time. They were built on the assumption that all users would log into the same shared Unix host with their own accounts. The second generation VCS introduced networking which led to centralized repositories that contained the 'official' versions of their projects. This was good progress, since it allowed multiple users to checkout and work with the code at the same time, but they would all be committing back to the same central repository. Furthermore, network access was required to make commits. The third generation comprises the distributed VCS. In a distributed VCS, all copies of the repository are created equal - there is no central copy of the repository. This opens the path for commits, branches, and merges to be created locally without network access and pushed to other repositories as needed.
VCS Release History Timeline
For context, here is a timeline of the creation of these VCS tools:
Figure 1: Timeline of the Creation of Version Control Systems
SCCS - Source Code Control System - First Generation
SCCS is considered to be one of the first successful VCS tools created. It was developed by Marc Rochkind at Bell Labs in 1972. It is written in C and was created to solve the problems of source file revision tracking. Furthermore, it made it significantly easier to track down the source of bugs introduced into a program. SCCS is worth understanding at a basic level because it is the seed of the set of modern VCS tools that are so important to developers today.
Like most modern day VCS, SCCS has a set of commands that allow developers to work with versioning of their files. These commands are used to:
- Check in files to have their history tracked using SCCS
- Check out specific file revisions for review or compilation
- Check out specific file revisions for editing
- Check in new file revisions along with a comment explaining the changes
- Revert changes made in a checked out file
- Basic branching and merging of changes
- Provide a log of a file's revision history
A special type of file called an
s-file or a
history file is created when a file is added for tracking with SCCS. This file is named using the original file name prefixed with a
s. and is stored in a subdirectory called
SCCS. So a file called
test.txt would get a history file created in the
./SCCS/ directory with a name of
s.test.txt. On creation, the history file contains the initial content of the original file as well as some metadata to assist with version tracking. Checksums are stored in the history files to verify that the content has not been tampered with. The history file content is not compressed or encoded (as we will see is the case with the later generation VCS).
Since the content of the original file is now stored in the history file, it can be retrieved into the working directory for review, compilation, or editing. Further changes made to the file such as line additions, modifications, and removals can be checked back into the history file, which increments its revision number.
Subsequent SCCS checkins only store only the
deltas or changes to a file as opposed to the entire file content each time. This decreases the size of the history file. Each time a checkin is made, the delta is stored in a structure known as a
delta table inside the history file. As previously mentioned, the actual file content is more or less copied verbatim, with special control sequences for marking the start and end of sections of added and removed content. Since SCCS history files don't use compression, they are typically larger in size that the actual file being tracked. SCCS uses a delta method known as
interleaved deltas. This is beneficial since it allows constant-time checkouts regardless of how old the checked out revision is - i.e. older revisions don't take longer to checkout than newer revisions.
One important thing to note is that all files are tracked and checked in separately in SCCS. There is no way to checkin changes to multiple files as a part of one atomic unit - like a commit in Git. Each tracked file has a corresponding history file which stores its revision history. In general, this means that the version numbers of different files in a project will not usually match each other. However, matching revision numbers can be achieved by editing every file in the project at once (even if not all of the files have real changes) and checking them all at one time. This will increment the revision number for all the files to keep them consistent, but note that this is NOT the same as including multiple files in a single commit like in Git. In SCCS, this makes an individual checkin in each history file, as opposed to one big commit including all the changes at once.
When a file is checked out for editing in SCCS, a lock is placed on the file so it cannot be edited by anyone else. This prevents changes from being overwritten by other users, but also limits development since only one user can work with a given file at a time.
SCCS has support for branches that can store sequences of changes within a specific file. Branches can be merged back in with the original versions or merged with other branched versions of the same parent.
Below is a list of the most common SCCS commands.
sccs create <filename.ext>: Check in a new file to SCCS and create a new history file for it (in the
./SCCS/ directory by default).
sccs get <filename.ext>: Check out a file from from its corresponding history file and place it in the working directory in readonly mode.
sccs edit <filename.ext>: Check out a file from the corresponding history file for editing. Locks the history file so no other users can modify it.
sccs delta <filename.ext>: Check in the modifications to the specified file. Will prompt for a comment, store the changes in the history file, and remove the lock.
sccs prt <filename.ext>: Display the revision log for a tracked file.
sccs diffs <filename.ext>: Display the differences between the current working copy of a file and the state of the file when it was checked out.
Sample SCCS History File
^Ah20562 ^As 00001/00001/00002 ^Ad D 1.3 19/11/26 14:37:08 jack 3 2 ^Ac Here is a comment. ^Ae ^As 00002/00000/00001 ^Ad D 1.2 19/11/26 14:36:00 jack 2 1 ^Ac No. ^Ae ^As 00001/00000/00000 ^Ad D 1.1 19/11/26 14:35:27 jack 1 0 ^Ac date and time created 19/11/26 14:35:27 by jack ^Ae ^Au ^AU ^Af e 0 ^At ^AT ^AI 1 Hi there ^AE 1 ^AI 2
^AD 3 This is a test of SCCS ^AE 2 ^AE 3 ^AI 3 A test of SCCS ^AE 3
RCS - Revision Control System - First Generation
RCS was written in C by Walter Tichy in 1982 as an alternative to SCCS, which wasn't open source at the time.
RCS shares many traits with its predecessor, including:
- Handling revisions on a file-by-file basis
- Changes across multiple files can't be grouped together into an atomic commit
- Tracked files are intended to be modified by one user at a time
- No network functionality
- Revisions for each tracked file are stored in a corresponding history file
- Basic branching and merging of revisions within individual files.
When a file is set checked into RCS for the first time, a corresponding history file is created for it in the local
./RCS/ directory. This file is postfixed with a
,v so a file named
test.txt would be tracked by a file called
RCS uses a
reverse-delta scheme for storing file changes. When a file is checked in, a full snapshot of the file's content is stored in the history file. When the file is modified and checked in again, a delta is calculated based off of the existing history file content. The old snapshot is discarded and the new one is saved, along with the delta to get back to the older state. This is called
reverse-delta since to check out an older revision, RCS starts with the newest version of the file and applies consecutive deltas until the older revision is reached. This method allows for very quick checkouts of current revisions since the full snapshot of the current revision is always available. However, the older the checkout revision, the longer the checkout takes since an increasing number of deltas need to be calculated against the current snapshot.
This is not the case with SCCS which takes the same amount of time to fetch any revision. In addition, no checksum is stored in RCS history files so file integrity cannot be ensured.
Below is a list of the most common RCS commands:
ci <filename.ext>: Check in a new file to RCS and create a new history file for it (in the
./RCS/ directory by default).
co <filename.ext>: Check out a file from from its corresponding history file and place it in the working directory in readonly mode.
co -l <filename.ext>: Check out a file from the corresponding history file for editing. Locks the history file so no other users can modify it.
ci <filename.ext>: Check in file changes and create a new revision for it in its corresponding history file.
merge <file-to-merge-into.ext> <parent.ext> <file-to-merge-from.ext>: Merge changes from two modified children of the same parent file.
rcsdiff <filename.ext>: Display the differences between the current working copy of a file and the state of the file when it was checked out.
rcsclean: Removes working files that don't have locks.
For more information on RCS internals, see the GNU RCS manual.
Sample RCS History File
head 1.2; access; symbols; locks; strict; comment @# @;
1.2 date 2019.11.25.05.51.55; author jstopak; state Exp; branches; next 1.1;
1.1 date 2019.11.25.05.49.02; author jstopak; state Exp; branches; next ;
desc @This is a test. @
1.2 log @Edited the file. @ text @hi there, you are my bud.
You are so cool!
The end. @
1.1 log @Initial revision @ text @d1 5 a5 1 hi there @
CVS - Concurrent Versions System - Second Generation
CVS was created by Dick Grune in 1986 with the goal of adding a networking element to version control. It is also written in C. CVS kicked off the second generation of VCS tools which allowed geographically dispersed development teams to work on projects together.
CVS is a frontend for RCS - it provides a set of commands for interacting with files in a project, but uses the RCS history file format and commands behind the scenes. For the first time in VCS history, CVS allowed multiple developers to check out and work on the same files simultaneously. It did this by using a centralized repository model. The first step is to set up a centralized repository on a remote server using CVS. Projects can then be imported into the repository. When a project is imported into CVS, each file is converted into a
,v history file and stored in a central directory known as a
module. The repository generally lives on a remote server which is accessible over a local network or the Internet.
A developer checks out a copy the module which is copied to a working directory on their local machine. No files are locked in this process so there is no limit to the number of developers that can check out the module at one time. Developers can modify their checked out files and commit their changes as needed. If a developer commits a change, other developers will need to update their working copies via a (usually) automated merge process before committing their changes. Occasionally merge conflicts will need to be manually resolved before the commit can be made. CVS also provides the ability to create and merge branches.
export CVSROOT=<path/to/repository>: Sets the CVS repository root directory so it doesn't need to be specified in each command.
cvs import -m 'Import module' <module-name> <vendor-tag> <release-tag>: Import a directory of files into a CVS module. Before running this browse into the root directory of the project you want to import.
cvs checkout <module-name>: Copy a module to the working directory.
cvs commit <filename.ext>: Commit a changed file back to the module in the central repository.
cvs add <filename.txt>: Add a new file to track revisions for.
cvs update: Update the working copy by merging in committed changes that exist in the central repository but not the working copy.
cvs status: Show general information about the checked out working copy of a module.
cvs tag <tag-name> <files>: Add an identifying tag to a single file or set of files.
cvs tag -b <new-branch-name>: Create a new branch in the repository (must be checked out before working on it locally).
cvs checkout -r <branch-name>: Checkout an existing branch to the working directory.
cvs update -j <branch-to-merge>: Merge an existing branch into the local working copy.
Sample CVS History File
head 1.1; branch 1.1.1; access ; symbols start:22.214.171.124 jack:1.1.1; locks ; strict; comment @# @;
1.1 date 2019.11.26.18.45.07; author jstopak; state Exp; branches 126.96.36.199; next ; commitid zsEBhVyPc4lonoMB;
188.8.131.52 date 2019.11.26.18.45.07; author jstopak; state Exp; branches ; next ; commitid zsEBhVyPc4lonoMB;
1.1 log @Initial revision @ text @hi there @
184.108.40.206 log @Imported sources @ text @@
SVN - Subversion - Second Generation
Subversion was created in 2000 by Collabnet Inc and is now maintained by the Apache Software Foundation. It is written in C and was designed to be a more robust centralized solution than CVS.
Like CVS, Subversion uses a centralized repository model. Remote users must have a working network connection to commit their changes to the central repository.
Subversion introduced the functionality of atomic commits which ensured that a commit would either fully succeed, or be completely abandoned if an issue occurred. In CVS, if a commit operation failed midway, for example due to a network outage, the repository could be left in a corrupted and inconsistent state. Furthermore, a commit or revision in Subversion can include multiple files and directories. This is important since it allows users to track sets of related changes together as a grouped unit, instead of the past storage models that track changes separately for each file.
The current storage model that Subversion uses for tracked files is called
File System atop the File System. This name was chosen since it creates its database structure using a file and directory structure that match the operating system filesystem it is running on. The unique feature of the Subversion filesystem is that it is designed to track not only the files and the directories it contains, but the different versions of these files and directories and they change over time. It is a filesystem with an added time dimension. In addition, folders are first class citizens in Subversion. Empty folders can be committed in Subversion, whereas in the rest (even Git) empty folders are unnoticed.
When a Subversion repository is created, a (nearly) empty database of files and folders is created as a part of it. A directory called
db/revs is created in which all revision tracking information for the checked-in (committed) files is stored. Each commit (which can include changes to multiple files) is stored in a new file in the
revs directory and is named with a sequential numeric identifier starting with 1. When a file is committed for the first time, its full content is stored. Future commits of the same file will store only the changes - also called the
diffs or deltas - in order to conserve space. In addition, the deltas are compressed using
zlib compression algorithms to further reduce their size.
By default, this is actually only true to a point. Although storing file deltas instead of the whole file each time does save on storage space, it adds time to checkout and commit operations since all the deltas need to be strung together in order to recreate the current state of the file. For this reason, by default Subversion stores up to 1023 deltas per file before storing a new full copy of the file. This achieves a nice balance of both storage and speed.
SVN does not use a conventional branching and tagging system. A normal Subversion repository layout is to have three folders in the root:
trunk/ folder is used for the production version of the application. The
branches/ folder is used to store subfolders that correspond to individual branches. The
tags/ folder is used to store tags which represent specific (usually significant) project revisions.
svn create <path-to-repository>: Create a new, empty repository shell in the specified directory.
svn import <path-to-project> <svn-url>: Import a directory of files into the specified Subversion repository path.
svn checkout <svn-path> <path-to-checkout>: Copy a stored repository path to the desired working directory.
svn commit -m 'Commit message': Commit a set of changed files and folders along with a descriptive commit message. The message of the initial commit is typically set to 'Initial commit'.
svn add <filename.txt>: Add a new file to track revisions for.
svn update: Update the working copy by merging in committed changes that exist in the central repository but not the working copy.
svn status: Show a list of tracked files that have been changed in the working directory (if any).
svn info: Show a list of general details about the checked-out copy.
svn copy <branch-to-copy> <new-branch-path-and-name>: Create a new branch by copying an existing one.
svn switch <existing-branch>: Switch the working directory to an existing branch. This will checkout the specified branch.
svn merge <existing-branch>: Merge the specified branch into the current branch checked out in the working directory. Note this needs to be committed afterwards.
svn log: Show the commit history and associated descriptive messages for the active branch.
For more information on SVN internals, see the Version Control with Subversion book.
Sample SVN Revision File
DELTA SVN^B^@^@ ^B ^A<89> hi there ENDREP id: 2-1.0.r1/4 type: file count: 0 text: 1 3 21 9 12f6bb1941df66b8f138a446d4e8670c 279d9035886d4c0427549863c4c2101e4a63e041 0-0/_4 cpath: /trunk/hi.txt copyroot: 0 /
DELTA SVN^B^@^@$^B%^$K 6A hi.txt V 15 file 2-1.0.r1/4 END ENDREP id: 0-1.0.r1/6 type: dir count: 0 text: 1 5 48 36 d84cb1c29105ee7739f3e834178e6345 - - cpath: /trunk copyroot: 0 /
DELTA SVN^B^@^@'^B#^'K 5A trunk V 14 dir 0-1.0.r1/6 END ENDREP id: 0.0.r1/2 type: dir pred: 0.0.r0/2 count: 1 text: 1 7 46 34 1d30e888ec9e633100992b752c2ff4c2 - - cpath: / copyroot: 0 /
_0.0.t0-0 add-dir false false false /trunk
_2.0.t0-0 add-file true false false /trunk/hi.txt
L2P-INDEX ^A<80>@^A^A^A^M^H^@^H^A^FD^Bz^AP2L-INDEX ^A<91>^E<80><80>@^A?^@'2^@<8D<90>^^^N=A^X^@C> ^@<8d>^Ft^V^@<92><9a><89>^E; ^@<8Aw|I^@<88><83>><93>^L`^<92>M^E^@?^[^@^@657 6aad60ec758d121d5181ea4b81a9f5f4 688 75f59082c8b5ab687ae87708432ca406I
Git - Third Generation
Git was created in 2005 by Linus Torvalds (also the creator of Linux) and is written primarily in C combined with some shell scripts. It is a great VCS due to its features, flexibility, and speed. Linus Torvalds originally wrote it for the Linux codebase and it has grown to become the most popular VCS in use today.
Git is a distributed VCS. This means that no copy of the repository needs to be designated as the centralized copy - all copies are created equal. This is in stark contrast to the second generation VCS which rely on a centralized copy for users to checkin and checkout from. What this means is that developers can share changes with each other directly before merging their changes into an official branch.
Furthermore, developers can commit their changes to their local copy of the repository without any other repositories knowing about it. This means that commits can be made without any network or Internet connection. Developers can work locally offline until they are ready to share their work with others. At that point, the changes can be pushed to other repositories for review, testing, or deployment.
When a file is added for tracking with Git, it is compressed using the
zlib compression algorithm. The result is hashed using a SHA-1 hash function. This yields a unique hash value that corresponds specifically to the content in that file. Git stores this in an
object database which is located in the hidden
.git/objects folder. The name of the file is the generated hash value, and the file contains the compressed content. These files are called
blobs and are created each time a new file (or changed version of an existing file) are added to the repository.
Git implements a
staging index which acts as an intermediate area for changes that are getting ready to be committed. As new changes are staged for commit, their compressed contents are referenced in a special index file - which takes the form of a
tree object. A
tree is a Git object that connects blob objects to their real file names, file permissions and links to other trees, and in this way represents the state of a particular set of files and directories. Once all related changes are staged for commit, the index tree can be committed to the repository, which creates a
commit object in the Git object database. A commit references the head tree for a particular revision as well as the commit author, email address, date, and a descriptive commit message. Each commit also stores a reference to its parent commit(s) and so over time a history of project development is established.
As mentioned, all Git objects - blobs, trees, and commits - are compressed, hashed, and stored in the object database based on their hash value. These are called
loose objects. At this point no diffs have been utilized to save space which makes Git very fast since the full content of each file revision is accessible as a loose object. However, certain operations such as pushing commits to a remote repository, storing too many objects, or manually running Git's garbage collection command can cause Git to repackage the objects into
pack files. In the packing process, reverse diffs are taken and compressed to eliminate redundant content and reduce size. This process results in
.pack files containing the object content, each with a corresponding
.idx (or index) file containing a reference of the packed objects and their locations in the pack file.
These pack files are transferred over the network when branches are pushed to or pulled from remote repositories. When pulling or fetching branches, the pack files are unpacked to create the loose objects in the object repository.
git init: Initialize the current directory as a Git repository (creates the hidden
.git folder and its contents).
git clone <git-url>: Download a copy of the Git repository at the specified URL.
git add <filename.ext>: Add an untracked file or changed file to the staging area (creates corresponding entries in the object database).
git commit -m 'Commit message': Commit a set of changed files and folders along with a descriptive commit message.
git status: Show information related to the state of the working directory, current branch, untracked files, modified files, etc.
git branch <new-branch>: Create a new branch based on the current checked-out branch.
git checkout <branch>: Checkout the specified branch into the working directory.
git merge <branch>: Merge the specified branch into the current branch checked out in the working directory.
git pull: Update the working copy by merging in committed changes that exist in the remote repository but not the working copy.
git push: Pack loose objects for local active branch commits into pack files and transfer to remote repository.
git log: Show the commit history and associated descriptive messages for the active branch.
git stash: Save all uncommitted changes in the working directory to a cache so that they can be retrieved later.
Sample Git Blob, Tree, Commit Files
A blob file with hash value
A tree object with hash value
100644 blob 37d4e6c5c48ba0d245164c4e10d5f41140cab980hi.txt
A commit object with hash value
tree b769f35b07fbe0076dcfc36fd80c121d747ccc04 author Jacob Stopak
1574915303 -0800 committer Jacob Stopak 1574915303 -0800
Mercurial - Third Generation
Mercurial was created in 2005 by Matt Mackall and it is written in Python. It was also started with the goal of hosting the codebase for Linux, but Git was chosen instead. It is the second most popular distributed VCS after Git, but is used far less often.
Like Git, Mercurial is a distributed version control system that allows any number of developers to work with their own copy of a project independently from others. Mercurial leverages many of the same technologies as Git, such as compression and SHA-1 hashing, but does so in different ways.
When a new file is committed for tracking in Mercurial, a corresponding
revlog file is created for it in the hidden directory
.hg/store/data/. You can think of a
revlog (or revision log) file as a modernized version of the
history files used by the older VCS like CVS, RCS, and SCCS. Unlike Git, which creates a new
blob for every version of every staged file, Mercurial simply creates a new entry in the revlog for that file. To conserve space, each new entry only contains the delta (changes) from the previous version. Once a threshold number of deltas is reached, a full snapshot of the file is stored again. This reduces the lookup time when applying many deltas to reconstruct a particular file revision.
These file revlogs are named to match the files that they track, but are postfixed with
.d extensions. The
.d files contained the compressed delta content. The
.i files are used as indexes to quickly track down different revisions inside the
.d files. For small files with low numbers of revisions, both the indexes and content are stored in
.i files. Revlog file entries are compressed for performance and hashed for identification. The hash values are referred to as
Whenever a new commit is made, Mercurial tracks the all file revisions in that commit in something called the
manifest. The manifest is also a revlog file - it stores entries that correspond to particular states of the repository. However, instead of storing individual file content like the file revlogs, the manifest stores a list of filenames and nodeids that specify which file revision entries exist in each revision of the project. These manifest entries are also compressed and hashed. The hash values are again referred to as
Lastly, Mercurial uses one more type of revlog called a
changelog. The changelog contains a list of entries that associate each commit with the following information:
- Manifest nodeid: Identifies the full set of file revisions that exist at a particular time.
- Parent commit nodeid(s): This allows Mercurial to establish a timeline or branch of project history. One or two parent ID's are stored depending on the type of commit (normal vs merge).
- Commit author
- Commit date
- Commit message
Each changelog entry also generates a hash known as its
hg init: Initialize the current directory as a Mercurial repository (creates the hidden
.hg folder and its contents).
hg clone <hg-url>: Download a copy of the Mercurial repository at the specified URL.
hg add <filename.ext>: Add a new file for revision tracking.
hg commit -m 'Commit message': Commit a set of changed files and folders along with a descriptive commit message.
hg status: Show information related to the state of the working directory, untracked files, modified files, etc.
hg update <revision>: Checkout the specified branch into the working directory.
hg merge <branch>: Merge the specified branch into the current branch checked out in the working directory.
hg pull: Download new revisions from remote repository but don't merge them into the working directory.
hg push: Transfer new revisions to remote repository.
hg log: Show the commit history and associated descriptive messages for the active branch.
Sample Mercurial Files
Manifest revlog entry:
Changelog revlog entry:
b8ee947ce6f25b84c22fbefecab99ea918fc0969 Jacob Stopak
1575082451 28800 hey.txt
For more information on Mercurial internals, check out the following links:
- Bryan O'Sullivan's Hg Book
- Mercurial Wiki (Revlog)
- Mercurial Wiki (ChangeSet)
- Mercurial Wiki (Manifest)
- Mercurial Wiki (Revision)
- Mercurial Wiki (Nodeid)
In this article, we provided a technical comparison of some historically relevant version control systems. To continue learning about the internals of other important VCS, check out Part 2 of this article.
If you're interested in learning more about how Git's code works, check out our Baby Git Guidebook for Developers.
If you're interested in learning the basics of coding and software development, check out our Coding Essentials Guidebook for Developers.
A special thanks to Reddit user u/Teknikal_Domain, who provided expert details and insight that greatly contributed to the writing of this article.
If you have any questions or comments, feel free to reach out to firstname.lastname@example.org