Image of An Overview of Git's Original C Header File

ADVERTISEMENT

Introduction

Git is mostly written in the C programming language. C uses header files to store information that is reusable across multiple files, including variables, functions, structures, and macros. In this article, we provide background information on C header files and walk through the code in Git's header file line by line. If you're already familiar with C header files, skip directly to the section on Git's original header file.

Side-note: If you're interested in a great beginner book on using Git, I highly recommend Version Control with Git, by O'Reilly Media. I read this book a few years ago and it clarified a lot of Git concepts and commands that I now use almost every day!

C Header Files

C is a statically typed language. This means that programmers must specify the data type of each variable used in their code. The data types of function arguments and return values must be specified as well.

For example, a variable can be defined as an integer in C like this:

int myNumber;

This is called a variable declaration. The C compiler checks to make sure all variables are declared with a type before use. It also ensures that the values assigned to each variable match their type. The compiler will report an error during compilation if either of these criteria is not met.

Function definitions must specify type information for the return type and the types of any arguments. Let's consider a simple function that adds two numbers:

int addNumbers(int x, int y) {
    return x + y;
}

Note that we specified the return type of the function with the initial int, and we also specified the types of the two arguments x and y. However, in some cases we must declare the function before using it, as follows:

int addNumbers(int x, int y);

This is called a function prototype or the function signature. This tells the C compiler the function's name, return type, and argument types. This is required when:

  • A function is called from a different file than it is defined in.
  • A function call occurs before its definition in a single file.

Oftentimes, programmers want to re-use the same variables, functions, structures, and macros across multiple .c source code files. To achieve this, variables, functions, structures, and macros can be placed in .h header files. The header files can then be included in each .c source file as follows:

#include "headerFile.h"

In this way, the compiler gets the type information it needs for variables and functions before they are used/called. It also has access to any macros and structures defined in the header file.

Git's Original Header File

Luckily for us, Git's initial commit only has one header file. It is called cache.h. The purpose of this file is to define and include the required libraries, function signatures, and default settings for the Git .c programs to function. The cache.h header is included in all 7 of Git's original .c source code files including:

  • init-db.c
  • update-cache.c
  • cat-file.c
  • show-diff.c
  • write-tree.c
  • read-tree.c
  • commit-tree.c

Below find the code for the original version of cache.h, fully documented with inline comments describing how it works. This is an extract from our fully documented Baby Git codebase. (All the same licenses apply).

/*
 * This code is the full contents of the file `cache.h` from the
 * initial commit of Git's codebase. Git's initial commit has a
 * SHA1 of e83c5163316f89bfbde7d9ab23ca2e25604af290. The original
 * version has been tweaked to compile on modern OS's.
 *
 * This file is a `.h` header file that is included in all the source
 * files via the `include` preprocessing directive.  This file contains 
 * other `include` directives of library header files, token definitions, 
 * declarations of external variables and structure templates, and
 * function prototypes.
 */

/*
 * Only execute the code in this file if the `CACHE_H` macro has NOT
 * been defined yet. This is to prevent compiler errors resulting from
 * multiple direct or indirect inclusions of the header in a single
 * `.c` source file.
 */
#ifndef CACHE_H

#define CACHE_H    /* Define the macro `CACHE_H`. */

#include <stdio.h>      /* Standard C library defining input/output tools. */
#include <string.h>     /* Standard C library for working with character arrays. */
#include <unistd.h>     /* Standard C library for access to the POSIX OS API. */
#include <sys/stat.h>   /* Standard C library defining `stat` tools. */
#include <fcntl.h>      /* Standard C library for working with files. */
#include <stddef.h>     /* Standard C library for type definitions. */
#include <stdlib.h>     /* Standard C library for library definitions. */
#include <stdarg.h>     /* Standard C library for variable argument lists. */
#include <errno.h>      /* Standard C library for system error numbers. */

/* If we are not on Windows... */
#ifndef BGIT_WINDOWS
    #include <sys/mman.h>  /* Standard C library for memory management
                              declarations. */

/* If we are on Windows... */
#else
    #include <windows.h> /* C library for access to the Windows API. */
    #include <lmcons.h>  /* C library declaring assorted Windows constants. */
    #include <direct.h>  /* C library for working with file system directories */
    #include <io.h>      /* C library for Windows-only functions. */
#endif

#include <openssl/sha.h>  /* Include SHA hash tools from openssl library. */
#include <zlib.h>         /* Include compression tools from zlib library. */

/*
 * Linus Torvalds: Basic data structures for the directory cache.
 *
 * Linus Torvalds: NOTE NOTE NOTE! This is all in the native CPU byte format. 
 * It's not even trying to be portable. It's trying to be efficient. It's
 * just a cache, after all.
 */

/*
 * Define some operating system specific macros.
 */
#ifdef BGIT_UNIX
    #define STAT_TIME_SEC( st, st_xtim ) ( (st)->st_xtim ## e )
    #define STAT_TIME_NSEC( st, st_xtim ) ( (st)->st_xtim.tv_nsec )

#elif defined BGIT_DARWIN
    #define STAT_TIME_SEC( st, st_xtim ) ( (st)->st_xtim ## espec.tv_sec )
    #define STAT_TIME_NSEC( st, st_xtim ) ( (st)->st_xtim ## espec.tv_nsec )

#elif defined BGIT_WINDOWS
    #define STAT_TIME_SEC( st, st_xtim ) ( (st)->st_xtim ## e )
    #define STAT_TIME_NSEC( st, st_xtim ) 0

#endif

/* If we are not on Windows... */
#ifndef BGIT_WINDOWS
    #define OPEN_FILE( fname, flags, mode ) open( fname, flags, mode );

/* If we are on Windows... */
#else
    #define OPEN_FILE( fname, flags, mode ) open( fname, flags | O_BINARY, 
                                                      mode );
#endif

/*
 * Below are the templates for the `cache_header` and `cache_entry` structures.
 * These structures are used to store the repository cache.
 *
 * The cache header structure consists of a signature, a version, the number of
 * cache entries, and the SHA-1 hash that identifies the cache file. The
 * signature is a constant whose value is also given in the header file via the
 * `CACHE_SIGNATURE` token.
 */

/* This `CACHE_SIGNATURE` is hardcoded for loading into all cache headers. */
#define CACHE_SIGNATURE 0x44495243   /* Linus Torvalds: "DIRC" */

/* Template of the header structure that identifies a set of cache entries. */
struct cache_header {

    /* Constant across all headers, to validate authenticity. */
    unsigned int signature;

    /* Stores the version of Git that created the cache. */
    unsigned int version;

    /* The number of cache entries in the cache. */
    unsigned int entries;

    /* The SHA1 hash that identifies the cache. */
    unsigned char sha1[20];

};

/*
 * The `cache_time` structure is declared below. Its members represent a
 * time stamp's second and nanosecond portions. It is a template for a time
 * structure for storing the timestamps of actions taken on a file
 * corresponding to a cache entry. For example, the time the file was 
 * modified. For more info on file times, see:
 * https://www.quora.com/What-is-the-difference-between-mtime-atime-and-ctime
 */
struct cache_time {
    unsigned int sec;
    unsigned int nsec;
};

/*
 * The `cache_entry` structure is used to store information or metadata about
 * the file that corresponds to the cache entry.  Note the `sha1` array, which
 * is the structure member that stores the 20-byte representation of the SHA-1
 * hash of the deflated blob object that corresponds to the cache entry.
 * This hash is used to index the blob object in the object database.
 */
struct cache_entry {

    struct cache_time ctime;   /* Time of file's last status change. */
    struct cache_time mtime;   /* Time of file's last modification. */
    unsigned int st_dev;       /* Device ID of device containing the file. */

    /* 
     * The file serial number, which distinguishes this file from all
     * other files on the same device.
     */
    unsigned int st_ino;

    /*
     * Specifies the mode of the file. This includes information about the 
     * file type and permissions.
     */
    unsigned int st_mode;

    unsigned int st_uid;      /* The user ID of the file’s owner. */
    unsigned int st_gid;      /* The group ID of the file. */
    unsigned int st_size;     /* The size of a regular file in bytes. */
    unsigned char sha1[20];   /* The SHA1 hash of deflated blob object. */
    unsigned short namelen;   /* The filename length or path length. */
    unsigned char name[0];    /* The filename or path. */

};

/*
 * The following are declarations of external variables. They are defined in
 * the source code file read-cache.c.
 */

/* The path to the object database. */
const char *sha1_file_directory;

/* An array of pointers to cache entries. */
struct cache_entry **active_cache;

/* The number of entries in the `active_cache` array. */
unsigned int active_nr;

/* The maximum number of elements the active_cache array can hold. */
unsigned int active_alloc;

/*
 * The two tokens below `DB_ENVIRONMENT` and `DEFAULT_DB_ENVIRONMENT`
 * are defined for the program to determine where to store the object
 * database.
 *
 * By default, the object database is stored in the path defined by 
 * the `DEFAULT_DB_ENVIRONMENT` token, which is `.dircache/objects`.
 * The `DB_ENVIRONMENT` token is defined as `SHA1_FILE_DIRECTORY`, which
 * is an environment variable that the user can set to specify an
 * existing valid object database.
 */

#define DB_ENVIRONMENT "SHA1_FILE_DIRECTORY"

#define DEFAULT_DB_ENVIRONMENT ".dircache/objects"

/*
 * These macros are used to calculate the size to be allocated to a cache 
 * entry. 
 */
#define cache_entry_size(len) ((offsetof(struct cache_entry, name) 
                                + (len) + 8) & ~7)

#define ce_size(ce) cache_entry_size((ce)->namelen)

/*
 * See this link for details on this macro:
 * https://stackoverflow.com/questions/22090101/
 * why-is-define-alloc-nrx-x163-2-macro-used-in-many-cache-h-files
 */
#define alloc_nr(x) (((x)+16)*3/2)

/*
 * The following are function prototypes. They are defined in the source file
 * read-cache.c.
 */

/*
 * Read the contents of the `.dircache/index` file into the `active_cache` 
 * array. 
 */
extern int read_cache(void);

/*
 * Linus Torvalds: Return a statically allocated filename matching the SHA1 
 * signature 
 */
extern char *sha1_file_name(unsigned char *sha1);

/* Linus Torvalds: Write a memory buffer out to the SHA1 file. */
extern int write_sha1_buffer(unsigned char *sha1, void *buf, 
                             unsigned int size);

/*
 * Linus Torvalds: Read and unpack a SHA1 file into memory, write memory to a 
 * SHA1 file. 
 */
extern void *read_sha1_file(unsigned char *sha1, char *type, 
                            unsigned long *size);
extern int write_sha1_file(char *buf, unsigned len);

/* Linus Torvalds: Convert to/from hex/sha1 representation. */
extern int get_sha1_hex(char *hex, unsigned char *sha1);

/* Linus Torvalds: static buffer! */
extern char *sha1_to_hex(unsigned char *sha1);

/* Print usage message to standard error stream. */
extern void usage(const char *err);

#endif /* Linus Torvalds: CACHE_H */

Summary

In this article, we provided an overview of the C header file found in Git's initial commit. If you have any questions or comments, feel free to email jacob@initialcommit.io.

If you're interested in learning more about how Git's code works, check out our Baby Git Guidebook for Developers.