What % of Git commit messages use the imperative mood?
Describe your changes in imperative mood, e.g. "make xyzzy do frotz" instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy to do frotz", as if you are giving orders to the codebase to change its behavior.
Some examples of commit messages written in the imperative mood are:
- Bump version to 1.0
- Add .gitignore
- Refactor product repository for functional isolation and clarity
- Merge branch 'master'
- Remove unneeded tests
- Fix bug preventing menu from sliding out on mobile
Notice how each commit message starts with a verb in the present tense. This helps describe the purpose of each commit in a clear and concise way. It also helps standardize the format of commit messages in general.
In this article, we'll explore how frequently developers adhere to this rule by estimating the percentage of commit messages that use the imperative mood.
We will do this by combining the forces of two powerful public datasets from Google BigQuery. The first is the GitHub Activity Data dataset that contains data from almost 3 million Git repositories. The second is the GDELT Web Part of Speech dataset, which contains more than 101 billion language tokens extracted, analyzed, and tagged from global web activity using Google's Natural Language API. We will link these two datasets to roughly estimate the percentage of Git commits that use the imperative mood.
For a primer on using Google BigQuery to analyze a simpler problem, check out my previous article What is the most popular initial commit message in Git? before reading this one.
Dataset #1: GitHub Activity Data
In my previous article, I used the GitHub Activity Data dataset to find the most popular initial commit messages in Git. This was quite simple because all of the required data lives in a single table (the
commits table) in a single database (the
As a refresher, the public
bigquery-public-data.github_repos database contains data from millions of public GitHub repositories. This data includes repository names, committed file names, commit messages, author names, timestamps, and more. In this article, we will again make use commit message data from the
commits table for our analysis. Our goal will be to extract the commit messages from the
message field of the
commits table, and try to determine what percentage of the commits use the imperative mood.
To get things started, we can easily get the total number of non-empty commit messages in the dataset (between January 1st 2000 and April 22nd 2020) by running the following query:
SELECT COUNT(*) FROM bigquery-public-data.github_repos.commits WHERE author.date.seconds >= 946684800 AND author.date.seconds <= 1585800000 AND LENGTH(TRIM(LOWER(message))) > 0;
This yields a result of 237,447,598 total commits.
Dataset #2: The GDELT Web Part of Speech Dataset
At this point we need a way to identify whether or not each commit message in the
commits table uses the imperative mood. This is where the GDELT Web Part of Speech dataset comes in. This dataset includes a table called
web_pos, in which each record represents a language token extracted from an online source between 2016 and 2020. The records come from sources in dozens of languages. For our purposes, a language token is a single word such as a noun, verb, or adjective.
Here are a few of the most useful fields in the
web_pos table, many of which we will make use of:
- The date that the source of the token was published
- The token text itself (in our case a single word)
- The language of the token
- A tag representing the token type (
- The tense of the token (
- The mood of the token (
- The URL of the token's source
Assumptions and Method
We will make the imperfect assumption that for a commit message to be of the imperative mood, the first word in the commit message must be a present tense, imperative verb. Luckily, Google BigQuery allows the joining of data from multiple unrelated datasets in a single SQL query. This allows us to write the following query which accesses both datasets and returns a count of commit messages that have a present tense, imperative verb as the first word:
SELECT COUNT(*) FROM bigquery-public-data.github_repos.commits WHERE author.date.seconds >= 946684800 AND author.date.seconds <= 1585800000 AND LENGTH(TRIM(LOWER(message))) > 0 -- Regular expression to match the first word of each commit message AND LOWER(REGEXP_EXTRACT(message, r'\w+')) in ( SELECT LOWER(token) FROM `gdelt-bq.gdeltv2.web_pos` WHERE lang='en' -- Only match English tokens AND posTag = 'VERB' -- Only match VERBs AND posMood = 'IMPERATIVE' -- Only match IMPERATIVE mood AND posTense = 'PRESENT' -- Only match PRESENT tense -- Filter out plural tokens, unless they end in a double S AND (LOWER(SUBSTR(token, -1)) != 's' OR LOWER(SUBSTR(token, -2)) = 'ss') GROUP BY LOWER(token) );
The resulting output of the above query is 104,057,902 commits. Dividing this by 237,447,598 (the total number of commits we calculated above) yields 43.8%. Therefore, we can estimate that approximately 44% of commit messages in the GitHub dataset use the imperative mood.
Keep in mind, there are several aspects of this method that introduce error in the calculation. Oftentimes, the beginning of a commit message contains noise such as a ticket number, story ID, build tool stamp, or some other arbitrary tag data. In these cases, the
REGEXP_EXTRACT(message, r'\w+') function will pick out the first word it comes across in that tag, even if the intended starting point of an imperative mood verb appears later in the commit message. I suspect this will lead to a noticeable under-counting of the actual number of imperative mood commit messages in the dataset.
Furthermore, the natural language database has about 4,000 unique present tense verbs labelled as imperative. After doing a quick Google search I believe there are significantly more verbs that can be used in the imperative mood, so its possible that with more words in that list, more matches would occur with the commit message data. However, I have a feeling that imperative verbs typically used by programmers in commit messages (like
modify, etc) are relatively common ones that are well represented by the current set of 4,000.
If you have any thoughts to make my query more accurate, feel free to shoot me an email.
In this article, we used Google BigQuery to access two public datasets that enabled us to estimate the percentage of Git commits that use the imperative mood.