What is the most popular initial commit message in Git?
ADVERTISEMENT
Table of Contents
- Introduction
- Some background on Google BigQuery
- Writing SQL queries against Google BigQuery
- The top 20 most popular initial commit messages
- Summary
- Next Steps
Introduction
In this article, we will explore which initial commit messages are the most popular using the Git version control system. We will do this by analyzing a public GitHub dataset from Google BigQuery that contains data from almost 3 million Git repositories. We will leverage this data to compile a list of the 20 most commonly used initial commit messages in the dataset.
If you aren't familiar with initial commits, check out our article What is an Initial Commit in Git? before reading this one.
We'll start off by briefly explaining what Google BigQuery is.
Some background on Google BigQuery
Google BigQuery is a data warehouse hosted on the Google Cloud Platform that is accessible over web services. It was designed to host big data in a cloud environment and provide fast, convenient access to that data over the Internet.
As a part of the service, Google provides numerous public datasets for developers to experiment with. Many of these are useful to the scientific and research communities, including data from the Food and Drug Administration, the US Census, the National Oceanic and Atmospheric Administration, and of course, GitHub.
We'll be using the GitHub dataset in this article, which contains data from millions of public GitHub repositories. This data includes repository names, committed file names, commit messages, author names, timestamps, and more. We'll mainly be using the commit messages to try and gain some insight into the most common initial commit messages that developers use.
For more information on the GitHub dataset, see Google's overview of the GitHub Activity Data. If you're interested, this link includes a button to access the data via the BigQuery console. Note that you'll have to provide a billing account details even though your access will be free.
Now let's jump in and query some Git data!
Writing SQL queries against Google BigQuery
Here is what Google's BigQuery interface looks like:
The bottom-left panel in the console lists the different datasets that are available. Note that we have selected the github_repos
dataset and expanded it. Each item in the expanded list represents a table in the database. We'll be making use of the commits
table which contains the commit message information we'll need. The commit message information is stored in the message
column in this table. The commits
table contains data from the 1990's all the way through the present.
The big white panel in the top-center of the screen is an editor that we can use to write and execute SQL queries against the datasets. Here is the query I used to fetch, group, and count the commit messages:
SELECT TRIM(LOWER(message)), COUNT(*)
FROM bigquery-public-data.github_repos.commits
WHERE author.date.seconds >= 946684800
AND author.date.seconds <= 1585800000
AND ARRAY_LENGTH(parent) = 0
AND LENGTH(TRIM(LOWER(message))) > 0
GROUP BY TRIM(LOWER(message))
ORDER BY COUNT(*) DESC
LIMIT 15999;
This query simply groups together all commits with the same commit message and counts up how many commits contained each message. The commits
table has a field called parent
which stores the parent commit IDs for each commit, if any. I filtered on this field using the clause ARRAY_LENGTH(parent) = 0
, since initial commits don't have any parents. Note that any commits made from a detached head state would also not contain parents, but these can be manually excluded based on the commit message content.
I used the LOWER()
function to ensure that uppercase/lowercase letters didn't prevent the same messages from being grouped together. I also used the TRIM()
function to remove any leading or trailing whitespace before grouping. I added a filter on the author.data.seconds
column to bring back commits made between January 1st 2000 and April 22nd 2020. I ordered the resulting commit messages by their frequency of occurrence, from highest to lowest. Finally, I limited the results to 15999 records since that is the maximum amount that can be conveniently downloaded from the console interface (and that will be way more than we need).
Now let's move on to the findings!
The top 20 most popular initial commit messages
After running the above query, I simply picked out the top 20 results in the ranking. Here are the top 20 commit messages ranked by frequency of occurrence in the dataset:
Commit Message | Count | % to Total |
---|---|---|
initial commit | 1957096 | 86 |
first commit | 151042 | 7 |
init | 39357 | 2 |
initial commit. | 36616 | 2 |
initial | 17882 | 1 |
initial import | 14735 | 1 |
create readme.md | 11510 | 1 |
init commit | 9686 | <1 |
update license.md | 6606 | <1 |
first | 6029 | <1 |
first commit. | 5689 | <1 |
initial version | 5326 | <1 |
create license.md | 3968 | <1 |
inital commit | 3852 | <1 |
initial import. | 3460 | <1 |
create gh-pages branch via github | 3372 | <1 |
initial release | 3347 | <1 |
initial checkin | 3185 | <1 |
initial commit to add default .gitignore and .gitattribute files. | 2967 | <1 |
Total | 2285725 | 100 |
These top 20 initial commits messages combined for a total of 2,285,725 commits.
From the results, we can see that initial commit
is by far the most popular message used, representing 86% of the sample. The second most popular message is first commit
, representing 7%. The third ranked message is init
with 2%. Clearly the percentages drop off extremely quickly. The remaining results were mostly made up of the word initial
or init
mixed with another term like import
, version
, release
, or checkin
. A few of the remaining messages were related to including the readme.md
, license.md
, .gitignore
, and .gitattribute
files. Lastly, there was the message create gh-pages branch via github
, likely indicating that the GitHub pages feature is gaining traction.
So from what we can tell, the initial commit messages used in this dataset are strongly weighted toward initial commit
, with a small minority favoring first commit
and a smattering of other options. As you can probably tell by the name of my website, I run with the herd on this one and favor initial commit
!
Summary
In this article we analyzed a public GitHub dataset from Google BigQuery to explore the most popular initial commit messages used by developers working with Git. If you found this topic interesting, check out our analysis that uses Google BigQuery to estimate the percentage of commit messages that use the imperative mood.
Next Steps
If you're interested in learning more about how Git works under the hood, check out our Baby Git Guidebook for Developers, which dives into Git's code in an accessible way. We wrote it for curious developers to learn how Git works at the code level. To do this we documented the first version of Git's code and discuss it in detail.
We hope you enjoyed this post! Feel free to shoot me an email at jacob@initialcommit.io with any questions or comments.
Final Notes
Recommended product: Decoding Git Guidebook for Developers