Separating a Monolith while Keeping Commit History

June 5, 2017
development git

As small companies grow, and their engineering teams expand, they will often wish to begin splitting a large, monolithic repository into smaller service repositories. There are two naive approaches to this, neither of which we wanted to do:

1. Copy into a new repo

This would simply involve copying the files related to the service into a new repository, and then removing them from the original. This solution will totally work, however you will lose all history in the migration. When we began to split our services, we really wanted to keep our history, mostly so we could still identify changesets, as well as identifying the original author or specific code.

2. Clone the repo, and remove everything else

This is pretty straight forward, and would maintain all the history. The problem with this is that all of the history would be present even for files which we no longer care about. While this may not be a problem for you, our packed repo size is currently ~400Mb, much to large to keep unnecessarily.

400Mb is taken from git count-objects -vH

A Solution

Disclaimer: If you are planning to use this approach, I would strongly recommend making a local backup of your .git directory first, and validating the results before force pushing the new history.

It occurred to use that we could get what we wanted, keeping the history for only the files applicable to the new service, by obliterate-ing all of the files that we didn’t want. Unfortunately, a naive approach here would have been a prohibitively long process. Our main repository contains nearly 15000 commits, with ~15k files (that existed at some point). Obliterating a single file from our repo could take ~2hours , so obliterating ~10k would be ~20000 hours, or just over 2 years. Clearly this naive approach was not an option.

Quick sidebar to explain where these numbers came from:

  • Number of commits: git log | wc -l
  • Number of files ever: git log --pretty=format: --name-only --diff-filter=A | sort -u | wc -l

Fortunately, we are able to obliterate a list of files all at once, and we have a way of identifying all of the files we need to remove.

  1. Get all files that ever existed: git log --pretty=format: --name-only --diff-filter=A | sort -u > /tmp/git-clean-history-all.txt
  2. Get all files that we want to keep: git ls-tree -r $(git rev-parse --abbrev-ref HEAD) --name-only > /tmp/git-clean-history-keep.txt
  3. Get the files that no longer exist: comm -23 /tmp/git-clean-history-all.txt /tmp/git-clean-history-keep.txt /tmp/git-clean-history-remove.txt

Now, we can pass this list into git-obliterate, which will rewrite our git history as if the provided files never existed. All history of the removed files will be removed, and commits which only affected removed files will be pruned.

cat /tmp/git-clean-history-remove.txt | tr '\n' ' ' | xargs git obliterate

Note: Be sure to be using a version of git-obliterate which supports file list (such as this one). Some versions will support only a single file at a time, and if this is happening, can be difficult to figure out why this isn’t working (made even more frustrating by the fact that the turn-around here can be on the scale of hours).

Tip: If you’re looking for your git-obliterate, I would start by checking /usr/local/bin/git-obliterate.

One caveat here being that if the list of files is longer than the unix (or git) command length limit, it will have to be split. Some quick googling told me that that limit is ~260k characters, so if needed, we can split the /tmp/git-clean-history-remove.txt into smaller groups, and then remove those. It will take longer, but it will work.

One last Tip: At the end, recalculate the “files that ever existed” and “current files”. They should be the same. If not, some files may have been missed for some unknown reason. You can repeat the above steps to clean them up. Depending on the repo size though, a little bit of extra history may not matter to you.

Results

After splitting one of our services out of our main repo, we needed to remove ~13k unrelated files, from ~15k commits, with a packed repo size of ~400Mb. After taking the steps above, we were able to reduce the repository to a few hundred files, a few thousand commits, and a packed repo size of ~1.5Mb. Overall, it took ~5 hours to run through and rewrite our repo. The repo size savings here really shine when cloning into build/CI processes, and the savings here were well worth letting a machine rewrite some history for a few hours.

Note that I’ve written this up in a nice little repo, which I’m still improving. If you don’t care to do these steps manually, and are okay with a little magic, feel free to just use this script.


Hopefully this works out for you, or was at least helpful. If you have any questions, don’t hesitate to shoot me an email, or follow me on twitter @nrmitchi.