The Infolytx Trial: Bringing a "Big Fat" Repo to Justice

Author: Shad Humydee
Software Engineer II – AI & ML
AI / ML / Architecture

Background

At some point in development, a certain developer pushes ML Models into the Repo, and he realizes two problems – one, these models are huge; two, models were not supposed to be pushed in a repo. So, he removes the models from his local, adds a new commit, and again pushes to the remote to rectify the situation of accidentally increasing the Repo size into something gigantic.

He continues work, gives a PR with that branch a couple of days later, and for some reason, the PR (let’s call this PR-X) never gets merged. The development continues. After a couple of months, people, who know their stuff, realize that they don’t have anything in the repo that can be more than 50 MB in total. However, the repo has grown to be a fat 1.4 GB.

Well, what went wrong? Being exactly in the same situation, I started an investigation.

First Hunch

Maybe some large files remained in a branch that’s not closed.

Findings

Checked and found nothing.
Some branches were found that were stale and unnecessary. Deleted all of those.
No apparent impact on the size.

Second Hunch

Git is a VCS; logically, it enables us to revert to any state that the project has been in along the lifecycle by commit hashes. What if, despite the large files being removed, those remained somewhere along the development? The point being – otherwise, how would we revert to that exact commit where that developer pushed the large files?

Finding

The hunch was right.
As the development continued, every single one of the branches that followed the PR-X kept a link with those large files.

Digging Deep on a Resolution

After googling up and going through a plethora of StackOverflow threads, my neophyte eyes found something very relatable to what I was looking for: git-filter-branch comes to the rescue. It checks every branch one by one and unlinks the branches with any file/folder I want to unlink. So, in a sense, the repo’s history gets rewritten from the point when the unwanted files/folders have been added, resulting in a repo history without those files/folders.

With a cherry on top, I have found a wrapper, BFG, that works on top of the git-filter branch, providing an interface to work from a higher level for people who don’t need the granular-level controls.

For the nerds: The git-filter-branch works quite at a low level but cannot leverage the multi-threaded architecture properly. They claim BFG to be 10 – 720x faster than git-filter-branch.

Getting Hands Dirty

Ran the command using BFG and found the Devils hiding in the weeds!

$ git clone –mirror git://example.com/my-repo.git

$ java -jar bfg-1.14.0.jar –delete-folders “folder_with_large_files” my-repo.git

After performing the operation with BFG, the authors recommend manually running Garbage Collection (GC) so that the Repo gets a fresh .git folder with only the things you want. Any files/folders that are not linked get removed after running the GC from Git.

$ cd my-repo.git

$ git reflog expire –expire=now –all && git gc –prune=now –aggressive

So, after I ran the GC, git removed the models from its storage/indexes. The local size of the repo was reduced to 228 MB.

Cool! Let’s force-push the entire thing to remote, and we have the perfect Repo we have been wanting! Amazing, right?

Not yet! After the push, the size of the remote didn’t get reduced. Why God, WHY?!

Cookie for Risat

After discussing with a resourceful colleague, Risat, I got a hint – the GC that runs on Bitbucket (remote-end) usually takes a certain interval to run. Assuming that after my push with the reduced repo size, the GC on Bitbucket’s end hasn’t run yet. Hence, the storage on the Remote remained the same.

I slept and woke up, and nothing changed. I couldn’t afford any more time.

Cookie for Saila

In a separate discussion with another veteran colleague, Saila, I got another hint – there should be a way to trigger the GC from the Remote end. I looked in Bitbucket’s repo settings but found nothing.

Rebel on all fronts!

Cookies for the Good Samaritan at StackOverflow:

Stackoverflow pointed to a hack (Easter Egg). If you revert your head by one commit and force push, for the case of bitbucket, the GC on Remote gets auto-triggered.

I tried that, and it downsized the repo from 1.4 GB to 1.36 GB. The remote GC ran, yet the repo size wasn’t reduced to what I expected.

How about Plan B?

Create a new repo with a new name, move the most updated content there, and continue developing that. However, you have the following side effects eventually might result in more overheads –

You lose your precious development history.
If there are any clients of that repo, you need to manually go and change every single one.

Also, at that point, I started praying to God that December doesn’t be too hard on me for being unable to fix this!

Silver Lining!

It suddenly clicked (also, some good people from StackOverFlow shared that they, too, did this when the remote GC didn’t result in what they expected) – we have perks; we pay for the cloud services, right? Why not pay a visit to the Big Brothers of Atlassian Support? Well, I did.

Eureka – Towards the Light!

The Atlassian Support first told me to try the same solution I did with BFG again. I complied like a good boy and informed them when it was done.
They ran the GC from their end and got back to me with something strange that I had never heard of.

They shared a list of closed/merged PRs from my Repo that had no reference on git at that point (local, staging, remote), but remained on the Bitbucket’s server pointing to those large files from PR-X. Hence the repo size wouldn’t get reduced. They call these references – the Dark-Refs, no reference to these can be found on the reflog.

Why do they do this? Because using those dark-refs they can provide Diff functionality between desired open branch/es and branch/es that is closed/merged.

They wanted Admin Permission to remove those. Being an Admin and seeing those PRs as unnecessary anymore, I allowed it.
They ran the GC on remote and returned with the same size and another list of PRs with the same issue.
I allowed it again.
This time they removed those PRs and ran the GC for one last time, and VOILA, 228 MB!

Moral of the Story

Git isn’t child’s play. You don’t let anyone and everyone commit and push anything and everything without a solid knowledge of git and its philosophy.
Don’t be too hard on the child. If he wouldn’t put you through this, how would you learn it all ;)

Happy Weekend! Thanks for reading the long post.

For the SuperNerds:

Let’s say I accidentally push a model in one branch. Does this mean I must go through all these steps to get it undone? Since we are working with ML models, shouldn’t there be safeguards for a project using a model developed by the devs?

Yes, you have to. Revert/Reset Hard is actually a new commit. So, nothing becomes non-existent from Git History practically. As long as the repo is small, developers/managers might not care much, but if it grows like ours, we need to start asking the question to find the reason behind growing into such a large repo.

Another point from Saila – Philosophically, Git is a VCS that keeps track of things that moves along the development process. So, when you want to really remove something from history that’s against that philosophy, hence, the trouble of doing so many things to untangle the unwanted things.

I didn’t quite get the “Dark ref” part. Why would a closed/merged branch from a Repo, which is pointing towards large files, have no reference on git, reflog, or bitbucket? When does this happen?

Referring to what the Atlassian Support shared with me, verbatim –

For the server-side copy of the repository that we host, it’s sometimes necessary for us to preserve objects that aren’t part of a clone—even a full mirror clone. The reason for this is to be able to continue to perform diffs on pull requests that have been closed either by merging or declining. This property was a part of a recent change to improve the performance of complex diffs.
Pull requests that were created recently but prior to the history rewrite could still be preserving some of those large objects you pushed. We can delete these Pull request IDs so and so. Also, we need the admin’s confirmation to delete these PRs from the backend and finally re-run git GC.

References

Thank You

You guys are awesome –

Istiak Ahmed Risat, Senior Software Engineer – AI/ML, Infolytx
Saila Nasrin, Senior Software Engineering, Infolytx

Categories: Technology / Tags: GIT /

The Infolytx Trial: Bringing a “Big Fat” Repo to Justice

Related Posts

API testing using Jest with supertest

Mastering HTTP Basic Authentication in Cypress: Effective Solutions for Test Automation Challenges

FastAPI – The Good, the Bad and the Ugly

Encrypted Machine Learning as a Service