A previous post we discussed FLP, a theoretical result that distributed consensus is impossible with one faulty process. This seems an arcane theoretical result that does not impact the decisions a developer faces in normal course of work. However, the fundamental concept behind FLP, that it’s impossible to tell the difference between a process that is really slow and one that has failed, is of real significance. An example of this is in the Git version control system, from the git-gc manpage (https://git-scm.com/docs/git-gc):
On the other hand, when git gc runs concurrently with another process, there is a risk of it deleting an object that the other process is using but hasn’t created a reference to. This may just cause the other process to fail or may corrupt the repository if the other process later adds a reference to the deleted object. Git has two features that significantly mitigate this problem:
Any object with modification time newer than the –prune date is kept, along with everything reachable from it.
Most operations that add an object to the database update the modification time of the object if it is already present so that #1 applies.
However, these features fall short of a complete solution, so users who run commands concurrently have to live with some risk of corruption (which seems to be low in practice) unless they turn off automatic garbage collection with git config gc.auto 0.
This is related to FLP. The first clue that this is related to FLP is the default period for the prune time which is 2 weeks, this is the timeout – essentially Git is assuming that any process that is updating the repository, for example a git push, will complete in 2 weeks. Git is now treating the world as a synchronous system such that all updates to the repository will complete within 2 weeks, if an update takes longer then there is a danger the repository will be corrupted. In other words it assumes that a process that takes longer than 2 weeks to complete the update of a repository has failed.
When you make a push to a Git repo it first checks the head of the branch you are updating to make sure your push is allowed, for example that it is a fastforward change when non-fastforwards are disallowed. When it checks the head it takes a file lock while it reads the SHA1 of the current head of the branch. After it reads the head and releases the file lock it starts to push the objects that make up the commit. Once the client process has completed the pushing the objects it takes the file lock again, it checks that no other process has updated the head while it was pushing up the objects and if not it writes the new head to disk and releases the file lock. Until the process updates the head the objects it has put in the repository are “orphans”, they are not connected into any branch. The process grabs the file lock twice during the push, at the start and at the end of the push, it would not make sense for the process to hold the lock for the whole time of the push as other processes would be blocked – also if the process failed in some way without releasing the lock then no other process could update that branch.
So where does git garbage collection fit in? If the process fails while sending objects then there may be some orphan objects left in the repository or if the repository sends all the objects but finds the head has moved on in the final step then you are left with objects that will remain orphaned. Git garbage collection looks through the repository for orphaned objects and removes them. The problem for git-gc is how does it differentiate between objects that are truly orphans and objects that are part of an ongoing push that will eventually complete? The answer is a simple rule that for automatic garbage collection it will not remove objects that are less than 2 weeks old by default, this gives a process doing a push a limit of 2 weeks to complete the push. So by default git differentiates between a process that has failed and one that is running very slowly using a 2 week timeout, it does not treat the world as an asynchronous system but rather as a synchronous system with very loose timing restrictions.