Continuing on from Top Challenges of the Git Enterprise Architect #2: Access Control I’ll next talk about Git’s ever growing repos and some challenges presented.
Git stores all history locally, which is good because it’s fast. But it’s also bad, because clone and other command response times grow and never shrink.
Linear response time
Certain commands take linear time, O(n), either with the number of files in a repo or the depth of history. For example, Git has no built-in notion of revision number. Here’s a way to get the number of revisions of a file, Main.java:
git shortlog Main.java | grep -E '^[ ]+\w+' | wc -l
Of course, a consecutive file revision number is not a foundational construct in Git, as it is with a system like Subversion, so Git needs to walk its DAG (Directed Acyclic Graph) backward to the origin, counting the revisions of a particular file along the way.
When a repo is young, this is typically very fast. Also, as I noted, the revision numbers play less of a important role with Git so we need them less often. However, think about what will happen if you have an active shared Git repository in service for a long period of time? When I’ve asked how long typical SCM admins expect to keep a project supported in an SCM, numbers range from 4 to up to 10 years. An active file might have accumulated hundreds or even thousands of revisions, and you’d want to think twice about counting them all up with Git.
Facebook’s concern over Git’s ever-growing-repos and very-slowing-performance a few years ago led them to switch to Mercurial and centralize much of the data. Alas, this approach is not a solution for most companies. It relies on a fast, low latency connection, and if you don’t have access to the unique, fast, data safe, active-active replication as found in WANdisco’s MultiSite products for Subversion and Git, users remote to the central site will suffer sharply degraded performance.
The most common workaround I hear about is that when a Git repo gets too big and slow, a new shared master is cloned and deployed, and the old one serves as history. This is clearly not ideal, but many of the types of development that have first started using Git are less impacted by fragmented historical SCM data. As Git gains more widespread adoption into a greater variety of enterprise development projects, solutions may become more needed. Here at WANdisco we are hard at work paving the road ahead so that your Git deployments will scale historically as well as geographically.