Binary artifact management in Git

Paul Hammant has an interesting post on whether to check binary artifacts into source control.  Binary artifact management in Git is an interesting question and worth revisiting from time to time.

First, a bit of background.  Centralized SCM systems like Subversion and ClearCase are a bit more capable than Git when it comes to handling binary files.  One reason is sheer performance: since a Git repository has a full copy of the entire history, you just don’t want your clone (working copy) to be too big.  Another reason is assembling your working views.  ClearCase and to a lesser extent Subversion give you some nice tools to pick and choose pieces of a really big central repository and assemble the right working copy.  For example in a ClearCase config spec you can specify that you want a certain version of a third party library dependency.  Git on the other hand is pretty much all or nothing; it’s not easy to do a partial clone of a really big master repository.

Meanwhile, there had been a trend in development to move to more formal build and artifact management systems.  You could define a dependency graph in a tool like Maven and use Maven or Artifactory or even Jenkins to manage artifacts.  Along with offering benefits like not storing derived objects in source control, this trend covered off Git’s weak spot in handling binaries.

Now I’m not entirely sure about Paul’s reasons for recommending a switch back to managing binaries in Git.  Personally I prefer to properly capture dependencies in a configuration file like Maven’s POM, as I can exercise proper change control over that file.  The odd thing about SCM working view definitions like config specs is that they aren’t strongly versioned like source code files are.

But that being said,  you may prefer to store binaries in source control, or you may have binaries that are actually source artifacts (like graphics or multimedia for game development).  So is it hopeless with Git?

Not quite.  There are a couple of options worth looking at.  First, you could try out one of the Git extensions like git-annex or git-media.  These have been around a long time and work well in some use cases.  However they do require extra configuration and changes to the way you work.

Another interesting option is the use of shared back-end storage for cloned repositories.  Most Git repository management solutions that offer forks use these options for efficient use of back-end storage space.  If you can accept working on shared development infrastructure rather than your own workstation, then you can clone a Git repository using the file protocol with the -s option to share the object folder.  There’s also the -reference option to point a new Git clone at an existing object store.  These options make cloning relatively fast as you don’t have to create copies of large objects.  It doesn’t alleviate the pain of having the checked out files in your clone directory, but if you’re working on a powerful server that may be acceptable.  The bigger drawback to the file protocol is the lack of access control.

Management of large binaries is still an unsolved problem in the Git community.  There are effective alternatives and work-arounds but it’ll be interesting to see if anyone tries to solve the problem more systematically.

0 Responses to “Binary artifact management in Git”

  • No Comments

Leave a Reply