This is post number two in a series of short articles exploring challenges facing anyone deploying Git at scale in their enterprise software development environment. Find the introduction here.
Anyone migrating to Git from a centralized version control system will quickly run into one of Git’s most characteristic features: a codebase is most naturally represented by one complete repository. That is, you don’t define a working copy based on a part of a repository; you get the entire repository as your working copy.
Centralized version control systems tended to become a grab bag of everything: main products, side projects, a file you needed at home but didn’t have a USB handy, and sometimes, lots and lots of large binary files. You’d then define a tiny fraction of the world as your working copy and move just that piece down.
When migrated to a Git repo however, all of a sudden you are cloning the world on to your laptop!
Doing the splits
The best practice answer in the Git world is that you need to split all the unrelated items into separate Git repos. But now there are many repos and a related number of new questions:
- Who in your organization is responsible for managing all the repos?
- What tracks code if it is, for example, refactored to a different repo?
- How do developers find the repos with the code they need?
- What about codebases that share code but for secrecy or scaling reasons can’t all be included in a single Git repos?
- Who provisions new repos? Are they automatically backed up properly? Where do they live?
- What if you have a large and entangled code base that will be expensive to refactor?
Untracked code movement
Of the various questions raised, one of the most important is that movement of code between repos generates no metadata. To each Git repo, files appear like a code drop, or local files are donated to another repo. This cuts against the grain of SCM. Software Configuration Management is in danger of becoming Software Confusion Mess, because we lose track of why and how code is moving within a codebase. That means we might not be able to answer questions like:
- What codelines contain this recently discovered bug?
- What products contain this piece of GPL-licensed code?
- Did the refactored code get into every library that uses it?
- What repos did this particular line of code pass through before getting here?
There are ways around all of these problems, of course, and my main point is merely that these are some issues to keep in mind. Perhaps most of them are not important in your uses cases, or you are using one of the many tools that address some of them. And of course, as I implied in my article “Problem-centric Products“, you can expect WANdisco’s Git roadmap to pass through all of these challenges.
So stay tuned for the next installment: “Access Control”.