Verifying Git Data Integrity

As a Git administrator you’re probably familiar with the git fsck command which checks the integrity of a Git database, but in a large deployment you may have several mirrors in different locations supporting users and build automation (the record I’ve heard so far is over 50 mirrors).  You can run git fsck on each one as normal maintenance, but even if all of the mirrors have intact databases, how do you make sure that all of the mirrors are consistent with the master repository?  You need a simple way to verify Git data integrity for all repositories at all sites.

That’s quite a difficult question to answer. If you have 20 or 30 mirrors, you want to know if any of them are not in sync with the master. Inconsistencies may arise if the replication is lagging behind, or if there is some other subtle corruption.

Git MultiSite provides a simple consistency checker to answer this question quickly. (Bear in mind that Git MultiSite nodes are all writable peer nodes; it does not use a master-slave paradigm.  But the ability to make sure that all peer nodes are consistent is equally valuable.)  The consistency checker can be invoked for any repository in the administration console:

Git Consistency Check

Git Consistency Check

The consistency checker computes a SHA1 ID over the values of all the current refs in the repository on each replicated peer node. This SHA1 is tied to the Global Sequence Number (GSN), which uniquely identifies all of the proposals in Git MultiSite’s Distributed Coordination Engine. The result looks like this:

Consistency Check Result

Consistency Check Result

 

First, I see that the GSN matches across all three nodes. I’m now confident that they’re all reporting results at a consistent point, when the same transactions should be present in all nodes. In other words, I’m able to discount any inconsistencies due to network lag.

More importantly, I see that the SHA1 for the second node doesn’t match the other two. That’s a red flag, and it means that I should immediately investigate what’s wrong on that node.

Now consider this example:

 

Different GSN

Different GSN

Notice that the third node is reporting an earlier GSN (23 versus 29) compared to the other two nodes. That tells me that this node is lagging behind, which may be expected if it’s connected over a WAN and always running 2-3 minutes behind the other nodes.

Running a distributed SCM environment is very difficult, and the consistency check is another way that Git MultiSite makes things easier for you. Check out a free trial and see for yourself!

 

 

0 Responses to “Verifying Git Data Integrity”


  • No Comments

Leave a Reply