Monthly Archive for September, 2013

Git Subtrees and Dependency Management

Component-based development has always seemed difficult to manage directly in Git. Legacy systems like ClearCase UCM have the idea of formal baselines to manage the dependencies in a project, and of course Subversion uses externals to capture the same concept. By contrast, Git started life as a single-project repository system, and the submodule and subtree concepts seemed clunky and difficult to manage. A lot of software teams overcame that problem by deferring component manifests to build or CI systems. The latest incarnation of Git subtrees is significantly improved, however, and worth a second look for dependency management.

The latest version of Git subtree is available with Git 1.7.11+. (If you need the most recent version of Git for your platform, WANdisco offers certified binaries.) It offers a much simplified workflow for importing dependencies, updating the version of an imported dependency, and making small fixes to a dependency.

For example, let’s say we have three components in our software library, and we have two teams working on different sets of those components.

Component Architecture

Component Architecture

With subtrees, we can easily create new ‘super project’ repositories containing different sets of components. To get started, we add component repos as new remotes in the super project, then define the subtree.


git remote add modA git@repohost:modA
git fetch modA
git subtree add --prefix=modA --squash modA/master
git remote add modB git@repohost:modB
git fetch modB
git subtree add --prefix=modB --squash modB/master

We repeat this process with a different set of components in the second super project, yielding a directory tree that looks like this:

│   ├───modA
│   └───modB

As the architect I’ve determined the set of components used in the super projects, and the rest of the team gets the right set of data just by regular clones and pulls. Similarly, if I want to update to the latest code, I just run:

git subtree pull --prefix=modB --squash modB master

Or, if I want to peg a component to a specific branch:

git subtree pull --prefix=modB --squash modB r1.1

By using –squash I generate a single merge commit when I add or update a subtree. That’s equivalent to one commit every time I adjust the version of a component, which is usually the right way to track this activity. Keep in mind that it is very easy to create a new branch off of a specific tag or commit at any time.

Similarly, if I want to contribute a bug fix, I just commit into the component and push the change back:

echo "mod b change from super 1" >> .\modB\readme.txt
git commit -am "change to modB from super 1"
git subtree push --prefix=modB modB master

There are a couple of good rules to follow when using subtrees. First, don’t make changes to a subtree unless you really want to contribute a bug fix or patch back upstream. Second, don’t make commits that span multiple subtrees or a subtree and the super project code. Both of these rules can be enforced with hooks if necessary, and you can rebase to fix any mistakes before pushing.

Git subtrees are now a very effective and convenient tool for component and dependency management. Combined with the power of modern build and CI systems, they can manage a reasonably complex development project.

Questions about how to take advantage of Git subtrees? WANdisco is here to help with Professional Git Support and Services.



Delegated Subversion Access Control

Managing Subversion access control for a small team is fairly simple, but if your team is growing to several hundred developers, you don’t want to get a phone call whenever a new developer joins the team or someone needs a different access level. Delegated Subversion access control is what you need, and WANdisco’s SVN Access Control product uses the concept of team owners and sub-teams to get you there.

As a simple example, let’s say that we want all developers to have read access to the web-ui project. A subset of developers will have also have write access to the trunk, and as the Subversion administrators we don’t want to decide which developers belong in each group. Using SVN Access Control, administrators can delegate that responsibility to the team leads while still being able to audit what’s happening.

In SVN Access Control we simply define a group called web-ui-devs and a subgroup called web-ui-committers. Next we set the permissions appropriately on the group and subgroup, and each is assigned an an owner who can then manage membership.

Simple enough! SVN Access Control also allows a subgroup to have subgroups of its own, so you can set up a structure as deep as necessary to model your permissions and rules.

If you’d like more information on how to use SVN Access Control to solve your Subversion management challenges, contact our team of Subversion experts for advice. If you’re not yet enjoying the power of SVN Access Control, you can start a free trial.


WANdisco Announces SVN On-Demand Training for Administrators and Developers

Whether you’re looking to get started with Subversion or build your skills in managing large-scale Subversion deployments, WANdisco’s new SVN On-Demand Training offers courses designed for Subversion administrators and developers of all skill levels.

SVN On-Demand Training offers instruction to boost administrators’ and developers’ knowledge of Subversion and the library includes more than 30 videos and supporting reference materials. New material is being continually added for subscribers.

Some of the current SVN On-Demand Training courses include:

  • Introduction to Subversion
  • Subversion for Beginners
  • Intermediate Subversion
  • Advanced Subversion

SVN On-Demand Training is available now. Visit for more information and to request a quote.

Attendees of Subversion & Git Live 2013 in Boston, San Francisco, and London this October will receive two weeks free with a special code at the conference. Visit to register using promo code REF2013 to save 30%.

Data Auditing in Subversion

I’ve been writing a lot lately about the new features in Subversion 1.8, but there’s a little nugget in Subversion 1.7 that just caught my attention recently. I knew that Subversion stored md5 checksums of files in the repository, but I wasn’t quite sure how to easily access that information. The svnrdump command introduced in Subversion 1.7 provides the answer, and makes data auditing in Subversion much easier.

So why is this important? Well, to put it bluntly, stuff happens to data: it may be corrupted due to hardware failure, lost due to improper backup procedures, or purposely damaged by someone with bad intentions. Subversion MultiSite can protect you against all the vagaries of hardware and network, but if you work in a regulated environment you will someday have to prove that the data you took out of Subversion is the same as the data you put in.

That’s where the checksums come in. Let’s say I check out an important file from Subversion, like a configuration script or a data file with sensitive information. I can easily compare a local checksum against the checksum on the server to see if they match.

> md5sum BigDataFile.csv
3eba79a554754ac31fa0ade31cd0efe5  BigDataFile.csv
> svnrdump dump svn://myrepo/trunk/BigDataFile.csv
Text-content-md5: 3eba79a554754ac31fa0ade31cd0efe5

Simple enough, and very easy to script for automated auditing. If you store any important data in Subversion in a regulated environment, this simple feature is another way to help satisfy any compliance concerns about data integrity.

If you have any regulatory or compliance concerns around Subversion then grab the latest certified binaries, ask us for advice, or try out SVN MultiSite’s 100% data safety capability.


Git Data Mining with Hadoop

Detecting Cross-Component Commits

Sooner or later every Git administrator will start to dabble with simple reporting and data mining.  The questions we need to answer are driven by developers (who’s the most active developer) and the business (show me who’s been modifying the code we’re trying to patent), and range from simple (which files were modified during this sprint) to complex (how many commits led to regressions later on). But here’s a key fact: you probably don’t know in advance all the questions you’ll eventually want to answer. That’s why I decided to explore Git data mining with Hadoop.

We may not normally think of Git data as ‘Big Data’. In terms of sheer volume, Git repositories don’t qualify. In several other respects, however, I think Git data is a perfect candidate for analysis with Big Data tools:

  • Git data is loosely structured. There is interesting data available in commit comments, commit events intercepted by hooks, authentication data from HTTP and SSH daemons, and other ALM tools. I may also want to correlate data from several Git repositories. I’m probably not tracking all of these data sources consistently, and I may not even know right now how these pieces will eventually fit together. I wouldn’t know how to design a schema today that will answer every question I could ever dream up.

  • While any single Git repository is fairly small, the aggregate data from hundreds of repositories with several years of history would be challenging for traditional repository analysis tools to handle. For many SCM systems the ‘reporting replica’ is busier than the master server!

Getting Started

As a first step I decided to use Flume to stream Git commit events (as seen by a post-receive hook) to HDFS. I first set up Flume using a netcat source connected to the HDFS sink via a file channel. The flume.conf looks like:

git.sources = git_netcat
git.channels = file_channel
git.sinks = sink_to_hdfs
# Define / Configure source
git.sources.git_netcat.type = netcat
git.sources.git_netcat.bind =
git.sources.git_netcat.port = 6666
# HDFS sinks
git.sinks.sink_to_hdfs.type = hdfs
git.sinks.sink_to_hdfs.hdfs.fileType = DataStream
git.sinks.sink_to_hdfs.hdfs.path = /flume/git-events
git.sinks.sink_to_hdfs.hdfs.filePrefix = gitlog
git.sinks.sink_to_hdfs.hdfs.fileSuffix = .log
git.sinks.sink_to_hdfs.hdfs.batchSize = 1000
# Use a channel which buffers events in memory
git.channels.file_channel.type = file
git.channels.file_channel.checkpointDir = /var/flume/checkpoint
git.channels.file_channel.dataDirs = /var/flume/data
# Bind the source and sink to the channel
git.sources.git_netcat.channels = file_channel = file_channel

The Git Hook

I used the post-receive-email template as a starting point as it contains the basic logic to interpret the data the hook receives. I eventually obtain several pieces of information in the hook:

  • timestamp

  • author

  • repo ID

  • action

  • rev type

  • ref type

  • ref name

  • old rev

  • new rev

  • list of blobs

  • list of file paths

Do I really care about all of this information? I don’t really know – and that’s the reason I’m just stuffing the data into HDFS right now. I don’t care about all of it right now, but I might need it a couple years down the road.

Once I marshal all the data I stream it to Flume via nc:

nc_data = \
 "{0}|{1}|{2}|{3}|{4}|{5}|{6}|{7}|{8}|{9}|{10}\n".format( \
 timestamp, author, projectdesc, change_type, rev_type, \
 refname_type, short_refname, oldrev, newrev, ",".join(blobs), \
p = Popen(['nc', NC_IP, NC_PORT], stdout=PIPE, \
 stdin=PIPE, stderr=STDOUT)
nc_out = p.communicate(input="{0}".format(nc_data))[0]

The First Query

Now that I have Git data streaming into HDFS via Flume, I decided to tackle a question I always find interesting: how isolated are Git commits? In other words, does a typical Git commit touch only one part of a repository, or does it touch files in several parts of the code? If you work in a component based architecture then you’ll recognize the value of detecting cross-component activity.

I decided to use Pig to analyze the data, and started by ingesting data with HCat.

hcat -e "CREATE TABLE GIT_LOGS(time STRING, author STRING, \
  repo_id STRING, action STRING, rev_type STRING, ref_type STRING, \
  ref_name STRING, old_rev STRING, new_rev STRING, blobs STRING, paths STRING) \

Now for the fun part – some Pig Latin! Actually detecting cross-component activity will vary depending on the structure of your code; that’s part of the reason why it’s so difficult to come up with a canned schema in advance. But for a simple example let’s say that I want to detect any commit that touches files in two component directories, modA and modB. The list of file paths contained in the commit is a comma delimited field, so some data manipulation is required if we’re to avoid too much regular expression fiddling.

-- load from hcat
raw = LOAD 'git_logs' using org.apache.hcatalog.pig.HCatLoader();

-- tuple, BAG{tuple,tuple}
-- new_rev, BAG{p1,p2}
bagged = FOREACH raw GENERATE new_rev, TOKENIZE(paths) as value;
DESCRIBE bagged;

-- tuple, tuple
-- tuple, tuple
-- new_rev, p1
-- new_rev, p2
bagflat = FOREACH bagged GENERATE $0, FLATTEN(value);
DESCRIBE bagflat;

-- create list that only has first path of interest
modA = FILTER bagflat by $1 matches '^modA/.*';

-- create list that only has second path of interest
modB = FILTER bagflat by $1 matches '^modB/.*';

-- So now we have lists of commits that hit each of the paths of interest.  Join them...
-- new_rev, p1, new_rev, p2
bothMods = JOIN modA by $0, modB by $0;
DESCRIBE bothMods;

-- join on new_rev
joined = JOIN raw by new_rev, bothMods by $0;
DESCRIBE joined;

-- now that we've joined, we have the rows of interest and can discard the extra fields from both_mods
final = FOREACH joined GENERATE $0, $1, $2, $3, $4, $5, $6, $7, $8, $9, $10;
DUMP final;

As the Pig script illustrates, I manipulated the data to obtain a new structure that had one row per file per commit. That made it easier to operate on the file path data; I made lists of commits that contained files in each path of interest, then used a couple of joins to isolate the commits that contain files in both paths. There are certainly other ways to get to the same result, but this method was simple and effective.

In A Picture

A simplified data flow diagram shows how data makes its way from a Git commit into HDFS and eventually out again in a report.

Data Flow

Data Flow

What Next?

This simple example shows some of the power of putting Git data into Hadoop. Without knowing in advance exactly what I wanted to do, I was able to capture some important Git data and manipulate it after the fact. Hadoop’s analysis tools make it easy to work with data that isn’t well structured in advance, and of course I could take advantage of Hadoop’s scalability to run my query on a data set of any size. In the future I could take advantage of data from other ALM tools or authentication systems to flesh out a more complete report. (The next interesting question on my mind is whether commits that span multiple components have a higher defect rate than normal and require more regression fixes.)

Using Hadoop for Git data mining may seem like overkill at first, but I like to have the flexibility and scalability of Hadoop at my fingertips in advance.

Certified Git 1.8.4 Binaries Available

Git 1.8.4 (released on August 23rd) contains a nice collection of improvements and bug fixes.  Best of all, there’s no need to wait for an updated package or rebuild from source.  WANdisco has just released certified Git 1.8.4 binaries for all major platforms.

Here’s a list of key improvements:

  • Support for Cygwin 1.7
  • An update for git-gui
  • More flexible rebasing options allow you to select rebase strategy and automatically stash local changes before rebase begins
  • A contrib script that mines git blame output to show you who else might be interested in a commit
  • Improvements to submodule support, including the ability to run git submodule update from the submodule directory and the option to run a custom command after an update
  • A performance improvement when fetching between repositories with many refs

Of course, WANdisco offers Git training and consulting to help you get the most out of your Git deployments.  Grab Git 1.8.4 and start taking advantage of those new features!

Don’t forget to sign up for Subversion & Git Live 2013 in Boston, San Francisco or London!


Announcing Speakers for Subversion & Git Live 2013

Subversion & Git Live 2013 returns this October featuring expert-led workshops, presentations from industry leading analysts on the future of Subversion & Git, and unique committer roundtable discussions. In addition to all of the great sessions at this year’s conference, we’re pleased to announce a new keynote by Jeffrey S. Hammond, VP, Principal Analyst with Forrester Research. Hammond is a widely recognized expert in software application development and delivery. He will present “The Future of Subversion and Git,” and discuss the differences between these popular version control systems, while pointing out the issues IT organizations should consider before deciding which one to use.

Jeffrey Hammond joins Apache Software Foundation Vice Chairman and VP Apache Subversion Greg Stein who will present “Why Open Source is Good for your Health,” a look at how open source software works, how communities manage complex projects, and why it’s better for your business to rely on open-source rather than proprietary software.

Sessions include:

  • The Future of Subversion and Git

  • Why Open Source is Good for your Business

  • Subversion: The Road Ahead

  • What Just Happened? Intro to Git in the Real World

  • Practical TortoiseSVN

  • Introduction to Git Administration

  • Progress in Move Tracking

  • Developments in Merging

  • Git Workflows

  • …and more!

Subversion & Git Live is coming to Boston (Oct. 3), San Francisco (Oct. 8), and London (Oct. 16). View the agenda, travel and hotel details, and register at  We’re offering you and anyone you’d like to invite a 30% discount off the normal $199 registration fee if you register using promo code REF2013. Normally conferences featuring speakers of this caliber cost four times as much. Space is going fast, so register now!

Verifying Git Data Integrity

As a Git administrator you’re probably familiar with the git fsck command which checks the integrity of a Git database, but in a large deployment you may have several mirrors in different locations supporting users and build automation (the record I’ve heard so far is over 50 mirrors).  You can run git fsck on each one as normal maintenance, but even if all of the mirrors have intact databases, how do you make sure that all of the mirrors are consistent with the master repository?  You need a simple way to verify Git data integrity for all repositories at all sites.

That’s quite a difficult question to answer. If you have 20 or 30 mirrors, you want to know if any of them are not in sync with the master. Inconsistencies may arise if the replication is lagging behind, or if there is some other subtle corruption.

Git MultiSite provides a simple consistency checker to answer this question quickly. (Bear in mind that Git MultiSite nodes are all writable peer nodes; it does not use a master-slave paradigm.  But the ability to make sure that all peer nodes are consistent is equally valuable.)  The consistency checker can be invoked for any repository in the administration console:

Git Consistency Check

Git Consistency Check

The consistency checker computes a SHA1 ID over the values of all the current refs in the repository on each replicated peer node. This SHA1 is tied to the Global Sequence Number (GSN), which uniquely identifies all of the proposals in Git MultiSite’s Distributed Coordination Engine. The result looks like this:

Consistency Check Result

Consistency Check Result


First, I see that the GSN matches across all three nodes. I’m now confident that they’re all reporting results at a consistent point, when the same transactions should be present in all nodes. In other words, I’m able to discount any inconsistencies due to network lag.

More importantly, I see that the SHA1 for the second node doesn’t match the other two. That’s a red flag, and it means that I should immediately investigate what’s wrong on that node.

Now consider this example:


Different GSN

Different GSN

Notice that the third node is reporting an earlier GSN (23 versus 29) compared to the other two nodes. That tells me that this node is lagging behind, which may be expected if it’s connected over a WAN and always running 2-3 minutes behind the other nodes.

Running a distributed SCM environment is very difficult, and the consistency check is another way that Git MultiSite makes things easier for you. Check out a free trial and see for yourself!



Why We Don’t Build Software Like We Build Houses

smallhouseIn Why We Should Build Software Like We Build Houses, the author Leslie Lamport makes the case for creating detailed specifications for building software following the model of how an architect’s blueprints are used for building a house.

Lamport goes on to suggest: “Can this be why houses seldom collapse and programs often crash?” Seldom, perhaps, but with especially catastrophic consequences, such as the 68,000+ people killed in the recent 2008 Sichuan earthquake. Most of the deaths were due to the collapse of poorly designed or constructed houses.

However, as bad as that is, the 1556 Shaanix earthquake killed more than 800,000 people, which at the time was at least 0.25% of the world’s total population. The equivalent death toll today: 17.5 million.

It’s hard to imagine an earthquake today that would kill 17.5 million people. And that’s largely because we know more about how to design and build houses for events such as earthquakes than we did in 1556. The point of this is to illustrate that methods for building houses are far more mature than for building software. Humans have been building houses at least since the last Ice Age, software, for barely 50 years. That’s a reason why we don’t build software like we build houses – it’s simply so new we are still in the early days of figuring out what works.
Knowing how to build houses or software to be robust during failure conditions is part of the solution. Another part is that architects and contractors have a mature form for communication: the blueprint. Blueprints are such complete and specific documents that a contractor could be reasonably expected to complete a house even without continued input from the architect. Architects who create blueprints have extensive knowledge about materials, joints, weatherproofing, building codes, fasteners, almost every detail. Most people responsible for software products rarely create software specifications to the same level of detail as an architect’s blueprint. Whatever the equivalent is, it clearly needs to be more developed that what we normally think of as a software “spec”.

Indeed, the software industry in large part has pointedly turned its back on well-developed construction paradigms with trends like Extreme Programming, Agile, Lean, and Minimally Viable Product. Many of these techniques involve iteration, quick results, and fast prototyping. I see these reactions not as embracing a superior method, but instead reflecting creative searching within a new technology. Eventually, I think we will develop more formal methods for planning and building software, technology for software blueprints will emerge, and highly technical product architects will effectively span the divide between vision and code.

The relative newness of consumer and even enterprise software means we have still have workarounds and a measure of tolerance for computer failures. However, even only a few decades into the computer era, we are increasing our expectations around continuous availability of our systems and disaster recovery/data safety for our data.

This will likely drive new baselines for reliability and availability of software, and with it, the need to more fully visualize the system we are trying to build. At WANdisco, we are seeing urgent requirements emerge around software development tools, Git and Subversion, and also as Hadoop and big data analysis transitions from promising curiosity to ubiquitous backbone technology. Non-Stop Data™, indeed.

We don’t build software like we build houses because we don’t know enough yet about building software, and to a lesser extent because we aren’t completely dependent on software yet. As the field matures, Lamport’s recommendation may yet become standard practice.

WANdisco Sponsors UC Berkeley AMPLab, Creators of Spark and Shark

We’re pleased to sponsor the UC Berkeley AMPLab (Algorithms, Machines, and People), a five-year collaborative effort responsible for the development of Spark, Shark and Mesos.

WANdisco previously announced the integration of Spark and Shark into our certified Apache Hadoop binaries, and look forward to working closely with the talented AMPLab team on continued research into in-memory data storage for Hadoop.

“We are pleased with WANdisco’s strong support of AMPLab as well as Spark and Shark,” said Ion Stoica, co-director of AMPLab of UC Berkeley Electrical Engineering and Computer Sciences. “Their participation helps with market validation, and our continued work will enable businesses to quickly deploy interactive data analytics on Hadoop.”

Interested in learning more about Hadoop? Register for one of our Hadoop Webinars. Or register for 25% off registration fees to Strata RX Boston September 25-27 using promo code WANDISCO25.