Detecting Dependency Trends in Components Using R and Hadoop

As I’ve been experimenting with Flume to ingest ALM data into a Hadoop cluster, I’ve made a couple of interesting observations.

First, the Hadoop ecosystem makes it easy for any team to start using these tools to gather data from disparate ALM sources. You don’t need big enterprise data warehouse (EDW) tools – just Flume and a small Hadoop cluster, or even just a VM from one of the Hadoop vendors to get started. These tools are free and easy to use in a small deployment, and you simply scale everything up as your needs grow.

Second, once the data is in Hadoop, you have access to the growing set of free data analysis tools for Hadoop, ranging from Hive and Pig, to scripted MapReduce jobs and more powerful tools like R.

My most recent experiment utilized the RMR package from Revolution Analytics, which provides a bridge between R, MapReduce, and HDFS. In this case, I had already used Flume to ingest Git commit data from a couple of related Git repositories, and I decided to look for any unusual relationships in the commit activity for the components in the system, including:

  • The most active components

  • The number of commits that affected more than one component

  • Which pairs of components tended to see work in the same commit

That last item I often find very interesting, as it may indicate some dependencies between components that aren’t otherwise obvious.

I had all the Git data stored on HDFS, so I used a ‘word count’-style MapReduce task to provide the counts. A partial R script is shown below.

# libraries
dfs.git = mapreduce(
 input = "/user/admin/git",
 map = function(k,v)  {
   comps = c()

   for(i in 1:nrow(v)) {
     lcomps = c()

     # … some cleanup work to extract components ...
     lcomps = append(lcomps, component)
     lcomps = sort(unique(lcomps))
     numUnique = length(lcomps)
     multis = c()
     for(j in 1:length(lcomps)) {
       for(k in (j+1):length(lcomps)) {
         # record pairs
         multis = append(multis, paste0(lcomps[j], "-", lcomps[k]))
     lcomps = append(lcomps, multis)

     if(numUnique > 1) {
       lcomps = append(lcomps, "MULTI")

     comps = append(comps, lcomps)
 reduce = function(k,vv) {
   keyval(k, sum(vv))


Now that I’ve got these counts for each component and component pair, I can easily get it back into R for further manipulation.

out = from.dfs(dfs.git)
comps = unlist(out[[1]])
count = unlist(out[[2]])
results = data.frame(comps=comps, count = count)
results = results[order(results[,2], decreasing=T), ]
r = results[count > 250,]

I’ll just focus on the most active components and pairs, which I can see in this plot.

Anything interesting there? Maybe. It certainly looks like the ‘app’ component is far and away the busiest component, so perhaps it’s ripe for refactoring. I also notice that ‘app’ and ‘spec’ tend to be updated a lot in the same commit, and there’s a lot of cross-component work (“MULTI”) going on. And what’s missing? Well, the ‘doc’ module isn’t updated very often with other components.  Perhaps we’re not being good about documenting test cases right away.

But the main point is that I can now do some interesting data exploration with a minimum amount of work and no investment in an EDW.

So even if your ALM data isn’t ‘Big Data’ yet, you can still take advantage of the flexibility, low barriers to entry, and scalability of the Hadoop ecosystem. You’ll have some fairly interesting realizations before you know it!


0 Responses to “Detecting Dependency Trends in Components Using R and Hadoop”

  • No Comments

Leave a Reply