Git Data Mining with Hadoop

Detecting Cross-Component Commits

Sooner or later every Git administrator will start to dabble with simple reporting and data mining.  The questions we need to answer are driven by developers (who’s the most active developer) and the business (show me who’s been modifying the code we’re trying to patent), and range from simple (which files were modified during this sprint) to complex (how many commits led to regressions later on). But here’s a key fact: you probably don’t know in advance all the questions you’ll eventually want to answer. That’s why I decided to explore Git data mining with Hadoop.

We may not normally think of Git data as ‘Big Data’. In terms of sheer volume, Git repositories don’t qualify. In several other respects, however, I think Git data is a perfect candidate for analysis with Big Data tools:

  • Git data is loosely structured. There is interesting data available in commit comments, commit events intercepted by hooks, authentication data from HTTP and SSH daemons, and other ALM tools. I may also want to correlate data from several Git repositories. I’m probably not tracking all of these data sources consistently, and I may not even know right now how these pieces will eventually fit together. I wouldn’t know how to design a schema today that will answer every question I could ever dream up.

  • While any single Git repository is fairly small, the aggregate data from hundreds of repositories with several years of history would be challenging for traditional repository analysis tools to handle. For many SCM systems the ‘reporting replica’ is busier than the master server!

Getting Started

As a first step I decided to use Flume to stream Git commit events (as seen by a post-receive hook) to HDFS. I first set up Flume using a netcat source connected to the HDFS sink via a file channel. The flume.conf looks like:

git.sources = git_netcat
git.channels = file_channel
git.sinks = sink_to_hdfs
# Define / Configure source
git.sources.git_netcat.type = netcat
git.sources.git_netcat.bind =
git.sources.git_netcat.port = 6666
# HDFS sinks
git.sinks.sink_to_hdfs.type = hdfs
git.sinks.sink_to_hdfs.hdfs.fileType = DataStream
git.sinks.sink_to_hdfs.hdfs.path = /flume/git-events
git.sinks.sink_to_hdfs.hdfs.filePrefix = gitlog
git.sinks.sink_to_hdfs.hdfs.fileSuffix = .log
git.sinks.sink_to_hdfs.hdfs.batchSize = 1000
# Use a channel which buffers events in memory
git.channels.file_channel.type = file
git.channels.file_channel.checkpointDir = /var/flume/checkpoint
git.channels.file_channel.dataDirs = /var/flume/data
# Bind the source and sink to the channel
git.sources.git_netcat.channels = file_channel = file_channel

The Git Hook

I used the post-receive-email template as a starting point as it contains the basic logic to interpret the data the hook receives. I eventually obtain several pieces of information in the hook:

  • timestamp

  • author

  • repo ID

  • action

  • rev type

  • ref type

  • ref name

  • old rev

  • new rev

  • list of blobs

  • list of file paths

Do I really care about all of this information? I don’t really know – and that’s the reason I’m just stuffing the data into HDFS right now. I don’t care about all of it right now, but I might need it a couple years down the road.

Once I marshal all the data I stream it to Flume via nc:

nc_data = \
 "{0}|{1}|{2}|{3}|{4}|{5}|{6}|{7}|{8}|{9}|{10}\n".format( \
 timestamp, author, projectdesc, change_type, rev_type, \
 refname_type, short_refname, oldrev, newrev, ",".join(blobs), \
p = Popen(['nc', NC_IP, NC_PORT], stdout=PIPE, \
 stdin=PIPE, stderr=STDOUT)
nc_out = p.communicate(input="{0}".format(nc_data))[0]

The First Query

Now that I have Git data streaming into HDFS via Flume, I decided to tackle a question I always find interesting: how isolated are Git commits? In other words, does a typical Git commit touch only one part of a repository, or does it touch files in several parts of the code? If you work in a component based architecture then you’ll recognize the value of detecting cross-component activity.

I decided to use Pig to analyze the data, and started by ingesting data with HCat.

hcat -e "CREATE TABLE GIT_LOGS(time STRING, author STRING, \
  repo_id STRING, action STRING, rev_type STRING, ref_type STRING, \
  ref_name STRING, old_rev STRING, new_rev STRING, blobs STRING, paths STRING) \

Now for the fun part – some Pig Latin! Actually detecting cross-component activity will vary depending on the structure of your code; that’s part of the reason why it’s so difficult to come up with a canned schema in advance. But for a simple example let’s say that I want to detect any commit that touches files in two component directories, modA and modB. The list of file paths contained in the commit is a comma delimited field, so some data manipulation is required if we’re to avoid too much regular expression fiddling.

-- load from hcat
raw = LOAD 'git_logs' using org.apache.hcatalog.pig.HCatLoader();

-- tuple, BAG{tuple,tuple}
-- new_rev, BAG{p1,p2}
bagged = FOREACH raw GENERATE new_rev, TOKENIZE(paths) as value;
DESCRIBE bagged;

-- tuple, tuple
-- tuple, tuple
-- new_rev, p1
-- new_rev, p2
bagflat = FOREACH bagged GENERATE $0, FLATTEN(value);
DESCRIBE bagflat;

-- create list that only has first path of interest
modA = FILTER bagflat by $1 matches '^modA/.*';

-- create list that only has second path of interest
modB = FILTER bagflat by $1 matches '^modB/.*';

-- So now we have lists of commits that hit each of the paths of interest.  Join them...
-- new_rev, p1, new_rev, p2
bothMods = JOIN modA by $0, modB by $0;
DESCRIBE bothMods;

-- join on new_rev
joined = JOIN raw by new_rev, bothMods by $0;
DESCRIBE joined;

-- now that we've joined, we have the rows of interest and can discard the extra fields from both_mods
final = FOREACH joined GENERATE $0, $1, $2, $3, $4, $5, $6, $7, $8, $9, $10;
DUMP final;

As the Pig script illustrates, I manipulated the data to obtain a new structure that had one row per file per commit. That made it easier to operate on the file path data; I made lists of commits that contained files in each path of interest, then used a couple of joins to isolate the commits that contain files in both paths. There are certainly other ways to get to the same result, but this method was simple and effective.

In A Picture

A simplified data flow diagram shows how data makes its way from a Git commit into HDFS and eventually out again in a report.

Data Flow

Data Flow

What Next?

This simple example shows some of the power of putting Git data into Hadoop. Without knowing in advance exactly what I wanted to do, I was able to capture some important Git data and manipulate it after the fact. Hadoop’s analysis tools make it easy to work with data that isn’t well structured in advance, and of course I could take advantage of Hadoop’s scalability to run my query on a data set of any size. In the future I could take advantage of data from other ALM tools or authentication systems to flesh out a more complete report. (The next interesting question on my mind is whether commits that span multiple components have a higher defect rate than normal and require more regression fixes.)

Using Hadoop for Git data mining may seem like overkill at first, but I like to have the flexibility and scalability of Hadoop at my fingertips in advance.

0 Responses to “Git Data Mining with Hadoop”

  • No Comments

Leave a Reply