Using Hadoop to Generate a Commit Time Histogram
Knowing when your Git servers are under the most load can help you answer several questions:
When is a good time to schedule routine maintenance or automated activity? Ideally, you want to find a time when there is very little developer activity on the system.
Are there periods of peak usage coinciding with the normal working schedule of a particular office? Perhaps that office needs more Git servers.
Are most of the commits coming at the end of a normal working day? Are you seeing a spike of commits during a certain time frame, say late at night? These might be signs of unhealthy work habits, such as an overburdened team, or capacity challenges, such as bottleneck issues when everyone tries to commit right before going home.
I decided to analyze this issue with Hadoop tools.
Briefly, we need to:
Extract the relevant data from Git and make it available on HDFS. I covered one approach to this problem – using Flume to stream Git data into HDFS – in a previous post.
Load the data into a table in HCatalog. This step is trivial and I described it in a previous post.
Use Pig to analyze the data.
Use a graphing tool to visualize the results.
I want to generate a commit time histogram showing the number of commits during each hour of the day. I need to group commits by the hour of the commit time, and then count the commits in each bucket. These steps are very easy in Pig.
-- load data
raw = LOAD 'git_logs' using org.apache.hcatalog.pig.HCatLoader();
-- extract hour from commit timestamp
hours = FOREACH raw GENERATE new_rev, GetHour(ToDate(time)) as hour;
-- group by hour
groupedbyhour = GROUP hours by hour;
-- sum up number of commits per hour
hourcounts = FOREACH groupedbyhour GENERATE group AS hour, COUNT(hours) AS numhour;
store hourcounts into 'gl.hist' using PigStorage();
The output looks like this:
The output file has 24 lines showing the count of commits for each hour of the day. It’s then simple to plot the data using Excel, gnuplot, or another graphing tool.
In this example I’ve graphed the commits from a popular open source project. We can see that there is a nice even distribution of commits over the working day and evening, and a lull overnight.
That’s a Wrap
A commit time histogram is just another example of the interesting data you can extract from your SCM and ALM systems using Hadoop tools. Some of this data can be seen using traditional data analysis tools, but using Hadoop takes away any concern about future scalability or data structure problems.
In my next post I’ll be looking at another take on visualizing commit data: generating a heat map of commits by user location.