Monthly Archive for January, 2014

Metadata: Big Data’s Secret Superpower

Attribution:  License:

When I heard the President of the United States repeatedly saying the word “metadata” in a speech recently, I realized just how seamlessly the phrase had made it into our common vernacular. For someone who’s long worked with SCM (Software Configuration Management) technology, however, metadata has been a primary focus of mine for over a decade. That’s because SCM is about the creation, management, querying and archival of metadata about the changes in a software codebase.

What is metadata? I describe it as “data about data”. In SCM, we would talk of “integration credit” for a merge. This credit is actually data that’s created and stored about the merge of a file from one branch into another, a common software engineering task. I have likely performed thousands of merges during my career in software development, and each of them have created a piece of data recording my actions.

Note that this metadata does not store any of the content of the merged files. That’s what makes it “data about data”, i.e., metadata. Some industries even consider their SCM metadata a trade secret because it reveals the methods used to build something.

Enter Big Data. “Big Data” is not just “big” in terms of size-on-disk, but the massive breadth of data that’s typically accumulated from many different sources and made available in a single database.

It’s precisely this breadth of data that ignites Big Data’s secret superpower of metadata. That’s partly because metadata is often linked to the intent of an action and partly because interpreting straight data itself often requires specialized knowledge to parse and understand. Linking together metadata from a broad range of sources can reveal connections not otherwise possible, powering the Holy Grail that is Predictive Analytics.

We’re still at the beginning of Big Data’s disruption of all manner of markets and systems. Learning what data sources are valuable, discovering sources of data within a system and exporting them in real time, finding the types of new questions we can ask and have answered by Big Data, are all works in progress.

Here at WANdisco, we are hard at work paving the road ahead for the new paradigms and promise of Big Data. So we’re always interested in challenges faced by our present and future customers. What are your plans for using Big Data in your business?

Why Now is the Time to Move to Subversion or Git

A recent Forrester survey revealed that 17% of enterprise software developers are still using Visual Source Safe (VSS) or CVS. It’s a bit of a puzzle why these two antiquated systems are still hanging on, but I think I understand a bit of the reason.

SCM is like plumbing from a certain perspective. It’s a vital piece of infrastructure and once you’ve used it you don’t ever want to go without. But as long as it’s working well, you also don’t see a real reason to upgrade it very often. It’s only when things break that you realize how important those pipes are.

Fair enough: if your company made an investment in CVS or Visual Source Safe (VSS) 10 years ago, you need a solid reason to upgrade. SCM systems don’t wear out like physical assets, and moving to a new system can be complex.

But I think now is the time to make the move for these three reasons:

  • EOL. VSS has reached end of life.

  • No updates. Neither VSS nor CVS is receiving updates anymore. You’re missing out on features like atomic commits and strong branching which are considered essential for productive software development.

  • Lack of tooling. Compatible software development tooling – IDEs, build systems, code review tools, deployment pipelines – is increasingly difficult to find for VSS and CVS.

Given that replacing an SCM system may only happen every 5-10 years, it’s worth considering where you want to make your investment. Here are three reasons I think Subversion and Git should be at the top of your future-proof list.

  • Open source. Subversion and Git are actively developed, have a robust user community with widespread adoption, and immunize you from vendor lock-in.

  • Best of breed. Subversion and Git have all the modern features you need. Subversion in particular handles large data sets very well, while Git is known for powerful local development workflows.

  • Stable and proven. Subversion has been widely adopted over the past 10 years and is used by some of the largest companies in the world. It is rock solid stable and has many enterprise features. Git is newer but now enjoys wide community and vendor support.

Whatever your choice, WANdisco is ready to help. We have certified binaries, support and training, and products for enterprise-grade uptime and security.  Let’s start the conversation!

Git Repository Metrics with Nagios

A few weeks ago I wrote about gathering some Git repository metrics and viewing them in the Git MultiSite GUI or in Graphite, and someone pointed out that some of the administrative metrics useful for capacity planning could be gathered using monitoring tools like Nagios. Repository size on disk is a good example since system administrators normally monitor disk space to make sure that the server doesn’t run out of space. You can set up Nagios to monitor any file system that contains Git repositories.


Git Repo Metrics

Git Repo Metrics

This level of check_disk information is useful for high level monitoring, and there are many plugins to connect Nagios to rrd graphing tools. I used OpsView to set up my example which includes a built-in graphing capability.

Drilling down to individual repositories would be an easy modification to the check_disk plugin, or you can get more granular data from Git MultiSite.


When Are Your Git Servers Busy?

Using Hadoop to Generate a Commit Time Histogram

Knowing when your Git servers are under the most load can help you answer several questions:

  • When is a good time to schedule routine maintenance or automated activity? Ideally, you want to find a time when there is very little developer activity on the system.

  • Are there periods of peak usage coinciding with the normal working schedule of a particular office? Perhaps that office needs more Git servers.

  • Are most of the commits coming at the end of a normal working day? Are you seeing a spike of commits during a certain time frame, say late at night? These might be signs of unhealthy work habits, such as an overburdened team, or capacity challenges, such as bottleneck issues when everyone tries to commit right before going home.

I decided to analyze this issue with Hadoop tools.

The Steps

Briefly, we need to:

  • Extract the relevant data from Git and make it available on HDFS. I covered one approach to this problem – using Flume to stream Git data into HDFS – in a previous post.

  • Load the data into a table in HCatalog. This step is trivial and I described it in a previous post.

  • Use Pig to analyze the data.

  • Use a graphing tool to visualize the results.

Analysis Step

I want to generate a commit time histogram showing the number of commits during each hour of the day. I need to group commits by the hour of the commit time, and then count the commits in each bucket. These steps are very easy in Pig.

-- load data
raw = LOAD 'git_logs' using org.apache.hcatalog.pig.HCatLoader();
describe raw;

-- extract hour from commit timestamp
hours = FOREACH raw GENERATE new_rev, GetHour(ToDate(time)) as hour;
describe hours;

-- group by hour
groupedbyhour = GROUP hours by hour;
describe groupedbyhour;

-- sum up number of commits per hour
hourcounts = FOREACH groupedbyhour GENERATE group AS hour, COUNT(hours) AS numhour;
describe hourcounts;
dump hourcounts;

store hourcounts into 'gl.hist' using PigStorage();

The output looks like this:

0 314
1 190


The output file has 24 lines showing the count of commits for each hour of the day. It’s then simple to plot the data using Excel, gnuplot, or another graphing tool.


Commit Time Histogram

Commit Time Histogram

In this example I’ve graphed the commits from a popular open source project.  We can see that there is a nice even distribution of commits over the working day and evening, and a lull overnight.

That’s a Wrap

A commit time histogram is just another example of the interesting data you can extract from your SCM and ALM systems using Hadoop tools. Some of this data can be seen using traditional data analysis tools, but using Hadoop takes away any concern about future scalability or data structure problems.

In my next post I’ll be looking at another take on visualizing commit data: generating a heat map of commits by user location.


Git MultiSite Simplifies Complexity

Our newest version of Git MultiSite, version 1.2, provides centralized management and replicated configuration settings for simplified administration, as well as enhanced security across multiple sites. In addition, Git MultiSite 1.2 integrates seamlessly with common ALM toolsets with enhanced support for distributed notification mechanisms. These features alleviate administrative burdens and boost security for global enterprises looking to streamline their source control management systems.

According to Jay Lyman, senior analyst for enterprise software at 451 Research, “Large enterprises are using Git for faster, more agile and collaborative development. However, sometimes tools like these add tremendous complexity, so managers and administrators appreciate the centralized management and seamless integration capabilities provided by solutions like WANdisco Git MultiSite, and this is key for global enterprises.”

The new release also enables easy integration with WANdisco Git Access Control for further security and simplicity. Git Access Control protects valuable intellectual property by providing granular access control with enterprise-grade authorization and audit capabilities, providing a complete audit trail, including user ID, date/time stamp, and command used.

“Git MultiSite provides enterprises with global disaster recovery and 100% uptime for Git,” said David Richards, WANdisco Chairman and CEO. “Git MultiSite 1.2 adds features that further enhance performance, manageability, and security for enterprises that value their data and appreciate the ability to maximize their source code management systems.”

Learn more here.

Advanced Subversion Access Control

Wrapping up a short series on some of the hidden gems of SVN Access Control, let’s take a look at using regular expressions to handle some advanced Subversion access control problems. The example I’ll use today is granting all developers the right to commit into a subdirectory of otherwise restricted branches.

The repository starts with a typical trunk-branches-tags structure, and all of these branches and tags are read-only for most developers, but we’d like to let developers commit their personal configuration and environment settings into a debug folder in each branch.

Managing this problem for one branch is easy: just define a rule that grants read access to the branch and add a second rule that grants write access to the debug folder.

But I don’t want to have to list a write rule for each branch individually; that just doesn’t scale.  Instead I’ll take advantage of SVN Access Control’s regular expressions to handle the job.

RegEx-based Access Control

RegEx-based Access Control

That’s probably the simplest example of using regular expressions to handle non-trivial access control rules. Another common example is restricting write access to build scripts (e.g. makefiles, build.xml, pom.xml).

Whatever your challenge, SVN Access Control gives you the tools for the job. Chat with one of our Subversion experts or start a free trial and see for yourself.



Challenges of the Git Enterprise Architect #1: Managing Many Repos

swarmThis is post number two in a series of short articles exploring challenges facing anyone deploying Git at scale in their enterprise software development environment. Find the introduction here.

Anyone migrating to Git from a centralized version control system will quickly run into one of Git’s most characteristic features: a codebase is most naturally represented by one complete repository.  That is, you don’t define a working copy based on a part of a repository; you get the entire repository as your working copy.

Centralized version control systems tended to become a grab bag of everything: main products, side projects, a file you needed at home but didn’t have a USB handy, and sometimes, lots and lots of large binary files. You’d then define a tiny fraction of the world as your working copy and move just that piece down.

When migrated to a Git repo however, all of a sudden you are cloning the world on to your laptop!

Doing the splits

The best practice answer in the Git world is that you need to split all the unrelated items into separate Git repos. But now there are many repos and a related number of new questions:

  • Who in your organization is responsible for managing all the repos?
  • What tracks code if it is, for example, refactored to a different repo?
  • How do developers find the repos with the code they need?
  • What about codebases that share code but for secrecy or scaling reasons can’t all be included in a single Git repos?
  • Who provisions new repos? Are they automatically backed up properly? Where do they live?
  • What if you have a large and entangled code base that will be expensive to refactor?

Untracked code movement

Of the various questions raised, one of the most important is that movement of code between repos generates no metadata. To each Git repo, files appear like a code drop, or local files are donated to another repo. This cuts against the grain of SCM. Software Configuration Management is in danger of becoming Software Confusion Mess, because we lose track of why and how code is moving within a codebase.  That means we might not be able to answer questions like:

  • What codelines contain this recently discovered bug?
  • What products contain this piece of GPL-licensed code?
  • Did the refactored code get into every library that uses it?
  • What repos did this particular line of code pass through before getting here?

There are ways around all of these problems, of course, and my main point is merely that these are some issues to keep in mind. Perhaps most of them are not important in your uses cases, or you are using one of the many tools that address some of them. And of course, as I implied in my article “Problem-centric Products“, you can expect WANdisco’s Git roadmap to pass through all of these challenges.

So stay tuned for the next installment: “Access Control”.

Git 1.8.5 certified binaries available

Git 1.8.5 was released recently, and WANdisco has just published certified binaries for all major platforms.

What’s new in 1.8.5?  As with all minor releases there are several nice fixes and improvements:

  • You can now specify HTTP configuration settings (like accepting unknown certificates) per site.
  • You can move submodules with git mv.
  • git gc will detect when another instance is running and quit.

Grab a certified binary and enjoy the goodness!

Location-Aware Subversion Access Control

Almost all Subversion access control systems are role or group-based. Typically a particular group of developers has write access to the repository while another, larger group has read access, but sometimes it’s more useful to control access based on location. IP address-based or location-aware Subversion access control is one of the most powerful features of WANdisco’s SVN Access Control product.

SVN Access Control is a mature product, but it’s worth taking a look at some of the clever features that may not jump out at first glance. The foundations of SVN Access Control are simple management, LDAP integration, granular permissions down to the file level, and strong auditing, but IP address-based rules are one of the hidden gems.

Setting an IP address-based rule is easy; simply specify the range of applicable IP addresses when adding or editing the rule.

Still the question remains: why should you care about the IP address of a user? If that person is part of the team, why does it matter where they’re connecting from? There are many reasons, actually, but they boil down to two categories.

Not every part of the network is trusted as much as the main office LAN

  • We can limit access to sensitive data when developers are connecting over VPN.

  • We can grant different access to the same user if they’re working at a remote partner office versus the main office (and can audit what’s being accessed from remote sites, in the spirit of trust but verify).

More than just source code is stored in Subversion

  • We can make production environment and configuration data read-only on development machines, read-only on app servers, and writable only for authorized Ops workstations.

  • We can lock down data that we need to push to a public cloud for deployment.

  • We can make data read-only when accessed from a build server, just in case.

SVN Access Control is a powerful tool for securing and managing Subversion data. If you haven’t explored IP address-based rules yet, give it a shot. You may find they help solve some tricky problems. You can start with a free trial or talk to one of our Subversion experts first.



SmartSVN 8 Available Now

We’re pleased to announce the release of SmartSVN 8, the popular graphical Subversion (SVN) client for Mac, Windows, and Linux. SmartSVN 8 is available immediately for download from our website.

While the main feature of this release is support for Subversion 1.8, we’ve provided a few more enhancements and additional bug fixes.

New SmartSVN 8 features include:

  • Eagerly awaited support for Subversion 1.8 working copies, allowing you to use SmartSVN with a Subversion 1.8 server. (For a full list of Subversion 1.8 benefits see the Apache release notes)
  • Ability to specify different merge tools for different file patterns as conflict solvers, allowing you to customize SmartSVN to suit your needs.
  • Prevent showing a notification while a dialog is showing, ensuring you don’t miss anything important.
  • Project menu: “Open or Manage projects” (and others) are now available without the project window, allowing you to work faster and smarter.
  • OS X: dock icon click reopens minimized windows, making SmartSVN consistent with most OS X applications.
  • Upgrade: SmartSVN will convert 1.7 working copies to 1.8 format, making it easier for you to get started with Subversion 1.8.

SmartSVN 8 fixes include:

  • Possible internal errors closing project windows or the repository browser, or when using ‘add’ or Conflict Solver
  • Problems with comparing repository files or directories
  • Compare: upper block line was drawn 1 pixel too high in line number gutter
  • Commit: committing a removal of a directory using svn protocol did not work
  • Linux: notification popup might have been closed quickly after showing
  • Start Up: crash on Ubuntu 13.10

For a full list of all enhancements and bug fixes, see the changelog.

Contribute to further enhancements

Many issues resolved in this release were raised by our dedicated SmartSVN forum, so if you’ve got an issue or a request for a new feature, head there and let us know.

Get Started