Monthly Archive for March, 2014

Top Challenges of the Git Enterprise Architect #3: Ever Growing Repos

volcanoContinuing on from Top Challenges of the Git Enterprise Architect #2: Access Control I’ll next talk about Git’s ever growing repos and some challenges presented.

Git stores all history locally, which is good because it’s fast. But it’s also bad, because clone and other command response times grow and never shrink.

Linear response time

Certain commands take linear time, O(n), either with the number of files in a repo or the depth of history. For example, Git has no built-in notion of revision number. Here’s a way to get the number of revisions of a file, Main.java:

git shortlog Main.java | grep -E '^[ ]+\w+' | wc -l

Of course, a consecutive file revision number is not a foundational construct in Git, as it is with a system like Subversion, so Git needs to walk its DAG (Directed Acyclic Graph) backward to the origin, counting the revisions of a particular file along the way.

When a repo is young, this is typically very fast. Also, as I noted, the revision numbers play less of a important role with Git so we need them less often.  However, think about what will happen if you have an active shared Git repository in service for a long period of time? When I’ve asked how long typical SCM admins expect to keep a project supported in an SCM, numbers range from 4 to up to 10 years. An active file might have accumulated hundreds or even thousands of revisions, and you’d want to think twice about counting them all up with Git.

Facebook’s struggle

Facebook’s concern over Git’s ever-growing-repos and very-slowing-performance a few years ago led them to switch to Mercurial and centralize much of the data.  Alas, this approach is not a solution for most companies. It relies on a fast, low latency connection, and if you don’t have access to the unique, fast, data safe, active-active replication as found in WANdisco’s MultiSite products for Subversion and Git, users remote to the central site will suffer sharply degraded performance.

Common workaround

The most common workaround I hear about is that when a Git repo gets too big and slow, a new shared master is cloned and deployed, and the old one serves as history. This is clearly not ideal, but many of the types of development that have first started using Git are less impacted by fragmented historical SCM data. As Git gains more widespread adoption into a greater variety of enterprise development projects, solutions may become more needed. Here at WANdisco we are hard at work paving the road ahead so that your Git deployments will scale historically as well as geographically. 

SmartSVN 8.5 Available Now

We’re happy to announce the release of SmartSVN 8.5, the graphical Subversion (SVN) client for Mac, Windows and Linux. SmartSVN 8.5 is available for download from our website here.

Along with several bug fixes and enhancements, SmartSVN 8.5 makes the critical move from SVNKit to JavaHL, the same back end as used by the Subversion command line client/server.

Major Improvements

Whilst it may not look different, this release signifies a huge change in that we’ve moved away from SVNkit and SmartSVN now uses JavaHL. This is the same library used by command line Subversion, and has given SmartSVN 8.5 much improved stability and a huge speed boost. Some comparison tables:

Text files
Jpg files
Operation 7.6.3
time (s)
8/8.0.1
time (s)
8.5 (JavaHL)
time (s)
7.6.3
time (s)
8/8.0.1
time (s)
8.5 (JavaHL)
time (s)
Checkout 72.21 78.86 7.34 118.13 120.35 10.92
1st Add 133.60 201.94 37.49 64.19 98.61 15.47
Revert 47.14 75.06 16.89 41.19 77.15 8.85
2nd Add 131.75 186.18 34.64 60.87 101.76 13.81
Commit 314.44 440.23 85.70 167.34 252.85 42.46
Remove 13.86 1146.77 13.76 6.76 553.41 8.70

We’ve also added support for Subversion 1.8.8 and the file:// protocol for local repository access.

For a full list of all improvements, bug fixes and other changes please take a look at the changelog.

Have your feedback included in a future version of SmartSVN

Many issues resolved in this release were raised via our dedicated SmartSVN forum, so if you’ve got an issue or a request for a new feature, head over there and let us know.

You can download SmartSVN 8.5 from our website here.

Haven’t yet started with SmartSVN? Claim your free trial of SmartSVN Professional here.

Subversion 1.9 Underway

Subversion 1.9 is already well underway, following up quickly after last year’s impressive 1.8 release. Although the final set of new features may change, there’s one piece of infrastructure work that’s worth highlighting.

A new tunable to control storage compression levels lets you choose a better balance between repository size and server CPU load. Disabling regular compression and deltification will yield a substantial improvement in throughput when adding, committing, and checking out large files.  You can expect to see more numbers at the next Subversion & Git Live conference, but I will mention that commit speed can increase from 30-40 MB/s to 100 MB/s.

Here are a few other tidbits that may interest you:

  • New tool to list and manage cached credentials

  • The ability to commit to a repository while a pack operation is in progress (Goodbye, long maintenance windows!)

  • Infrastructure work is starting for a new storage layer that will reduce repository size and improve performance.

While you’re waiting for Subversion 1.9, now’s a great time to upgrade to Subversion 1.8. You can enjoy many of the benefits just by upgrading the Subversion binaries on your server.

Resource Management in HDFS and Parallel Databases

A recent survey from Duke University and Microsoft Research provides a fascinating overview of the evolution of massively parallel data processing systems. It starts with the evolution of traditional row-oriented parallel databases before covering columnar databases, MapReduce-based systems, and finally the latest Dataflow systems like Spark.

Two of the areas of analysis are resource management and system administration. An interesting tradeoff becomes apparent as you trace the progression of these systems.

Traditional row- and column-oriented databases have a rich set of resource management tools available. Experienced database administrators (DBAs) can tune individual nodes or the entire system based on hardware capacity, partitions, and typical workloads and data distribution. Perhaps just as importantly, the DBA can draw on decades of experience and best practices during this tuning. Linear scalability, however, is a bit more challenging. Theoretically, many parallel database systems support adding more nodes to balance work load, but in reality it requires careful management of data partitions to get the best value out of new resources.

Similarly, DBAs have access to many high quality system administration tools. These tools provide performance monitoring, query diagnostics, and recovery assistance. These tools have evolved over the years to allow very granular tuning of query plans, indexes, partitions, and schemas.

Reading between the lines, you had better have a good team of DBAs on hand. Classic database systems are expensive to purchase and operate, and knowing how to turn all of those dials to get the best performance is a challenge. Query optimization, for example, can be quite complex. Knowing how to best partition the data for efficient joins across a massive data set is not a solved problem in all cases, especially when a columnar data layout is used.

There’s a very big contrast in these areas when you look at systems built on HDFS, from the original MapReduce designs to the latest Dataflow systems like Spark. The very first design of MapReduce opted for simplicity with a static allocation of resources and the ability to easily add new nodes into the cluster. The later evolutions of Hadoop introduce improvements like YARN, which provide for more flexible resource management schemes, while still allowing for easy cluster expansion with the HDFS Rebalancer taking care of data transfer to new nodes. The newest Dataflow systems have the potential for much improved resource management, using in-memory techniques to aid in processing time. Most notably, systems like Spark can use query optimization based on DAG principles.

System administration in Hadoop is an evolving field. Some expertise exists in cluster management (or you can delegate that chore to cloud systems), but a Hadoop administrator does not have the same set of tools available to a traditional DBA; indeed a priori plan optimization is not even feasible when many ‘Big Data’ analytics packages only interpret data structure at query time.

To sum this up, I think that the ‘Big Data’ solutions have made (and continue to make) an interesting design choice by sacrificing some of the advanced resource management and system administration tools available to DBAs. (Again, some of these simply aren’t available when you do not know the data schema in advance.) Instead they favor a simplified internal representation of data and jobs, which allows for easier expansion of the cluster.

To put it another way, a finely tuned traditional parallel database will probably outperform a Hadoop cluster given sufficient hardware, expertise, and advanced knowledge of the data. On the other hand, that Hadoop cluster can grow easily with commodity hardware (beyond the breaking point of traditional systems) and not much tuning expertise other than cluster administration, which is a cost that can be spread over a large pool of applications. Plus, you don’t need to make assumptions about your data in advance. Dataflow systems like Spark will go a long way towards closing the performance gap, but in essence Big Data solutions are performing a cost-benefit analysis and coming down on the side of simplicity and ease of expansion.

This may be old hat to Big Data veterans, but I found the paper to be a great refresher on how the Big Data field reached its current position and where it’s going in the future.

SmartSVN 8.5 RC2 Released

We’re happy to say we’ve just released SmartSVN 8.5 Release Candidate 2. SmartSVN is the cross-platform graphical client for Apache Subversion.

Major Improvements

Whilst it may not look different, this release signifies a huge change in that we’ve moved away from SVNkit and SmartSVN now uses JavaHL. This is the same library used by command line Subversion, and has given SmartSVN 8.5 RC2 much improved stability and a huge speed boost. Some comparison tables:

 
Text files
Jpg files
Operation 7.6.3
time (s)
8/8.0.1
time (s)
8.5 (JavaHL)
time (s)
7.6.3
time (s)
8/8.0.1
time (s)
8.5 (JavaHL)
time (s)
Checkout 75.27 81.61 26.22 75.64 81.52 27.69
Add 52.77 131.22 67.37 37.38 72.12 22.22
Revert 36.83 60.02 33.74 33.80 69.67 18.08
Commit 195.15 279.49 75.31 116.07 176.88 40.63
Remove 8.75 1176.24 21.38 5.57 595.71 11.58

We’ve also added support for Subversion 1.8.8 and the file:// protocol for local repository access.

For a full list of all improvements, bug fixes and other changes please take a look at the changelog.

Though this is still a release candidate, given the major improvements to performance we strongly recommend that all customers using SmartSVN version 8 or newer upgrade to this latest RC.

Have your feedback included in a future version of SmartSVN

Many issues resolved in this release were raised via our dedicated SmartSVN forum, so if you’ve got an issue or a request for a new feature, head over there and let us know.

You can download Release Candidate 2 for SmartSVN 8.5 from our early access page.

Haven’t yet started with SmartSVN? Claim your free trial of SmartSVN Professional here.

SCM is for everyone

A recent Forrester survey revealed some startling information about the adoption of SCM tools in the enterprise: 21% of the respondents are not using any SCM, and 17% are using tools that are a couple generations out of date.

That information caught me off guard, but then again I work in the industry and probably tend to focus more on the up-and-coming than the tried-and-true. As I’ve been mulling this over, I’ve started to recall my own very first exposure to SCM.

The year was 1996, and I was an undergraduate research assistant working on an autonomous vehicle project. (Note to my children: yes, I do *really exciting and super cool* things at work.) I was on a team of five electrical engineers that wrote a lot of C code for image processing and vehicle control. Like a lot of ‘software developers’, however, we were domain problem solvers first and coders second.

 

For a while we did backups of our code on network drives and tapes. Then we heard about something shocking: a tool called Visual Source Safe (VSS) that would easily store every version of our code, let us see how it changed, and revert changes easily. We could even make a branch as well as work on bug fixes and new experiments at the same time! (Of course we were programming half on Linux so we couldn’t use VSS for everything, and therefore had to learn the unpleasantness of CVS.)

Back to the point: normally when someone asks me about the importance of SCM, I start thinking about the mainline model and how SCM is the one of the foundations of continuous delivery. To take a step back, anyone who writes software, from assembly motor control code to Hadoop plumbing, needs to use SCM for the same three reasons I needed it in 1996:

  • Backups. Code is valuable.

  • History. Being human, you may make mistakes and need to discover/roll back those mistakes.

  • Branching. You sometimes need to work on bugs and new stuff at the same time.

Luckily, you’ve got better choices now than I did in 1996. Subversion and Git are two free, powerful, and mature SCM choices available on every platform. Both are fairly friendly to the newcomer particularly if you pick a GUI, and most applications that work in any way with source code will integrate with them.

So no more excuses – head over to our website for binaries, community forums, and quick reference cards and tutorials.

Git 1.9 certified binaries available

Git 1.9 is mainly a maintenance release, and includes a number of minor fixes and improvements. As usual, WANdisco has published certified binaries for all major platforms.

Click here to see the release notes. Key changes in 1.9 include:

  • You can now exclude a specific directory from contributing to the ‘git log’ command. This makes it easy to ignore changes from that directory when you’re browsing history.
  • ‘git log’ can also exclude history from branches that match a pattern.
  • Heads-up that the default behavior of ‘git push’ is slowly moving towards the Git 2.0 standard, so be sure to start being more explicit about setting up tracking relationships.

Visit our website to download certified binaries for Windows, Mac, and Linux.

LDAP Authentication in SVN 1.8.8

Following a thread in our Subversion forums, we’ve found that some people are having problems with LDAP after upgrading or installing version 1.8.8. So far this has only been reported on the Redhat and CentOS builds.

These builds use an updated apr-util package which no longer offers support for ldap – this has been moved to a separate package, apr-util-ldap, which wasn’t included.

We’ve now updated the dependencies in that package to include the necessary missing package, apr-util-ldap. If you’re getting this error, just re-run yum install subversion and the missing files will be installed for you.

Many thanks to Philip, one of our Subversion committers, for highlighting the issue so that we could sort it 🙂

America Invents Act and the Prior Use Defense

Proving Your Right to Continued Use with Global Repository Management

One of the significant changes in the America Invents Act (AIA) is the expanded scope of the ‘prior use’ defense. Before the AIA, the prior use defense only applied to business method patents, but it is now an effective and increasingly important defense in patent disputes centered on any process, machine, or manufacture – if you can prove a valid prior use a year before the filing date or public disclosure of the claim in question.

Why is prior use so important now? For starters, the pace of patent litigation is on a sharp upward trajectory. Between 2008 and 2012 the number of patent cases commenced rose from under 3,000 a year to over 5,000 a year. Any effective defense is worth considering in this environment.

Also bear in mind that the AIA changed from a first-to-invent scheme to first-to-file. If a patent troll files the paperwork first to claim an invention, you could be at risk. Establishing with clear and convincing evidence that you were using the invention over a year before the filing is a very strong point in your favor.

The best evidence of prior use of software inventions is the audit trail provided by your SCM repository. By its nature an SCM repository tracks the birth of a software implementation: the combination of source code, libraries, build scripts, and deployment processes that shows how and when you started using an invention.

But there’s one fly in the ointment – the use of ‘skunk works’ repositories. Some development teams using Git like to stand up informal repositories to work on new ideas or pet projects, only moving the project into the ‘official’ repository when it reaches some stability milestone.

That’s clearly a problem. As we’ve seen from the time frames in the AIA, every day counts when you’re establishing a prior use defense. If you lose the first three months of prototype history because the ‘skunk works’ repository was lost, you may slip past the one year limit for prior use.

Before you try to enforce a policy against these ‘skunk works’ repositories, keep in mind why Git development teams might use them:

  • They’re working at a remote office and the network latency is making their Git operations painfully slow.

  • The official Git repositories are slow due to too much load on the system from a large user base and build automation.

  • The process of requesting an official repository and adding it to the enterprise backup and security schemes is too time consuming.

The solution is a fast and stable enterprise Git service that is easily accessible for any development team. In other words, make it easier for developers to use your secure and highly available Git repositories and they won’t be tempted to set up their own infrastructure.

Git MultiSite provides the enterprise Git solution that fits the bill. With Git MultiSite’s patented replication technology, every development team gets a local Git repository with fast LAN access. Every Git server in the deployment is a fully replicated and writable peer. Slow Git operations are a thing of the past, with most operations being local and commits (pushes) being coordinated very efficiently with other nodes in the deployment. Plus, Git MultiSite’s automated failover and self-healing capabilities mean zero down time.

Git MultiSite and Git Clustering also provide a very scalable solution. Additional Git servers can be added at any time to handle increased load, giving you more confidence to spin up new Git repositories for every pet project that might turn into the next big thing. These new repositories can be deployed at the click of a button in the administration console.

Finally, Git Access Control makes sure that your security policies and permissions are applied consistently at every site, on every server.

Git MultiSite removes the performance and security concerns that normally make you hesitate about providing ‘self-service’ SCM infrastructure, eliminating the need for ‘skunk works’ repositories.

The AIA makes it more important than ever to keep track of every software recipe in your organization. Let Git MultiSite provide the infrastructure you need to protect all of your intellectual property.

 

Top Challenges of the Git Enterprise Architect #2: Access Control

Attrib: http://www.flickr.com/photos/biscuitsmlp/ http://creativecommons.org/licenses/by/2.0/Git has no built-in access control features.  If that seems a surprise, one reason is that the Git project specifically considers access control to be outside the scope of a version control tool. Another reason is best practices with Git typically result in many small repos, in contrast with the gargantuan repos often found with centralized version control systems.  Having logically unrelated code resident in separate repositories means we can control access through authentication, where a user either has zero or complete access to a repo.

Enterprise software development is often subject to demands not found in the open source landscape where Git was born. Code bases can have interdependencies that may prove too entangled to refactor into individual Git repositories.  Or projects have migrated from a centralized version control system where large files are mixed with small, creating trouble for Git’s assumptions about how large a large repo can be. Sometimes, we see product code assembled from a combination of contributors, combining code from outside-the-firewall contractors with inside-the-firewall employees. All of these situations can create access control needs that go beyond all-or-nothing repo access.

Far reaching effects

The Git project’s decision to leave access control as an exercise of the user has another important effect: it invites increased diversity in choice of access control tooling. Rather than most groups falling in line with, for example, Apache controls for Subversion, we typically see a large organization sprout a variety of open source and commercial solutions. Infrastructure seeks consolidation, so we see IT/SCM teams sometimes scrambling in order to avoid becoming responsible for supporting a large number of new technologies of questionable pedigree.

This situation is compounded by the fact that Git can easily, and often secretly, be used by developers still tethered to a centralized system. This means Git often obtains a significant beachhead of adoption, and a divergent mess of ad-hoc technology stacks along with it, before your enterprise SCM administrators step in.

Gradual migration

Another common situation is where developers in a company are gradually shifting from other systems to projects using Git. In these cases they may spend some or possibly long periods of time accessing older as well as new systems. We believe we will see Subversion and Git co-deployed for an extended period in enterprise development environments. That’s why managing access control across Subversion and Git deployments is part of the core functionality of our new Access Control Plus product.

Looking ahead to solutions

This series of posts is about challenges more than solutions. I’ll be speaking about solutions for Git access control at our Subversion & Git Live conference this May. See you there!