Subversion and Git Live 2013 – Feedback

As we’re now less than two weeks away from Subversion and Git Live 2014 in New York we thought we’d post a few details about last year’s events – how they went, what people thought was most useful, what they enjoyed and probably most importantly what they got out of it.

We’ve collated our feedback and Dan, our lovely graphics guy, has created this infographic from the results:

WANdisco infographic - click image for full size (opens in new window)

So all around a very positive experience with much being learned both by attendees and ourselves as well. Here’s a few select comments from people who were at the London event last year:

“Very reassuring to hear that other companies have experienced similar “challenges”.”

“The most beneficial thing from this event? A sanity check that our Git implementation ticked all the boxes!”

“Most beneficial? Learning about current imminent functionality.”

“Time to think about some of this stuff!”

Following feedback received we’ve made some of our talks hands-on this year so you’ll get the chance to try out some of the things being talked about as well.

We hope to see you there! :)

More Rebasing in Subversion

Continuing on from a previous post about rebasing in Subversion, let’s look at a more general example of using rebasing to port commits to a new base branch.

In Git we’ll start with this picture.

I have three branches: master, team, and topic. Now I’d like to get the unique commits (to-1, to-2) on the topic branch and get them back to master cleanly, but I don’t want the intermediate work on the team branch (commit te-1).

So I use rebasing to get the diffs between topic and team, and use master as the new base for the rebased topic branch.

That gives me the clean picture above. At this point it would be trivial to do a fast forward merge of topic to master.

Using much the same techniques as I discussed last time, it’s possible to emulate this capability in Subversion. Here’s my starting point.

Again, I want to get the local commits from topic and make them more easily accessible to trunk without running a regular merge, which would have to go through the team branch.  Here’s the recipe.

make branch topic-prime from current head of trunk
run svn mergeinfo to determine the diffs between topic and team (revs eligible for a merge from topic to team)
run a cherry-pick merge (ignoring ancestry) of each of those revs from topic to topic-prime

Using that recipe gives me this picture:

At this point I could continue working on topic-prime or run a relatively simple merge to trunk. I could have also changed my recipe to run the cherry-pick merges directly onto trunk instead of using a new branch.

In any case, the end result is fairly close to what you’d have in Git, although the process of getting there wasn’t as easy (and I still have the original topic branch lying around).

Git has uncovered a lot of useful tools and techniques, and although it takes a bit of extra work, you can emulate some of these in Subversion. Questions? Give me a ping on Twitter or svnforum.

Authentication and Authorization – Subversion and Git Live 2014

We’ve switched the format of some of the talks for Subversion and Git Live this year – several will be hands on, giving you the opportunity to test out the subject being discussed rather than just making notes.

One of these talks this year will be delivered by Ben Reser, one of our Subversion committers, on Authentication and Authorization. Ben has been working on Subversion since 2003 and will be discussing:

  • A brief overview of the access control methods Subverison supports.
  • Hands on setting up of a Subversion server with LDAP authentication over HTTP.
  • A look at the performance costs of access control and what you can do to minimize them.
  • How to put your authz configuration file into the repository.

The hands on portion will be covering a hypothetical company as they grow and shift from a very basic setup to a much more complex setup, showing some of the problems they’d have along the way and discussing their reasons for making configuration changes. The company starts off with a single repository with basic authentication (no path based authorization) and ends up with multiple repositories, LDAP and path based authorization. Eventually we’ll even use the new in-repository authz feature added with 1.8. The configuration improvements along the way will show how to ease administrative burden and improve performance.

The goal with this talk is to have you walking away knowing why you configure Subversion the way you do and how you can make things better for your particular setup, rather than just giving you an example authz file and telling you it’s the ‘right’ way to do things.

If that sounds good to you, why not come see us at Subversion and Git Live 2014? There’s more info about the event here: http://www.wandisco.com/subversion-git-live-2014.

Open Source and Analytics

A recent Information Week article predicted that open source analytics tools would continue to gain ground over commercial competitors in 2014 in the Big Data arena. That may seem surprising. After all, you’ve made an investment in moving some unwieldy data into Hadoop.  Why not start to hook up your traditional data analytics and business intelligence tools?

To see why this prediction makes sense, let’s review some of the advantages of Hadoop Big Data infrastructure:

  • Cost efficiency: Hadoop’s storage costs per terabyte are about one-fifth to one-twentieth the cost of legacy enterprise data warehouse (EDW) solutions. Once you have a Hadoop cluster up and running, scaling it out is economical.

  • Visibility: Hadoop lets you store, manage, and analyze wildly disparate data sets with no penalty. Silos that existed due to storage costs or technical incompatibility start to disappear.

  • Future proofing: Hadoop is an open platform with a vibrant community. There’s no risk of lock-in to obsolete tools and vendors.

These same reasons explain why open analysis platforms will continue to see wide adoption.

First, let’s consider cost efficiency and visibility. You’ll find that both tools and talent are more affordable and easier to find when you use open platforms, which means you’ll have a lot more people looking for the gems in your data.

Recall that one feature of Big Data is that you probably don’t know how you’re going to use all of the data you collect in the future. In other words, you don’t know now what questions you’ll be asking next year. You need to unleash your analysts and data scientists to explore this data, and open analysis platforms have a much lower cost barrier than commercial tools. Any budding data scientists can get started without consuming scarce licenses.

Finally, the next generation of data scientists will be trained on open platforms like R. R is gaining traction rapidly and is the key tool in a new data science MOOC offered by Johns Hopkins. Not only will recruiting be easier, but anyone on your team who needs to start working with data can acquire some basic skills easily. Visibility matters: after all, if data is stored in Hadoop and no one is there to analyze it, why bother?

Source: http://r4stats.com/articles/popularity/

Now getting back to future proofing, data science is a rapidly evolving field.  New tools and methods are springing up almost every day.  Much of that research is being done and published in open platforms like R.  You’ll be able to take advantage of that cutting edge knowledge without having to wait for a vendor to support it in a closed framework.

Embracing this wave of open source analytics tools will help you start to see real ROI from your Big Data investment.

WANdisco Announces Availability of Apache Subversion 1.9.0 alpha binaries

Apache have announced the release of the binaries for Subversion 1.9.0 alpha with a number of significant improvements.

It’s important to note that this is an alpha release and as such is not recommended for production environments. If you’re able to download and test this release in a non production environment though we’d be grateful for any feedback – if you notice anything untoward or even just want to chat or ask about this latest version please drop us a post in our forums.

This release introduces improvements to caching and authentication, some filesystem optimisations for fsx and fsfs, a number of additions to svnadmin commands and improvements to the interactive conflict resolution menus. Other enhancements include:

  • New options for ‘svnadmin verify’
  •  –check-normalization
  •  –keep-going
  • svnadmin info: print info about a repository.
  • additions to svn cleanup
  •  add ‘–remove-unversioned’ and ‘–remove-ignore’
  •  add ‘–include-externals’ option
  •  add ‘–quiet’ option

You can see a full list of changes in the release notes here.

To save you the hassle of compiling from source you can download our fully tested, certified binaries free from our website here: http://www.wandisco.com/subversion/download/early-access

WANdisco’s Subversion binaries provide a complete, fully tested version of Subversion based on the most recent stable release, including the latest fixes, and undergo the same rigorous quality assurance process that WANdisco uses for its enterprise products that support the world’s largest Subversion implementations.

Using TortoiseSVN or SmartSVN? As this is an alpha release there’s no compatible version of these Subversion clients yet, but watch this space and we’ll have them ready before the general release of Subversion 1.9.0.

OpenSSL Vulnerability – The Heartbleed Bug

The OpenSSL team recently published a security advisory regarding the TLS heartbeat read overrun. This vulnerability allows up to 64k of memory to be read by a connected client or server in chunks and different chunks can be requested on each attack.

The vulnerability affects versions 1.0.1 and 1.0.2-beta of OpenSSL.

The WANdisco SVN binaries for Windows and Solaris available since 2011 have included OpenSSL libraries which are vulnerable. We’ve released updated versions with the patch as of today, so if you are still using one of these older versions please download the latest:

Windows: http://www.wandisco.com/subversion/download#windows

Solaris: http://www.wandisco.com/subversion/download#solaris

Users of our Subversion products (including SVN Multisite) on other operating systems will still need to ensure they’ve updated their OpenSSL package however there’s nothing vulnerable included with our binaries. We recommend all users of these operating systems update their version of OpenSSL to 1.0.1g as soon as possible or, if unable to update, recompile OpenSSL with the -DOPENSSL_NO_HEARTBEATS flag.

For more information on this vulnerability please see http://heartbleed.com/

UPDATE: SmartSVN versions 8.5 and 8.5.1 are also vulnerable due to the included version of OpenSSL. We’ve now released SmartSVN 8.5.2 and would urge all users of SmartSVN 8.5 and 8.5.1 to update to this latest version as soon as possible. SmartSVN 8.5.2 is available for download at http://www.wandisco.com/smartsvn/download-all

Can Big Data Help with Prediction?

A recent article entitled, “Limited role for big data seen in developing predictive models”, splashes a little cold water on the idea that Big Data will magically help develop better predictive analytics tools.  The headline caught my attention, as it’s become a truism that a poor algorithm with lots of data will outperform a great algorithm with not enough data.  Let’s go ahead and ask, can Big Data help with prediction?

Now, I understand the author’s point.  If you are performing a well-structured study and you have a deep understanding of the domain, then a smaller and carefully constructed data set will probably serve you better. Later in the article, however, Peter Amstutz, analytics strategist at advertising agency Carmichael Lynch, mentions that in many cases you’re not even sure what you’re looking for and often need to aggregate loosely structured data from disparate sources.  After all, there’s a lot more unstructured data in the world, and it’s growing quickly.

I find myself favoring the dissenting view.  In my job I’m often trying to answer questions like, “Will our next release ship on time given what I now know about the backlog, other projects taking away resources…,” and so on.  It’s not as simple as looking at a burn down chart to track progress.  In my head I’m meshing all types of data points – chatter on the engineering forums, vacation schedules, QA panic boards, et cetera.  I sometimes get a ‘pit of my stomach’ feeling that the schedule is slipping, but when I try to actually quantify what I’m seeing, it’s difficult.  There are so many sources of data to correlate, and none of them report consistently.

Of course, if we had a data warehouse I could run some cool reports on trends I’m seeing, but I wouldn’t try to convince the higher-ups to make that level of investment (ETL tools, data stores, visualization front end) and I’m sure they won’t give me JDBC access to all of our databases.

On the other hand, I’ve got a small Hadoop cluster available – just a set of VMs, but sufficient for the volume of data I need to examine – and I know how to pull data using tools like Flume and Sqoop.  All of a sudden I’m seeing possibilities.

This is one of the real benefits of ‘Big Data’ for predictive analytics.  It can handle the variety of data I need without ETL tools, at a fairly low cost.

Intro to Gerrit – Subversion and Git Live 2014

You may be aware of Gerrit, the web based code review system. Our Director of Product Marketing, Randy Defauw, has a number of good reasons for adopting it as part of your development process:

The most interesting thing about Gerrit is that it facilitates what some call ‘continuous review’. Code review is often seen as a bottleneck in continuous delivery, but it’s also widely recognized as a way to improve quality. Gerrit resolves this conundrum with innovative features like dynamic review branch creation and the incorporation of continuous build into the heart of the review process.

Gerrit is also notable because it is the most enterprise friendly Git code review system, although it has open source roots. It integrates with all standard authentication frameworks, has delegated permission models, and was designed for large deployments.

Randy is Director of Product Marketing for WANdisco’s ALM products. He focuses on understanding in detail how WANdisco’s products help solve real world problems, and has deep background in development tools and processes. Prior to joining WANdisco he worked in product management, marketing, consulting, and development. He has several years of experience applying Subversion and Git workflows to modern development challenges.

If you’d like to hear more about Gerrit, or Git in general, come see us at Subversion and Git Live 2014.

SmartSVN 8.5 Moves from SVNKit to JavaHL

Following on from the release of SmartSVN 8.5, we wanted to give you a bit more detail about the main big change in SmartSVN 8.5, so here’s Branko Čibej, our Director of Subversion, with an explanation:

One of the most significant events during the development of SmartSVN 8.5 was the decision to adopt the JavaHL library in favour of SVNKit, which was used by all previous releases of SmartSVN.

JavaHL is a Java wrapper for Subversion, published by the Apache Subversion project. The most important difference compared to SVNKit is that JavaHL uses the same code base as the Subversion command-line client and tools. This has several benefits for SmartSVN: quicker adoption of new Subversion features; more complete compatibility with Subversion servers, repositories and other clients; built-in support for new working copy formats; and, last but not least, speed — as demonstrated by the phenomenal performance improvements in SmartSVN 8.5, compared to both 8.0 and 7.6.

The decision to adopt JavaHL has also benefited the Subversion community at large: several bug fixes and enhancements in Subversion 1.8.8 and the forthcoming 1.8.9 and 1.9.0 releases are a direct result of the SmartSVN porting effort. We will continue to work closely with the Apache Subversion developers to further improve both JavaHL and Subversion.

Hope that helps explain what’s going on a bit and why we opted to make the change, though it’s worth bearing in mind this is largely an ‘under-the-hood’ change and you won’t notice much difference in the interface. The change will however make future development of SmartSVN much easier.

If you want to see more about the speed improvements there’s a results table in the release blog here.

Cheers all.

Top Challenges of the Git Enterprise Architect #3: Ever Growing Repos

volcanoContinuing on from Top Challenges of the Git Enterprise Architect #2: Access Control I’ll next talk about Git’s ever growing repos and some challenges presented.

Git stores all history locally, which is good because it’s fast. But it’s also bad, because clone and other command response times grow and never shrink.

Linear response time

Certain commands take linear time, O(n), either with the number of files in a repo or the depth of history. For example, Git has no built-in notion of revision number. Here’s a way to get the number of revisions of a file, Main.java:

git shortlog Main.java | grep -E '^[ ]+\w+' | wc -l

Of course, a consecutive file revision number is not a foundational construct in Git, as it is with a system like Subversion, so Git needs to walk its DAG (Directed Acyclic Graph) backward to the origin, counting the revisions of a particular file along the way.

When a repo is young, this is typically very fast. Also, as I noted, the revision numbers play less of a important role with Git so we need them less often.  However, think about what will happen if you have an active shared Git repository in service for a long period of time? When I’ve asked how long typical SCM admins expect to keep a project supported in an SCM, numbers range from 4 to up to 10 years. An active file might have accumulated hundreds or even thousands of revisions, and you’d want to think twice about counting them all up with Git.

Facebook’s struggle

Facebook’s concern over Git’s ever-growing-repos and very-slowing-performance a few years ago led them to switch to Mercurial and centralize much of the data.  Alas, this approach is not a solution for most companies. It relies on a fast, low latency connection, and if you don’t have access to the unique, fast, data safe, active-active replication as found in WANdisco’s MultiSite products for Subversion and Git, users remote to the central site will suffer sharply degraded performance.

Common workaround

The most common workaround I hear about is that when a Git repo gets too big and slow, a new shared master is cloned and deployed, and the old one serves as history. This is clearly not ideal, but many of the types of development that have first started using Git are less impacted by fragmented historical SCM data. As Git gains more widespread adoption into a greater variety of enterprise development projects, solutions may become more needed. Here at WANdisco we are hard at work paving the road ahead so that your Git deployments will scale historically as well as geographically.