Authentication and Authorization – Subversion and Git Live 2014

We’ve switched the format of some of the talks for Subversion and Git Live this year – several will be hands on, giving you the opportunity to test out the subject being discussed rather than just making notes.

One of these talks this year will be delivered by Ben Reser, one of our Subversion committers, on Authentication and Authorization. Ben has been working on Subversion since 2003 and will be discussing:

  • A brief overview of the access control methods Subverison supports.
  • Hands on setting up of a Subversion server with LDAP authentication over HTTP.
  • A look at the performance costs of access control and what you can do to minimize them.
  • How to put your authz configuration file into the repository.

The hands on portion will be covering a hypothetical company as they grow and shift from a very basic setup to a much more complex setup, showing some of the problems they’d have along the way and discussing their reasons for making configuration changes. The company starts off with a single repository with basic authentication (no path based authorization) and ends up with multiple repositories, LDAP and path based authorization. Eventually we’ll even use the new in-repository authz feature added with 1.8. The configuration improvements along the way will show how to ease administrative burden and improve performance.

The goal with this talk is to have you walking away knowing why you configure Subversion the way you do and how you can make things better for your particular setup, rather than just giving you an example authz file and telling you it’s the ‘right’ way to do things.

If that sounds good to you, why not come see us at Subversion and Git Live 2014? There’s more info about the event here: http://www.wandisco.com/subversion-git-live-2014.

Open Source and Analytics

A recent Information Week article predicted that open source analytics tools would continue to gain ground over commercial competitors in 2014 in the Big Data arena. That may seem surprising. After all, you’ve made an investment in moving some unwieldy data into Hadoop.  Why not start to hook up your traditional data analytics and business intelligence tools?

To see why this prediction makes sense, let’s review some of the advantages of Hadoop Big Data infrastructure:

  • Cost efficiency: Hadoop’s storage costs per terabyte are about one-fifth to one-twentieth the cost of legacy enterprise data warehouse (EDW) solutions. Once you have a Hadoop cluster up and running, scaling it out is economical.

  • Visibility: Hadoop lets you store, manage, and analyze wildly disparate data sets with no penalty. Silos that existed due to storage costs or technical incompatibility start to disappear.

  • Future proofing: Hadoop is an open platform with a vibrant community. There’s no risk of lock-in to obsolete tools and vendors.

These same reasons explain why open analysis platforms will continue to see wide adoption.

First, let’s consider cost efficiency and visibility. You’ll find that both tools and talent are more affordable and easier to find when you use open platforms, which means you’ll have a lot more people looking for the gems in your data.

Recall that one feature of Big Data is that you probably don’t know how you’re going to use all of the data you collect in the future. In other words, you don’t know now what questions you’ll be asking next year. You need to unleash your analysts and data scientists to explore this data, and open analysis platforms have a much lower cost barrier than commercial tools. Any budding data scientists can get started without consuming scarce licenses.

Finally, the next generation of data scientists will be trained on open platforms like R. R is gaining traction rapidly and is the key tool in a new data science MOOC offered by Johns Hopkins. Not only will recruiting be easier, but anyone on your team who needs to start working with data can acquire some basic skills easily. Visibility matters: after all, if data is stored in Hadoop and no one is there to analyze it, why bother?

Source: http://r4stats.com/articles/popularity/

Now getting back to future proofing, data science is a rapidly evolving field.  New tools and methods are springing up almost every day.  Much of that research is being done and published in open platforms like R.  You’ll be able to take advantage of that cutting edge knowledge without having to wait for a vendor to support it in a closed framework.

Embracing this wave of open source analytics tools will help you start to see real ROI from your Big Data investment.

WANdisco Announces Availability of Apache Subversion 1.9.0 alpha binaries

Apache have announced the release of the binaries for Subversion 1.9.0 alpha with a number of significant improvements.

It’s important to note that this is an alpha release and as such is not recommended for production environments. If you’re able to download and test this release in a non production environment though we’d be grateful for any feedback – if you notice anything untoward or even just want to chat or ask about this latest version please drop us a post in our forums.

This release introduces improvements to caching and authentication, some filesystem optimisations for fsx and fsfs, a number of additions to svnadmin commands and improvements to the interactive conflict resolution menus. Other enhancements include:

  • New options for ‘svnadmin verify’
  •  –check-normalization
  •  –keep-going
  • svnadmin info: print info about a repository.
  • additions to svn cleanup
  •  add ‘–remove-unversioned’ and ‘–remove-ignore’
  •  add ‘–include-externals’ option
  •  add ‘–quiet’ option

You can see a full list of changes in the release notes here.

To save you the hassle of compiling from source you can download our fully tested, certified binaries free from our website here: http://www.wandisco.com/subversion/download/early-access

WANdisco’s Subversion binaries provide a complete, fully tested version of Subversion based on the most recent stable release, including the latest fixes, and undergo the same rigorous quality assurance process that WANdisco uses for its enterprise products that support the world’s largest Subversion implementations.

Using TortoiseSVN or SmartSVN? As this is an alpha release there’s no compatible version of these Subversion clients yet, but watch this space and we’ll have them ready before the general release of Subversion 1.9.0.

OpenSSL Vulnerability – The Heartbleed Bug

The OpenSSL team recently published a security advisory regarding the TLS heartbeat read overrun. This vulnerability allows up to 64k of memory to be read by a connected client or server in chunks and different chunks can be requested on each attack.

The vulnerability affects versions 1.0.1 and 1.0.2-beta of OpenSSL.

The WANdisco SVN binaries for Windows and Solaris available since 2011 have included OpenSSL libraries which are vulnerable. We’ve released updated versions with the patch as of today, so if you are still using one of these older versions please download the latest:

Windows: http://www.wandisco.com/subversion/download#windows

Solaris: http://www.wandisco.com/subversion/download#solaris

Users of our Subversion products (including SVN Multisite) on other operating systems will still need to ensure they’ve updated their OpenSSL package however there’s nothing vulnerable included with our binaries. We recommend all users of these operating systems update their version of OpenSSL to 1.0.1g as soon as possible or, if unable to update, recompile OpenSSL with the -DOPENSSL_NO_HEARTBEATS flag.

For more information on this vulnerability please see http://heartbleed.com/

UPDATE: SmartSVN versions 8, 8.5 and 8.5.1 are also vulnerable due to the included version of OpenSSL. We’ve now released SmartSVN 8.5.2 and would urge all users of SmartSVN 8 to update to this latest version as soon as possible. SmartSVN 8.5.2 is available for download at http://www.wandisco.com/smartsvn/download-all

Can Big Data Help with Prediction?

A recent article entitled, “Limited role for big data seen in developing predictive models”, splashes a little cold water on the idea that Big Data will magically help develop better predictive analytics tools.  The headline caught my attention, as it’s become a truism that a poor algorithm with lots of data will outperform a great algorithm with not enough data.  Let’s go ahead and ask, can Big Data help with prediction?

Now, I understand the author’s point.  If you are performing a well-structured study and you have a deep understanding of the domain, then a smaller and carefully constructed data set will probably serve you better. Later in the article, however, Peter Amstutz, analytics strategist at advertising agency Carmichael Lynch, mentions that in many cases you’re not even sure what you’re looking for and often need to aggregate loosely structured data from disparate sources.  After all, there’s a lot more unstructured data in the world, and it’s growing quickly.

I find myself favoring the dissenting view.  In my job I’m often trying to answer questions like, “Will our next release ship on time given what I now know about the backlog, other projects taking away resources…,” and so on.  It’s not as simple as looking at a burn down chart to track progress.  In my head I’m meshing all types of data points – chatter on the engineering forums, vacation schedules, QA panic boards, et cetera.  I sometimes get a ‘pit of my stomach’ feeling that the schedule is slipping, but when I try to actually quantify what I’m seeing, it’s difficult.  There are so many sources of data to correlate, and none of them report consistently.

Of course, if we had a data warehouse I could run some cool reports on trends I’m seeing, but I wouldn’t try to convince the higher-ups to make that level of investment (ETL tools, data stores, visualization front end) and I’m sure they won’t give me JDBC access to all of our databases.

On the other hand, I’ve got a small Hadoop cluster available – just a set of VMs, but sufficient for the volume of data I need to examine – and I know how to pull data using tools like Flume and Sqoop.  All of a sudden I’m seeing possibilities.

This is one of the real benefits of ‘Big Data’ for predictive analytics.  It can handle the variety of data I need without ETL tools, at a fairly low cost.

Intro to Gerrit – Subversion and Git Live 2014

You may be aware of Gerrit, the web based code review system. Our Director of Product Marketing, Randy Defauw, has a number of good reasons for adopting it as part of your development process:

The most interesting thing about Gerrit is that it facilitates what some call ‘continuous review’. Code review is often seen as a bottleneck in continuous delivery, but it’s also widely recognized as a way to improve quality. Gerrit resolves this conundrum with innovative features like dynamic review branch creation and the incorporation of continuous build into the heart of the review process.

Gerrit is also notable because it is the most enterprise friendly Git code review system, although it has open source roots. It integrates with all standard authentication frameworks, has delegated permission models, and was designed for large deployments.

Randy is Director of Product Marketing for WANdisco’s ALM products. He focuses on understanding in detail how WANdisco’s products help solve real world problems, and has deep background in development tools and processes. Prior to joining WANdisco he worked in product management, marketing, consulting, and development. He has several years of experience applying Subversion and Git workflows to modern development challenges.

If you’d like to hear more about Gerrit, or Git in general, come see us at Subversion and Git Live 2014.

SmartSVN 8.5 Moves from SVNKit to JavaHL

Following on from the release of SmartSVN 8.5, we wanted to give you a bit more detail about the main big change in SmartSVN 8.5, so here’s Branko Čibej, our Director of Subversion, with an explanation:

One of the most significant events during the development of SmartSVN 8.5 was the decision to adopt the JavaHL library in favour of SVNKit, which was used by all previous releases of SmartSVN.

JavaHL is a Java wrapper for Subversion, published by the Apache Subversion project. The most important difference compared to SVNKit is that JavaHL uses the same code base as the Subversion command-line client and tools. This has several benefits for SmartSVN: quicker adoption of new Subversion features; more complete compatibility with Subversion servers, repositories and other clients; built-in support for new working copy formats; and, last but not least, speed — as demonstrated by the phenomenal performance improvements in SmartSVN 8.5, compared to both 8.0 and 7.6.

The decision to adopt JavaHL has also benefited the Subversion community at large: several bug fixes and enhancements in Subversion 1.8.8 and the forthcoming 1.8.9 and 1.9.0 releases are a direct result of the SmartSVN porting effort. We will continue to work closely with the Apache Subversion developers to further improve both JavaHL and Subversion.

Hope that helps explain what’s going on a bit and why we opted to make the change, though it’s worth bearing in mind this is largely an ‘under-the-hood’ change and you won’t notice much difference in the interface. The change will however make future development of SmartSVN much easier.

If you want to see more about the speed improvements there’s a results table in the release blog here.

Cheers all.

Top Challenges of the Git Enterprise Architect #3: Ever Growing Repos

volcanoContinuing on from Top Challenges of the Git Enterprise Architect #2: Access Control I’ll next talk about Git’s ever growing repos and some challenges presented.

Git stores all history locally, which is good because it’s fast. But it’s also bad, because clone and other command response times grow and never shrink.

Linear response time

Certain commands take linear time, O(n), either with the number of files in a repo or the depth of history. For example, Git has no built-in notion of revision number. Here’s a way to get the number of revisions of a file, Main.java:

git shortlog Main.java | grep -E '^[ ]+\w+' | wc -l

Of course, a consecutive file revision number is not a foundational construct in Git, as it is with a system like Subversion, so Git needs to walk its DAG (Directed Acyclic Graph) backward to the origin, counting the revisions of a particular file along the way.

When a repo is young, this is typically very fast. Also, as I noted, the revision numbers play less of a important role with Git so we need them less often.  However, think about what will happen if you have an active shared Git repository in service for a long period of time? When I’ve asked how long typical SCM admins expect to keep a project supported in an SCM, numbers range from 4 to up to 10 years. An active file might have accumulated hundreds or even thousands of revisions, and you’d want to think twice about counting them all up with Git.

Facebook’s struggle

Facebook’s concern over Git’s ever-growing-repos and very-slowing-performance a few years ago led them to switch to Mercurial and centralize much of the data.  Alas, this approach is not a solution for most companies. It relies on a fast, low latency connection, and if you don’t have access to the unique, fast, data safe, active-active replication as found in WANdisco’s MultiSite products for Subversion and Git, users remote to the central site will suffer sharply degraded performance.

Common workaround

The most common workaround I hear about is that when a Git repo gets too big and slow, a new shared master is cloned and deployed, and the old one serves as history. This is clearly not ideal, but many of the types of development that have first started using Git are less impacted by fragmented historical SCM data. As Git gains more widespread adoption into a greater variety of enterprise development projects, solutions may become more needed. Here at WANdisco we are hard at work paving the road ahead so that your Git deployments will scale historically as well as geographically. 

SmartSVN 8.5 Available Now

We’re happy to announce the release of SmartSVN 8.5, the graphical Subversion (SVN) client for Mac, Windows and Linux. SmartSVN 8.5 is available for download from our website here.

Along with several bug fixes and enhancements, SmartSVN 8.5 makes the critical move from SVNKit to JavaHL, the same back end as used by the Subversion command line client/server.

Major Improvements

Whilst it may not look different, this release signifies a huge change in that we’ve moved away from SVNkit and SmartSVN now uses JavaHL. This is the same library used by command line Subversion, and has given SmartSVN 8.5 much improved stability and a huge speed boost. Some comparison tables:

Text files
Jpg files
Operation 7.6.3
time (s)
8/8.0.1
time (s)
8.5 (JavaHL)
time (s)
7.6.3
time (s)
8/8.0.1
time (s)
8.5 (JavaHL)
time (s)
Checkout 72.21 78.86 7.34 118.13 120.35 10.92
1st Add 133.60 201.94 37.49 64.19 98.61 15.47
Revert 47.14 75.06 16.89 41.19 77.15 8.85
2nd Add 131.75 186.18 34.64 60.87 101.76 13.81
Commit 314.44 440.23 85.70 167.34 252.85 42.46
Remove 13.86 1146.77 13.76 6.76 553.41 8.70

We’ve also added support for Subversion 1.8.8 and the file:// protocol for local repository access.

For a full list of all improvements, bug fixes and other changes please take a look at the changelog.

Have your feedback included in a future version of SmartSVN

Many issues resolved in this release were raised via our dedicated SmartSVN forum, so if you’ve got an issue or a request for a new feature, head over there and let us know.

You can download SmartSVN 8.5 from our website here.

Haven’t yet started with SmartSVN? Claim your free trial of SmartSVN Professional here.

Subversion 1.9 Underway

Subversion 1.9 is already well underway, following up quickly after last year’s impressive 1.8 release. Although the final set of new features may change, there’s one piece of infrastructure work that’s worth highlighting.

A new tunable to control storage compression levels lets you choose a better balance between repository size and server CPU load. Disabling regular compression and deltification will yield a substantial improvement in throughput when adding, committing, and checking out large files.  You can expect to see more numbers at the next Subversion & Git Live conference, but I will mention that commit speed can increase from 30-40 MB/s to 100 MB/s.

Here are a few other tidbits that may interest you:

  • New tool to list and manage cached credentials

  • The ability to commit to a repository while a pack operation is in progress (Goodbye, long maintenance windows!)

  • Infrastructure work is starting for a new storage layer that will reduce repository size and improve performance.

While you’re waiting for Subversion 1.9, now’s a great time to upgrade to Subversion 1.8. You can enjoy many of the benefits just by upgrading the Subversion binaries on your server.