DevOps is eating the world

You know a technology trend has become fully mainstream when you see it written up in the Wall Street Journal.  So it goes with DevOps, as this recent article shows.

DevOps and continuous delivery have been important trends in many firms for several years.  It’s all about building higher quality software products and delivering them more quickly.  For SaaS companies it’s an obvious fit as they sometimes push out minor changes many times a day.  But even companies with more traditional products can benefit.  And internal IT departments can use DevOps principles to start saying “yes” to business users more often.

For example, let’s say that your business analytics team asks for a small Hadoop cluster to try out some of the latest machine learning algorithms on Spark.  Saying “yes” to that request should only take hours, not weeks.  If you have a private cloud and the right level of automation, you can spin up a new Spark cluster in minutes.  Then you can work with the analysts to automate the deployment of their algorithms.  If they’re wildly successful and they need to move their new project to a production cluster it’s just a matter of deploying somewhere with more resources.

Of course, none of this comes easily.  On the operations side you’ll need to invest in the right configuration and private cloud infrastructure.   Tools like Puppet, Ansible, and Docker can capture the configuration of servers and applications as code.

But equally important is the development infrastructure.  Companies like Google practice mainline development: all of their work is done from the trunk or mainline, supported by a massive continuous build and test infrastructure.  And Gerrit, a tool that Google sponsors, is perhaps the best code review tool for continuous delivery.

If you look at potential bottlenecks in a continuous delivery pipeline, you need to consider how code gets to the mainline, and then how it gets deployed.  With Gerrit there are only two steps to the mainline:

  • Commit the code.  Gerrit makes a de facto review branch on the fly and initiates a code review.
  • Approve the merge request.  Gerrit handles the merge automatically unless there’s a conflict.

With this system you don’t even need to ask a developer to open a pull request or create a private branch.  Gerrit just automates all of that.  And Gerrit will also invoke any continuous build and test automation to make sure that code is passing those tests before a human reviewer even looks at it.

Once it’s on the mainline the rest of the automation kicks in, and those operational tools become important to help you rapidly spin up more realistic test environments.

As you can imagine, this type of infrastructure can put a heavy load on your development systems.  That’s why WANdisco has put the muscle of Git MultiSite behind Gerrit, giving you a horizontally scalable Gerrit infrastructure.

Latest Git binaries available for download

As part of our participation in the open source SCM community, WANdisco provides up-to-date binary downloads for Git and Subversion for all major platforms.  We now have the latest Git binaries available for download on our Git downloads site.

One interesting new feature is git push –atomic.  When you’re pushing several refs (e.g. branches) at once, this feature makes sure that either all the refs are accepted or none are.  That’s useful if you’re making related changes on several branches at once.  Those who merge patches onto several releases at once are often in this position.

The Git community has done a great job of ensuring a stable upgrade process, so there’s generally little concern about upgrading.  It’s always a good idea to review the release notes of course.

Big Data Tech Infrastructure Market Share

The Data Science Association just published this infographic showing market share for a variety of different tools and technologies that form part of the Big Data ecosystem.  The data would’ve been more useful if it was grouped into categories, but here are a few observations:

  • Amazon is dominating the field for cloud infrastructure.  It’d be interesting to see how much of that is used for test and development versus serious production deployments.
  • Cloudera has more market share than vanilla Apache Hadoop, Hortonworks, or MapR.  It’ll be interesting to see how this picture evolves over time with the advent of the Open Data Platform.
  • Mesos has a surprising share of 14%.  At a recent Big Data event in Denver an audience survey showed that only one person out of 50 was even experimenting with Mesos.  Perhaps this survey is oriented more towards early adopters.

It’s always interesting to see these types of surveys as a complement to the analyst surveys from 451, Wikibon, and the like.

The 100 Day Progress Report on the ODP

This blog by Cheryle Custer, Director Strategic Alliance Marketing Hortonworks, has been republished with the author’s permission.

It was just a little over 100 days ago that 15 industry leaders in the Big Data space announced the formation of the Open Data Platform (ODP) initiative. We’d like to let you know what has been going on in that time, to bring you a preview of what you can expect in the next few months and let you know how you can become involved.

Some Background

What is the Open Data Platform Initiative?
The Open Data Platform Initiative (ODP) is an enterprise-focused shared industry effort focused on simplifying adoption and promoting the use and advancing the state of Apache Hadoop® and Big Data technologies for the enterprise. It is a non-profit organization being created by folks that help to create:  Apache, Eclipse, Linux, OpenStack, OpenDaylight, Open Networking Foundation, OSGI, WSI (Web Services Interoperability), UDDI , OASIS, Cloud Foundry Foundation and many others.

The organization relies on the governance of the Apache Software Foundation community to innovate and deliver the Apache project technologies included in the ODP core while using a ‘one member one vote’ philosophy where every member decides what’s on the roadmap. Over the next few weeks, we will be posting a number of blogs to describe in more detail how the organization is governed and how everyone can participate.

What is the Core?
The ODP Core provides a common set of open source technologies that currently includes: Apache Hadoop® (inclusive of HDFS, YARN, and MapReduce) and Apache® Ambari. ODP relies on the governance of the Apache Software Foundation community to innovate and deliver the Apache project technologies included in the ODP core. Once the ODP members and processes are well established, the scope of the ODP Core will expand to include other open source projects.

Benefits of the ODP Core
The ODP core is a set of open source Hadoop technologies designed to provide a standardized core that big data solution providers software and hardware developers can use to deliver compatible solutions rooted in open source that unlock customer choice.

By delivering on a vision of “verify once, run anywhere”, everyone benefits:

  • For Apache Hadoop® technology vendors, reduced R&D costs that come from a shared qualification effort
  • For Big Data application solution providers, reduced R&D costs that come from more predictable and better qualified releases
  • Improved interoperability within the platform and simplified integration with existing systems in support of a broad set of use cases
  • Less friction and confusion for Enterprise customers and vendors
  • Ability to redirect resources towards higher value efforts

100 Day Progress Report

In the 100 days since the announcement, we’ve made some great progress:

Four Platforms Shipping
At Hadoop Summit in Brussels in April, we announced the availability of four Hadoop platforms all based on a vision of a common ODP core: Infosys Information PlatformIBM Open Platform, Hortonworks Data Platformand Pivotal HD. The commercial delivery of ODP based distributions across multiple industry leading vendors immediately after the launch of the initiative demonstrates the momentum behind ODP to accelerate the delivery of compatible Hadoop distributions and the simplification it brings to the ecosystem using that as an industry standard.

New Members and New Participation Levels
In addition to revealing that Telstra is one of the founding Platinum members of the ODP, we’ve added new nine new members, including BMC, DataTorrent,PLDTSquid SolutionsSyncsort, UnifizData, Zettaset. We welcome these new members and are looking forward to their participation and their announcements. We also announced new membership level to provide an easy entrée for any company to participate in the ODP. The Silver level of membership allows companies to have a direct voice into the future of big data and contribute people, tests, and code to accelerate executing on the vision.

Community Collaboration at the Bug Bash
ODP Member Alitscale lead the efforts on a Hadoop Community Bug Bash. This unique event for the Apache Hadoop community, along with co-sponsors Hortonworks, Huawei, Infosys, and Pivotal, saw over 150 participants from eight countries and nine time zones, to strengthen Hadoop and honor the work of the community by reviewing and resolving software patches. Read more about the Bug Bash, where 186 issues were resolved either with closure or patches committed to code. Nice job everyone!  You can participate in upcoming bug bashes, so stay tuned.

Technical Working Group and the ASF
Senior engineers and architects from the ODP member companies have come together as a Technical Working Group (TWG). The goal of the TWG is to jump-start the work required to produce ODP core deliverables and to seed the technical community overseeing the future evolution of the ODP core. Delivering on the promise of “verify once and run anywhere” TWG is building h certification guidelines for “compatibility” (for software running on top of ODP) and “compliance” (for ODP platforms). We have scheduled a second TWG face-to-face meeting at Hadoop Summit and where committers, PMC and ASF members will be meeting to continue these discussions.

What’s Next?

Many of the member companies will be at Hadoop Summit in San Jose.

While you’re at Hadoop Summit, you can attend the IBM Meet Up and hear more about the ODP. Stay tuned to this blog as well – we’ll use this as a platform to inform you of new developments and provide you insight on how the ODP works.

Want to know more about the ODP, here are a few reference documents

Enterprise Hadoop Adoption: Half Empty or Half Full?

This blog by Shaun Connolly, Hortonworks VP of Corporate Strategy, has been republished with the author’s permission.

As we approach Hadoop Summit in San Jose next week, the debate continues over where Hadoop really is on its adoption curve. George Leopold from Datanami was one of the first to beat the hornet’s nest with his article entitled Gartner: Hadoop Adoption ‘Fairly Anemic’. Matt Asay from TechRepublic and Virginia Backaitis from CMSWire volleyed back with Hadoop Numbers Suggest the Best is Yet to Come and Gartner’s Dismal Predictions for Hadoop Could Be Wrong, respectively.

At the center of the controversy is the report published by Merv Adrian and Nick Heudecker from Gartner: Survey Analysis: Hadoop Adoption Drivers and Challenges. Specifically, the Gartner survey shows that 26% of respondents are deployed, piloting or experimenting; 11% plan to invest within 12 months; and an additional 7% plan to invest within 24 months.

Glass Half Empty or Half Full?

I believe the root of the controversy comes not in the data points stated above, but in the phrasing of one of the key findings statements: “Despite substantial hype and reported successes for early adopters, over half of respondents (54%) report no plans to invest at this time. Additionally, only 18% have plans to invest in Hadoop over the next two years.

The statement is phrased in the negative sense, from a lack of adoption perspective. While not wrong, it represents a half-empty perspective that is more appropriate for analyzing mature markets such as the RDBMS market, which is $100s of billions in size and decades into its adoption curve. Comparing today’s Hadoop market size and adoption to today’s RDBMS market is not particularly useful. However, comparing the RDBMS market at the time it was five years into its adoption cycle might be an interesting exercise.

When talking about adoption for newer markets like Enterprise Hadoop, I prefer to frame my view using the classic technology adoption lifecycle that models adoption across five categories with corresponding market share %s: Innovators (2.5%), Early Adopters (13.5%), Early Majority (34%), Late Majority (34%), and Laggards (16%).

Putting the Gartner data into this context shows Hadoop in the Early Majority of the market at the classic inflection point of its adoption curve.


As a publicly traded enterprise open source company, not only is Hortonworks code open, but our corporate performance and financials are open too. Earlier this month, we released Hortonworks’ first quarter earnings. In Q4-2014 and Q1-2015, we added 99 and 105 new subscription customers respectively, which means we added over 46% of our 437 subscription customers in the past 6 months. If we look at the Fortune 100, 40% are Hortonworks subscribers including: 71% of F100 retailers, 75% of F100 Telcos, and 43% of F100 banks.


We see these statistics as clear indicators of the building momentum of Open Enterprise Hadoop and the powerful Hortonworks model for extending Hadoop adoption across all industries. I won’t hide the fact that I am guilty of having a Half Full approach to life. As a matter of fact, I proudly wear the t-shirt every chance I get. The Half Full mindset serves us well at Hortonworks, because we see the glass filling quickly. The numbers for the last two quarters show that momentum.

Come Feel the Momentum at Hadoop Summit on June 9th in San Jose!

If you’d like to see the Hadoop momentum for yourself, then come join us at Hadoop Summit in San Jose starting June 9th.

Geoffrey Moore, author of Crossing the Chasm, will be a repeat keynote presenter this year. At Hadoop Summit 2012, he laid out a technology adoption roadmap for Big Data from the point of view of technology providers. Join Geoff as he updates that roadmap with a specific focus on business customers and the buying decisions they face in 2015.

Mike Gualtieri, Principal Analyst at Forrester Research, will also be presenting. Join Mike for his keynote entitled Adoption is the Only Option—Five Ways Hadoop is Changing the World and Two Ways It Will Change Yours.

In addition to keynote speakers, Summit will host more than 160 sessions being delivered by end user organizations, such as Aetna, Ernst & Young, Facebook, Google, LinkedIn, Mercy, Microsoft, Noble Energy, Verizon, Walt Disney, and Yahoo!, so you can get the story directly from the elephant’s mouth.

San Jose Summit 2015 promises to be an informational, innovative and entertaining experience for everyone.

Come join us. Experience the momentum for yourself.

Configuring multiple zones in Hadoop

Hortonworks, a WANdisco partner and another member of the Open Data Platform, recently published a list of best practices for Hadoop infrastructure management.  One of the top recommendations is configuring multiple zones in Hadoop.  Having development, test, and production environments gives you a safe way to test upgrades and new applications without disturbing a production system.

One of the challenges with creating multiple similar zones is sharing data between them.  Whether you’re testing backup procedures and application functionality, or prototyping a new data analysis algorithm, you need to see similar data in all the zones.  Otherwise you’re not really testing in a production-like environment.

But in a large cluster transferring terabytes of data around between zones can be time consuming and it’s tough to tell how stale the data really is.  That’s where WANdisco Fusion becomes an essential part of your operational toolkit.  WANdisco Fusion provides active-active data replication between Hadoop clusters.  You can use it to effectively share part of your Hadoop data between dev/test/prod zones in real-time.  All of the zones can make full use of the data, although you can of course use your normal access control system to prevent updates from certain zones.

DevOps principles are coming to Hadoop, so contact one of our solutions architects today to see how WANdisco Fusion can help you maintain multiple zones in your Hadoop deployment.

Different views on Big Data momentum

I was struck recently by two different perspectives on Big Data momentum.  Computing Research just published their 2015 Big Data Review in which they found continued momentum for Big Data projects.  A significantly higher number of their survey respondents in 2015 are using Big Data projects for operational results.  In a contrasting view, Gartner found that only 26% of the respondents were running or even experimenting with Hadoop.

If you dig a little deeper into the Computing study, you’ll see that it’s speaking about a wider range of Big Data options than just Hadoop.  The study mentions that 29% of the respondents are at least considering using Hadoop specifically, up from 15% last year.  So the two studies are closer than they look at first glance, yet the tone is strikingly different.

One possible explanation is that the Big Data movement is much bigger than Hadoop and it’s easier to be optimistic about a movement than a technology.  But even so, I’d tend towards the optimistic view of Hadoop.  If you look at the other technologies being considered for Big Data, analytics tools and databases (including NoSQL databases) are driving tremendous interest, with over 40% of the Computing Research participants evaluating new options.  And the Hadoop community has done a tremendous amount of work to turn Hadoop into a general purpose Big Data platform.

You don’t have to look very far for examples.  Apache Spark is now bundled in mainstream distributions to provide fast in-memory processing, while Pivotal (a member of the Open Data Platform along with WANdisco) has contributed Greenplum and HAWQ to the open source effort.

To sum up, the need for ‘Big Data’ is not in dispute, but the technology platforms that underpin Big Data are evolving rapidly.  Hadoop’s open nature and evolution from a processing framework to a platform are points in its favor.

Behind the scenes: Rapid Hadoop deployment

If you’ve ever deployed a Hadoop cluster from scratch on internal hardware or EC2, you know there are a lot of details to get right.  Syncing time with ntp, setting up password-less login across all the nodes, and making sure you have all the prerequisite packages installed is just the beginning.  Then you have to actually deploy Hadoop.  Even with a management tool like Ambari there’s a lot of time spent going through the web interface and deploying software.  In this article I’m going to describe why we invested in a framework for rapid Hadoop deployment with Docker and Ansible.

At WANdisco we have teams of engineers and solutions architects testing our latest products on a daily basis, so automation is a necessity.  Last year I spent some time on a Vagrant-Puppet toolkit to set up EC2 images and deploy Hadoop using Ambari blueprints.  As an initial effort it was pretty good but I never invested the time to handle the cross-node dependencies.  For instance, after the images are provisioned with all the prerequisites I manually ran another Puppet script to deploy Ambari, then another one to deploy Hue, rather than having a master process that handled the timing and coordination.

Luckily we have a great automation team in our Sheffield office that set up a push-button solution using Docker and Ansible.  With a single invocation you get:

  • 3 clusters (mix-and-match with the distributions you prefer)
  • Each cluster has 7 containers.  The first runs the management tool (like Ambari), the second runs the NameNode and most of the master services, the third runs Hue, and the others are data nodes.
  • All of the networking and other services are registered correctly.
  • WANdisco Fusion installed.

Starting from a bare metal host, it takes about 20 minutes to do a one-time setup with Puppet that installs Docker and the Ansible framework and builds the Docker images.  Once that first-time setup is done, a simple script starts the Docker containers and runs Ansible to deploy Hadoop.  That takes about 20 minutes for a clean install, or 2-3 minutes to refresh the clusters with the latest build of our products.

That’s a real time-saver.  Engineers can refresh with a new build in minutes, and solution architects can set up a brand new demo environment in under a half hour.  Docker is ideal for demo purposes as well.  Cutting down the number of nodes lets the whole package run comfortably on a modern laptop, and simply pausing a container is an easy way to simulate node failures.  (When you’re demonstrating the value of active-active replication, simulating failure is an everyday task.)

As always, DevOps is a work-in-progress.  The team is making improvements every week, and I think with improved use of Docker images we can cut the cluster creation time down even more.

That’s a quick peek at how our internal engineering teams are using automation to speed up development and testing of our Hadoop products.  If you’d like to learn more, I encourage you to tweet @wandisco with questions, or ask on our Hadoop forum.

Cos Boudnik on Apache Ignite and Apache Spark

In case you missed it, WANdisco’s own Konstantin (Cos) Boudnik wrote a very interesting blog post about in-memory computing recently.  Apache Spark has attracted a lot of attention for its robust programming model and excellent performance.  Cos’ article points out another Apache project that’s worth keeping an eye on, Apache Ignite.

Ignite is a full in-memory computing system, whereas Spark uses memory for processing.  Ignite also features full SQL-99 support and a Java-centric programming model, compared to Spark’s preference for Scala.  (I’ll note that I do appreciate Spark’s strong support for Python as well.)

Although I won’t pretend to understand all the technical nuances of Ignite and Spark, it seems that there is some overlap in use cases.  That’s a good sign for data analysts looking for more choices for faster big data processing.

5 questions for your Hadoop architect

I was baffled last week when I was told that a lot of Hadoop deployments don’t even use a backup procedure.  Hadoop does of course provide local data replication that gives you three copies of every file.  But catastrophes can and do happen.  Data centers aren’t immune to natural disasters or malicious acts, and if you try to put some of your data nodes in a remote site the performance will suffer greatly.

WANdisco of course makes products that solve data availability problems among other challenges, so I’m not an impartial observer.  But ask yourself this: is the data in your Hadoop cluster less valuable than the photos on your cell phone that are automatically synced to a remote storage site?

And after that, ask your Hadoop architect these 5 questions:

  • How is our Hadoop data backed up?
  • How much data might we lose if the data center fails?
  • How long will it take us to recover data and be operational again if we have a data center failure?
  • Have you verified the integrity of the data at the backup site?
  • How often do you test our Hadoop applications on the backup site?

The answers might surprise you.

Hortonworks and WANdisco make it easy to get started with Spark

Hortonworks, one of our partners in the Open Data Platform Initiative, recently released version 2.2.4 of the Hortonworks Data Platform (HDP).  It bundles Apache Spark 1.2.1.  That’s a clear indicator (if we needed another one) that Spark has entered the Hadoop mainstream.  Are you ready for it?

Spark opens up a new realm of use cases for Hadoop since it offers very fast in-memory data processing.  Spark has blown through several Hadoop benchmarks and offers a unified batch, SQL, and streaming framework.

But Spark presents new challenges for Hadoop infrastructure architects.  It favors memory and CPU with a smaller number of drives than a typical Hadoop data node.  The art of monitoring and tuning Spark is still in early days.

Hortonworks is addressing many of these challenges by including Spark in HDP 2.2.4 and integrating it into Ambari.  And now WANdisco is making it even easier to get started with Spark by giving you the flexibility to deploy Spark into a separate cluster while still using your production data.

WANdisco Fusion uses active-active data replication to make the same Hadoop data available and usable consistently from several Hadoop clusters.  That means you can run Spark against your production data, but isolate it on a separate cluster (perhaps in the cloud) while you get up to speed on hardware sizing and performance monitoring.  You can continue to run Spark this way indefinitely in order to isolate any potential performance impact, or eventually migrate Spark to your main cluster.

Shared data but separate compute resources gives you the extra flexibility you need to rapidly deploy new Hadoop technologies like Spark without impacting critical applications on your main cluster.  Hortonworks and WANdisco make it easy to get started with Spark.  Get in touch with our solution architects today to get started.



WANdisco Fusion Q&A with Jagane Sundar, CTO

Tuesday we unveiled our new product: WANdisco Fusion. Ahead of the launch, we caught up with WANdisco CTO Jagane Sundar, who was one of the driving forces behind Fusion.

Jagane joined WANdisco in November 2012 after the firm’s acquisition of AltoStor and has since played a key role in the company’s product development and rollout. Prior to founding AltoStor along with Konstantin Shvachko, Jagane was part of the original team that developed Apache Hadoop at Yahoo!.

Jagane, put simply, what is WANdisco Fusion?

JS: WANdisco Fusion is a wonderful piece of technology that’s built around a strongly consistent transactional replication engine, allowing for the seamless integration of different types of storage for Hadoop applications.

It was designed to help organizations get more out of their Big Data initiatives, answering a number of very real problems facing the business and IT worlds.

And the best part? All of your data centers are active simultaneously: You can read and write in any data center. The result is you don’t have hardware that’s lying idle in your backup or standby data center.

What sort of business problems does it solve?

JS: It provides two new important capabilities for customers. First, it keeps data consistent across different data centers no matter where they are in the world.

And it gives customers the ability to integrate different storage types into a single Hadoop ecosystem. With WANdisco Fusion, it doesn’t matter if you are using Pivotal in one data center, Hortonworks in another and EMC Isilon in a third – you can bring everything into the same environment.

Why would you need to replicate data across different storage systems?

JS: The answer is very simple. Anyone familiar with storage environments knows how diverse they can be. Different types of storage have different strengths depending on the individual application you are running.

However, keeping data synchronized is very difficult if not done right. Fusion removes this challenge while maintaining data consistency.

How does it help future proof a Hadoop deployment?

JS: We believe Fusion will form a critical component of companies’ workflow update procedures. You can update your Hadoop infrastructure one data center at a time, without impacting application availability or by having to copy massive amounts of data once the update is done.

This helps you deal with updates from both Hadoop and application vendors in a carefully orchestrated manner.

Doesn’t storage-level replication work as effectively as Fusion?

JS: The short answer is no. Storage-level replication is subject to latency limitations that are imposed by file systems. The result is you cannot really run storage-level replication over long distances, such as a WAN.

Storage-level replication is nowhere nearly as functional as Fusion: It has to happen at the LAN level and not over a true Wide Area Network.

With Fusion, you have the ability to integrate diverse systems such as NFS with Hadoop, allowing you to exploit the full strengths and capabilities of each individual storage system – I’ve never worked on a project as exciting and as revolutionary as this one.

How did WANdisco Fusion come about?

JS: By getting inside our customers’ data centers and witnessing the challenges they faced. It didn’t take long to notice the diversity of storage environments.

Our customers found that different storage types worked well for different applications – and they liked it that way. They didn’t want strict uniformity across their data centers, but to be able to leverage the strengths of each individual storage type.

At that point we had the idea for a product that would help keep data consistent across different systems.

The result was WANdisco Fusion: a fully replicated transactional engine that makes the work of keeping data consistent trivial. You only have to set it up once and never have to bother with checking if your data is consistent.

This vision of a fully utilized, strongly consistent diverse storage environment for Hadoop is what we had in mind when came up with the Fusion product.

You’ve been working with Hadoop for the last 10 years. Just how disruptive is WANdisco Fusion going to be?

JS: I’ve actually been in the storage industry for more than 15 years now. Over that period I’ve worked with shared storage systems, and I’ve worked with Hadoop storage systems. WANdisco Fusion has the potential to completely revolutionize the way people use their storage infrastructure. Frankly, this is the most exciting project I’ve ever been part of.

As the Hadoop ecosystem evolved I saw the need for this virtual storage system that integrates different types of storage.

Efforts to make Hadoop run across different data centers have been mostly unsuccessful. For the first time, we at WANdisco have a way to keep your data in Hadoop systems consistent across different data centers.

The reason this is so exciting is because it transforms Hadoop into something that runs in multiple data centers across the world.

Suddenly you have capabilities that even the original inventors of Hadoop didn’t really consider when it was conceived. That’s what makes WANdisco Fusion exciting.

Scalable and Secure Git

Now that WANdisco has released an integration between Git MultiSite and GitLab, it’s worth putting the entire Git lineup at WANdisco into perspective.

Git MultiSite is the core product providing active-active replication of Git repository data. This underpins our efforts to make Git more reliable and better performing. Active-active replication means that you have full use of your Git data at several locations, not just in a single ‘master’ Git server. You get full high availability and disaster recovery out of the box, and you can load balance your end user and build demands between several Git servers. Plus, users at every location get fast local read and write access. As one of our customers recently pointed out, trying to make regular Git mirrors work this way requires a few man-years of effort.

On top of Git MultiSite you have three options for user management, security, and collaboration.

  • Use WANdisco’s Access Control Plus for unified, scalable user and permission management. It features granular permissions, delegated team management, and full integration with SVN MultiSite Plus for unified Subversion-Git administration.
  • Use Gerrit to take advantage of powerful continuous review workflows that underpin the Android community.
  • Use GitLab for an enterprise-grade social coding and collaboration platform.

Not sure which direction to take? Our solution architects help you understand how to choose between Subversion, Git, and all the other tools that you have to contend with.

Active-active strategies for data protection

A new report preview from 451 Research highlights some of the challenges facing data center operators. Two of the conclusions stood out in particular.  First, disaster recovery (DR) strategies are top of mind as IT operations become increasingly centralized, increasing the cost of an outage. 42% of data center operators are evaluating DR strategies, and a majority (62%) are using active-active strategies for data protection. Second, data center operators are playing in a more complicated world now. The ability to operator applications and data centers in a hybrid cloud environment is called out as a particular area of interest.

These findings echo what we’re hearing from our own customers. For many enterprise IT architects, active-active data replication is a checklist item when deploying a vital service like a Hadoop cluster. Many WANdisco Fusion customers buy our products for precisely that reason. And we’re also seeing strong interest in WANdisco Fusion’s unique ability to provide that replication between Hadoop clusters that use different distributions and storage systems, on-premise or in the cloud.

Visit 451 Research to obtain the full report. In the meantime, our solution architects can help you evaluate your own DR and hybrid deployment strategies.

Improving HBase Scalability for Real-time Applications

When we introduced Non-Stop for Apache HBase, we explained how it would improve HBase reliability for critical applications.  But Non-Stop for Apache HBase also uniquely improves HBase scalability and performance.

By making multiple active-active region servers, Non-Stop for Apache HBase alleviates some common HBase performance woes.  First, clients are load balanced between several region servers for any particular region.  By spreading the load among several region servers, the impact of problems like region ‘hot spots’ is alleviated.


So far so good, but you might be thinking that you could get the same benefit by using HBase read-HA.  However, HBase read-HA is limited to read operations in a single data center.  Non-Stop for Apache HBase lets you put region servers in several data centers, and any of them can handle write operations.  That gives you a few nice benefits:

  • Writes can be directed to any region server, reducing the chance that a single region server becomes a bottleneck due to hot spots or garbage collection.
  • Applications at other data centers now have fast access to a ‘local’ region server.

Although the HBase community continues to try to improve HBase performance, there are some bottlenecks that just can’t be eliminated without active-active replication.  No other solution lets you use several active region servers per region, and put those region servers at any location without regard to WAN latency.

If you’ve ever struggled with HBase performance, you should give Non-Stop for Apache HBase a close look.

Reducing Hadoop network vulnerabilities for backup and replication

Managing network connections between Hadoop clusters in different data centers is a significant challenge for Hadoop and network administrators. WANdisco Fusion reduces the number of connections required for any cross-cluster data flow, thereby reducing Hadoop network vulnerabilities for backup and replication.

DistCP is the tool used for data transfer and backup in almost every Hadoop backup and workflow system.  DistCP requires connectivity from each data node in the source cluster to each data node in the target cluster.

DistCP networking

DistCP requires connections from each data node to each data node

Typically each data node – to – data node connection requires configuring two connections for inbound and outbound traffic to cross the firewall and navigate any intermediate proxies.  In a case where you have 16 data nodes in each cluster, that means [16x16x2] connections to configure, secure, and monitor – 512 in total!  That’s a nightmare for Hadoop administrators and network operators.  Just ask the people responsible for planning and running your Hadoop cluster.

WANdisco Fusion solves this problem by routing cross-cluster communication through a handful of servers.  As the diagram below illustrates, in the simplest case you’ll have one server in the source cluster talking to one server in the target cluster, requiring a grand total of 2 connections to configure.

WDFusion network connections

WANdisco Fusion requires only a handful of network connections

In a realistic deployment, you’d require additional connections for the redundant WANdisco Fusion servers – this is an active-active configuration after all.  Still, in a large deployment you’d see a few tens of connections, rather than many hundreds.

The most recent spate of data breaches is driving a 10% annual increase in cybersecurity spending.  Why make yourself more vulnerable by exposing your entire Hadoop cluster to the WAN?  Our solution architects can help you reduce your Hadoop network exposure.

Improving HBase Resilience for Real-time Applications

HBase is the NoSQL database of choice for Hadoop, and now supports critical real-time workloads in financial services and other industries. As HBase has grown more important for these workloads, the Hadoop community has focused on reducing potential down time in the event of region server failure. Rebuilding a region server can take 15 minutes or more, and even the latest improvements only provide timeline-consistent read access using standby region servers. In many critical applications, losing write access for more than a few seconds is simply unacceptable.


Enter Non-Stop for Apache HBase. Built on WANdisco’s patented active-active replication engine, WANdisco provides fully consistent active-active access to a set of replicated region servers. That means that your HBase data is always safe and accessible for read and write activity.


By providing fully consistent active-active replication for region servers, Non-Stop for Apache HBase gives applications always-on read/write access for HBase.

Putting a replica in a remote location also provides geographic redundancy. Unlike native HBase replication, region servers at other data centers are fully writable and guaranteed to be consistent. Non-Stop for Apache HBase includes active-active HBase masters, so full use of HBase can continue even if an entire cluster is lost.

Non-Stop for Apache HBase also simplifies the management of HBase, as you no longer need complicated asynchronous master-slave setups for backup and high availability.

Take a look at what Non-Stop for Apache HBase can do for your low-latency and real-time analysis applications.

Monitoring active-active replication in WANdisco Fusion

WANdisco Fusion provides a very unique capability: active-active data replication between Hadoop clusters that may be in different locations and run very different types of Hadoop.

From an operational perspective, that capability poses some new and interesting questions about cross-cluster data flow.  Which cluster is data most often originating at?  How fast is getting moving between clusters?  And how much data is flowing back and forth?

WANdisco Fusion captures a lot of detailed information about the replication of data that can help to answer those questions, and it’s exposed through a series of REST end points.  The captured information includes:

  • The origin of replicated data (which cluster it came from)
  • The size of the files
  • Transfer rate
  • Transfer start, stop, and elapsed time
  • Transfer status

A subset of this information is visible in the WANdisco Fusion user interface, but I decided it would be a good chance to dust off my R scripts and do some visualization on my own.

For example, I can see that the replication data transfer rate between my two EC2 clusters is roughly bteween 800 and 1200 kb/s.

file-xfer-rateAnd, the size of the data is pretty small, between 600 and 900 kb.

file-xfer-sizeThose are just a couple of quick examples that I captured while running a data ingest process.  But over time it will be very helpful to keep an eye on the flow of data between your Hadoop clusters.  You could see, for instance, if there are any peak ingest times for clusters in different geographic regions.

Beyond all the other benefits of WANdisco Fusion, it provides this wealth of operational data to help you manage your Hadoop deployment. Be sure to contact one of our solutions architects if this information could be of use to you.

Transforming health care with Big Data

There’s a lot of hype around Big Data these days, so it’s refreshing to hear a real success story directly from one of the practitioners.  I was lucky a couple of weeks ago to attend a talk given by Charles Boicey, an Enterprise Analytics Architect, at an event sponsored by WANdisco, Hortonworks, and Slalom Consulting.  Charles helped put a Big Data strategy in place at the University of California – Irvine (UCI) Medical Center, and is now working on a similar project at Stony Brook.

If you’ve ever read any of Atul Gawande‘s publications, you’ll know that the U.S. health care system is challenged by a rising cost curve.  Thoughtful researchers are trying to address costs and improve quality of care by reducing error rates, focusing on root causes of recurring problems, and making sure that health care practitioners have the right data at the right time to make good decisions.

Mr. Boicey is in the middle of these transformational projects.  You can read about this work on his Twitter feed and elsewhere, and WANdisco has a case study available.  One thing that caught my attention in his latest talk is the drive to incorporate data from social media and wearable devices to improve medical care.  Mr. Boicey mentioned that sometimes patients will complain on Facebook while they’re still in the hospital – and that’s probably a good thing for the doctors and nurses to know.

And of course, all of the wearable devices that track daily activity and fitness would be a boon to medical providers if they could get a handle on that data easily.  The Wall Street Journal has a good write-up on the opportunities and challenges in this area.

It’s nice to see that Big Data is concrete applications that will truly benefit society.  It’s not just a tool for making the web work better anymore.

Benefits of WANdisco Fusion

In my last post I described WANdisco Fusion’s cluster-spanning file system. Now think of what that offers you:

  • Ingest data to any cluster and share it quickly and reliably with other clusters. That’ll remove fragile data transfer bottlenecks while still letting you process data at multiple places to improve performance and get more utilization out of backup clusters.
  • Support a bimodal or multimodal architecture to enable innovation without jeopardizing SLAs. Perform different stages of the processing pipeline on the best cluster. Need a dedicated high-memory cluster for in-memory analytics? Or want to take advantage of an elastic scale-out on a cheaper cloud environment? Got a legacy application that’s locked to a specific version of Hadoop? WANdisco Fusion has the connections to make it happen. And unlike batch data transfer tools, WANdisco Fusion provides fully consistent data that can be read and written from any site.


  • Put away the emergency pager. If you lose data on one cluster, or even an entire cluster, WANdisco Fusion has made sure that you have consistent copies of the data at other locations.


  • Set up security tiers to isolate sensitive data on secure clusters, or keep data local to its country of origin.


  • Perform risk-free migrations. Stand up a new cluster and seamlessly share data using WANdisco Fusion. Then migrate applications and users at your leisure, and retire the old cluster whenever you’re ready.

Read more

Interested? Check out the WANdisco Fusion page or call us for details.

Enter WANdisco Fusion

In my last post I talked about some of the problems of setting up data lakes in real Hadoop deployments. And now here’s the better way: WANdisco Fusion lets you build an effective, fast, and secure Hadoop deployments by bridging several Hadoop clusters – even if those clusters use different distributions, different versions of Hadoop, or even different file systems.

How does it work?

WANdisco Fusion (WD Fusion for short) lets you share data directories between two or more clusters. The data is replicated using WANdisco’s active-active replication engine – this isn’t just a fancier way to mirror data. Every cluster can write into the shared directories, and changes are coordinated in real-time between the clusters. That’s where the reliability comes from: the Paxos-based replication engine is a proven, patented way to coordinate changes coming from anywhere in the world with 100% reliability. Clusters that are temporarily down or disconnected catch up automatically when they’re back online.

The actual data transfer is done as an asynchronous background process and doesn’t consume MapReduce resources.

Selective replication enhances security. You can centrally define if data is available to every cluster or just one cluster.


What benefits does it bring?

Ok, put aside the technical bits. What can this thing actually do? In the next post I’ll show you how WD Fusion helps you get more value out of your Hadoop clusters.

WANdisco Fusion: A Bridge Between Clusters, Distributions, and Storage Systems

The vision of the data lake is admirable: collect all your valuable business data in one repository. Make it available for analysis and generate actionable data fast enough to improve your strategic and tactical business decisions.

Translated to Hadoop language, that implies putting all the data in a single large Hadoop cluster. That gives you the analysis advantages of the data lake while leveraging Hadoop’s low storage costs. And indeed, a recent survey found that 61% of Big Data analytics projects have shifted some EDW workload to Hadoop.

But in reality, it’s not that simple. 35% of those involved in Big Data projects are worried about maintaining performance as the data volume and work load increase. 44% are concerned about lack of enterprise-grade backup. Those concerns argue against concentrating ever more data into one cluster.

And meanwhile, 70% of the companies in that survey have multiple clusters in use. Small clusters that started as department-level pilots become production clusters. Security or cost concerns may dictate the use of multiple clusters for different groups. Upgrades to new Hadoop distributions to take advantage of new components (or abandon old ones) can be a difficult migration process. Whatever the reason, the reality of Hadoop deployments is more complicated than you’d think.

As for making multiple clusters play well together… well, the fragility of the tools like DistCP brings back memories of those complicated ETL processes that we wanted to leave behind us.

So are we doomed to an environment of data silos? Isn’t that what we were trying to avoid?


There is a better way. In the next post I’ll introduce WANdisco Fusion, the only Hadoop-compatible file system that quickly and easily shares data across clusters, distributions, and file systems.

Survey source: Wikibon

SmartSVN has a new home

We’re pleased to announce that from 23/02/2015 SmartSVN will be owned, maintained and managed by SmartSVN GmbH, a 100% child of Syntevo GmbH.

Long term customers will remember that Syntevo were the original creators and suppliers of SmartSVN, before WANdisco’s purchase of the product.

We’ve brought a lot of great features and enhancements to SmartSVN since we purchased it in 2012, particularly with the change from SVNkit to JAVAHL, which brought significant performance improvements and means that SmartSVN will be compatible with updates to core Subversion much faster than previously.

During the last two years the founders of Syntevo have continued to work with WANdisco on both engineering and consulting levels, so the transition back into their ownership will be smooth and seamless. We’re confident that having the original creators of SmartSVN take over the reins again will ensure that SmartSVN remains the best cross-platform Subversion product available for a long time to come.

Will this affect my purchased SmartSVN license?

No, SmartSVN GmbH will continue to support current SmartSVN users and you’ll be able to renew through them when the free upgrade period of your SmartSVN license has expired.

Where should I raise issues in the future?

The best place to go is Syntevo’s contact page where you’ll find the right contact depending on the nature of your issue.

A thank you to the SmartSVN community

Your input has been invaluable in guiding the improvements we’ve made to SmartSVN, we couldn’t have done it without you. We’d like to say thank you for your business over the last two years, and hope you continue to enjoy the product.

Team WANdisco

Join WANdisco at Strata

The Strata conferences are some of the best Big Data shows around.  I’m really looking forward to the show in San Jose on February 17-20 this year.  The presentations look terrific, and there are deep-dive sessions into Spark and R for all of the data scientists.

Plus, WANdisco will have a strong presence.  Our very own Jagane Sundar and Brett Rudenstein will be in the Cube to talk about WANdisco’s work on distributed file systems.  They’ll also show early demos of some exciting new technology, and you can always stop by our booth to see more.

Look forward to seeing everyone out there!

Register for Hadoop Security webinar

Security in Hadoop is a challenging topic.  Hadoop was built without very much of a security framework in mind, and so over the years the distribution vendors have added new authentication layers.  Kerberos, Knox, Ranger, Sentry – there are a lot of security components to consider in this fluid landscape.  Meanwhile, the demand for security is increasing thanks to increased data privacy concerns, exacerbated by the recent string of security breaches at major corporations.

This week Wikibon’s Jeff Kelly will give his perspective on how to secure sensitive data in Hadoop.  It should be a very interesting and useful Hadoop security webinar and I hope you’ll join us.  Just visit to register.

Data locality leading to more data centers

In the ‘yet another headache for CIOs’ category, here’s an interesting read from the Wall Street Journal on why US companies are going to start building more data centers in Europe soon.  In the wake of various cybersecurity threats and some recent political events, national governments are more sensitive to their citizens’ data leaving their area of control.  That’s data locality leading to more data centers – and it’ll hit a lot of companies.

Multinational firms are of course affected as they have customer data originating from several areas.  But in my mind the jury is out on how big the impact will be.  If you’re even a consumer of social media information, do you need a local data center in every area where you’re trying to get that data feed?  It’s likely going to take a few years (and probably some legal rulings) before the dust settles.

You can imagine that this new requirement puts a real crimp in Hadoop deployment plans.  Do you now need at least a small cluster in each area you do business in?  If so, how do you easily keep sensitive data local while still sharing downstream analysis?

This is one of the areas where a geographically distributed HDFS with powerful selective replication capabilities can come to the rescue.  For more details, have a listen to the webinar on Hadoop data transfer pipelines that I ran with 451 Research’s Matt Aslett last week.

Hadoop Data Protection

I just came across this nice summary of data protection strategies for Hadoop.  It hits on a key problem: typical backup strategies for Hadoop just won’t handle the volume of data, and there’s not much available from dedicated backup vendors either.  Because it’s a daunting problem, companies just assume that Hadoop is “distributed enough” to not require a bulletproof backup strategy.

But as we’ve heard time and again from our customers, that’s just not the case.  The article shows why – if you read on to the section on DistCP, the tool normally used for cluster backup, you’ll see that it can take hours to back up a few terabytes of data.

As the article mentions, what’s necessary is an efficient block-level backup solution.  Luckily, that’s just what WANdisco provides in Nonstop Hadoop.  The architect of our geographically distributed solution for a unified cross-cluster HDFS data layer described the approach at a Strata conference last year.

The article actually mentions our solution, but I think there was a slight misunderstanding.  WANdisco does not actually make “just a backup” solution, so we don’t provide any more versioning than what you get out of regular HDFS.  In fact that’s the whole point – we provide an HDFS data layer that spans multiple clusters and locations.  It provides a very effective Disaster Recovery strategy as well as other benefits like cluster zones and multiple data-center ingest.


Interested in learning more?  We’re here to help.

Complete control over Hadoop data locality

Non-Stop Hadoop provides a unified data layer across Hadoop clusters in one or many locations. This unified data layer solves a number of problems by providing a very low recovery point objective for critical data, full continuity of data access in the event of failure, and the ability to ingest and process data at any cluster.

Carrying implementation of this layer to its logical conclusion, you may ask if we’ve introduced a new problem in the process of solving these others, namely, what if you don’t want to replicate all HDFS data everywhere?

Perhaps you have to respect data privacy or locality regulations, or maybe it’s just not practical to ship all your raw data across the WAN. Do you have to fall back to workflow management systems like Falcon to do scheduled selective data transfers, and deal with the delays and complexity of building an ETL-style pipeline?

Luckily, no. Non-Stop Hadoop provides a selective replication capability that is more sophisticated than what you could build manually with the stock data transfer tools. As part of a centralized administration function, for each part of the HDFS namespace you can define:

  • Which data centers receive the data
  • The replication factor in each data center
  • Whether data is available for remote (WAN) read even if it is not available locally
  • Whether data can be written in a particular data center

This solves a host of problems. Perhaps most importantly, if you have sensitive data that cannot be transferred outside a certain area, you can make sure it never reaches data centers in other areas. Further, you can ensure that the restricted part of the namespace is never accessed for reads or writes in other areas.

Non-Stop Hadoop’s selective replication also solves some efficiency problems. Simply choose not to replicate temporary ‘working’ data, or only replicate rarely accessed data on demand. Similarly, you don’t need as high a replication factor if data exists in multiple locations, so you can cut down on some local storage costs.

Selective replication across multiple clusters sharing a Nonstop Hadoop HDFS data layer: Replication policies control where subsets of HDFS are replicated, the replication factor in each cluster, and the availability of remote (WAN) reads

Selective replication across multiple clusters sharing a Nonstop Hadoop HDFS data layer: Replication policies control where subsets of HDFS are replicated, the replication factor in each cluster, and the availability of remote (WAN) reads

Consistent highly available data is really just the starting point for Nonstop Hadoop.  Nonstop Hadoop also gives you powerful tools to control where data resides, how it gets there, and how it’s stored.

By now you’ve probably thought of a problem that selective replication can help you solve.  Give our team of Hadoop experts a call to learn more.

Wildcards in Subversion Authorization

Support for wildcards in Subversion authorization rules has been noticeably lacking for many years.  The use cases for using wildcards are numerous and well understood: denying write access to a set of protected file types in all branches, granting access to all sandbox branches in all projects, and so on.

So I’m very pleased to announce that WANdisco is now supporting wildcards for Subversion in our Access Control Plus product.  With this feature you can now easily define path restrictions for Subversion repositories using wildcards.

How does this work given that core Subversion doesn’t support wildcards?  Well, wildcard support is a long-standing feature request in the open source Subversion project, and we picked up word that there was a good design under review.  We asked one of the committers that works for WANdisco to create a patch that we can regression test and ship with our SVN MultiSite Plus and Access Control Plus products until the design lands in the core project.

Besides letting you define rules with wildcards, Access Control Plus does a couple of other clever things.

  • Let you set a relative priority that impacts the ordering of sections in the AuthZ file.  The order is significant when wildcards are in use as multiple sections may match the same path.
  • Warn you if two rules may conflict because they affect the same path but have a different priority.

acp-wildcard-conflictThis feature will likely be a life saver for Subversion administrators – just contact us and we’ll help you take advantage of it.

Is One Hadoop Cluster Enough?

A new report out from GigaOM analyst Paul Miller provides some insights into the question, is one Hadoop cluster enough for most Big Data needs?  It’s surprising how much attention this topic has garnered recently.  Up until a few months ago I hadn’t really thought that much about why you’d need more than a single cluster.  After all, most of the technical information about Hadoop is geared towards running everything on one cluster, especially since YARN makes it easier to run multiple applications on a single cluster.

But another recent study shows that a majority of Big Data users are running multiple data centers.  The GigaOM report dives into some of the reasons why that might be.  Workload optimization, load balancing, taking advantage of the cloud for affordable burst processing and backups, regulatory concerns – there are a host of reasons that are driving Hadoop adopters toward a logical data lake consisting of several clusters.  And of course there’s also the fact that many Hadoop deployments evolve from a collection of small clusters set up in isolation.

The report also notes that the tools for managing the flow of data between multiple clusters are still rudimentary.  DistCP, which underpins many of the ETL-style tools like Falcon, can be quite slow and error-prone.  If you only need to sync data between clusters once a day it might be ok, but many use cases are demanding near real-time roll-up analysis.

That’s why WANdisco provides active-active replication: Non-stop Hadoop lets your data span clusters and geographies.  In the interest of saving a thousand words:


Interested?  Check out some of the reasons why this architecture is attractive to Hadoop data consumers and operators.

The hunger for low latency big data

Good grief: Spark has barely hit a 1.0 release and already there are several projects vying to improve on it and perhaps be the next big thing.  I think this is another sign that Spark is here to stay – everyone is focusing on how to beat it!  In fact even the Berkeley lab that developed Spark has come up with an alternative that is supposedly a couple orders of magnitude faster than Spark for some types of machine learning.

The bigger lesson here for CIOs and data architects is that your Hadoop infrastructure has to be flexible enough to deploy the latest and greatest tools.  Your ‘customers’ – data scientists, marketers, managers – will keep asking for faster processing time.

Of course here at WANdisco we’ve got some of the best minds in Big Data working on exactly this problem.  Our principal scientists have been working on the innards of Hadoop almost since day one, and they’re evolving our Hadoop products to support very sophisticated deployments.  For instance, Non-stop Hadoop lets you run several Hadoop clusters that share the same HDFS namespace but otherwise operate independently.  That means you can allocate distinct clusters (or carve off part of a cluster) to run dedicated processing pipelines that might require a different hardware or job management profile to support low latency big data operations.

Sound interesting?  It’s a fast-moving field and we’re ready to help!

Hadoop security tiers via cluster zones

The recent cyberattack on Sony’s network was a CIO’s nightmare come true. The Wall Street Journal had a good summary of some of the initial findings and recommendations. One of the important points was that data integration, although a huge win for productivity, increases the exposure from a single security breach.

That started me thinking about the use of isolated Hadoop security tiers in Hadoop clusters. I’m as excited as anyone by the prospect of Hadoop data lakes; in general, the more data you have available, the happier your data scientists will be. When it comes to your most sensitive data, however, it may be worth protecting with greater rigor.

Hadoop security has come a long way in recent releases, with better integration with Kerberos and more powerful role-based controls, but there is no substitute for the protection that comes with isolating sensitive data on a separate cluster.

But how do you do that and still allow privileged users full access to the entire set of data for analysis? Non-stop Hadoop offers the answer: you can share the HDFS namespace across the less secure and more secure clusters, and use selective replication to ensure that the sensitive data never moves into the less secure cluster. The picture below illustrates the concept.

hadoop -ref-arch-5

Users on the ‘open’ cluster can only see the generally available data. Users on the ‘secure’ cluster can access all of the data.

Feel free to get in touch if you have questions about how to add this extra layer of defense into your Hadoop infrastructure.

Machine Learning as a Service?

Rick Delgado had an interesting article on how the widespread availability of machine learning will facilitate the rollout of the Internet of Things (IoT).  Intuitively it makes sense; as algorithms become widely understood and field tested, they evolve from black magic to tools in the engineering kit.  You can see this phenomenon in automotive safety technology.  In the mid 1990s I was working on machine vision algorithms for automotive applications.  Everything was new and exciting; there were a few standard theories, but they had barely been tested at any scale and the processing hardware hadn’t caught up to the data demands.  Now as the Wall Street Journal reports, Toyota is making collision-avoidance gadgets standard on almost every new model.  One driver is the reduced price of the cameras and radars, but I think a bigger driver is the trustworthiness of the autonomous vehicle algorithms that can reliably sense a possible collision.

Of course, here at WANdisco the IoT is of much interest.  For all of this new streaming data to be useful, it has to be ingested, processed, and used, often at very high speeds.  That’s a challenge for traditional Hadoop architectures – but one that we’re quite prepared to meet.

SmartSVN 8.6.3 General Access Released!

We’re pleased to announce the latest release of SmartSVN, 8.6.3. SmartSVN is the popular graphical Subversion (SVN) client for Mac, Windows, and Linux. SmartSVN 8.6.3 is available immediately for download from our website.

New Features include:

– Show client certificate option in the SSL tab in Preferences

Fixes include:

– Bug reporting now suggests the email address from the license file

For a full list of all improvements and features please see the changelog.


Note for Mac Os X 8.6.2 users:- If you installed version 8.6.2 as a new download (rather than autoupdating) you will need to download and reinstall 8.6.3 to stop the master password window from constantly reappearing
– You will be required to enter the master password once more after the installation

Contribute to further enhancements

Many of the issues resolved in SmartSVN were raised by our dedicated SmartSVN forum, so if you’ve got an issue or a request for a new feature, head there and let us know.

Get Started

Haven’t yet started using SmartSVN? Get a free trial of SmartSVN Professional now.

If you have SmartSVN and need to update to SmartSVN 8, you can update directly within the application. Read the Knowledgebase article for further information.

An essential Git plugin for Gerrit

One of the frequent complaints about Gerrit is the esoteric syntax of pushing a change for review:

git push origin HEAD:refs/for/master

Translated, that means to push your current HEAD ref to a remote named origin and to a special review ref (for master).

If you’re a Gerrit user, you need this plugin:

It automates some of the Gerrit syntax so now you can just run:

git review

The only problem is that when you push to a non-Gerrit repository you start to wonder why your review command doesn’t work anymore.  That’s how deeply ingrained code review is to the Gerrit workflow.

Another Top 5 List on Hadoop

Top 5 lists are always fun, and here’s another top 5 list on Hadoop.  It’s fairly familiar to anyone who follows the space, but it does highlight a few important trends.  A few comments and quibbles:

  • The fact that open source is the foundation of Big Data software shouldn’t be surprising even to the government anymore.  After all, even the secretive NSA has publicly acknowledged use of Hadoop.
  • The only controversial claim is that Hadoop is set to replace Enterprise Data Warehouses (EDWs).  I’ve heard a lot of arguments for and against that point over the last year.  It seems the Hadoop will at least complement EDWs and allow them to be used more efficiently, but complete replacement will depend on Hadoop maturing in a couple of key areas.  First, it will have to handle low-latency queries more efficiently.  Second, it will have to be as reliable and flexible as mature EDWs.  Keep an eye on projects like Apache Spark and, of course, Non-stop Hadoop in this area.
  • I agree that the Internet of Things (IoT) will be a new and important source of data for Hadoop in the future.  However, just a point of terminology: no one will “embed Hadoop” into  small devices.  Rather, data from these devices will be streamed into Hadoop.
  • Siri and the other smart assistants like Cortana are making waves, but IBM’s Watson seems to be years ahead in terms of analyzing complex unstructured situations.  Watson does use Hadoop for distributed processing but it has a much different paradigm than traditional MapReduce processing, and it needs to store a good chunk of its data in RAM.  That’s another sign that the brightest future for Hadoop will require new and exciting analytics frameworks.


Binary artifact management in Git

Paul Hammant has an interesting post on whether to check binary artifacts into source control.  Binary artifact management in Git is an interesting question and worth revisiting from time to time.

First, a bit of background.  Centralized SCM systems like Subversion and ClearCase are a bit more capable than Git when it comes to handling binary files.  One reason is sheer performance: since a Git repository has a full copy of the entire history, you just don’t want your clone (working copy) to be too big.  Another reason is assembling your working views.  ClearCase and to a lesser extent Subversion give you some nice tools to pick and choose pieces of a really big central repository and assemble the right working copy.  For example in a ClearCase config spec you can specify that you want a certain version of a third party library dependency.  Git on the other hand is pretty much all or nothing; it’s not easy to do a partial clone of a really big master repository.

Meanwhile, there had been a trend in development to move to more formal build and artifact management systems.  You could define a dependency graph in a tool like Maven and use Maven or Artifactory or even Jenkins to manage artifacts.  Along with offering benefits like not storing derived objects in source control, this trend covered off Git’s weak spot in handling binaries.

Now I’m not entirely sure about Paul’s reasons for recommending a switch back to managing binaries in Git.  Personally I prefer to properly capture dependencies in a configuration file like Maven’s POM, as I can exercise proper change control over that file.  The odd thing about SCM working view definitions like config specs is that they aren’t strongly versioned like source code files are.

But that being said,  you may prefer to store binaries in source control, or you may have binaries that are actually source artifacts (like graphics or multimedia for game development).  So is it hopeless with Git?

Not quite.  There are a couple of options worth looking at.  First, you could try out one of the Git extensions like git-annex or git-media.  These have been around a long time and work well in some use cases.  However they do require extra configuration and changes to the way you work.

Another interesting option is the use of shared back-end storage for cloned repositories.  Most Git repository management solutions that offer forks use these options for efficient use of back-end storage space.  If you can accept working on shared development infrastructure rather than your own workstation, then you can clone a Git repository using the file protocol with the -s option to share the object folder.  There’s also the -reference option to point a new Git clone at an existing object store.  These options make cloning relatively fast as you don’t have to create copies of large objects.  It doesn’t alleviate the pain of having the checked out files in your clone directory, but if you’re working on a powerful server that may be acceptable.  The bigger drawback to the file protocol is the lack of access control.

Management of large binaries is still an unsolved problem in the Git community.  There are effective alternatives and work-arounds but it’ll be interesting to see if anyone tries to solve the problem more systematically.

SmartSVN 8.6.2 General Access Now Available

We’re pleased to announce the latest release of SmartSVN, 8.6.2. SmartSVN is the popular graphical Subversion (SVN) client for Mac, Windows, and Linux. SmartSVN 8.6.2 is available immediately for download from our website.

New Features include:

– Support for Mac OSX 10.10 Yosemite

Fixes include:

– Issue with log and graphing when no cache is created

For a full list of all improvements and features please see the changelog.

Contribute to further enhancements

Many of the issues resolved in SmartSVN were raised by our dedicated SmartSVN forum, so if you’ve got an issue or a request for a new feature, head there and let us know.

Get Started

Haven’t yet started using SmartSVN? Get a free trial of SmartSVN Professional now.

If you have SmartSVN and need to update to SmartSVN 8, you can update directly within the application. Read the Knowledgebase article for further information.

Starting at WANdisco: Gordon Vaughan, SDM

Hello world.

So, I’ve been asked to write a blog about my experience of starting at Wandisco. It was only 5 weeks ago, but it still feels a bit weird to write about it because it simultaneously feels like yesterday and a year ago, in equally positive measures. I’ll try to give an idea of why that is, and why I’m happy that I chose WANdisco as the next step in my career.

With my previous employer, I’d had a brilliant time for around 3-4 years; working my way up, gaining experience, pushing myself to go above and beyond every day. It was fantastic. Then, after a great run, things started to slow. The business got quite staid, opportunities to learn dried up and instead of progressing we were living in a perpetual ‘firefighting’ limbo. At the same time, my employer was owned by a larger organisation that was gradually, but perceptibly, making changes that impacted on the way our business performed. I’m sure many readers will have seen their employer go through similar absorption, and felt the tremors themselves first hand.

After a couple of years of stagnant career progress, albeit in a comfortable and fairly happy setting, an opportunity was pointed out to me at WANdisco.

It’s important at this point that I make something clear: I am not a technical expert. I’m one of those people that complete novices think are magical because I know how to use Google. On your initial Googling of WANdisco, that could seemingly rule you out because they talk in confident terms about their MultiSite products, enabling active-active replication of development environments across the globe at LAN speed with… Nope, I’m lost again… When I stepped away from Google and thought in isolation about what it was they were saying, it made a lot more sense. A change management system, that runs globally as fast as locally, that’s the same wherever you access it from. We forget sometimes that massive files take ages to download over large geographic areas, and if that’s happening all the time then how much time is lost waiting for updates? That, plus the fact MultiSite means, by its very nature, having multiple copies, you also have effective disaster recovery. I suddenly found myself interested.

Have to admit that Big Data was the product that made me really excited. Some of the stats around production of data are mind-blowing. By the time you have read this far down the page, it’s likely the amount of data globally recorded outstrips anything from the early 90s back to the beginning of time. All that data needs to go somewhere and it’s probably all usable, but how? I mean, physically, how? I saw a video by David Richards, the man who started WANdisco, explaining that Big Data had been used in the automotive industry to accurately predict the failure rate of components on cars to make pro-active repair possible. The video went on to mention how that could apply to healthcare, and then that wave of realisation hit. Big Data could well be the biggest thing to happen to this world since the Internet itself. How *amazing* would it be to help our customers build and shape that product to their own specification? Notice the ‘our’ in that sentence – I was already on board in my mind 🙂

After polishing my CV, having a shave and a haircut and all the other prep you would normally do for an interview, that ‘our’ became a reality 6 weeks later.

The role I fulfil is that of Service Delivery Manager. In title, that meant doing exactly the same thing as I did in my old workplace. In reality, it was everything that role should have been, and more besides. We perform quarterly service reviews with our customers, whether they have needed our support team or not, to talk to them about how we’re doing from a global support perspective, how the product is working for them, if there are any challenges or changes coming up, etc. That’s a mandated part of the service and not a nice-to-have – unless of course the customer chooses not to have them! What’s key is that we’re always talking to our customers, always looking for the next hurdle before it hits us, always being open and honest about our performance. It’s that approach that we believe will provide us the valuable intelligence we need to keep evolving, and showing our customers that we’re listening and adapting constantly to their needs.

The thought of having these kinds of conversations with customers without product knowledge was, frankly, terrifying. Thankfully, WANdisco had a full induction plan in place to ensure I had a full days’ worth of training across Subversion, MultiSite and Big Data to get the basics, and since then it’s been topped up by more in-depth sessions, particularly on Big Data. What I think is brilliant about the industry we’re in is that a lot of the software and processes we work with are open source, and there’s a wealth of information available on them. It’s not like the textbook models of old; it’s seminars, product demonstrations, lectures and other learning tools presented in engaging formats across the internet. YouTube has been a fantastic resource for learning; where previously I’d used it solely for watching Nanners and Sips playing various games, now I find myself lost in hours of concepts and theories that are still sinking in. It’s the diversity, yet relevance, of the information available to you that simply boggles the mind, and it’s all so new and rapidly changing that it’s compelling. WANdisco provide a good proportion of that content, either themselves or via exposes/conferences, which really makes you feel like you’re part of an important player in the community.

Of course, it’s very early days for me in learning, and there’s a strong chance that I’ll never have the knowledge that some of the people around the business hold. I wouldn’t have it any other way though; I love that we have so many brilliant minds across multiple sites. The culture within WANdisco is very similar to that of the open source community as a whole, in that we share, we collaborate, we discuss, and everyone learns. Everyone is approachable, and you can bet if the first person you speak to doesn’t have the answer, they will be able to walk you over to someone who does. In my role it’s vital that I have access to that knowledge quickly and easily, so it’s fantastic to have that ‘resource’ so accessible.

At this point I need to confess something: it’s now 13 weeks since I started, and it’s taken me 8 weeks to write this because I’ve been so busy. I’ve loved every second of it, and I love the fact that when I see a clock say 4pm I now think ‘where has the day gone?’ instead of ‘oh no, there’s still 2 hours left…’ There aren’t enough hours in the day, genuinely.

I’ll sign off there, but if you’re looking at WANdisco as a potential employer, or even if you think you’re happy where you are but find yourself reading this for some bizarre reason, do take a look at our careers site. It’s a great place to work, a great place to learn, and simply a great place to be.

WANdisco Engineering Offsite 2014

Hello from Belfast!

I’ve been enjoying a quick visit to Belfast this week to participate in WANdisco’s engineering offsite meeting. WANdisco has engineering offices in California, England, Northern Ireland, and India, and it’s really a pleasure to work with great people around the world. Belfast is also a terrific city to visit, with an amazing local food scene and a fun downtown area.

WANdisco is a fast-paced company and it’s always interesting to take a breath and catch up with colleagues that you normally only see on video conferencing. We’ve achieved an amazing amount in the past year, launching two new products (Access Control Plus and Gerrit integration for Git MultiSite) with a few more in various stages of work. Every WANdisco office has people with different viewpoints and skill sets, but keeping the communication channels open requires an investment in keeping in touch. Of course from one perspective it’s really easy: we use our MultiSite products internally, so sharing source code is dead simple…

Anyway, our batteries are recharged, we’ve got a plan for the rest of this year going into 2015, and we’re going to continue to deliver products that solve tough problems and delight our customers. That’s all for now – someone said there were pubs in Ireland, so I’m off to explore!


SmartSVN 7.6.4 With SSL Fix Available Now

We’re pleased to announce the release of SmartSVN 7.6.4, available for Mac, Windows and Linux. SmartSVN is available for immediate download from our website.

This is an update to the older version of SmartSVN, based on SVNkit rather than JavaHL. The update includes a fix to the POODLE bug that affects SSLv3.

For a full list of changes please see the changelog.

If you’ve any requests or feedback please drop a post into our dedicated SVN Forum.

Get Started

Haven’t yet started using SmartSVN? Get a free trial of SmartSVN Professional now.

If you have SmartSVN and need to update to SmartSVN 8, you can update directly within the application. Read the Knowledgebase article for further information.

Thoughts on Hadoop architecture

Gartner just released a new research note on comparing Hadoop distributions.  Although the note itself is behind a paywall, some of the key findings are posted openly.  And I find it very interesting that when Gartner shares its thoughts on Hadoop architecture and distributions, they tend to focus much more on the big picture of how to design the best Hadoop for your business.

The item that stood out most was the finding that Hadoop is becoming the default cluster management solution.  YARN really changed the focus of Hadoop from a batch processing system to a general purpose platform for large scale data management and computation.  The Hadoop ecosystem is evolving so quickly that it can be frightening, but you do get some ‘future proofing’ as well – whenever the next big thing comes along, chances are it will run on Hadoop, just like Spark does.

On a related note, Gartner also recommends focusing on your ideal architecture rather than on the nuts-and-bolts of any particular distribution.  That’s just good sense; if you know what you want to do with your data, chances are Hadoop is now mature enough to accommodate you.  And of course, WANdisco provides some clever solutions to help all of those Hadoop clusters work better together.

Anyway, the research note is a nice read, particularly if you’re feeling overwhelmed by how complicated Hadoop is getting.

Solving the 3 biggest Hadoop challenges

A colleague recently pointed me to this great article on the 3 biggest Hadoop challenges. The article is written by Sean Suchter, the CEO of Pepperdata, and offers a practical perspective on how these challenges are seen and managed through workarounds.

Ultimate none of those workarounds are very satisfactory. Fortunately, Non-Stop Hadoop offers a compelling way to solve these challenges, either in whole or in part.

Resource contention due to mixed workloads and multi-tenancy environments

This problem seems to be the biggest driver of Hadoop challenges. Of the many workarounds Suchter discusses, all seem either manually intensive (tweaking Hadoop parameters for better performance) or limiting from a business perspective (gating production jobs or designing workflows to avoid bottlenecks).

As I’ve written before, the concept of a logical data lake with a unified HDFS namespace largely overcomes this challenge. Non-Stop Hadoop lets you set up multiple clusters at one or several locations, all sharing the same data – unless you choose to restrict the sharing through selective replication. Now you can run jobs on the most appropriate cluster (e.g. using high-memory nodes for in-memory processing) and avoid the worst of the resource contention.

Difficult troubleshooting

We all know the feeling of being under the gun while an important production system is offline. While the Hadoop ecosystem will surely mature in the coming years, Non-Stop Hadoop gives you built-in redundancy. Lose a NameNode? You’ve got 8 more. The whole cluster is shot? You’ve got two others that can fill in the gap…immediately.

Inefficient use of hardware

It’s really a tough problem: you need enough hardware to handle peak bursts of activity, but then a lot of it will sit idle during non-peak times. Non-Stop Hadoop gives you a clever solution: put your backup cluster to work. The backup cluster is effectively just an extension of the primary cluster when you use Non-Stop Hadoop. Point some jobs at the second cluster during periods of peak workload and you’ll have easy load balancing.

To borrow an analogy from the electric power industry, do you want to maintain expensive and inefficient peaker units for the two hours when the air-conditioning load is straining the grid? Or do you want to invest in distributed power setups like solar, wind, and neighborhood generation?

A better Hadoop

Non-Stop Hadoop is Hadoop…just better. Let’s solve your problems together.

GitLab and Git MultiSite: Architecture

The architecture of GitLab running with Git MultiSite is worth exploring.  In the interest of saving a thousand words, here’s the picture.


As you can see, the topology is quite a bit more complex when you use a Git repository management system that uses multiple data stores.  Git MultiSite coordinates with GitLab to replicate all repository activity, including wiki repositories.  Git MultiSite also replicates some important files like the GitLab authorization files for access control.

As for the other data stores, we’re relying on GitLab’s ability to run with multiple web apps connected to a single logical relational database and a single logical Redis database.  They can be connected directly or via pass-through mirrors.  Kudos to the GitLab team for a clean architecture that facilitates this multi-master setup; they’ve avoid some of the nasty caching issues that other applications encounter.  This topology is in fact similar to what you can do with GitLab when you use shared storage for the repositories.  Git MultiSite provides the missing link: full repository replication with robust performance in a WAN environment and a shared-nothing architecture.

Short of relying completely on Git as a data store for code reviews and other metadata, this architecture is about as clean as it gets.

Now for some nuts and bolts…

We are making some simplifying assumptions for the first release of GitLab integration.  The biggest assumption is that all nodes run all the software, and that all repositories originate in GitLab and exist on all nodes.  We plan to relax some of these constraints in the future.

And what about performance?  Well, I’m happy to relate that you’ll see very good performance in all cases and much improved performance in some cases.  Balancing repository activity across several nodes gives better throughput when the system is under practical load.


Well, that picture saved a few words, but nothing speaks better than a demo or a proof-of-concept deployment.  Contact us for details!


Scalable Social Coding

I’m very pleased to announce that Git MultiSite now formally supports GitLab, a leading on-premise Git collaboration and management suite.  With this and future integrations, Git MultiSite’s promise of a truly distributed Git solution is coming to fruition.

WANdisco first announced Git MultiSite in 2013.  Git MultiSite provides our patented active-active replication for Git, giving you a deployment of fully writable peer nodes instead of a single ‘master’ Git server.  The next step came with Access Control Plus in 2014, which brought Git repositories under a unified security and management umbrella.

And now we’re tackling the final piece of the puzzle.  Those of you active in the Git ecosystem know that most companies deploy Git as part of an integrated repository management solution that also provides social coding and collaboration tools — code review, wikis, and sometimes lightweight issue tracking.

In one sense, Git MultiSite is still a foundational technology that can replicate Git repositories managed by almost any system.  And indeed we do have customers who deployed Git MultiSite with GitLab long before we did any extra work in this area.

The devil is in the details though.  For one thing, some code review systems actually modify a Git repository using non-standard techniques in response to code review activity like approving a merge request.  So we had to make a few under-the-hood modifications to support that workflow.

Perhaps more importantly, Git MultiSite and Access Control Plus provide consistent (and writable) access to repository and access control data at all sites.  But if the collaboration tool is a key part of the workflow, you really need that portal to be available at every node as well.  And we’ve worked hard with the GitLab crew to make that possible.

So what does that all mean?  You get it all:

  • LAN speed access to repositories at every site
  • A built-in HA/DR strategy for zero down time
  • Easy scalability for build automation or a larger user base
  • Fast access to the GitLab UI for code reviews and more at every site
  • Consistent access control at every site
  • All backed by WANdisco’s premier support options

Interested?  I’ll be publishing more details on the integration in the near future.  In the meantime, give us a call and we’ll give you a full briefing.


Advanced Gerrit Workflows

As a final note on Gerrit workflows, it’s worth looking into Gerrit’s Prolog engine if you need a customized code approval process.  Now, I know what you’re thinking – do you really need to learn Prolog to use Gerrit?  Certainly not!  You can use Gerrit out of the box very effectively.  But if you need a highly tailored workflow, you can either write a Java plugin or write some rules in Prolog.  The Prolog syntax is well suited for logical expressions, and you can check the Prolog rules in to a Gerrit repo as regular text files.  That’s easier than writing, building, and maintaining a Java plugin.

So what can you do with Prolog?  Two very useful things:


  • Submit rules define when a change can be submitted.  The default is to require one vote of the highest option from each rule category, with no lowest votes in any category.  A common choice is to require a human ‘+2’ and a ‘+1’ from the CI system.  Submit rules can be defined globally or per project.  Submit rules are given a set of facts about a commit (author, message, and so on) and then decide whether the commit can be submitted.
  • Submit types define how a change can be submitted, per project.  You can choose from fast forward only, merge if necessary, merge always, cherry pick, or rebase if necessary.


There’s a great Gerrit Prolog workbook to get you started, and Gerrit provides a Prolog shell and debugging environment.

As a simple example, here’s a submit type that only allows fast-forward updates on release branches, but allows other submit types on other branches.

submit_type(fast_forward_only) :-
 gerrit:change_branch(B), regex_matches('refs/heads/
   release.*', B),
submit_type(T) :- gerrit:project_default_submit_type(T)

Hacking Prolog is not for the brand-new-to-Gerrit, but don’t be scared of it either.  It gives you a tremendous amount of control over how changes flow into your repositories.  If you store configuration data in Git and are subject to PCI regulations or other compliance measures, then a strong Gerrit workflow explicitly defined in Prolog will help satisfy your compliance concerns.

As always if you have any questions just ask.  We have a team of Git experts waiting to help.


Gerrit Administration

So far I’ve been talking a lot about Gerrit’s strong points. Now it’s time to focus on one of Gerrit’s comparative weak points: administration. Gerrit has all the tools you need to run a stable and secure deployment, but you need to be a master mechanic, not a weekend hobbyist.

Although Gerrit has an easy ‘quick start’ mode that’s great for trying it out, you need to do some research before running it in a production environment. Here are some areas that will need attention.

User Management

Gerrit supports several authentication mechanisms. The default is OpenID, which is suitable for open source projects or for enterprise environments that have an internal OpenID provider. Other sites will want to look at using LDAP, Active Directory, or possible Apache for authentication. Similarly, you can maintain groups internally or via an external directory.


Gerrit can serve Git repositories over SSH or HTTP/S. SSH is a convenient way to start for small teams, as each user can upload a public key. However maintaining SSH keys for a large user base is cumbersome, and for large deployments we recommend serving over HTTP/S.

Of course you should use HTTPS to secure both the Gerrit UI and the repositories.


Gerrit has a robust access control system built in. You set permissions in a hierarchy, with global defaults set for the ‘All Projects’ project. You can set up other project templates and have new projects inherit from the template of your choice.

You can manage permissions on:

  • Branches and tags
  • Change sets uploaded for review
  • Configuration including access control settings and submit rules
  • Code review workflow steps including approving and verifying changes


You’ll want to hook up your build system to Gerrit to make best use of its workflow. (The build system can vote on whether to accept a change.) Similarly, you might want to integrate an external ticket system or wiki.


I’ll cover this topic in more detail later on. But for now I’ll mention that you should have mirrors available at each location to provide the best performance. If you need Gerrit to enforce access control on the mirrors then you’ll need to run Gerrit in slave mode against a database mirror.

Sound complicated? It is. That’s why WANdisco provides Git MultiSite for Gerrit. You’ll get active-active fully replicated and writable repositories at each site, with regular Gerrit access control enforced.

Need help?

Call our Git support specialists if you need a hand getting started with Gerrit.

Unlimited Holidays? Old news to us!

Don’t get me wrong. It’s a great idea, though it also looks a bit like an attempt to sell a book – but this is Sir Richard Branson, a very smart and exceedingly canny man, who I believe has pledged to never undertake any task in life if he can’t make any money from it. This may sound mercenary but to my knowledge Sir Richard has never done so at the expense of or by stepping on other people. Which is nice.
Anyway, holidays. To all of us here at WANdisco, this kind of thing is old news. I’m lucky enough to work for a company that adopted the same policy a couple of years ago and I tell you what – it’s liberating, is probably the best word. I realise it may not work for every individual, but to know that you’re trusted to do your job and to know enough about what your colleagues are doing and what projects are on the go and to plan your holidays around that is something special.
Much like Netflix, we’ve found that treating people like grownups works. If you’re forced to report weekly, daily, even hourly in some cases what you’re doing and need to put your hand up to ask if you can use the bathroom do you feel trusted? It’s a weird feeling, having been out of school for several years and then find yourself in an environment that’s not much different. No one wants to feel like just a number, and policies (or lack of!) such as these have a big impact on working life.
A common question when people announce this sort of thing is ‘won’t the office just be empty all the time?’. Here at WANdisco we found that not to be the case, in actual fact last time we crunched the numbers we had to go out and ensure people took their statutory minimum holiday entitlement…. in addition to the 8 bank holidays. All of us appreciate the fact that we’re given the choice to take holiday when we need it, but for the most part we love coming to work.
It may not be the sort of thing that could work at your company, but if you want to engender satisfaction and loyalty in your workforce and if you want them to be proud of the company they work for, it’s certainly worth considering.

Starting at WANdisco (part 2)

Part 1

So, that was the majority of my first few months. I can do forums, blogging, writing, all that – that’s fine, but I needed to learn Subversion and Git because I need to be able to answer posts in forums helping people to use it, and arguably more importantly I need to be able to replicate issues that are raised and report them to the developers.

Just as an aside here, this is a fairly important place to be – in between the devs and the customers, understanding the language of both sides and translating from one to the other. I find it extremely satisfying and quite often have a lot of fun with it.

*ahem* training. Subversion was where I started, which is probably the smart move as it’s a fair bit simpler than Git though arguably not as good depending on your point of view. My understanding is that Subversion was written so that non-coders could have some control over what code gets committed or not, whereas Git was written by coders for coders, with the things that coders want in it, hence it’s significantly more complex.

It was good training, in fact the same training that our support engineers are given when they start (and our support engineers are incredible guys). It taught me a lot about Subversion – especially because it was written for Windows and TortoiseSVN and I was following it using SmartSVN on a mixture of OSX and Linux. In all honesty I can totally recommend that approach as you learn so much more in applying instructions for one thing to something fairly similar in a lot of ways but in others fundamentally different.

The svn command line stuff is all the same no matter the operating system – you’re giving commands to the program, so they’re the same whatever platform it’s running on. It’s when you get to the GUI stuff that things are different. TortoiseSVN is not the same as SmartSVN, and when your instructions are to view the repository log or even find the graph version of the log with helpful screenshots of a totally different application there’s a lot of looking up in help files and googling.

And as for setting up a server…well. Windows may have its share of detractors, but tickboxes for ‘start SVN server on startup’ and ‘install as a service’ basically take care of everything you need to worry about for a standard setup so it’s hard to argue. It took several VMs before I had a working SVN over http server running on Linux and (yes, noob, I know) several more before I had one that would still be working after a reboot.

Git…now that took a little longer. The training was a bit more in depth, but also very good in that it’s basically a list of tasks – achievements, if you will – and fairly vague ones at that. It also included a list of resources, although I mostly used the Git book ( which is invaluable, seriously. I’m not sure if it works this way for everyone but I certainly remember a lot more when I’ve had to figure something out for myself.

For example, “Rebase to edit a commit message”. That was it, that’s the full instruction. Not the first one in the training though, so at least when I came to it I definitely knew what committing was and why it would have a message. Rebasing I had to read up on. As I said though, fortunately for me and indeed everyone else, the Git book is brilliant.

So I learned a lot about Git, Gitlab, Gitosis and in the process a fair bit about Ubuntu and CentOS as well (I’d used Ubuntu before – in fact it runs my home server), and come to the conclusion that I like both of them even though installing and configuring Git over http on Linux is not the easiest thing I’ve ever done. Throw something like Gitolite with its dependency on Ruby into the mix and you may well spend a fair amount of time following installation guides.

So, training done, let’s hit the forums, but not literally because that would be silly. Forums are the lifeblood of my job and the hub of the WANdisco community, whom I am here to help and grow as much as possible – we’ll leave aside the occasional urge to lmgtfy (which I’ve managed to resist doing so far).

At the moment things are fairly quiet but the spam is cleaned out daily (I make sure of that) so that’s improving things, and now we have someone in there during (UK) business hours as well. At present things aren’t busy enough to warrant them being looked at outside those times, though I’m sure some of our guys in the US look through them from time to time, but if I have my way (and I fully intend to) it’ll get a lot busier.

So, how?

Well, in the first instance, by cleaning up the spam and being present in the forums. Then getting the word out. Social media is very powerful, but I think our best strategy is to be as knowledgeable and helpful as possible. The more that happens and the more we get out there and help with stuff, the more word will spread. Along with our own forums for Subversion, Git and Hadoop there’s StackOverflow and LinkedIn for those more technical queries, Facebook and to some degree LinkedIn again for less techy more human stories, and Twitter to tie things together and also point out new articles, forum threads and with any luck, engage in some banter as well.

The blogs, then – release blogs usually, for a new version of one of our products, but if something interesting happens then we like to talk about it, so we do. Hence this, and other blogs you’ll see shortly. We want to talk about what we’re doing a bit more in the office, whether it’s related to Big Data, improving our working environment, or just plain having fun.

So that’s it, really. Hopefully this has given you some insight into my journey and an idea of what we’re hoping to accomplish in the near future. Beyond that? I dunno. World domination might be nice.


If you want to find me you can on the above forums, I’m on Twitter as @WANdisco_Matt or there’s always my LinkedIn page – give me a shout if I can help with anything, and cheers for reading this far 🙂

Gerrit Workflow

As I mentioned in an earlier post, Gerrit has a unique workflow.  It has some similarities to pull and merge request models, but is more flexible and more automated.  That goes back to its roots in the Android ecosystem; at the scale of work in that community, bottlenecks need to be few and far between.


Gerrit’s model is unique in a couple of ways:

  • By default all changes are put into temporary pending review branches, which are created automatically on push.
  • The workflow engine enforces rules before changes can be merged to the permanent repository.  Notably, you can require human code review, automated build and test, or both, and use the access control system to specify who’s allowed to perform various steps in the workflow.
  • Review IDs are generated automatically via hooks and group serial sets of patches.  Additional patches can be provided for rework based on the result of a review.
  • Gerrit’s Prolog engine can be used to create customized review approval conditions.

Gerrit’s workflow engine is well tuned for ‘continuous review’, which means that commits can be reviewed rapidly and merged into trunk (master or mainline) very quickly.  In some cases only certain commits would be manually reviewed, while all commits would be subject to automated build and test.  Gerrit is thus a good choice for large scale deployments that want to move towards continuous delivery practices.


Starting at WANdisco

9 years. I hadn’t expected to last at a job that long, but then I’d never had a job that felt like a career before. Unfortunately, it stopped feeling like a career and went back to being a job, so when a new opportunity knocked I answered with ebullience.

I’d been working in various customer forums and social media for the past half decade or so, and the opportunity to become Communications Lead for WANdisco was quite simply far too good to pass up.

So, that was it… off I went. It’s surprisingly easy to change jobs, in spite of how difficult it seems. Bear in mind if you’re thinking similarly, it’s the change that we fear and it’s nothing to be scared of. It’s a good thing. Chances are it’s what you need, especially if you feel bogged down and like you aren’t going anywhere. As an aside, if you like what you’re reading and think we’d be a good fit for you we are recruiting at the moment – why not check out the posts we have on offer at

Having said that though, moving from ISP support (essentially) to supporting version control systems is a fair leap and has involved an awful lot of learning. This also has been a good thing.

So, the runup to the change. Some clandestine emails (from a personal account of course), an after work visit to the new office for a chat, and finally the handing in of notice, which was kind of satisfying but mostly…melancholy, I think is the best word for it, though it wasn’t unpleasant. After sorting out the remainder of my holidays and arranging for a week off in between jobs (heartily recommended and well enjoyed), the first day dawned.

office panorama

Apologies for potatocam. Panorama shots are like that sometimes.

As luck would have it, a few others had trodden the path I was soon to walk so I wasn’t heading into a strange place filled with new people – several of them I’d worked with before which certainly helped tamp down the first day nerves. I even found myself sat next to a friend I’d had since secondary school, which was an interesting experience – we spent more than a few classes sat next to each other and while we hadn’t done the same in, oh, twenty years or so, it felt eerily familiar. Fortunately we were both professional enough to not let things interfere with the work that has to be done.

The other people I didn’t know? Lovely, lovely people. All of them. Especially the content team (but then I’m biased, and also a part of that team. Coinkydink? Decide for yourself). I feel like I fitted in well and nothing has happened to make me think otherwise so I’m going to assume the feeling is mutual or at least not totally opposite.

And the coffee? Oh my, the coffee.


The coffee machine says ‘COFFEE READY’. The sticker says ‘GLADIATOR READY’.

Never underestimate the power of a decent coffee machine. You’ll save so much money, at least you will if you like coffee. It’s what, £4 for a decent sized cost-bucks? Twice a day for some people, especially in the IT industry. There’s also pool and ping pong if you like that sort of thing, which I do. So that’s nice.

pool and wiff

Also bike parking and meeting rooms with panoramic floor to ceiling windows (not pictured).

So that’s the people and the office, summed up in a couple of paragraphs. I could go on, but I don’t think that would be the best thing in the world, so I will move on to training and learning and working which are all things that happen in the world of jobs.

To start, version control. I’ve not written code. I’ve tinkered, and could – with much messing about and no small amount of internet searching – probably hack existing code with copy and pasted bits of other code in order to get it doing what I want it to do. I realise that’s how most hackers get started, and I enjoy it, but I’ve not done enough to actually learn code. I could explain an array or a variable, but I couldn’t write one without googling.

Therefore, I have never used version control. It sounds simple enough, right? Keep a copy of this code, if someone makes a change remember both how it was and how it is now, and give each change a sequential reference.

Now, let’s scale that up.

But… but we can’t. You have a repository server, which clients connect to and commit code. How can you scale that?

Well, you have more than one repository server.


What do you mean, ‘Eh’? More than one. Many. Many servers, for many many many clients.


Well, indeed. How do those servers know about each other? How do they know when a client has connected and added more changes and files, and how do they talk to each other to make sure there aren’t conflicts and that changes aren’t missed?

That’s what we do. We sell software (and support for said software) that guarantees 100% uptime for distributed version control systems. We have a number of large clients with big names, too. (Oooh, get me.) We also do training, which is lucky as (to finally close this rapidly expanding circle of text) I needed some.



Part two to follow in a week or so. If you want to find me you can on our forums, I’m on Twitter as @WANdisco_Matt or there’s always my LinkedIn page – give me a shout if I can help with anything, and cheers for reading this far 🙂


Gerrit Scalability

As a fundamental part of the Android Open Source Project (AOSP), Gerrit has to support a large user base and a big data set.  In this article I’ll review Gerrit scalability from both a performance and operational standpoint.

Operational Scalability

Let’s start with operational tasks:

  • Managing users.  Gerrit provides integration with most common enterprise authentication solutions including LDAP and Active Directory, so the Gerrit administrator should not have to worry much about user management.
  • Managing permissions.  Gerrit has a rich set of permissions that govern operations on code, code reviews, and internal Gerrit data.  The permission model is hierarchical, with any project able to inherit permissions from a parent project.  As long as the Gerrit administrator has set up sensible top level defaults, individual team leads can override the settings as necessary and permission management should be easy on a large scale.  The only potential wrinkle comes when Gerrit mirrors are used.  Unless you run the Gerrit UI in slave mode at every site, the mirrors will not have Gerrit access control applied.
  • Auditing.  Gerrit does not provide auditing, so this area can be a challenge.  You may have to set up your own tools to watch SSH and Apache logs as well as Gerrit logs.
  • Monitoring performance.  As a Gerrit administrator you’ll have to set up your own monitoring system using tools like Nagios and Graphite.  You should keep a particular eye on file system size growth, RAM usage, and CPU usage.
  • Monitoring mirrors.  Like most Git mirrors, a Gerrit mirror (as provided by the Gerrit replication plugin) is somewhat fragile.  There’s no automated way to detect if a Gerrit mirror is out of sync, unless you monitor the logs for replication failures (or your users start to complain that their local mirror is out of date).
  • HA/DR.  Gerrit has no HA/DR solution built-in.  Most deployments make use of mirrors for the repositories and database to support a manual failover strategy.

If you use Git MultiSite with Gerrit, those last two points will be largely addressed.  Git MultiSite nodes are self-healing in the case of temporary failure, and the Git MultiSite console will let you know about nodes that are down or transactions that have failed to replicate due to network issues.  And similarly, as we’ll see in the next section, Git MultiSite gives you a 100% uptime solution with automated failover out of the box.

Performance Scalability

Now on to performance.  Gerrit was designed for large deployments (hundreds of repositories, millions of lines of code, thousands of developers) and the Gerrit community has provided some innovations like bitmap indexes.

Nevertheless, running Gerrit on a single machine will eventually reach some scalability limits.  Big deployments require big hardware (24 core CPUs, 100+ GB of RAM, fast I/O), and even so they may use several read-only mirrors for load balancing and remote site support.

If you want to run a big Gerrit deployment without worrying about managing expensive hardware and monitoring a farm of mirrors, Git MultiSite provides an elegant solution.  Using active-active replication, you’ll have a deployment of fully writable Gerrit nodes.  That means that any single machine doesn’t have to be sized as large, as you can deploy more writable nodes for load balancing.  You can also put fully writable nodes at remote locations for better performance over the WAN.  To put the icing on the cake, there is no single point of failure in Git MultiSite.  If you have 5 nodes in your Gerrit deployment you can tolerate the loss of 2 of those nodes without any downtime, giving you HA/DR out of the box.


And here’s Gerrit with Git MultiSite!

With the recent announcement of Gerrit support in Git MultiSite, it’s worth taking a step back and looking at Gerrit itself.  Gerrit, just like its logo, is a bit of an odd bird. It has a huge user base and dynamic community including the likes of Google and Qualcomm, yet is little known outside of that community.


Gerrit is one of two known descendants of Mondrian, a code review tool used internally at Google. Mondrian proved very popular and led to Rietveld, an open source code review tool for Subversion and Git, and Gerrit. Gerrit was developed as the code review and workflow solution for the Android Open Source Project (AOSP).

In order to support AOSP, Gerrit was designed to be:

  • Scalable. It supports large deployments with thousands of users.
  • Powerful. The workflow engine enforces code review and automated build and test for every commit.
  • Flexible. Gerrit offers a delegated permission model with granular permissions as well as a Prolog interpreter for custom workflows.
  • Secure. Gerrit integrates with enterprise authentication mechanisms including LDAP, Active Directory, and OpenID, and can be served over SSH and HTTPS.

Gerrit offers three key features: repository management, access control, and the code review and workflow engine.

In future articles I’ll dive into more detail on Gerrit’s workflow and other features, but for now, I’ll conclude by talking about why we decided to put MultiSite support behind Gerrit.

Gerrit is a scalable system, but still has a centralized architecture. Out of the box it has a master set of repositories and a simple master-slave replication system. That can lead to challenges in performance and uptime – exactly the problems that WANdisco solves with our patented active-active replication technology. Under Git MultiSite, Gerrit repositories can be replicated to any location for maximum performance, or you can add additional local repositories for load balancing. Access control is enforced with the normal Gerrit permissions, and code review and workflow still route through the Gerrit UI.

Gerrit with Git MultiSite gives you 100% uptime and the best possible performance for users everywhere. More details coming soon!

A bit of programming language history

When I started programming, I used C and just a bit of Fortran. I took my first degrees in electrical engineering, and at the time those languages were the default choice for scientific and numerical computing on workstations. The Java wave was just building at the time, Perl was for sysadmins, and Python was a toy.

That’s how the landscape appeared from my limited perspective. As I started working more deeply in computer science, I started glimpsing odd languages that I couldn’t quite place (Smalltalk? Tcl?). If you follow data analytics and big data, you’ll see a bewildering array of new and old languages in use. Java is still around, but we also have a lot of functional languages to consider as there’s a concerted effort to expose data analysis languages to big data infrastructure. R, Erlang, Go, Scala, of course Java and Python – how do we keep track?

I was very happy to find a lovely diagram showing how these languages have evolved from common heritage. It’s on slide 2 of this presentation from the Data Science Association.

This may be old hat to those who’ve been in the space for a long time, but I find this sort of programming language history very useful. Now I’ve got to find out what in the world Algol 60 was.

Sample datasets for Big Data experimentation

Another week, another gem from the Data Science Association. If you’re trying to prototype a data analysis algorithm, benchmark performance on a new platform like Spark, or just play around with a new tool, you’re going to need reliable sample data.

As anyone familiar with testing knows, good data can be tough to find. Although there’s plenty of data in the public domain, most of it is not ready to use. A few months ago, I downloaded some data sets from a US government site and it took a few hours of cleaning before I had the data in shape for analysis.

Behold: Here the Frictionless Data Project has compiled a set of easily accessible and well documented data sets. The specific data may not be of much interest, but these are great for trials and experimentation. For example, if you want to analyze time series financial data, there’s a CSV file with updated S&P 500 data.

Well worth a look!

SmartSVN 8.6 Available Now

We’re pleased to announce the release of SmartSVN 8.6, the popular graphical Subversion (SVN) client for Mac, Windows, and Linux. SmartSVN 8.6 is available immediately for download from our website.

New Features include:

  • Bug reporting now optionally allows uploading bug reports directly to WANdisco from within SmartSVN
  • Improved handling of svn:global-ignores inherited property
  • Windows SASL authentication support added and required DLLs provided

Fixes include:

  • Internal error when selecting a file in the file table
  • Possible internal error in repository browser related to file externals
  • Potentially incorrect rendering of directory tree in Linux

For a full list of all improvements and features please see the changelog.

Contribute to further enhancements

Many issues resolved in this release were raised by our dedicated SmartSVN forum, so if you’ve got an issue or a request for a new feature, head there and let us know.

Get Started

Haven’t yet started using SmartSVN? Get a free trial of SmartSVN Professional now.

If you have SmartSVN and need to update to SmartSVN 8, you can update directly within the application. Read the Knowledgebase article for further information.

Experiences with R and Big Data

The next releases of Subversion MultiSite Plus and Git MultiSite will embed Apache Flume for audit event collection and transmission. We’re taking an incremental approach to audit event collection and analysis, as the throughput at a busy site could generate a lot of data.

In the meantime, I’ve been experimenting with some more advanced and customized analysis. I’ve got a test system instrumented with a custom Flume configuration that pipes data into HBase instead of our Access Control Plus product. The question then is how to get useful answers out of HBase to questions like: What’s the distribution of SCM activity between the nodes in the system?

It’s actually not too bad to get that information directly from an HBase scan, but I also wanted to see some pretty charts. Naturally I turned to R, which led me again to the topic of how to use R to analyze Big Data.

A quick survey showed three possible approaches:

  • The RHadoop packages provided by Revolution Analytics, which includes RHBase and Rmr (R MapReduce)
  • The SparkR package
  • The Pivotal package that lets you analyze data in Hawq

I’m not using Pivotal’s distribution and I didn’t want to invest time in looking at a MapReduce-style analysis, so that left me with RHBase and Spark R.

Both packages were reasonably easy to install as these things go, and RHBase let me directly perform a table scan and crunch the output data set. I was a bit worried about what would happen once a table scan started returning millions of rows instead of thousands, so I wanted to try SparkR as well.

SparkR let me define a data source (in this case an export from HBase) and then run a functional reduce on it. In the first step I would produce some metric of interest (AuthZ success/failure for some combination of repository and node location) for each input line, and then reduce by key to get aggregate statistics. Nothing fancy, but Spark can handle a lot more data than R on a single workstation. The Spark programming paradigm fits nicely into R; it didn’t feel nearly as foreign as writing MapReduce or HBase scans. Of course, Spark is also considerably faster than normal MapReduce.

Here’s a small code snippet for illustration:


lines <- textFile(sc, "/home/vagrant/hbase-out/authz_event.csv")
mlines = lapply(lines, function(line) {
       return(list(key, metric))
parts = reduceByKey(mlines, "+", 2L)
reduced = collect(parts)


In reality, I might use SparkR in a lambda architecture as part of my serving layer and RHBase as part of the speed layer.

It already feels like these extra packages are making Big Data very accessible to the tools that data scientists use, and given that data analysis is driving a lot of the business use cases for Hadoop, I’m sure we’ll see more innovation in this area soon.

Subversion Vulnerability in Serf

The Apache Subversion team have recently published details of two vulnerabilities in the Serf RA layer.

Firstly, vulnerable versions of the Serf RA layer will accept certificates that it should not accept as matching the hostname the client is using to make the request. This is deemed a Medium risk vulnerability.

Additionally, affected versions of the Serf RA layer do not properly handle certificates with embedded NUL bytes in their Common Names or Subject Alternate Names. This is deemed a low risk vulnerability.

Either of these issues, or a combination of both, could lead to a man-in-the-middle attack and allow viewing of encrypted data and unauthorised repository access.

A further vulnerability has also been identified in the way that Subversion indexes cached authentication credentials. An MD5 hash collision can be engineered such that cached credentials are leaked to a third party. This is deemed a Low risk vulnerability.

For more information on these issues please see the following links:!msg/serf-dev/NvgPoK6sFsc/_TR7Buxtba0J

The ra_serf vulnerability affects Subversion versions 1.4.0-1.7.17 and 1.8.0-1.8.9. The Serf library vulnerability affects Serf versions 0.2.0 through 1.3.6 inclusive. Finally, the credentials vulnerability affects Subversion versions 1.0.0-1.7.17 and 1.8.0-1.8.9.

If you are using any of the vulnerable versions mentioned above we would urge you to upgrade to the latest release, either 1.8.10 or 1.7.18. Both are available on our website at

Spark and Hadoop infrastructure

I just read another article about how Spark stands a good chance of supplanting MapReduce for many use cases. As an in-memory platform, Spark provides answers much faster than MapReduce, which must perform an awful lot of disk I/O to process data.

Yet MapReduce isn’t going away. Beside all of the legacy applications built on MapReduce, MapReduce can still handle much larger data sets. Spark’s limit is in terabytes, while MapReduce can handle petabytes.

There’s one interesting question I haven’t seen discussed, however. How do you manage the different hardware profiles for Spark and other execution engines? A general-purpose MapReduce cluster will likely balance I/O, CPU, and RAM, while a cluster tailored for Spark will emphasize RAM and CPU much more heavily than I/O throughput. (As one simple example, switching a Spark MLlib job to a cluster that allowed allocation of 12GB of RAM per executor cut the run time from 370 seconds to 14 seconds.)

From what I’ve heard, YARN’s resource manager doesn’t handle hybrid hardware profiles in the same cluster very well yet. It will tend to pick the ‘best’ data node available for a task, but that means it will tend to pick your big-memory, big-CPU nodes for everything, not just Spark jobs.

So what’s the answer? One possibility is to set up multiple clusters, each tailored for different types of processing. They’ll have to share data, of course, which is where it gets complicated. The usual techniques for moving data between clusters (i.e. the tools built on distcp) are meant for schedule synchronization – in other words, backups. Unless you’re willing to accept a delay before data is available to both clusters, and unless you’re extremely careful about which parts of the namespace each cluster effectively ‘owns’, you’re out of luck…unless you use Non-Stop Hadoop, that is.

Non-Stop Hadoop lets you treat two or more clusters as a unified HDFS namespace, even when the clusters are separated by a WAN. Each cluster can read and write simultaneously, using WANdisco’s active-active replication technology to keep the HDFS metadata in sync. In addition, Non-Stop Hadoop’s efficient block-level replication between clusters means data transfers much more quickly.

This means you can set up two clusters with different hardware profiles, running Spark jobs on one and traditional MapReduce jobs on the other, without any additional administrative overhead. Same data, different jobs, better results.

Interested? We’ve got a great demo ready and waiting.


SmartSVN 8.6 RC1 Available Now

We’re proud to announce the release of SmartSVN 8.6 RC1. SmartSVN is the cross-platform graphical client for Apache Subversion.

New features include:

  • File protocol authentication implemented, using system login as default
  • “Manage as Project” menu item added for unmanaged working copies
  • Navigation buttons added to notifications to show previous/next notification

Fixes include:

  • Navigation buttons added to notifications to show previous/next notification
  • Opening a missing project could end up in an endless loop
  • Linux: Refresh might not have picked up changes if inotify limit was reached
  • Merge dialog “Merge From” label was cut off
  • Illegal character error while editing svn:mergeinfo
  • Context menu wasn’t available in commit message text area
  • Progress window for explorer integration context menu actions sometimes appeared behind shell window

For a full list of all improvements and features please see the changelog.

Have your feedback included in a future version of SmartSVN

Many issues resolved in this release were raised via our dedicated SmartSVN forum, so if you’ve got an issue or a request for a new feature, head over there and let us know.

You can download Release Candidate 1 for SmartSVN 8.6 from our website.

Haven’t yet started with SmartSVN? Claim your free trial of SmartSVN Professional here.

Health care and Big Data

What’s the general impression of the public sector and health care? Stodgy. Buried in paperwork. Dull. Bureaucrats in cubicles and harried nurses walking around with reams of paper charts.

Behind the scenes, however, the health care industry and their counterparts in the public sector have been quietly launching a wave of technological innovation and gaining valuable insight along the way. Look no further than some of their ‘lessons learned’ articles, including this summary of a recent study in Health Affairs. The summary is well worth a read, as the lessons are broadly applicable no matter what industry you’re in:

  • Focus on acquiring the right data
  • Answer questions of general interest, not questions that show off the technology
  • Understand the data
  • Let the people who understand the data have access to it in any form

Accomplishing these goals, they found, required a broader and more sophisticated Hadoop infrastructure than they had anticipated.

Of course, this realization isn’t too much of a surprise here at WANdisco. One of the early adopters of Non-Stop Hadoop was UC Irvine Health, a leading research and clinical center in California. UC Irvine Health has been recognized as an innovation center, and is currently exploring new ways to use Big Data to improve the quality and indeed the entire philosophy of its care.

You may be thinking you’ve still got time to figure out a Big Data strategy. Real time analytics on Hadoop? Wiring multiple data centers into a logical data lake? Not quite ready for prime time? Before you prognosticate any further, give us a call. Even ‘stodgy’ industries are seeing Big Data’s disruption up close and personal.

Distributed Code Review

As I’ve written about previously, one of the compelling reasons to look at Git as an enterprise SCM system is the great workflow innovation in the Git community. Workflows like Git Flow have pulled in best practices like short lived task branches and made them not only palatable but downright convenient. Likewise, the role of the workflow tools like Gerrit should not be discounted. They’ve turned mandatory code review from an annoyance to a feature that developers can’t live without (although we call it social coding now).

But as any tool skeptic will tell you, you should hesitate before building your development process too heavily on these tools. You’ll risk locking in to the way the tool works – and the extra data that is stored in these tools is not very portable.

The data stored in Git is very portable, of course. A developer can clone a repository, maintain a fork, and still reasonably exchange data with other developers. Git has truly broken the bond between code and a central SCM service.

As fans of social coding will tell you, however, the conversation is often just as important as the code. The code review data holds a rich history of why a change was rejected, accepted, or resubmitted. In addition, these tools often serve as the gatekeeper’s tools: if your pull request is rejected, your code isn’t merged.

Consider what happens if you decide you need to switch from one code review tool to another. All of your code review metadata is likely stored in a custom schema in a relational database. Moving, say, from Gerrit to GitLab would be a significant data migration effort – or you just accept the fact that you’ll lose all of the code review information you’ve stored in Gerrit.

For this reason, I was really happy to hear about the distributed code review system now offered in SmartGit. Essentially SmartGit is using Git to store all of the code review metadata, making it as portable as the code itself. When you clone the repository, you get all of the code review information too. They charge a very modest fee for the GUI tools they’ve layered on top, but you can always take the code review metadata with you, and they’ve published the schema so you can make sense of it. Although I’ve only used it lightly myself, this system breaks the chain between my Git repo and the particular tool that my company uses for repository management and access control.

I know distributed bug trackers fizzled out a couple of years ago, but I’m very happy to see Syntevo keep the social coding conversation in the same place as the code.

Git MultiSite Cluster Performance

A common misconception about Git is that having a distributed version control system automatically immunizes you from performance problems. The reality isn’t quite so rosy. As you’ll hear quite often if you read about tools like Gerrit, busy development sites make a heavy investment to cope with the concurrent demands on a Git server posed by developers and build automation.

Here’s where Git MultiSite comes into the picture. Git MultiSite is known for providing a seamless HA/DR solution and excellent performance at remote sites, but it’s also a great way to increase elastic scalability within a single data center by adding more Git MultiSite nodes to cope with increased load. Since read operations (clones and pulls) are local to a single node and write operations (pushes) are coordinated, with the bulk of the data transfer happening asynchronously, Git MultiSite lets you scale out horizontally. You don’t have to invest in extremely high-end hardware or worry about managing and securing Git mirrors.

So how much does Git MultiSite help? Ultimately that depends on your particular environment and usage patterns, but I ran a little test to illustrate some of the benefits even when running in a fairly undemanding environment.

I set up two test environments in Amazon EC2. Both environments used a single instance to run the Git client operations. The first environment used a regular Git server with a new empty repository accessed over SSH. The second environment instead used three Git MultiSite nodes.  All servers were m1.large instances.

The test ran a series of concurrent clone, pull, and push operations for an hour. The split between read and write operations was roughly 7:1, a pretty typical ratio in an environment where developers are pulling regularly and pushing periodically, and automated processes are cloning and pulling frequently. I used both small (1k) and large (10MB) commits while pushing.

What did I find?

Git MultiSite gives you more throughput

Git MultiSite processed more operations in an hour. There were no dropped operations, so the servers were not under unusual stress.


Better Performance

Git MultiSite provided significantly better performance, particularly for reads. That makes a big difference for developer productivity.


More Consistent Performance

Git MultiSite provides a more consistent processing rate.


You won’t hit any performance cliffs as the load increases.


Try it yourself

We perform regular performance testing during evaluations of Git MultiSite. How much speed do you need?

Big Data ETL Across Multiple Data Centers

Scientific applications, weather forecasting, click-stream analysis, web crawling, and social networking applications often have several distributed data sources, i.e., big data is collected in separate data center locations or even across the Internet.

In these cases, the most efficient architecture for running extract, transform, load (ETL) jobs over the entire data set becomes nontrivial.

Hadoop provides the Hadoop Distributed File System (HDFS) for storage and YARN (Yet Another Resource Negotiator) as the programming model in Hadoop 2.0. ETL jobs use the MapReduce programming model to run on the YARN framework.

Though these are adequate for a single data center, there is a clear need to enhance them for multi-data center environments. In these instances, it is important to provide active-active redundancy for YARN and HDFS across data centers. Here’s why:

1. Bringing compute to data

Hadoop’s architectural advantage lies in bringing compute to data. Providing active-active (global) YARN accomplishes that on top of global HDFS across data centers.

2. Minimizing traffic on a WAN link

There are three types of data analytics schemes:

a) High-throughput analytics where the output data of a MapReduce job is small compared to the input.

Examples include weblogs, word count, etc.

b) Zero-throughput analytics where the output data of a MapReduce job is equal to the input. A sort operation is a good example of a job of this type.

c) Balloon-throughput analytics where the output is much larger than the input.

Local YARN can crunch the data and use global HDFS to redistribute for high throughput analytics. Keep in mind that this might require another MapReduce job running on the output results, however, which can add traffic to the WAN link. Global YARN mitigates this even further by distributing the computational load.

Last but not least, fault tolerance is required at the server, rack, and data center levels. Passive redundancy solutions can cause days of downtime before resuming. Active-active redundant YARN and HDFS provide zero-downtime solutions for MapReduce jobs and data.

To summarize, it is imperative for mission-critical applications to have active-active redundancy for HDFS and YARN. Not only does this protect data and prevent downtime, but it also allows big data to be processed at an accelerated rate by taking advantage of the aggregated CPU, network and storage of all servers across datacenters.

– Gurumurthy Yeleswarapu, Director of Engineering, WANdisco

More efficient cluster utilization with Non-Stop Hadoop

Perhaps the most overlooked capability of WANdisco’s Non-Stop Hadoop is its efficient cluster utilization in secondary data centers. These secondary clusters are often used only for backup purposes, which is a waste of valuable computing resources. Non-Stop Hadoop allows you to take full advantage of the CPU and storage resources that you’ve paid for.

Of course anyone who adopts Hadoop needs a backup strategy, and the typical solution is to put a backup cluster in a remote data center. distcp, a part of the core Hadoop distribution, is used to periodically transfer data from the primary cluster to the backup cluster. You can also run some read-only jobs on the backup cluster, as long as you don’t need immediate access to the latest data.

Still, that backup cluster is a big investment that isn’t being used fully. What if you could treat that backup cluster as a part of your unified Hadoop environment, and use it fully for any processing?  That would give you a better return on that backup cluster investment, and let you shift some load off of the primary cluster, perhaps reducing the need for additional primary cluster capacity.

That’s exactly what Non-stop Hadoop provides: you can treat several Hadoop clusters as part of a single unified Hadoop file system. All of the important data is replicated efficiently by Non-Stop Hadoop, including the NameNode metadata and the actual data blocks. You can write data into any of the clusters, knowing that the metadata will be kept in sync by Non-Stop Hadoop and that the actual data will be transferred seamlessly (and much faster compared to using a tool like distcp).

As a simple example, recently I was ingesting two streams of data into a Hadoop cluster. Each ingest job handled roughly the same amount of data. The two jobs combined took up about 28 seconds of cluster CPU time during each run and consumed roughly 500MB of cluster RAM during operation.

Then I decided to run each job separately on two clusters that are part of a single Non-Stop Hadoop deployment. In this case, again running both jobs at the same time, I took up 15 seconds on the first cluster and 18 seconds on the second cluster, using about 250MB of RAM on each.

The exact numbers will vary depending on the job and what else is running on the cluster, but in this simple example I’ve accomplished three very useful things:

  • I’ve gotten useful work out of a second cluster that would otherwise be idle.
  • I’ve shifted half of the processing burden off of the first cluster. (It also helps to have six NameNodes instead of 2 to handle the concurrent writes.)
  • I don’t have to run the distcp job to transfer this data to a backup site – it’s already on both clusters. Not only am I getting more useful work out of my second cluster, I’m avoiding unnecessary overhead work.

So there you have it – Non-Stop Hadoop is the perfect way to get more bang for your backup cluster buck. Want to know more? We’ll be happy to discuss in more detail.

Talking R in an Excel World

I just finished reading a good summary of 10 powerful and free or low-cost analytics tools.  Of the 10 items mentioned, 2 are probably common in corporate environments (Excel and Tableau) while the other 8 are more specialized.  I wonder how successful anyone is at introducing a new specialized analytics tool into an environment where Excel is the lingua franca?

I’ve personally run into this situation a few times.  Recently I ran a Monte Carlo simulation in R to generate a simple P&L forecast for a business proposal.  I followed good practices by putting my code into R-markdown and generating a PDF with the code, assumptions, variables, and output.  I then had to share some of the conclusions with a colleague who was generating pricing models in a spreadsheet, however, and knowing that introducing R on short notice wouldn’t go over well, I just copied some of the key results into Excel and sent it off.

Similarly, a few months ago I was working on a group project to analyze some public data.  I used R to understand the data and perform the analysis, but then had to reproduce the final analysis in Excel to share it with the team.  That seems wasteful, but I couldn’t see another way to do it.  It would have taken me quite a long time to do all the exploratory work in Excel (I’ve not yet figured out how to create several charts in a loop in Excel) and Excel just doesn’t have the same type of tools (like principal components for dimension reduction).

R does have some capabilities to talk Excel, but these don’t seem particularly easy to use and the advice is typically to use CSV as an interchange format, which has obvious limitations including the loss of formulas and formatting.

So I’m stumped.  As long as Excel remains a standard interchange format I guess I’ll just have to do some manual data translation.  Has anyone solved this problem in a more elegant way?


Running Hadoop trial clusters

How not to blow your EC2 budget

Lately I’ve been spinning up a lot of Hadoop clusters for various demos and just trying new features. I’m pretty sensitive to exploding my company’s EC2 budget, so I dutifully shut down all my EC2 instances when I’m done.

Doing so has been proven to be a bit painful, however. Restarting a cluster takes time and there’s usually a service or two that don’t start properly and need to be restarted. I don’t have a lot of dedicated hardware available — just my trusty MBP. I realize I could run a single-node cluster using a sandbox from one of the distribution vendors, but I really need multiple nodes for some of my testing.

I want to run multiple node clusters as VMs on my laptop and be able to leave these up and running over a weekend if necessary, and I’ve found a couple of approaches that look promising:

  • Setting up traditional VMs using Vagrant. I use Vagrant for almost all of my VMs and this would be comfortable for me.
  • Trying to run a cluster using Docker. I have not used Docker but have heard a lot about it, and I’m hoping it will help with memory requirements. I have 16 GB on my laptop but am not sure how many VMs I can run in practice.

Here’s my review after quickly trying both approaches.

VMs using Vagrant

The Vagrant file referenced in the original article works well. After installing a couple of Vagrant plugins I ran ‘vagrant up’ and had a 4-node cluster ready for CDH installation. Installation using Cloudera Manager started smoothly.

Unfortunately, it looks like running 4 VMs consuming 2 GB of RAM each is just asking too much of this machine. (The issue might be CPU more than RAM — the fan was in overdrive during installation.) I could only get 3 of the VMs to complete installation, and then during cluster configuration only two of them were available to run services.


Running Docker on the Mac isn’t ideal, as you need to fire up a VM to actually manage the containers. (I’ve yet to find the perfect development laptop. Windows isn’t wholly compatible with all of the tools I use, Mac isn’t always just like Linux, and Linux doesn’t play well with the Office tools I have to use.) On the Mac there’s definitely a learning curve. The docker containers that actually run the cluster are only accessible to the VM that’s hosting the containers, at least in the default configuration I used. That means I had to forward ports from that VM to my host OS.

Briefly, my installation steps were:

  • Launch the boot2docker app
  • Import the docker images referenced in the instructions (took about 10 minutes)
  • Run a one line command to deploy the cluster (this took about 3 minutes)
  • Grab the IP addresses for the containers and set up port forwarding on the VM for the important ports (8080, 8020, 50070, 50075)
  • Log in to Ambari and verify configuration

At this point I was able to execute a few simple HDFS operations using webhdfs to confirm that I had an operational cluster.

The verdict

The verdict is simple — I just don’t have enough horsepower to run even a 4-node cluster using traditional VMs. With Docker, I was running the managing VM and three containers in a few minutes, and I didn’t get the sense that my machine was struggling at all. I did take a quick look at resource usage but I hesitate to report the numbers as I had a lot of other stuff running on my machine at the same time.

Docker takes some getting used to, but once I had a sense of how the containers and VM were interacting I figured out how to manage the ports and configuration. I think learning Docker will be like learning Vim after using a full blown IDE — if you invest the time, you’ll be able to do some things very quickly without putting a lot of stress on your laptop.

The rise of real-time Hadoop?

Over the past few weeks I’ve been reviewing a number of case studies of real-world Hadoop use, including stories from name-brand companies in almost every major industry. One thing that impressed me is the number of applications that are providing operational data in near-real time, with Hadoop applications providing analysis that’s no more than an hour out of date. These aren’t just toy applications either – one case study discussed a major retailer that is analyzing pricing for more than 73 million items in response to marketing campaign effectiveness, web site trends, and even in-store customer behavior.

That’s quite a significant achievement. As recently as last year I often heard Hadoop described as an interesting technology for batch processing large volumes of data, but one for which the practical applications weren’t quite clear. It was still seen as a Silicon Valley technology in some circles.

This observation is backed up by two other trends in the Hadoop community right now. Companies like Revolution Analytics are making great strides in making the analytical tools more familiar to data scientists, while Spark is making those tools run faster. Second, vendors (including WANdisco) are focusing on Hadoop operational robustness – high availability, better cluster utilization, security, and so on. A couple of years ago you might have planned on a few hours of cluster downtime if something went wrong, but now the expectation is clearly that Hadoop clusters will get closer to nine-nines of reliability.

If you haven’t figured out your Hadoop strategy yet, or have concerns about operational reliability, be sure to give us a call. We’ve got some serious Hadoop expertise on staff.


SmartSVN 8.5.5 Available Now

We’re pleased to announce the release of SmartSVN 8.5.5, the popular graphical Subversion (SVN) client for Mac, Windows, and Linux. SmartSVN 8.5.5 is available immediately for download from our website.

This release contains an improvement to the conflict solver along with a few bugfixes – for full details please see the changelog.

Contribute to further enhancements

Many issues resolved in this release were raised by our dedicated SmartSVN forum, so if you’ve got an issue or a request for a new feature, head there and let us know.

Get Started

Haven’t yet started using SmartSVN? Get a free trial of SmartSVN Professional now.

If you have SmartSVN and need to update to SmartSVN 8, you can update directly within the application. Read the Knowledgebase article for further information.

Hadoop Summit 2014

I was lucky to be at the Hadoop Summit in San Jose last week, splitting my time between the WANdisco booth and attending some of the meetups and sessions. I didn’t keep a conference diary, but here are a few quick impressions:

1. Hadoop seems to be maturing operationally. Between WANdisco’s solutions for making key parts of Hadoop failure-proof and the core improvements coming in Hadoop 2.4.0 and beyond, the community is focusing a lot of effort on uptime and resiliency.

2. Security is still an open question. Although technologies like Knox and Kerberos integration provide good gateway and authentication support, there is no standard approach for more granular authorization. This was a consistent theme in several presentations including a case study from Booz Allen.

3. Making analytics faster and more accessible will receive a lot of attention this year. Hive 0.13 is showing dramatic performance improvements; Microsoft showed a demonstration of accessing Hadoop data through Excel Power Queries, there are several initiatives to make R run better in the Hadoop sphere – the list goes on and on.

4. The power of active-active replication continues to surprise people. Almost everyone I talked to at our booth kept asking the same questions: “So this is like a standby NameNode? You have a hot backup? It’s kind of like distcp?” No, no, and no – WANdisco’s NonStop NameNode lets you run several active (fully writable) NameNodes at once, even in different data centers, as part of a unified HDFS namespace. (If you haven’t read our product briefs, that gives you a full HA/DR solution with better performance and much better utilization of that secondary data center.  Better yet, skip the product briefs and just ask us for a demo.)

5. Beach balls are a fantastic giveaway. Kudos to our marketing team. 🙂

See you at the next one!


HBase Sponsorship and Adoption

A recent infographic from the Data Science Association showed that MongoDB is leading the pack of NoSQL and NewSQL databases in 2014:

NoSQL NewSQL Database Adoption 2014


I’m not sure exactly where this data comes from, but it matches what I’ve heard anecdotally in the community. MongoDB seems to have a head start for many reasons, including the ease of standing up a new cluster.

Will this trend continue? To begin to answer this question, it’s worth considering the commercial interests behind these databases. This article shows a few metrics on current and projected market share. Of the databases with direct vendor sponsorship, MongoDB leads the pack at 7% compared with 3% each for Cassandra and Riak.

So where does that leave HBase? It’s running a respectable fourth place in the infographic, but that doesn’t tell the whole story. HBase is backed by the entire Hadoop community including contributors from Cloudera, Hortonworks, Facebook, and Intel. Community sponsors for HBase include all of the above plus MapR and

Now go back to the market share report and notice that HBase doesn’t have a primary commercial sponsor despite that a significant portion of the market share (and funding) going to companies like Cloudera, Hortonworks, and MapR is backing HBase as well. In the end, HBase may well benefit from being a primarily Apache-backed project that the whole community sponsors and supports, rather than being driven by a single vendor.

This fits into the trend of Apache-backed projects being significantly larger than vendor-backed projects. Apache’s namesake web server is a useful (if inexact) parallel. There are web servers out there that are certainly easier to install and configure, but a huge portion of the world’s websites run on Apache HTTPD. It’s robust, ubiquitous, and has a deep pool of community expertise. The same may be said of HBase in the future – it’s well-supported by every major Hadoop distribution, and it runs on top of Hadoop, yielding some infrastructural savings.

The next couple of years should be very interesting. I’m quite curious if HBase’s Apache heritage will give it the boost it needs to increase adoption in the NoSQL community.

GridGain goes open source

I’ve been catching up on some old RSS feed links recently and came across this gem:  I’m not too familiar with GridGain but I’m happy to see it being open sourced and joining the wave of other solutions like Spark.

I’m a little curious if anyone has hands-on performance data comparing GridGain and Spark.  I dug up a link to a thesis that compared Spark to another technology and found some limitations in Spark when the data set size started to exceed the RAM in the cluster.  But a quick search doesn’t yield anything similar for GridGain and Spark.

Why 100% Availability is Critical for Big Data Applications Marla LyJeremy Howard, the former president of Kaggle, opined recently that few people outside the machine learning space “have yet grasped how astonishingly quickly it’s progressing.”  This is no understatement. And as a Big Data company, WANdisco is right in the intellectual middle of this coming revolution.

No human could program an algorithm for a car to safely drive itself, there are simply too many edge cases. Only a learning computer can do this, and no human knows the complete algorithm. The computer actually learns how to drive by watching a human do it. Imagine this being repeated across any number of current activities that we today take for granted only humans can perform.

There’s a critical component of these machine learning robots that might be less flashy, but essential to their success: Big Data.  In the case of the Google driverless car, first an extremely detailed recreation of the world is built. The car must then only see the difference between what’s actually happening and its internal model. That’s where Big Data comes in: petabytes of data about the world and the ability to merge with a stream of incoming data in real time make this miracle work.

That’s also where these systems take a sharp turn from many computing systems of the past; this Big Data must always be available and working. Clearly a system that drives a car must be available more than 99.99% of the time. 99.99% uptime would mean approximately 8 seconds of failure for every 24 hours of driving, clearly not even close to acceptable.

Of course, computers have been critical components in cars for many years. But there’s a big difference between these computers and the machine learning, Big Data driverless car of today. Unlike an embedded system that is self contained in a controlled environment, today’s Big Data technology must work in the high failure environment of distributed systems.

As the inventor of Paxos, Leslie Lamport, defined it:

“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.”

Given this challenging environment, how does one obtain the kind of guaranteed availability that’s required for critically important functions such as driving a car? WANdisco’s core WAN-capable Paxos technology is the answer, removing single points of failure in existing technology and proving seamless redundancy with 100% data safety in high failure environments.

So while the future promises a Big Data driven revolution of new capabilities, those capabilities rely on systems that must always work. That’s Why WANdisco.

Permission Precedence in Access Control Plus

Access Control Plus gives you a flexible permission system for both Subversion and Git.  Management is delegated with a hierarchical team system, a necessity for large deployments.  You can’t have every onboarding request bubbling up to the SCM administration team, after all.  Daily permission maintenance is, within boundaries, rightfully the place of team leaders.

But what happens if several rules seem to apply at the same time?  Consider a few examples of authorization rules for Git.  In these examples I’ll look at different rule scope in terms of where the rule applies (the resource) and who the rule applies to (the target).

Same resource, same target

Harry want to push a commit to the task105 branch of the Git repository named acme.

  • Harry belongs to the team called acme-devs, which has write access to the entire acme repo.
  • Harry also belongs to the team called acme-qa, which has read access to the entire repo.

In this case Harry has write access to the task105 branch.  Both rules apply to the same resource and target, so Harry gets the most liberal permission.

Different resource, same target

Now consider this case where again Harry wants to push a commit to the task105 branch.

  • Harry belongs to acme-qa which has read access to the entire repo.
  • Harry belongs to acme-leads, a sub-team of acme-qa, which has write access to task105.

In this case Harry again has access.  The more specific resource of the rule for acme-leads (on the branch as opposed to the entire repo) takes precedence.

Different resource, different target

In another variation:

  • Harry has read access to the entire repo.
  • Harry belongs to acme-leads which has write access to task105.

In this case Harry again has access.  The more specific resource of the rule for acme-leads (on the branch as opposed to the entire repo) takes precedence.

Different resource, different target – with a wrinkle

Now consider:

  • Harry has write access to the entire repo.
  • Harry belongs to acme-reviewers-only which has read access to task105.

In this case Harry again has access.  The team rule that grants read access is considered first, but it doesn’t grant or deny the access level Harry needs (write access).  So we keep searching and find the more general rule that grants write access at the repo level.  If we actually wanted to prevent Harry from writing, the team rule would need to deny write access explicitly.

Same resource, different target

And in a final example:

  • Harry belongs to acme-reviewers-only which has read access to task105.
  • Another rule grants Harry himself write access to task105.

In this case Harry gets write access.  The two rules have the same resource (branch level), but the rule that applies to the more specific target (his own account versus a team) is applied first.

Rules of the road

To sum up, the rules of precedence for Git permissions are:

  • Rules on a more specific resource take precedence over more general rules.
  • If two rules apply to the same resource, then rules applying to a specific account take precedence over rules that apply to a team.  (All teams and sub-teams are considered as equivalent identities.)
  • When two rules are equivalent in resource and target, the more liberal rule takes precedence.
  • Rules are considered until a rule is found that grants or denies the requested access level.
  • If no rules apply, fall back on the default access level specified for Git MultiSite users.



And in case Harry has any doubts, he can always use the Rule Lookup tool to find out which rule applies.


Migrating to Git, Forensic Considerations Jack Spades

Git has unleashed an unusual number of migrations from legacy tools among a wide variety of companies. Desire to attract new developer talent is one reason we commonly hear, another is finally there is an open source version control tool that has sufficiently compelling advantages, perceived as well as real, to undergo a migration.

Often there’s an initial desire to migrate everything to Git, and shut off the legacy system. If that system is commercial and requires on going licensing fees, there’s a stronger incentive. But if it’s an open source tool, there are reasons you might want to pay the maintenance cost of keeping it around.

Easier migrations

One reason is that it’s usually advisable to not attempt a full migration of all history, but instead pick major baselines and only migrate those. By leaving the legacy system running, but in read-only mode, you always have the chance to go back and find something in the full history. That leads us to the next topic.

Proof of invention

Many companies face infrequent but high stakes litigation around intellectual property disputes. Take the example of an algorithm that’s an issue in a lawsuit. You need to prove you were using the algorithm prior to a certain date. Version control systems are ideal for this, but only if you have the complete historical record. Someone noticed that the CVS repository hadn’t been used in three years and deleted it? Oops.

Forensics in a hybrid system

Thinking ahead a few years, we now have all new development done in Git, with our trusty CVS server patiently waiting for the next lawsuit. Consider that an investigation begun today may start in the Git history, and then need to be traced into the read-only CVS history.

This means that you will need to be able to link history going backwards into time across your hybrid Git-CVS deployment. Practically, this means that this requirement should be taken into consideration during the initial migration to Git.

Imperfect history migration

You might think that doing a full history migration would be a fix for this. In some cases this might be advised; you should generally migrate enough history so that going back into the legacy system is an unusual event. However the problem here is that perfect fidelity in history migration between SCM systems is rarely possible. There are differences in capabilities or metadata that may have no deterministic answer.

The legacy system remains the definitive system of record pertaining to your intellectual property. Further, you may need to treat that history as it spans legacy and new tools. While its use may be infrequent, you’ll likely be happy you planned ahead during the giddy days of your Git adoption.



HIPAA Compliance and Continuous Delivery

The HIPAA law poses a compliance challenge for developers of software that intersects electronic protected health information (ePHI). Part of the burden is showing proper control over the electronic system: how you provision for auditing, availability, access control, and so on.

If you’re a practitioner of DevOps and continuous delivery, you’ve got a good head start on meeting those challenges. DevOps and continuous delivery believe in the idea of configuration as code. In other words, all of your runtime configuration and environment data is stored in an SCM system like Git or Subversion. As a result, the SCM system is your system of record for how your software was actually deployed, and helps you demonstrate compliance with the HIPAA provisions.

There is, however, a slight wrinkle in the story: the SCM system is now a critical part of your runtime infrastructure, and most SCM systems are not designed to be highly available with no risk of data corruption.

That’s where WANdisco’s family of MultiSite and Clustering products for Git and Subversion come into play. WANdisco provides a 100% uptime solution; every node in the deployment is a replicated peer, so the loss of a single server does not pose a problem. High Availability and Disaster Recovery are built in with automatic failover and recovery capabilities.

Moreover, these are zero data loss solutions. By the time a piece of runtime configuration data is committed, it is guaranteed to exist on more than one node, guaranteeing data integrity. Every site, including deployment sites, will see the right set of data.

In an environment bound by regulatory and compliance concerns, you need the peace of mind that a 100% uptime solution with guaranteed data integrity provides. Give us a call for more information on how Subversion and Git MultiSite and Clustering can help you meet your compliance demands.

Reflections on Subversion & Git Live 2014

Yesterday marked the conclusion of this year’s Subversion & Git Live conference tour through New York and San Francisco. This was also WANdisco’s second conference with the DVCS Git under our wing.

Growing Git Sophistication

Last fall for Subversion & Git Live 2013 we targeted Git materials at an introductory level. This turned out to be about exactly right, as our attendees were as novice as they were enthusiastic about the disruptive and beneficial effects of Git in their environments. This year, not only were virtually all attendees familiar with Git, they were also markedly more comfortable with it and the resulting impact on their development organizations.

Where a common question last year was “I’ve heard of Git and I’d like to learn something about it”, this year it was more likely to be “I have Git in my environment and what do you recommend for supporting it successfully.” Followup questions were more likely to be about specific tool stacks that could be deployed this year.

Strength of Subversion

All of this hot discussion played out against the backdrop of the enterprise workhorse of Subversion. Significant improvements in speed and scalability are part of the roadmap in 2013-2014, but there were also more ambitious discussions about assimilating more functionality from DVCS’s like Git, and even ground up designs for a new merge engine with move as a first class operation. I’ve rarely seen WANdisco’s Subversion committers more engaged; fortunately WANdisco’s long resume of large customers means easy access to real world use cases for complex enterprise software development.

What I see ahead

One change observed at this conference was an acceleration to end-of-life expensive, legacy commercial products, ClearCase the easy target here. The relevance of newer commercial SCM systems outside niche industries continues to decline as SCM and version control commoditizes around open source. Despite significant challenges for creating enterprise class deployments, Git seems an unstoppable force as a developer productivity and talent attraction tool. WANdisco plays a significant role here, leapfrogging the ubiquitous Web UI paradigm around self-provisioning of repos and engineering a world-scale, foundational backbone for Git.

It was a great conference, and we hope to see you next year!

Apache Announces Subversion 1.8.9

We’re pleased to announce the release of Subversion 1.8.9 on behalf of the Apache Subversion project. Along with the official Apache Software Foundation source releases, our own fully tested and certified binaries are available from our website.

1.8.9 contains a number of bugfixes. For a complete list please check the Apache changelogs for Subversion 1.8.

You can download our fully tested, certified binaries for Subversion 1.8.9 free here.

WANdisco’s binaries are a complete, fully-tested version of Subversion based on the most recent stable release, including the latest fixes, and undergo the same rigorous quality assurance process that WANdisco uses for its enterprise products that support the world’s largest Subversion implementations.

Git Access Control Levels

It seems that every Git management solution has its own flavor of access control permissions.  I thought it’d be useful to have a quick matrix of the capabilities WANDisco’s Access Control Plus.  Questions?  We’re here to help!


Feature Available
Repository read/write permissions now
Branch write permissions now
Branch/tag create/delete permissions now
Path write permissions 2014
Regular expressions (in refs and paths) 2014
HTTP(S) and SSH protocols now
Enforced on all Git replicas (via Git MultiSite) now
Unified interface for Subversion and Git now

Unified Git and Subversion Management

Over the past several years the movement in ALM tools has been away from heavy, inflexible tools towards lighter and more flexible solutions. Developers want and need the freedom to experiment and work quickly without being bound by heavy processes and restrictions.

But, of course, an enterprise still needs some level of management and governance over software development. Now it looks like the pendulum is swinging back towards a useful middle ground – and WANdisco’s new Access Control Plus product strikes that fine balance between flexibility and guidance.

Access Control Plus is flexible because it lets team leaders manage access to their repositories.  Site administrators can set overall policies and make sure that the truly sensitive data stays safe. Access Control Plus provides for any level of delegated team management, letting the team leaders closest to the source code manage their teams and permissions. And with accounts backed by any number of LDAP or Active Directory authorities, the grunt work of account management is automated.

Yet Access Control Plus is still an authoritative resource for security, auditing and reporting. It covers permissions for all of your Subversion and Git repositories at any location. That’s important for a number of reasons:

  • Sanity! You need some form of consistent permission management over your repositories.
  • An audit trail of your inventions. With the new America Invents Act, a comprehensive record of your intellectual property is more important than ever.
  • Regulatory regimes. Whether it’s Sarbanes-Oxley, HIPAA, or PCI, can you prove accurately who was accessing and modifying your IP?  That’s a key concern for compliance officers.
  • DevOps. If you practice configuration as code, then some of your crown jewels are stored in SCM, and need to be managed appropriately.
  • Industry standards. From CMMI to ISO 9000, standard processes and controls are the cost of doing business in certain industries.  Access Control Plus ticks all of the auditing and reporting checkmarks for you.

Combined with SVN MultiSite Plus and Git MultiSite, Access Control Plus is a complete solution for making your valuable digital data highly available and secure. Be proactive – give us a call and figure out how to manage all of your Subversion and Git repositories.


The AIA Prior Use Defense and DevOps

Configuration as Highly Valuable Code

As I wrote about earlier, the expanded scope of the ‘prior use’ defense in the America Invents Act (AIA) provides you with an improved defense against patent litigation. If you’ve adopted DevOps and Continuous Delivery, you need to make sure that you have a strong record of how you’re deploying your software, not just how it was developed. After all, some of your secret sauce may well be your deployment process – a clever way of scaling your application on Azure or EC2, or perhaps a sophisticated canary deployment technique.

Proving that your clever deployment tricks were in use at some point in time is just another reason to treat configuration as code and store it in your Git repositories. In order to do that, you need to figure out a couple of key problems:

  • How do you secure the production data while still making less sensitive deployment data available to development teams?
  • How do you prove that your production data was actually in use?
  • How do you manage having Git repositories on production app servers that may be outside your firewall?

WANdisco’s Git Access Control and Git MultiSite provide easy answers to those challenges.  Git Access Control lets you control write access down to the file level, so you can easily let developers modify staging data without giving them access to production data in the same repository. These permissions are applied consistently on every repository, on every server.

Similarly, Git Access Control provides comprehensive audit capabilities so you can see when data was cloned or fetched to a particular server. You can also use these auditing capabilities to satisfy regulatory concerns over access to production environment data.

Finally, Git MultiSite’s flexible replication groups let you securely control where and how a DevOps repository is used. For example, you may want to have the DevOps repository available for full use on internal servers, but only available for clones and pulls on a production server.

If DevOps has taught us anything, it’s that configuration and environment data is as important as source code in many cases. Git Access Control and Git MultiSite give you the control you need to confidently store configuration as code and establish your ‘prior use’ history.

Subversion and Git Live 2013 – Feedback

As we’re now less than two weeks away from Subversion and Git Live 2014 in New York we thought we’d post a few details about last year’s events – how they went, what people thought was most useful, what they enjoyed and probably most importantly what they got out of it.

We’ve collated our feedback and Dan, our lovely graphics guy, has created this infographic from the results:

WANdisco infographic - click image for full size (opens in new window)

So all around a very positive experience with much being learned both by attendees and ourselves as well. Here’s a few select comments from people who were at the London event last year:

“Very reassuring to hear that other companies have experienced similar “challenges”.”

“The most beneficial thing from this event? A sanity check that our Git implementation ticked all the boxes!”

“Most beneficial? Learning about current imminent functionality.”

“Time to think about some of this stuff!”

Following feedback received we’ve made some of our talks hands-on this year so you’ll get the chance to try out some of the things being talked about as well.

We hope to see you there! 🙂

More Rebasing in Subversion

Continuing on from a previous post about rebasing in Subversion, let’s look at a more general example of using rebasing to port commits to a new base branch.

In Git we’ll start with this picture.

I have three branches: master, team, and topic. Now I’d like to get the unique commits (to-1, to-2) on the topic branch and get them back to master cleanly, but I don’t want the intermediate work on the team branch (commit te-1).

So I use rebasing to get the diffs between topic and team, and use master as the new base for the rebased topic branch.

That gives me the clean picture above. At this point it would be trivial to do a fast forward merge of topic to master.

Using much the same techniques as I discussed last time, it’s possible to emulate this capability in Subversion. Here’s my starting point.

Again, I want to get the local commits from topic and make them more easily accessible to trunk without running a regular merge, which would have to go through the team branch.  Here’s the recipe.

make branch topic-prime from current head of trunk
run svn mergeinfo to determine the diffs between topic and team (revs eligible for a merge from topic to team)
run a cherry-pick merge (ignoring ancestry) of each of those revs from topic to topic-prime

Using that recipe gives me this picture:

At this point I could continue working on topic-prime or run a relatively simple merge to trunk. I could have also changed my recipe to run the cherry-pick merges directly onto trunk instead of using a new branch.

In any case, the end result is fairly close to what you’d have in Git, although the process of getting there wasn’t as easy (and I still have the original topic branch lying around).

Git has uncovered a lot of useful tools and techniques, and although it takes a bit of extra work, you can emulate some of these in Subversion. Questions? Give me a ping on Twitter or svnforum.

Authentication and Authorization – Subversion and Git Live 2014

We’ve switched the format of some of the talks for Subversion and Git Live this year – several will be hands on, giving you the opportunity to test out the subject being discussed rather than just making notes.

One of these talks this year will be delivered by Ben Reser, one of our Subversion committers, on Authentication and Authorization. Ben has been working on Subversion since 2003 and will be discussing:

  • A brief overview of the access control methods Subverison supports.
  • Hands on setting up of a Subversion server with LDAP authentication over HTTP.
  • A look at the performance costs of access control and what you can do to minimize them.
  • How to put your authz configuration file into the repository.

The hands on portion will be covering a hypothetical company as they grow and shift from a very basic setup to a much more complex setup, showing some of the problems they’d have along the way and discussing their reasons for making configuration changes. The company starts off with a single repository with basic authentication (no path based authorization) and ends up with multiple repositories, LDAP and path based authorization. Eventually we’ll even use the new in-repository authz feature added with 1.8. The configuration improvements along the way will show how to ease administrative burden and improve performance.

The goal with this talk is to have you walking away knowing why you configure Subversion the way you do and how you can make things better for your particular setup, rather than just giving you an example authz file and telling you it’s the ‘right’ way to do things.

If that sounds good to you, why not come see us at Subversion and Git Live 2014? There’s more info about the event here:

Open Source and Analytics

A recent Information Week article predicted that open source analytics tools would continue to gain ground over commercial competitors in 2014 in the Big Data arena. That may seem surprising. After all, you’ve made an investment in moving some unwieldy data into Hadoop.  Why not start to hook up your traditional data analytics and business intelligence tools?

To see why this prediction makes sense, let’s review some of the advantages of Hadoop Big Data infrastructure:

  • Cost efficiency: Hadoop’s storage costs per terabyte are about one-fifth to one-twentieth the cost of legacy enterprise data warehouse (EDW) solutions. Once you have a Hadoop cluster up and running, scaling it out is economical.

  • Visibility: Hadoop lets you store, manage, and analyze wildly disparate data sets with no penalty. Silos that existed due to storage costs or technical incompatibility start to disappear.

  • Future proofing: Hadoop is an open platform with a vibrant community. There’s no risk of lock-in to obsolete tools and vendors.

These same reasons explain why open analysis platforms will continue to see wide adoption.

First, let’s consider cost efficiency and visibility. You’ll find that both tools and talent are more affordable and easier to find when you use open platforms, which means you’ll have a lot more people looking for the gems in your data.

Recall that one feature of Big Data is that you probably don’t know how you’re going to use all of the data you collect in the future. In other words, you don’t know now what questions you’ll be asking next year. You need to unleash your analysts and data scientists to explore this data, and open analysis platforms have a much lower cost barrier than commercial tools. Any budding data scientists can get started without consuming scarce licenses.

Finally, the next generation of data scientists will be trained on open platforms like R. R is gaining traction rapidly and is the key tool in a new data science MOOC offered by Johns Hopkins. Not only will recruiting be easier, but anyone on your team who needs to start working with data can acquire some basic skills easily. Visibility matters: after all, if data is stored in Hadoop and no one is there to analyze it, why bother?


Now getting back to future proofing, data science is a rapidly evolving field.  New tools and methods are springing up almost every day.  Much of that research is being done and published in open platforms like R.  You’ll be able to take advantage of that cutting edge knowledge without having to wait for a vendor to support it in a closed framework.

Embracing this wave of open source analytics tools will help you start to see real ROI from your Big Data investment.

WANdisco Announces Availability of Apache Subversion 1.9.0 alpha binaries

Apache have announced the release of the binaries for Subversion 1.9.0 alpha with a number of significant improvements.

It’s important to note that this is an alpha release and as such is not recommended for production environments. If you’re able to download and test this release in a non production environment though we’d be grateful for any feedback – if you notice anything untoward or even just want to chat or ask about this latest version please drop us a post in our forums.

This release introduces improvements to caching and authentication, some filesystem optimisations for fsx and fsfs, a number of additions to svnadmin commands and improvements to the interactive conflict resolution menus. Other enhancements include:

  • New options for ‘svnadmin verify’
  •  –check-normalization
  •  –keep-going
  • svnadmin info: print info about a repository.
  • additions to svn cleanup
  •  add ‘–remove-unversioned’ and ‘–remove-ignore’
  •  add ‘–include-externals’ option
  •  add ‘–quiet’ option

You can see a full list of changes in the release notes here.

To save you the hassle of compiling from source you can download our fully tested, certified binaries free from our website here:

WANdisco’s Subversion binaries provide a complete, fully tested version of Subversion based on the most recent stable release, including the latest fixes, and undergo the same rigorous quality assurance process that WANdisco uses for its enterprise products that support the world’s largest Subversion implementations.

Using TortoiseSVN or SmartSVN? As this is an alpha release there’s no compatible version of these Subversion clients yet, but watch this space and we’ll have them ready before the general release of Subversion 1.9.0.

OpenSSL Vulnerability – The Heartbleed Bug

The OpenSSL team recently published a security advisory regarding the TLS heartbeat read overrun. This vulnerability allows up to 64k of memory to be read by a connected client or server in chunks and different chunks can be requested on each attack.

The vulnerability affects versions 1.0.1 and 1.0.2-beta of OpenSSL.

The WANdisco SVN binaries for Windows and Solaris available since 2011 have included OpenSSL libraries which are vulnerable. We’ve released updated versions with the patch as of today, so if you are still using one of these older versions please download the latest:



Users of our Subversion products (including SVN Multisite) on other operating systems will still need to ensure they’ve updated their OpenSSL package however there’s nothing vulnerable included with our binaries. We recommend all users of these operating systems update their version of OpenSSL to 1.0.1g as soon as possible or, if unable to update, recompile OpenSSL with the -DOPENSSL_NO_HEARTBEATS flag.

For more information on this vulnerability please see

UPDATE: SmartSVN versions 8.5 and 8.5.1 are also vulnerable due to the included version of OpenSSL. We’ve now released SmartSVN 8.5.2 and would urge all users of SmartSVN 8.5 and 8.5.1 to update to this latest version as soon as possible. SmartSVN 8.5.2 is available for download at

Can Big Data Help with Prediction?

A recent article entitled, “Limited role for big data seen in developing predictive models”, splashes a little cold water on the idea that Big Data will magically help develop better predictive analytics tools.  The headline caught my attention, as it’s become a truism that a poor algorithm with lots of data will outperform a great algorithm with not enough data.  Let’s go ahead and ask, can Big Data help with prediction?

Now, I understand the author’s point.  If you are performing a well-structured study and you have a deep understanding of the domain, then a smaller and carefully constructed data set will probably serve you better. Later in the article, however, Peter Amstutz, analytics strategist at advertising agency Carmichael Lynch, mentions that in many cases you’re not even sure what you’re looking for and often need to aggregate loosely structured data from disparate sources.  After all, there’s a lot more unstructured data in the world, and it’s growing quickly.

I find myself favoring the dissenting view.  In my job I’m often trying to answer questions like, “Will our next release ship on time given what I now know about the backlog, other projects taking away resources…,” and so on.  It’s not as simple as looking at a burn down chart to track progress.  In my head I’m meshing all types of data points – chatter on the engineering forums, vacation schedules, QA panic boards, et cetera.  I sometimes get a ‘pit of my stomach’ feeling that the schedule is slipping, but when I try to actually quantify what I’m seeing, it’s difficult.  There are so many sources of data to correlate, and none of them report consistently.

Of course, if we had a data warehouse I could run some cool reports on trends I’m seeing, but I wouldn’t try to convince the higher-ups to make that level of investment (ETL tools, data stores, visualization front end) and I’m sure they won’t give me JDBC access to all of our databases.

On the other hand, I’ve got a small Hadoop cluster available – just a set of VMs, but sufficient for the volume of data I need to examine – and I know how to pull data using tools like Flume and Sqoop.  All of a sudden I’m seeing possibilities.

This is one of the real benefits of ‘Big Data’ for predictive analytics.  It can handle the variety of data I need without ETL tools, at a fairly low cost.

Intro to Gerrit – Subversion and Git Live 2014

You may be aware of Gerrit, the web based code review system. Our Director of Product Marketing, Randy Defauw, has a number of good reasons for adopting it as part of your development process:

The most interesting thing about Gerrit is that it facilitates what some call ‘continuous review’. Code review is often seen as a bottleneck in continuous delivery, but it’s also widely recognized as a way to improve quality. Gerrit resolves this conundrum with innovative features like dynamic review branch creation and the incorporation of continuous build into the heart of the review process.

Gerrit is also notable because it is the most enterprise friendly Git code review system, although it has open source roots. It integrates with all standard authentication frameworks, has delegated permission models, and was designed for large deployments.

Randy is Director of Product Marketing for WANdisco’s ALM products. He focuses on understanding in detail how WANdisco’s products help solve real world problems, and has deep background in development tools and processes. Prior to joining WANdisco he worked in product management, marketing, consulting, and development. He has several years of experience applying Subversion and Git workflows to modern development challenges.

If you’d like to hear more about Gerrit, or Git in general, come see us at Subversion and Git Live 2014.

SmartSVN 8.5 Moves from SVNKit to JavaHL

Following on from the release of SmartSVN 8.5, we wanted to give you a bit more detail about the main big change in SmartSVN 8.5, so here’s Branko Čibej, our Director of Subversion, with an explanation:

One of the most significant events during the development of SmartSVN 8.5 was the decision to adopt the JavaHL library in favour of SVNKit, which was used by all previous releases of SmartSVN.

JavaHL is a Java wrapper for Subversion, published by the Apache Subversion project. The most important difference compared to SVNKit is that JavaHL uses the same code base as the Subversion command-line client and tools. This has several benefits for SmartSVN: quicker adoption of new Subversion features; more complete compatibility with Subversion servers, repositories and other clients; built-in support for new working copy formats; and, last but not least, speed — as demonstrated by the phenomenal performance improvements in SmartSVN 8.5, compared to both 8.0 and 7.6.

The decision to adopt JavaHL has also benefited the Subversion community at large: several bug fixes and enhancements in Subversion 1.8.8 and the forthcoming 1.8.9 and 1.9.0 releases are a direct result of the SmartSVN porting effort. We will continue to work closely with the Apache Subversion developers to further improve both JavaHL and Subversion.

Hope that helps explain what’s going on a bit and why we opted to make the change, though it’s worth bearing in mind this is largely an ‘under-the-hood’ change and you won’t notice much difference in the interface. The change will however make future development of SmartSVN much easier.

If you want to see more about the speed improvements there’s a results table in the release blog here.

Cheers all.

Top Challenges of the Git Enterprise Architect #3: Ever Growing Repos

volcanoContinuing on from Top Challenges of the Git Enterprise Architect #2: Access Control I’ll next talk about Git’s ever growing repos and some challenges presented.

Git stores all history locally, which is good because it’s fast. But it’s also bad, because clone and other command response times grow and never shrink.

Linear response time

Certain commands take linear time, O(n), either with the number of files in a repo or the depth of history. For example, Git has no built-in notion of revision number. Here’s a way to get the number of revisions of a file,

git shortlog | grep -E '^[ ]+\w+' | wc -l

Of course, a consecutive file revision number is not a foundational construct in Git, as it is with a system like Subversion, so Git needs to walk its DAG (Directed Acyclic Graph) backward to the origin, counting the revisions of a particular file along the way.

When a repo is young, this is typically very fast. Also, as I noted, the revision numbers play less of a important role with Git so we need them less often.  However, think about what will happen if you have an active shared Git repository in service for a long period of time? When I’ve asked how long typical SCM admins expect to keep a project supported in an SCM, numbers range from 4 to up to 10 years. An active file might have accumulated hundreds or even thousands of revisions, and you’d want to think twice about counting them all up with Git.

Facebook’s struggle

Facebook’s concern over Git’s ever-growing-repos and very-slowing-performance a few years ago led them to switch to Mercurial and centralize much of the data.  Alas, this approach is not a solution for most companies. It relies on a fast, low latency connection, and if you don’t have access to the unique, fast, data safe, active-active replication as found in WANdisco’s MultiSite products for Subversion and Git, users remote to the central site will suffer sharply degraded performance.

Common workaround

The most common workaround I hear about is that when a Git repo gets too big and slow, a new shared master is cloned and deployed, and the old one serves as history. This is clearly not ideal, but many of the types of development that have first started using Git are less impacted by fragmented historical SCM data. As Git gains more widespread adoption into a greater variety of enterprise development projects, solutions may become more needed. Here at WANdisco we are hard at work paving the road ahead so that your Git deployments will scale historically as well as geographically. 

SmartSVN 8.5 Available Now

We’re happy to announce the release of SmartSVN 8.5, the graphical Subversion (SVN) client for Mac, Windows and Linux. SmartSVN 8.5 is available for download from our website here.

Along with several bug fixes and enhancements, SmartSVN 8.5 makes the critical move from SVNKit to JavaHL, the same back end as used by the Subversion command line client/server.

Major Improvements

Whilst it may not look different, this release signifies a huge change in that we’ve moved away from SVNkit and SmartSVN now uses JavaHL. This is the same library used by command line Subversion, and has given SmartSVN 8.5 much improved stability and a huge speed boost. Some comparison tables:

Text files
Jpg files
Operation 7.6.3
time (s)
time (s)
8.5 (JavaHL)
time (s)
time (s)
time (s)
8.5 (JavaHL)
time (s)
Checkout 72.21 78.86 7.34 118.13 120.35 10.92
1st Add 133.60 201.94 37.49 64.19 98.61 15.47
Revert 47.14 75.06 16.89 41.19 77.15 8.85
2nd Add 131.75 186.18 34.64 60.87 101.76 13.81
Commit 314.44 440.23 85.70 167.34 252.85 42.46
Remove 13.86 1146.77 13.76 6.76 553.41 8.70

We’ve also added support for Subversion 1.8.8 and the file:// protocol for local repository access.

For a full list of all improvements, bug fixes and other changes please take a look at the changelog.

Have your feedback included in a future version of SmartSVN

Many issues resolved in this release were raised via our dedicated SmartSVN forum, so if you’ve got an issue or a request for a new feature, head over there and let us know.

You can download SmartSVN 8.5 from our website here.

Haven’t yet started with SmartSVN? Claim your free trial of SmartSVN Professional here.

Subversion 1.9 Underway

Subversion 1.9 is already well underway, following up quickly after last year’s impressive 1.8 release. Although the final set of new features may change, there’s one piece of infrastructure work that’s worth highlighting.

A new tunable to control storage compression levels lets you choose a better balance between repository size and server CPU load. Disabling regular compression and deltification will yield a substantial improvement in throughput when adding, committing, and checking out large files.  You can expect to see more numbers at the next Subversion & Git Live conference, but I will mention that commit speed can increase from 30-40 MB/s to 100 MB/s.

Here are a few other tidbits that may interest you:

  • New tool to list and manage cached credentials

  • The ability to commit to a repository while a pack operation is in progress (Goodbye, long maintenance windows!)

  • Infrastructure work is starting for a new storage layer that will reduce repository size and improve performance.

While you’re waiting for Subversion 1.9, now’s a great time to upgrade to Subversion 1.8. You can enjoy many of the benefits just by upgrading the Subversion binaries on your server.

Resource Management in HDFS and Parallel Databases

A recent survey from Duke University and Microsoft Research provides a fascinating overview of the evolution of massively parallel data processing systems. It starts with the evolution of traditional row-oriented parallel databases before covering columnar databases, MapReduce-based systems, and finally the latest Dataflow systems like Spark.

Two of the areas of analysis are resource management and system administration. An interesting tradeoff becomes apparent as you trace the progression of these systems.

Traditional row- and column-oriented databases have a rich set of resource management tools available. Experienced database administrators (DBAs) can tune individual nodes or the entire system based on hardware capacity, partitions, and typical workloads and data distribution. Perhaps just as importantly, the DBA can draw on decades of experience and best practices during this tuning. Linear scalability, however, is a bit more challenging. Theoretically, many parallel database systems support adding more nodes to balance work load, but in reality it requires careful management of data partitions to get the best value out of new resources.

Similarly, DBAs have access to many high quality system administration tools. These tools provide performance monitoring, query diagnostics, and recovery assistance. These tools have evolved over the years to allow very granular tuning of query plans, indexes, partitions, and schemas.

Reading between the lines, you had better have a good team of DBAs on hand. Classic database systems are expensive to purchase and operate, and knowing how to turn all of those dials to get the best performance is a challenge. Query optimization, for example, can be quite complex. Knowing how to best partition the data for efficient joins across a massive data set is not a solved problem in all cases, especially when a columnar data layout is used.

There’s a very big contrast in these areas when you look at systems built on HDFS, from the original MapReduce designs to the latest Dataflow systems like Spark. The very first design of MapReduce opted for simplicity with a static allocation of resources and the ability to easily add new nodes into the cluster. The later evolutions of Hadoop introduce improvements like YARN, which provide for more flexible resource management schemes, while still allowing for easy cluster expansion with the HDFS Rebalancer taking care of data transfer to new nodes. The newest Dataflow systems have the potential for much improved resource management, using in-memory techniques to aid in processing time. Most notably, systems like Spark can use query optimization based on DAG principles.

System administration in Hadoop is an evolving field. Some expertise exists in cluster management (or you can delegate that chore to cloud systems), but a Hadoop administrator does not have the same set of tools available to a traditional DBA; indeed a priori plan optimization is not even feasible when many ‘Big Data’ analytics packages only interpret data structure at query time.

To sum this up, I think that the ‘Big Data’ solutions have made (and continue to make) an interesting design choice by sacrificing some of the advanced resource management and system administration tools available to DBAs. (Again, some of these simply aren’t available when you do not know the data schema in advance.) Instead they favor a simplified internal representation of data and jobs, which allows for easier expansion of the cluster.

To put it another way, a finely tuned traditional parallel database will probably outperform a Hadoop cluster given sufficient hardware, expertise, and advanced knowledge of the data. On the other hand, that Hadoop cluster can grow easily with commodity hardware (beyond the breaking point of traditional systems) and not much tuning expertise other than cluster administration, which is a cost that can be spread over a large pool of applications. Plus, you don’t need to make assumptions about your data in advance. Dataflow systems like Spark will go a long way towards closing the performance gap, but in essence Big Data solutions are performing a cost-benefit analysis and coming down on the side of simplicity and ease of expansion.

This may be old hat to Big Data veterans, but I found the paper to be a great refresher on how the Big Data field reached its current position and where it’s going in the future.

SmartSVN 8.5 RC2 Released

We’re happy to say we’ve just released SmartSVN 8.5 Release Candidate 2. SmartSVN is the cross-platform graphical client for Apache Subversion.

Major Improvements

Whilst it may not look different, this release signifies a huge change in that we’ve moved away from SVNkit and SmartSVN now uses JavaHL. This is the same library used by command line Subversion, and has given SmartSVN 8.5 RC2 much improved stability and a huge speed boost. Some comparison tables:

Text files
Jpg files
Operation 7.6.3
time (s)
time (s)
8.5 (JavaHL)
time (s)
time (s)
time (s)
8.5 (JavaHL)
time (s)
Checkout 75.27 81.61 26.22 75.64 81.52 27.69
Add 52.77 131.22 67.37 37.38 72.12 22.22
Revert 36.83 60.02 33.74 33.80 69.67 18.08
Commit 195.15 279.49 75.31 116.07 176.88 40.63
Remove 8.75 1176.24 21.38 5.57 595.71 11.58

We’ve also added support for Subversion 1.8.8 and the file:// protocol for local repository access.

For a full list of all improvements, bug fixes and other changes please take a look at the changelog.

Though this is still a release candidate, given the major improvements to performance we strongly recommend that all customers using SmartSVN version 8 or newer upgrade to this latest RC.

Have your feedback included in a future version of SmartSVN

Many issues resolved in this release were raised via our dedicated SmartSVN forum, so if you’ve got an issue or a request for a new feature, head over there and let us know.

You can download Release Candidate 2 for SmartSVN 8.5 from our early access page.

Haven’t yet started with SmartSVN? Claim your free trial of SmartSVN Professional here.

SCM is for everyone

A recent Forrester survey revealed some startling information about the adoption of SCM tools in the enterprise: 21% of the respondents are not using any SCM, and 17% are using tools that are a couple generations out of date.

That information caught me off guard, but then again I work in the industry and probably tend to focus more on the up-and-coming than the tried-and-true. As I’ve been mulling this over, I’ve started to recall my own very first exposure to SCM.

The year was 1996, and I was an undergraduate research assistant working on an autonomous vehicle project. (Note to my children: yes, I do *really exciting and super cool* things at work.) I was on a team of five electrical engineers that wrote a lot of C code for image processing and vehicle control. Like a lot of ‘software developers’, however, we were domain problem solvers first and coders second.


For a while we did backups of our code on network drives and tapes. Then we heard about something shocking: a tool called Visual Source Safe (VSS) that would easily store every version of our code, let us see how it changed, and revert changes easily. We could even make a branch as well as work on bug fixes and new experiments at the same time! (Of course we were programming half on Linux so we couldn’t use VSS for everything, and therefore had to learn the unpleasantness of CVS.)

Back to the point: normally when someone asks me about the importance of SCM, I start thinking about the mainline model and how SCM is the one of the foundations of continuous delivery. To take a step back, anyone who writes software, from assembly motor control code to Hadoop plumbing, needs to use SCM for the same three reasons I needed it in 1996:

  • Backups. Code is valuable.

  • History. Being human, you may make mistakes and need to discover/roll back those mistakes.

  • Branching. You sometimes need to work on bugs and new stuff at the same time.

Luckily, you’ve got better choices now than I did in 1996. Subversion and Git are two free, powerful, and mature SCM choices available on every platform. Both are fairly friendly to the newcomer particularly if you pick a GUI, and most applications that work in any way with source code will integrate with them.

So no more excuses – head over to our website for binaries, community forums, and quick reference cards and tutorials.

Git 1.9 certified binaries available

Git 1.9 is mainly a maintenance release, and includes a number of minor fixes and improvements. As usual, WANdisco has published certified binaries for all major platforms.

Click here to see the release notes. Key changes in 1.9 include:

  • You can now exclude a specific directory from contributing to the ‘git log’ command. This makes it easy to ignore changes from that directory when you’re browsing history.
  • ‘git log’ can also exclude history from branches that match a pattern.
  • Heads-up that the default behavior of ‘git push’ is slowly moving towards the Git 2.0 standard, so be sure to start being more explicit about setting up tracking relationships.

Visit our website to download certified binaries for Windows, Mac, and Linux.

LDAP Authentication in SVN 1.8.8

Following a thread in our Subversion forums, we’ve found that some people are having problems with LDAP after upgrading or installing version 1.8.8. So far this has only been reported on the Redhat and CentOS builds.

These builds use an updated apr-util package which no longer offers support for ldap – this has been moved to a separate package, apr-util-ldap, which wasn’t included.

We’ve now updated the dependencies in that package to include the necessary missing package, apr-util-ldap. If you’re getting this error, just re-run yum install subversion and the missing files will be installed for you.

Many thanks to Philip, one of our Subversion committers, for highlighting the issue so that we could sort it 🙂

America Invents Act and the Prior Use Defense

Proving Your Right to Continued Use with Global Repository Management

One of the significant changes in the America Invents Act (AIA) is the expanded scope of the ‘prior use’ defense. Before the AIA, the prior use defense only applied to business method patents, but it is now an effective and increasingly important defense in patent disputes centered on any process, machine, or manufacture – if you can prove a valid prior use a year before the filing date or public disclosure of the claim in question.

Why is prior use so important now? For starters, the pace of patent litigation is on a sharp upward trajectory. Between 2008 and 2012 the number of patent cases commenced rose from under 3,000 a year to over 5,000 a year. Any effective defense is worth considering in this environment.

Also bear in mind that the AIA changed from a first-to-invent scheme to first-to-file. If a patent troll files the paperwork first to claim an invention, you could be at risk. Establishing with clear and convincing evidence that you were using the invention over a year before the filing is a very strong point in your favor.

The best evidence of prior use of software inventions is the audit trail provided by your SCM repository. By its nature an SCM repository tracks the birth of a software implementation: the combination of source code, libraries, build scripts, and deployment processes that shows how and when you started using an invention.

But there’s one fly in the ointment – the use of ‘skunk works’ repositories. Some development teams using Git like to stand up informal repositories to work on new ideas or pet projects, only moving the project into the ‘official’ repository when it reaches some stability milestone.

That’s clearly a problem. As we’ve seen from the time frames in the AIA, every day counts when you’re establishing a prior use defense. If you lose the first three months of prototype history because the ‘skunk works’ repository was lost, you may slip past the one year limit for prior use.

Before you try to enforce a policy against these ‘skunk works’ repositories, keep in mind why Git development teams might use them:

  • They’re working at a remote office and the network latency is making their Git operations painfully slow.

  • The official Git repositories are slow due to too much load on the system from a large user base and build automation.

  • The process of requesting an official repository and adding it to the enterprise backup and security schemes is too time consuming.

The solution is a fast and stable enterprise Git service that is easily accessible for any development team. In other words, make it easier for developers to use your secure and highly available Git repositories and they won’t be tempted to set up their own infrastructure.

Git MultiSite provides the enterprise Git solution that fits the bill. With Git MultiSite’s patented replication technology, every development team gets a local Git repository with fast LAN access. Every Git server in the deployment is a fully replicated and writable peer. Slow Git operations are a thing of the past, with most operations being local and commits (pushes) being coordinated very efficiently with other nodes in the deployment. Plus, Git MultiSite’s automated failover and self-healing capabilities mean zero down time.

Git MultiSite and Git Clustering also provide a very scalable solution. Additional Git servers can be added at any time to handle increased load, giving you more confidence to spin up new Git repositories for every pet project that might turn into the next big thing. These new repositories can be deployed at the click of a button in the administration console.

Finally, Git Access Control makes sure that your security policies and permissions are applied consistently at every site, on every server.

Git MultiSite removes the performance and security concerns that normally make you hesitate about providing ‘self-service’ SCM infrastructure, eliminating the need for ‘skunk works’ repositories.

The AIA makes it more important than ever to keep track of every software recipe in your organization. Let Git MultiSite provide the infrastructure you need to protect all of your intellectual property.


Top Challenges of the Git Enterprise Architect #2: Access Control

Attrib: has no built-in access control features.  If that seems a surprise, one reason is that the Git project specifically considers access control to be outside the scope of a version control tool. Another reason is best practices with Git typically result in many small repos, in contrast with the gargantuan repos often found with centralized version control systems.  Having logically unrelated code resident in separate repositories means we can control access through authentication, where a user either has zero or complete access to a repo.

Enterprise software development is often subject to demands not found in the open source landscape where Git was born. Code bases can have interdependencies that may prove too entangled to refactor into individual Git repositories.  Or projects have migrated from a centralized version control system where large files are mixed with small, creating trouble for Git’s assumptions about how large a large repo can be. Sometimes, we see product code assembled from a combination of contributors, combining code from outside-the-firewall contractors with inside-the-firewall employees. All of these situations can create access control needs that go beyond all-or-nothing repo access.

Far reaching effects

The Git project’s decision to leave access control as an exercise of the user has another important effect: it invites increased diversity in choice of access control tooling. Rather than most groups falling in line with, for example, Apache controls for Subversion, we typically see a large organization sprout a variety of open source and commercial solutions. Infrastructure seeks consolidation, so we see IT/SCM teams sometimes scrambling in order to avoid becoming responsible for supporting a large number of new technologies of questionable pedigree.

This situation is compounded by the fact that Git can easily, and often secretly, be used by developers still tethered to a centralized system. This means Git often obtains a significant beachhead of adoption, and a divergent mess of ad-hoc technology stacks along with it, before your enterprise SCM administrators step in.

Gradual migration

Another common situation is where developers in a company are gradually shifting from other systems to projects using Git. In these cases they may spend some or possibly long periods of time accessing older as well as new systems. We believe we will see Subversion and Git co-deployed for an extended period in enterprise development environments. That’s why managing access control across Subversion and Git deployments is part of the core functionality of our new Access Control Plus product.

Looking ahead to solutions

This series of posts is about challenges more than solutions. I’ll be speaking about solutions for Git access control at our Subversion & Git Live conference this May. See you there!

Rebasing in Subversion

One of Git’s most popular features is the rebase command. Rebasing has several uses, but one of the most interesting is the ability to ‘merge without merging’. In other words, you can take a set of commits from a branch and reapply them to the same branch after adjusting the branch pointer. A couple of pictures will help to illustrate the concept as we try to keep a topic branch up to date with the latest work from master.

In this picture we start with a master branch, then make a topic branch and record a commit there. Next we switch back to master and record another commit. Then we merge master to topic to pick up the latest trunk work, and end up with a merge commit.

Rebasing lets us take a different approach.  

We start out the same way, but instead of merging master to topic, we rebase topic to master. That takes our local commit (t-1) and reapplies an equivalent patch after moving the topic branch to the head of master. The net effect is that it looks like topic was created from the current head of master. This is a really useful way to keep topic up to date with master without recording a bunch of merge commits.

It’s not possible to do this directly in Subversion, but there is a recipe that gets you close. First let’s look at the normal situation in Subversion when I work on a topic branch and do a refresh merge from trunk.

The revision graph (from SmartSVN) shows what you’d expect, although note that Subversion revision graphs don’t show the merge arrows.

Now let’s look at how to get to this point instead:

I can’t change my branch pointer in Subversion like I can in Git, but I can create a new branch (I’ll call it feature_prime) that starts from the new tip of master and then apply any patches from the original feature branch. At this point I can throw away the original feature branch and continue working on feature_prime. It’s not an ideal solution, but may still be useful in some cases.

How did I get there?  Here’s the recipe, starting after the m-2 commit on master:

# make feature_prime branch
svn copy ^/master ^/feature_prime -m "mkbranch"
svn up
cd feature_prime

# calculate diffs between feature and master
svn mergeinfo --show-revs eligible ^/feature ^/master

# for each rev found, run a cherry pick merge into feature_prime
svn merge ^/feature -c 8 --ignore-ancestry
svn commit -m "rb-8”

Essentially we find the deltas between feature and master, then run an ignore-ancestry merge to get them into the new branch. That type of merge ignores the normal merge history and just calculates diffs.

In my next post about rebasing in Subversion I’ll talk about the more general case of rebasing in Git, which turns out to be easier to emulate in Subversion.


Apache Announces Subversion 1.7.16

On the heels of Subversion 1.8.8, I’m pleased to announce the release of 1.7.16 on behalf of the Apache Subversion project. Along with the official Apache Software Foundation source releases, our own fully tested and certified binaries are available from our website.

1.7.16 is a bugfix and security fix releases and does not include major new features. 1.7.16 includes the following changes:

  • A security fix that prevents Apache httpd from crashing when SVNListParentPath is turned on and certain requests are received. Further details on the issue can be found in the advisory the Apache Subversion project has published.
  • Reduced memory usage in both server implementations during checkout and export operations.
  • Fixed an issue that caused executing a copy of a relocated path to break the working copy.
  • Resolve a number of regressions in diff that we introduced in 1.7.14. Most notably, requesting a diff against a specified revision and a working copy file that had a svn:mime-type property would fail.

For a complete list of new features and fixes, visit the Apache changelogs for Subversion 1.7.

You can download our fully tested, certified binaries for Subversion 1.7.16 free here.

WANdisco’s binaries are a complete, fully-tested version of Subversion based on the most recent stable release, including the latest fixes, and undergo the same rigorous quality assurance process that WANdisco uses for its enterprise products that support the world’s largest Subversion implementations.

Demo of Non-Stop HBase: Brett Rudenstein – Strata SC 2014

Brett Rudenstein, Senior Product Manager for Big Data, sat down with theCUBE’s Dave Vellante and Wikibon’s Jeff Kelly at Strata to examine why customers are asking how they can use their idle clusters for more than just disaster recovery, what Non-Stop HBase means for enterprise architecture, and the implications for real-time mission-critical applications. Rudenstein provides a demonstration of Non-Stop Hadoop’s continuous availability to show how WANdisco’s active-active replication is applied to HBase region servers, enabling real-time data visualization.

Be sure to watch theCUBE’s interviews with CEO David Richards and CTO Jagane Sundar, CMO Jim Campigli, and Dr. Konstantin Boudnik for more from Strata Santa Clara.

More information about Non-Stop Hadoop is available on our our website.

Interview: Dr. Konstantin Boudnik – Strata Santa Clara 2014

The always entertaining and educational Dr. Konstantin Boudnik – a.k.a. “Cos” – gave theCUBE an insider’s perspective on the current state of Hadoop adoption and innovation. Cos cuts straight to the point, discussing why enterprise solutions available today aren’t meeting enterprise demands, the need for continuous global availability, the problems with HBase’s built-in failover capability in environments with multiple region servers, and what WANdisco’s technology means for the future of real-time applications.

Don’t miss theCUBE’s interviews with CEO David Richards and CTO Jagane Sundar, and CMO Jim Campigli for more from Strata Santa Clara.

More information about Non-Stop Hadoop is available on our our website.

Interview: Jim Campigli & Jagane Sundar – Strata Santa Clara 2014

Drilling deeper into the current state of Hadoop in the enterprise, CMO Jim Campigli and CTO and VP of Engineering for Big Data Jagane Sundar spoke with theCUBE to discuss why the majority of Hadoop clusters aren’t yet in production, how WANdisco is enabling worldwide data availability and disaster recovery, the use cases that enterprises are most interested in, and what’s driving demand for continuous availability and Non-Stop Hadoop in various industries.

Watch theCUBE’s interview with CEO David Richards and CTO Jagane Sundar and watch out for more interviews with WANdisco execs and engineers from Strata Santa Clara coming soon.

More information about Non-Stop Hadoop is available on our our website.

Apache Announces Crash Fixes and Performance Improvements for Subversion 1.8.8

Today I’m pleased to announce the release of Subversion 1.8.8 on behalf of the Apache Subversion project. Along with the official Apache Software Foundation source releases, our own fully tested and certified binaries are available from our website.

1.8.8 is a bugfix and security fix release and does not include major new features. 1.8.8 includes following changes:

  • A security fix that prevents Apache httpd from crashing when SVNListParentPath is turned on and certain requests are received. Further details on the issue can be found in the advisory the Apache Subversion project has published.
  • Reduced memory usage in both server implementations during checkout and export operations.
  • Fixed an issue that caused executing a copy of a relocated path to break the working copy.
  • Support verifying SSL server certificates using the Windows CryptoAPI when the certificate has an intermediary certificate between it and the root certificate. This restores the ability to verify certificates automatically as was the case before intermediate certificates became commonly used.
  • Clients receiving redirects from DAV servers can now automatically relocate the working copy even if the working copy is not rooted at the repository root.
  • Improve performance when built with SQLite 3.8 which has a new query planner.
  • Fix errors that occurred when executing a move between an external and the parent working copy.
  • Resolve a performance regression with log when used against old servers with a single revision range.
  • Decrease the disk I/O needed to calculate the differences between 3 files during a merge.
  • Prevent problems with symlinks being checked out on Windows to a NAS that doesn’t support a flush operation.
  • When committing do not change the permissions on files in the working copy.
  • When committing fix an assertion due to pool lifetime issues. This was usually seen by git-svn as an error about a path not being canonical.
  • Fix an error with status that caused failures on some lesser used platforms such as PPC due to a missing sentinel value.
  • When creating a rep-cache.db file in a FSFS repository, use the proper permissions so that it can be used without an admin fixing the permissions.
  • Fix the mod_dav_svn SVNAllowBulkUpdates directive so that it can be changed in different blocks.
  • Fix mod_dav_svn to include properties in the reports when requested by the client, so that the client doesn’t need to request them separately.
  • Fix the help text of svnserve to correctly document the default size of the memory cache. It does not default to 128 MBs in threaded mode, but 16 MBs in all modes.
  • Reduce the size of dump files when the ‘–deltas’ option is used by calculating the delta even when we haven’t stored a delta in the source repository due to the skip delta algorithm.
  • Fixed several build issues when building bindings. Most notably OS X can build the SWIG bindings out of the tarball without regenerating the interfaces.
  • Developers using the Subversion APIs will find numerous documentation fixes and some API changes and should refer to the CHANGES file for details.

For a complete list of new features and fixes, visit the Apache changelogs for Subversion 1.8.

You can download our fully tested, certified binaries for Subversion 1.8.8 here.

Using Subversion on Windows? Download TortoiseSVN 1.8.5 now.

WANdisco’s binaries are a complete, fully-tested version of Subversion based on the most recent stable release, including the latest fixes, and undergo the same rigorous quality assurance process that WANdisco uses for its enterprise products that support the world’s largest Subversion implementations.

Interview: David Richards & Jagane Sundar – Strata Santa Clara 2014

Strata Santa Clara was exciting for us with our announcement of Non-Stop HBase attracting great interest. During an interview with theCUBE, CEO David Richards and CTO and VP of Engineering for Big Data Jagane Sundar discussed Non-Stop HBase architecture and use cases, trends in enterprise Hadoop adoption, the leaders that will emerge in the Big Data market, and more.

Stay tuned for more interviews from theCUBE and visit our website to learn more about Non-Stop Hadoop.

Detecting Dependency Trends in Components Using R and Hadoop

As I’ve been experimenting with Flume to ingest ALM data into a Hadoop cluster, I’ve made a couple of interesting observations.

First, the Hadoop ecosystem makes it easy for any team to start using these tools to gather data from disparate ALM sources. You don’t need big enterprise data warehouse (EDW) tools – just Flume and a small Hadoop cluster, or even just a VM from one of the Hadoop vendors to get started. These tools are free and easy to use in a small deployment, and you simply scale everything up as your needs grow.

Second, once the data is in Hadoop, you have access to the growing set of free data analysis tools for Hadoop, ranging from Hive and Pig, to scripted MapReduce jobs and more powerful tools like R.

My most recent experiment utilized the RMR package from Revolution Analytics, which provides a bridge between R, MapReduce, and HDFS. In this case, I had already used Flume to ingest Git commit data from a couple of related Git repositories, and I decided to look for any unusual relationships in the commit activity for the components in the system, including:

  • The most active components

  • The number of commits that affected more than one component

  • Which pairs of components tended to see work in the same commit

That last item I often find very interesting, as it may indicate some dependencies between components that aren’t otherwise obvious.

I had all the Git data stored on HDFS, so I used a ‘word count’-style MapReduce task to provide the counts. A partial R script is shown below.

# libraries
dfs.git = mapreduce(
 input = "/user/admin/git",
 map = function(k,v)  {
   comps = c()

   for(i in 1:nrow(v)) {
     lcomps = c()

     # … some cleanup work to extract components ...
     lcomps = append(lcomps, component)
     lcomps = sort(unique(lcomps))
     numUnique = length(lcomps)
     multis = c()
     for(j in 1:length(lcomps)) {
       for(k in (j+1):length(lcomps)) {
         # record pairs
         multis = append(multis, paste0(lcomps[j], "-", lcomps[k]))
     lcomps = append(lcomps, multis)

     if(numUnique > 1) {
       lcomps = append(lcomps, "MULTI")

     comps = append(comps, lcomps)
 reduce = function(k,vv) {
   keyval(k, sum(vv))


Now that I’ve got these counts for each component and component pair, I can easily get it back into R for further manipulation.

out = from.dfs(dfs.git)
comps = unlist(out[[1]])
count = unlist(out[[2]])
results = data.frame(comps=comps, count = count)
results = results[order(results[,2], decreasing=T), ]
r = results[count > 250,]

I’ll just focus on the most active components and pairs, which I can see in this plot.

Anything interesting there? Maybe. It certainly looks like the ‘app’ component is far and away the busiest component, so perhaps it’s ripe for refactoring. I also notice that ‘app’ and ‘spec’ tend to be updated a lot in the same commit, and there’s a lot of cross-component work (“MULTI”) going on. And what’s missing? Well, the ‘doc’ module isn’t updated very often with other components.  Perhaps we’re not being good about documenting test cases right away.

But the main point is that I can now do some interesting data exploration with a minimum amount of work and no investment in an EDW.

So even if your ALM data isn’t ‘Big Data’ yet, you can still take advantage of the flexibility, low barriers to entry, and scalability of the Hadoop ecosystem. You’ll have some fairly interesting realizations before you know it!


SmartSVN 8.5 Preview 2 Released

A couple of days ago (10th Feb) we released SmartSVN 8.5, Preview 2. SmartSVN is the cross-platform graphical client for Apache Subversion.

New SmartSVN 8 features include:

  • Native Subversion libraries used for improved performance

SmartSVN 8 fixes include:

  • Various authentication fixes
  • Fixed conflicts with previously installed JavaHL version
  • Fixed several native crash causes
  • Fixed support of repositories where the user does not have access to the repository root
  • Fixed error after entering master password
  • Fixed local repository creation
  • Windows: attempt to launch a second SmartSVN instance no longer produces an error

For a full list of all improvements and bug fixes, view the changelog.

Have your feedback included in a future version of SmartSVN

Many issues resolved in this release were raised via our dedicated SmartSVN forum, so if you’ve got an issue or a request for a new feature, head over there and let us know.

You can download Preview 2 for SmartSVN 8.5 from our early access page.

Haven’t yet started with SmartSVN? Claim your free trial of SmartSVN Professional here.

Apache Announces Subversion 1.9 Alpha

Today the Apache Software Foundation (ASF) announced the alpha release of 1.9.0. As usual, you can download the official source release from the ASF, or download our fully tested and certified binaries from our website.

The alpha release of 1.9.0 is intended to provide an opportunity for users to give their feedback early in the release process. During the 1.8.0 release candidate process we received some feedback on the behavior of conflict resolution that we found difficult to change without delaying the release significantly. With 1.9.0, we’d like to get feedback earlier not only to speed up our release process, but to support the continued production of alpha releases.

In particular, we’d like feedback on the improvements to the interactive conflict menus, the new reverse blame support, and the new svn auth command. We believe the interactive conflict menus in the command line client have been made easier to understand. Reverse blame now allows you to see when lines were deleted, not just added or changed by moving through history in the opposite direction. Likewise, changes to the svn auth command allow you to view and manipulate your authentication credential cache.

Given that this is an alpha, we don’t recommend it for production use since there are known issues (see the release announcement) and testing has not been fully completed to ensure it is stable.  Additionally, things will almost certainly change before 1.9.0 is released. But if you have a test environment and you can spare some time to look into it, we’d like to hear your feedback.

One thing you may notice about 1.9.0 is that we’ve focused on smaller features and performance improvements. The biggest change that’s coming in 1.9.0 is the new FSFS format 7 with logical addressing which will improve server performance.

For a complete list of new features and fixes, visit the Apache changelog for Subversion 1.9. We’ll be publishing more blogs about upcoming 1.9 features and improvements in the coming months so stay tuned here for more details.

You can download our fully tested, certified Subversion binaries here. Certified 1.9.0 alpha binaries are available for the following operating systems:

  • Windows
  • Redhat Enterprise Linux 6
  • Redhat Enterprise Linux 5
  • CentOS 6
  • CentOS 5
  • Ubuntu 12.04
  • Ubuntu 10.04

Announcing Non-Stop HBase

Today at Strata Santa Clara we announced Non-Stop HBase, providing continuous availability across multiple data centers any distance apart. HBase is an open source, non-relational, distributed database modeled after Google’s BigTable and used for random, real-time read/write access to Big Data including Facebook’s messaging platform.

In the same way that HDFS has a single point of failure in the NameNode, HBase has a Master Server that manages the cluster and Region Servers that store portions of tables and perform work on the data. HBase is sensitive to the loss of the Region and Master Servers.

Non-Stop Hadoop (including Hortonworks and Cloudera editions) applies WANdisco’s patented replication technology to these two availability and performance bottlenecks in HBase’s architecture – its region and master servers – to eliminate the risk of downtime and data loss.

“HBase is used for real-time interactive applications built on Hadoop,” said David Richards, Chairman and CEO of WANdisco. “Many of these Big Data applications are mission critical and even the cost of one minute of downtime is unacceptable. One hundred percent uptime for HBase is now a reality with our Non-Stop HBase technology. For the first time we’ve eliminated the risk of downtime and data loss by enabling HBase to be continuously available on a global scale.”

Stop by the WANdisco booth at Strata Santa Clara and don’t miss CTO Jagane Sundar’s session, “Non-Stop HBase – Making HBase Continuously Available for Enterprise Deployment” Thursday, February 13th.

Visit our website to learn more about how Non-Stop Hadoop provides LAN-speed performance and access to the same data at every location with automatic failover and recovery both within and across data centers.

Smaller Subversion Repository Size

Subversion 1.8 brought several changes to the FSFS storage layer, and one of the results is smaller Subversion repository size in many situations.

The latest FSFS format will pack revision properties and use deltification for directories. It also eliminates duplicate storage of identical files and properties in a single revision. As usual you have a lot of control over which of these new features are used – just take a look in the fsfs.conf file in the repository’s db directory.

To get an idea of how much storage savings these changes might yield, I set up two new repositories, one with deltification enabled for directories and properties, and one with deltification turned off. I imported an identical set of source data into each repository, and the baseline repository size was about 18.4 KB. I then ran a script that set 1,000 properties on a directory, and 1,000 revision properties. (It may not seem terribly realistic, but recall that running a merge in Subversion writes the mergeinfo property on a directory.) After running this script on both repositories I measured the growth in disk space consumption:

  • The repository without deltification had grown to 39.6 KB.

  • The repository with deltification was at 26.4 KB.

That’s a pretty substantial relative savings – a 63% reduction in size. Over the life of a busy repository, delitification could save you a lot of disk space.

You may also see a performance boost from caching of revision properties. During a checkout, SVN may need to read thousands of revision property files, and now some portion of that data can be cached. If you combine revision property caching with revision property packing, you’ll see a nice speed up to commands like svn log.

Some of these improvements are available simply by running svnadmin upgrade on your repository, but to get the maximum benefit you should dump and reload the repository. If you want to give it a try, download the latest certified Subversion binaries.

Subversion is a registered trademark of the Apache Foundation.


Join us for Hands-On Subversion, Git and Gerrit!

Today, we announced Subversion & Git Live 2014 in New York (May 6) and San Francisco (May 13). We’re excited about the new hands-on workshops we’ve added to the schedule this year, along with presentations by industry analysts and open source developers, committers and experts.

Check out this year’s keynotes and general sessions:
Kurt Bittner, Forrester Research Principal Analyst – Compliance and Continuous Delivery for Subversion and Git: An Oxymoron No Longer
Ross Gardler, President, Apache Software Foundation – The Open Source Enterprise
Branko Cibej, Director of Subversion – The New Subversion: Storage, Scalability, and Speed
Daniel Petersen, Pitney Bowes – Subversion for the Global Enterprise

There will be four tracks with sessions geared to developers, administrators and managers, and at least one hands-on session in each. The ever-popular roundtable sessions — one for Git and one for Subversion — will provide the opportunity for you to meet core developers who design and build Subversion and Git and represent your organization’s interests in shaping the direction of these open source solutions.

Visit the main event web page at to learn more about the conference, view the agenda, and register. If you’re ready to register today, use the promo code EARLY2014 to take advantage of a 30% early bird discount available through February 28. If you’re a WANdisco customer, contact your rep for a special 50% off code!

Hope to see you at Subversion & Git Live!

Causality in a Real-World Distributed Consensus Implementation Sebastien Wiertz read with interest Peter Bailis’s excellent article Causality is expensive (and what to do about it) because causality lies at the heart of our global replication solutions for Subversion, Git and Hadoop.

Clock-less causal ordering implements WAN-based 100% data-safe replication by guaranteeing the order of writes is the same at each node of a replication group. The result are independent nodes that evolve in exactly the same way, remaining shared-nothing replicas of each other. This true active-active replication is the key technology for eliminating single points of failure in any system, such as the NameNode in Hadoop, while also reducing latency in SCM systems such as Subversion and Git.

The part of the article that most caught my eye, however, was the section on how expensive causality is in terms of storage. O(N) per vector clock seems to be the theoretical limit, where N is the number of processes. Clearly this can get expensive.

So how does WANdisco’s replication engine deal with this limit when processing millions of writes a month on a system with dozens of nodes?

Bailis outlines four sophisticated approaches in his article, none of which we employ in WANdisco’s core DConE replication technology.

Instead of weakening order guarantees or reducing availability, DConE instead uses two main methods to reduce the storage cost of causality:

1. Garbage collection
2. Sidelining

Garbage collection works because each node broadcasts the highest global sequence number it has applied when it has finished processing a transaction, and so nodes can garbage collect the transactions known to be processed. There are a few wrinkles that go beyond the scope of this article, but that’s the basic technique.

Sidelining is used when a node has gotten so far behind that other nodes may accumulate too large a transaction queue. This can happen when a node is partitioned from the group. Since the rest of the nodes can still achieve quorum, they can continue to process write transactions. The queue can’t be garbage collected because the partitioned node can’t signal completion, much less actually process queuing transactions. Sidelining means that we give up on the partitioned node and garbage collect without it. A recovery process with helper nodes assist in bringing the node back to current when it does come back online.

These techniques work because in our real-world implementations, the ordering is used to modify the state of something else. In our case Subversion, Git and Hadoop. So once we’ve applied the state, we need to keep around only enough causal history to catch up a temporarily slowed node.  A node can be be recovered by restoring a known application state out-of-band, and then starting the replication at an appropriate global sequence number.

Because our applications are deployed at hundreds of companies worldwide, we often take a pragmatic approach rather than a research oriented one. That said, the advanced techniques outlined in the article are not unknown to us, and form an inspiration as we stretch the capabilities of our products and technology.

Subversion and Git Disk Monitoring

Here’s a handy tip for Subversion and Git administrators: use one of WANdisco’s MultiSite products and you’ll automatically receive warnings when you’re running short of disk space on key file systems. If things get really bad, SVN MultiSite Plus or Git MultiSite will shut down activity until more disk space frees up.  Subversion and Git disk monitoring saves you time and money by preventing disk problems from turning into system outages.

By default both MultiSite products will watch a key directory used for new transactions.

I can add other directories, for example to monitor the repository storage area, and I can specify different severity levels based on how much available space is left.


I can also configure email notifications for different types of monitoring events.


Disk monitoring is just another way that SVN MultiSite Plus and Git MultiSite make managing large deployments easier. Start a free trial today and try it for yourself!

Metadata: Big Data’s Secret Superpower

Attribution:  License:

When I heard the President of the United States repeatedly saying the word “metadata” in a speech recently, I realized just how seamlessly the phrase had made it into our common vernacular. For someone who’s long worked with SCM (Software Configuration Management) technology, however, metadata has been a primary focus of mine for over a decade. That’s because SCM is about the creation, management, querying and archival of metadata about the changes in a software codebase.

What is metadata? I describe it as “data about data”. In SCM, we would talk of “integration credit” for a merge. This credit is actually data that’s created and stored about the merge of a file from one branch into another, a common software engineering task. I have likely performed thousands of merges during my career in software development, and each of them have created a piece of data recording my actions.

Note that this metadata does not store any of the content of the merged files. That’s what makes it “data about data”, i.e., metadata. Some industries even consider their SCM metadata a trade secret because it reveals the methods used to build something.

Enter Big Data. “Big Data” is not just “big” in terms of size-on-disk, but the massive breadth of data that’s typically accumulated from many different sources and made available in a single database.

It’s precisely this breadth of data that ignites Big Data’s secret superpower of metadata. That’s partly because metadata is often linked to the intent of an action and partly because interpreting straight data itself often requires specialized knowledge to parse and understand. Linking together metadata from a broad range of sources can reveal connections not otherwise possible, powering the Holy Grail that is Predictive Analytics.

We’re still at the beginning of Big Data’s disruption of all manner of markets and systems. Learning what data sources are valuable, discovering sources of data within a system and exporting them in real time, finding the types of new questions we can ask and have answered by Big Data, are all works in progress.

Here at WANdisco, we are hard at work paving the road ahead for the new paradigms and promise of Big Data. So we’re always interested in challenges faced by our present and future customers. What are your plans for using Big Data in your business?

Why Now is the Time to Move to Subversion or Git

A recent Forrester survey revealed that 17% of enterprise software developers are still using Visual Source Safe (VSS) or CVS. It’s a bit of a puzzle why these two antiquated systems are still hanging on, but I think I understand a bit of the reason.

SCM is like plumbing from a certain perspective. It’s a vital piece of infrastructure and once you’ve used it you don’t ever want to go without. But as long as it’s working well, you also don’t see a real reason to upgrade it very often. It’s only when things break that you realize how important those pipes are.

Fair enough: if your company made an investment in CVS or Visual Source Safe (VSS) 10 years ago, you need a solid reason to upgrade. SCM systems don’t wear out like physical assets, and moving to a new system can be complex.

But I think now is the time to make the move for these three reasons:

  • EOL. VSS has reached end of life.

  • No updates. Neither VSS nor CVS is receiving updates anymore. You’re missing out on features like atomic commits and strong branching which are considered essential for productive software development.

  • Lack of tooling. Compatible software development tooling – IDEs, build systems, code review tools, deployment pipelines – is increasingly difficult to find for VSS and CVS.

Given that replacing an SCM system may only happen every 5-10 years, it’s worth considering where you want to make your investment. Here are three reasons I think Subversion and Git should be at the top of your future-proof list.

  • Open source. Subversion and Git are actively developed, have a robust user community with widespread adoption, and immunize you from vendor lock-in.

  • Best of breed. Subversion and Git have all the modern features you need. Subversion in particular handles large data sets very well, while Git is known for powerful local development workflows.

  • Stable and proven. Subversion has been widely adopted over the past 10 years and is used by some of the largest companies in the world. It is rock solid stable and has many enterprise features. Git is newer but now enjoys wide community and vendor support.

Whatever your choice, WANdisco is ready to help. We have certified binaries, support and training, and products for enterprise-grade uptime and security.  Let’s start the conversation!

Git Repository Metrics with Nagios

A few weeks ago I wrote about gathering some Git repository metrics and viewing them in the Git MultiSite GUI or in Graphite, and someone pointed out that some of the administrative metrics useful for capacity planning could be gathered using monitoring tools like Nagios. Repository size on disk is a good example since system administrators normally monitor disk space to make sure that the server doesn’t run out of space. You can set up Nagios to monitor any file system that contains Git repositories.


Git Repo Metrics

Git Repo Metrics

This level of check_disk information is useful for high level monitoring, and there are many plugins to connect Nagios to rrd graphing tools. I used OpsView to set up my example which includes a built-in graphing capability.

Drilling down to individual repositories would be an easy modification to the check_disk plugin, or you can get more granular data from Git MultiSite.


When Are Your Git Servers Busy?

Using Hadoop to Generate a Commit Time Histogram

Knowing when your Git servers are under the most load can help you answer several questions:

  • When is a good time to schedule routine maintenance or automated activity? Ideally, you want to find a time when there is very little developer activity on the system.

  • Are there periods of peak usage coinciding with the normal working schedule of a particular office? Perhaps that office needs more Git servers.

  • Are most of the commits coming at the end of a normal working day? Are you seeing a spike of commits during a certain time frame, say late at night? These might be signs of unhealthy work habits, such as an overburdened team, or capacity challenges, such as bottleneck issues when everyone tries to commit right before going home.

I decided to analyze this issue with Hadoop tools.

The Steps

Briefly, we need to:

  • Extract the relevant data from Git and make it available on HDFS. I covered one approach to this problem – using Flume to stream Git data into HDFS – in a previous post.

  • Load the data into a table in HCatalog. This step is trivial and I described it in a previous post.

  • Use Pig to analyze the data.

  • Use a graphing tool to visualize the results.

Analysis Step

I want to generate a commit time histogram showing the number of commits during each hour of the day. I need to group commits by the hour of the commit time, and then count the commits in each bucket. These steps are very easy in Pig.

-- load data
raw = LOAD 'git_logs' using org.apache.hcatalog.pig.HCatLoader();
describe raw;

-- extract hour from commit timestamp
hours = FOREACH raw GENERATE new_rev, GetHour(ToDate(time)) as hour;
describe hours;

-- group by hour
groupedbyhour = GROUP hours by hour;
describe groupedbyhour;

-- sum up number of commits per hour
hourcounts = FOREACH groupedbyhour GENERATE group AS hour, COUNT(hours) AS numhour;
describe hourcounts;
dump hourcounts;

store hourcounts into 'gl.hist' using PigStorage();

The output looks like this:

0 314
1 190


The output file has 24 lines showing the count of commits for each hour of the day. It’s then simple to plot the data using Excel, gnuplot, or another graphing tool.


Commit Time Histogram

Commit Time Histogram

In this example I’ve graphed the commits from a popular open source project.  We can see that there is a nice even distribution of commits over the working day and evening, and a lull overnight.

That’s a Wrap

A commit time histogram is just another example of the interesting data you can extract from your SCM and ALM systems using Hadoop tools. Some of this data can be seen using traditional data analysis tools, but using Hadoop takes away any concern about future scalability or data structure problems.

In my next post I’ll be looking at another take on visualizing commit data: generating a heat map of commits by user location.


Git MultiSite Simplifies Complexity

Our newest version of Git MultiSite, version 1.2, provides centralized management and replicated configuration settings for simplified administration, as well as enhanced security across multiple sites. In addition, Git MultiSite 1.2 integrates seamlessly with common ALM toolsets with enhanced support for distributed notification mechanisms. These features alleviate administrative burdens and boost security for global enterprises looking to streamline their source control management systems.

According to Jay Lyman, senior analyst for enterprise software at 451 Research, “Large enterprises are using Git for faster, more agile and collaborative development. However, sometimes tools like these add tremendous complexity, so managers and administrators appreciate the centralized management and seamless integration capabilities provided by solutions like WANdisco Git MultiSite, and this is key for global enterprises.”

The new release also enables easy integration with WANdisco Git Access Control for further security and simplicity. Git Access Control protects valuable intellectual property by providing granular access control with enterprise-grade authorization and audit capabilities, providing a complete audit trail, including user ID, date/time stamp, and command used.

“Git MultiSite provides enterprises with global disaster recovery and 100% uptime for Git,” said David Richards, WANdisco Chairman and CEO. “Git MultiSite 1.2 adds features that further enhance performance, manageability, and security for enterprises that value their data and appreciate the ability to maximize their source code management systems.”

Learn more here.

Advanced Subversion Access Control

Wrapping up a short series on some of the hidden gems of SVN Access Control, let’s take a look at using regular expressions to handle some advanced Subversion access control problems. The example I’ll use today is granting all developers the right to commit into a subdirectory of otherwise restricted branches.

The repository starts with a typical trunk-branches-tags structure, and all of these branches and tags are read-only for most developers, but we’d like to let developers commit their personal configuration and environment settings into a debug folder in each branch.

Managing this problem for one branch is easy: just define a rule that grants read access to the branch and add a second rule that grants write access to the debug folder.

But I don’t want to have to list a write rule for each branch individually; that just doesn’t scale.  Instead I’ll take advantage of SVN Access Control’s regular expressions to handle the job.

RegEx-based Access Control

RegEx-based Access Control

That’s probably the simplest example of using regular expressions to handle non-trivial access control rules. Another common example is restricting write access to build scripts (e.g. makefiles, build.xml, pom.xml).

Whatever your challenge, SVN Access Control gives you the tools for the job. Chat with one of our Subversion experts or start a free trial and see for yourself.



Challenges of the Git Enterprise Architect #1: Managing Many Repos

swarmThis is post number two in a series of short articles exploring challenges facing anyone deploying Git at scale in their enterprise software development environment. Find the introduction here.

Anyone migrating to Git from a centralized version control system will quickly run into one of Git’s most characteristic features: a codebase is most naturally represented by one complete repository.  That is, you don’t define a working copy based on a part of a repository; you get the entire repository as your working copy.

Centralized version control systems tended to become a grab bag of everything: main products, side projects, a file you needed at home but didn’t have a USB handy, and sometimes, lots and lots of large binary files. You’d then define a tiny fraction of the world as your working copy and move just that piece down.

When migrated to a Git repo however, all of a sudden you are cloning the world on to your laptop!

Doing the splits

The best practice answer in the Git world is that you need to split all the unrelated items into separate Git repos. But now there are many repos and a related number of new questions:

  • Who in your organization is responsible for managing all the repos?
  • What tracks code if it is, for example, refactored to a different repo?
  • How do developers find the repos with the code they need?
  • What about codebases that share code but for secrecy or scaling reasons can’t all be included in a single Git repos?
  • Who provisions new repos? Are they automatically backed up properly? Where do they live?
  • What if you have a large and entangled code base that will be expensive to refactor?

Untracked code movement

Of the various questions raised, one of the most important is that movement of code between repos generates no metadata. To each Git repo, files appear like a code drop, or local files are donated to another repo. This cuts against the grain of SCM. Software Configuration Management is in danger of becoming Software Confusion Mess, because we lose track of why and how code is moving within a codebase.  That means we might not be able to answer questions like:

  • What codelines contain this recently discovered bug?
  • What products contain this piece of GPL-licensed code?
  • Did the refactored code get into every library that uses it?
  • What repos did this particular line of code pass through before getting here?

There are ways around all of these problems, of course, and my main point is merely that these are some issues to keep in mind. Perhaps most of them are not important in your uses cases, or you are using one of the many tools that address some of them. And of course, as I implied in my article “Problem-centric Products“, you can expect WANdisco’s Git roadmap to pass through all of these challenges.

So stay tuned for the next installment: “Access Control”.

Git 1.8.5 certified binaries available

Git 1.8.5 was released recently, and WANdisco has just published certified binaries for all major platforms.

What’s new in 1.8.5?  As with all minor releases there are several nice fixes and improvements:

  • You can now specify HTTP configuration settings (like accepting unknown certificates) per site.
  • You can move submodules with git mv.
  • git gc will detect when another instance is running and quit.

Grab a certified binary and enjoy the goodness!

Location-Aware Subversion Access Control

Almost all Subversion access control systems are role or group-based. Typically a particular group of developers has write access to the repository while another, larger group has read access, but sometimes it’s more useful to control access based on location. IP address-based or location-aware Subversion access control is one of the most powerful features of WANdisco’s SVN Access Control product.

SVN Access Control is a mature product, but it’s worth taking a look at some of the clever features that may not jump out at first glance. The foundations of SVN Access Control are simple management, LDAP integration, granular permissions down to the file level, and strong auditing, but IP address-based rules are one of the hidden gems.

Setting an IP address-based rule is easy; simply specify the range of applicable IP addresses when adding or editing the rule.

Still the question remains: why should you care about the IP address of a user? If that person is part of the team, why does it matter where they’re connecting from? There are many reasons, actually, but they boil down to two categories.

Not every part of the network is trusted as much as the main office LAN

  • We can limit access to sensitive data when developers are connecting over VPN.

  • We can grant different access to the same user if they’re working at a remote partner office versus the main office (and can audit what’s being accessed from remote sites, in the spirit of trust but verify).

More than just source code is stored in Subversion

  • We can make production environment and configuration data read-only on development machines, read-only on app servers, and writable only for authorized Ops workstations.

  • We can lock down data that we need to push to a public cloud for deployment.

  • We can make data read-only when accessed from a build server, just in case.

SVN Access Control is a powerful tool for securing and managing Subversion data. If you haven’t explored IP address-based rules yet, give it a shot. You may find they help solve some tricky problems. You can start with a free trial or talk to one of our Subversion experts first.



SmartSVN 8 Available Now

We’re pleased to announce the release of SmartSVN 8, the popular graphical Subversion (SVN) client for Mac, Windows, and Linux. SmartSVN 8 is available immediately for download from our website.

While the main feature of this release is support for Subversion 1.8, we’ve provided a few more enhancements and additional bug fixes.

New SmartSVN 8 features include:

  • Eagerly awaited support for Subversion 1.8 working copies, allowing you to use SmartSVN with a Subversion 1.8 server. (For a full list of Subversion 1.8 benefits see the Apache release notes)
  • Ability to specify different merge tools for different file patterns as conflict solvers, allowing you to customize SmartSVN to suit your needs.
  • Prevent showing a notification while a dialog is showing, ensuring you don’t miss anything important.
  • Project menu: “Open or Manage projects” (and others) are now available without the project window, allowing you to work faster and smarter.
  • OS X: dock icon click reopens minimized windows, making SmartSVN consistent with most OS X applications.
  • Upgrade: SmartSVN will convert 1.7 working copies to 1.8 format, making it easier for you to get started with Subversion 1.8.

SmartSVN 8 fixes include:

  • Possible internal errors closing project windows or the repository browser, or when using ‘add’ or Conflict Solver
  • Problems with comparing repository files or directories
  • Compare: upper block line was drawn 1 pixel too high in line number gutter
  • Commit: committing a removal of a directory using svn protocol did not work
  • Linux: notification popup might have been closed quickly after showing
  • Start Up: crash on Ubuntu 13.10

For a full list of all enhancements and bug fixes, see the changelog.

Contribute to further enhancements

Many issues resolved in this release were raised by our dedicated SmartSVN forum, so if you’ve got an issue or a request for a new feature, head there and let us know.

Get Started

Git Document Sharing

If you’re a Git user, you may have heard about SparkleShare, a clever tool that gives you a Dropbox-style interface for document sharing and collaboration backed by Git. Storing your documents in a SCM system like Git gives you strong version management of the documents and the ability to host the Git repository on your own servers, along with easier collaboration between software teams and document writers. The only thing better than Git document sharing is…Git document sharing backed by Git MultiSite!

One of WANdisco’s talented crew in Sheffield ran a proof-of-concept with SparkleShare on two machines, each instance using a different Git MultiSite node as a Git remote. It works as expected, which is no surprise, and Git MultiSite can replicate any Git repository. Setup is simple:

  • Set up two or more Git MultiSite nodes with a replicated repository.

  • Set up SparkleShare on two machines, each using a different Git MultiSite node as a Git remote.

  • Add documents in one SparkleShare folder and see them appear on the other machine.

The process is simple and convenient, and now your SparkleShare folder is backed by a Git repository that enjoys zero down time, LAN speeds at every location, strong security, and all the other benefits of Git MultiSite. That means you can have SparkleShare folders automatically available at remote sites, and use replication groups to figure out where SparkleShare data is replicated.

If you’re looking for a Git document sharing solution that works in the enterprise, this proof-of-concept is a great place to start.


The Open Source Wave in SCM

A recent Forrester survey has confirmed what those of us working in the ALM space have seen coming for several years: the open source wave has hit SCM.  The wave isn’t on the horizon, it’s not something you need to prepare for someday – it has well and truly arrived.

The numbers tell the tale.  As you can see in the infographic below, Subversion and Git lead the enterprise SCM market with a share of 28.8%.  Subversion is a stable and mature system proven at scale in challenging environments, and is widely accepted in mainstream enterprise development organizations.  Git is now moving past the early adopter phase.  And of course Subversion and Git are the dominant SCM solutions for open source projects.

SCM Adoption

SCM Adoption

The gradual adoption of proven open source technologies in the enterprise should come as no surprise.  We’ve seen this trend before with the Apache web server (51% market share), Linux data center servers (23% of revenue), and Android (81% of devices shipped).

Why did the open source wave arrive in SCM over the last couple of years?  (Depending on your perspective, you may be thinking ‘why did it take so long’ or ‘how did it happen so fast’.)  A few trend lines converged at the right time.

First, the face of enterprise development is changing.  The software industry is widely adopting lean development principles like the Scaled Agile Framework and continuous delivery.  Subversion and Git are well suited for the workflows that support lean development.

Second, Git and particularly Subversion have matured both in features and in commercial support options.  Maintaining enough internal expertise to be completely self-sufficient is both expensive and difficult, so the rise of a thriving ecosystem of commercial vendors around an open source project is a hallmark of enterprise adoption.  Vendors like WANdisco both sponsor future development and provide enterprise support, services, and products for Subversion and Git.

The future is indeed bright for Subversion and Git.  The 21% of developers not using SCM will likely adopt an open source SCM solution – why would they look anywhere else – while the 17% using legacy solutions will likewise look to Subversion and Git as the logical upgrade paths.  Economics will eventually drive a good part of the ‘everything else’ category into the Subversion and Git camps as well.  If you’re looking to make the move, our quick reference cards on Subversion and Git are a good place to start.

The open source wave in SCM is here.  Are you ready?

Storing Binary Files Efficiently with Subversion

In the course of running a recent performance test, I remembered another big advantage that Subversion has when used for managing large digital assets. Subversion practices deduplication (also known as “rep sharing”) in its back-end storage system.

That can result in considerably large savings in terms of costly storage. Of course, Subversion doesn’t create physical server-side copies of data when branching, but you may find that you save more than 20% of storage capacity thanks to deduplication.  Sometimes users copy files to stand up new projects, particularly game artists who may not be familiar with SCM.

It’s great that Subversion makes this so easy.  And it’s also surprising that Perforce, a system known for handling large binary data, doesn’t provide any deduplication out of the box.  You must (carefully) script it yourself or rely on more expensive storage solutions to provide deduplication.  The savings quickly add up when you use Subversion instead.


Repository Storage

Repository Storage when importing binary files


Subversion deduplication is enabled by default, although you can toggle the setting in your repository’s db/fsfs.conf file.  So relax – you don’t need to do anything to take advantage of this capability.  Of course if you have any questions our team of Subversion experts is here to help!


Monitoring Subversion and Git Repository Activity

There’s nothing as frustrating as trying to diagnose a slow Subversion or Git repository. You might spend a lot of time digging through logs and system monitoring tools before finally discovering that someone is submitting a 2GB file that needs to transfer from Singapore to Boston. That’s why SVN MultiSite Plus and Git MultiSite give you built-in tools for monitoring repository activity.

In the administration console I can see how many transactions are pending for a particular repository.


Repository Transactions

Repository Transactions

I can also see transactions pending for a particular server.


Transactions per Server

Transactions per Server

In addition to viewing the number of transactions pending for repositories and servers, I can drill down to see more details about the repository events to try and pin down what’s causing a hang-up.

Monitoring a big Subversion or Git deployment is challenging and requires several types of tools, but the quick view of pending transactions gives you a fast sense of whether there are a lot of transactions stacked up waiting to process. To get a sense of typical system load over time, you can always inject these data points into a monitoring tool like Graphite.

Interested in learning more? Give us a call and see a demo or start a free trial!





Faster Subversion working copy updates

Subversion 1.8 Caches Pristine Data to Reduce Data Transfer

One of the less noticed improvements in Subversion 1.8 is the efficient caching of pristine file data in a workspace.  This improvement can actually result in much faster working copy updates in many cases.

Subversion workspaces that contain multiple branches will often have duplicate copies of the same files.  Every time you checked out or updated those files, you’d download duplicate copies of the pristine file for use in the .svn directory as well.  Subversion 1.8 now checks if the pristine data cache already has a file with the same checksum, and will avoid downloading duplicate copies.

If you have a large workspace with several branches this improvement can result in much faster checkouts and updates, particularly if you’re working over a slow connection.  To get a sense of the improvement, I set up a Subversion 1.7 server and loaded in the Hadoop 2.0.5 source code.  I made two new branches, then checked out a working copy with all three branches.  On Subversion 1.7 Wireshark showed that I transferred about 49 MB of data during the checkout.  On a Subversion 1.8 server that was down to 21 MB – a reduction of almost 60%.

Branching is cheap and easy in Subversion , so it’s great that Subversion is now smarter about not sending duplicate data.  Of course if you work with media or documentation you can end up with duplicate files in the same branch, so this improvement is a big help in that situation as well.

The efficiency improvement is impressive, and marks another milestone in Subversion’s performance story for larger digital assets.  Want to try it?  Grab a certified Subversion 1.8 release.

Subversion 1.8.5 and 1.7.14 released!

Today the Apache Software Foundation (ASF) announced the release of Subversion 1.8.5 and 1.7.14, and we’re proud to announce our own fully tested and certified binaries are also available from our website.

Subversion 1.8.5 changes include:

  • Client-side bugfixes

    • Fix externals that point at redirected locations (issues #4428, #4429)

    • diff: fix assertion with move inside a copy (issue #4444)

  • Server-side bugfixes

    • mod_dav_svn: Prevent crashes with some 3rd party modules (r1537360 et al)

    • hotcopy: fix hotcopy losing revprop files in packed repos (issue #4448)

Subversion 1.7.14 changes include:

  • Client-side bugfixes

    • Fix externals that point at redirected locations (issues #4428, #4429)

    • diff: fix incorrect calculation of changes in some cases (issue #4283)

    • diff: fix errors with added/deleted targets (issues #4153, #4421)

  • Server-side bugfixes

    • mod_dav_svn: Prevent crashes with some 3rd party modules (r1537360 et al)

    • fsfs: limit commit time of files with deep change histories (r1536790)

Visit the Apache changelogs for Subversion 1.8 and 1.7.

You can download our fully tested, certified binaries for Subversion 1.8.5 and 1.7.14 free here.

WANdisco’s binaries are a complete, fully-tested version of Subversion based on the most recent stable release, including the latest fixes, and undergo the same rigorous quality assurance process that WANdisco uses for its enterprise products that support the world’s largest Subversion implementations.

Subversion Password Security Upgrade

Continuing a series of articles on the latest improvements in Subversion, this article will focus on a small but significant Subversion password security upgrade. Subversion 1.8 now allows passwords to be cached in memory rather than on disk.

Passwords or authentication tickets cached on disk are a security vulnerability if the drive is lost or stolen, so this is a welcome improvement. Note, however, that the password exists in memory in plain text, and if an intruder accesses the machine while the cache is live and knows the cache ID, the password could still be compromised.

In order to use this new feature you’ll need Subversion 1.8 binaries compiled with gpg-agent support, gpg-agent itself, and a pinentry program. You’ll also need to configure a couple of gpg-agent environment variables.

If password security is an important concern for you, get your certified Subversion 1.8 binaries and take advantage of this improvement.

Subversion is a registered trademark of the Apache Software Foundation.


SVN MultiSite Plus: The Fastest Solution for Game Developers

Handling large binary files, such as the media assets used during game development or design files used by hardware and firmware engineers, can be challenging for SCM systems.  Apache Subversion has accelerated development of enterprise features such as class-leading performance with large binary assets.  Subversion now shows a significant advantage when committing and transferring files over a WAN compared to Perforce, a commercial SCM system with a reputation for being the performance leader in binary asset management.

Test Setup

In order to test the performance of large binary file handling over a WAN, I set up a test configuration as follows.


SVN MultiSite Plus test configuration

SVN MultiSite Plus test configuration

Perforce test configuration

Perforce test configuration



During the test, a 1.7 GB ISO file is committed to a local node three times, and each iteration is timed.  Two measurements are taken:

  • Time to run the commit command. This is the user’s experience of system speed.

  • Time for the file to transfer to the remote node over a link with 256 ms simulated latency. This is the time after which the file would be available to users at other sites.

Test Results

As the chart below shows, Subversion is significantly faster to commit and transfer the data.

Test results

Test results


Several conclusions can be drawn.

1.       Perforce is very slow to transfer a large binary file over a WAN. No compression was used between client and server, and the link was unencrypted. As the chart indicates, the bulk of the commit time was simply waiting for file transfer. Subversion is significantly faster. [1]

2.       SVN MultiSite Plus offers the benefit of using a second local node to satisfy data integrity before accepting the commit. By default commit data must reach at least one other node to ensure data availability, but this can be a local node. The user does not need to wait for the file to transfer over the WAN. This result is clearly shown in the results for the first iteration, when the commit completed in 5 minutes and the file appeared on the remote node 9 minutes later.

3.       Subversion does not retransfer or store duplicate objects in the repository storage area. Thus, there is no time for transferring the commit content in the second and subsequent runs.

The departments working on game development or hardware/firmware design are often separated by function (e.g. game artists versus game developers or hardware engineers versus application developers), and often each team will be in a separate location. For these teams, SVN MultiSite Plus offers a significant performance advantage when handling large binary files.

[1] The ‘svn’ protocol used by svnserve is relatively unaffected by increased latency as it does not wait for responses from the server while transmitting data.  See for more information.  Perforce, however, is very sensitive to TCP send and receive buffer sizes over a high latency network.  Increasing the operating system network tunables does improve Perforce performance.  In a simpler test of a regular commit from client to server over a 256ms latency link, the transfer time dropped from 2 hours to 20 minutes.  However that still compares poorly with Subversion’s out of the box, non-replicated transfer time of 7 minutes, and again SVN MultiSite Plus completes a replicated commit in even less time.




WANdisco Available from Ingram Micro

We’re pleased to announce our products are now available to channel partners through Ingram Micro. Ingram Micro’s authorized technology resellers in North America can now quickly and easily add WANdisco’s solutions to their existing product offerings via our agreement with IT solutions aggregator DistiNow.

“Ingram Micro is well positioned to help us identify new business opportunities and drive significant growth throughout North America,” said David Richards, WANdisco CEO. “As a single-source distributor offering its channel partners full access to a complete array of solutions and services, Ingram Micro offers extensive reach within vertical markets.”

“Data availability is a high priority that spans across nearly every vertical market and as such presents a growing opportunity for our channel partners,” said Bill Brandel, senior director, Advanced Computing Division, Ingram Micro U.S. “We’re pleased to bring WANdisco’s portfolio of products to our channel partners and provide them with a needed solution that allows them to monetize the open source product market.”

Visit our website to learn more about WANdisco’s solutions providing continuous availability for Hadoop Big Data, Subversion, and Git. While you’re there, register for our upcoming webinar with Hortonworks, “The Modern Data Architecture for a Non-Stop Hadoop” scheduled for December 5th.

Dynamic Subversion Deployments

Among the many improvements in the latest release of SVN MultiSite Plus, the ability to add new nodes on the fly to replication groups really stands out. SVN MultiSite Plus helps a Subversion administrator cope with uncertainty: you don’t know very far in advance how many developers you have, where they’ll be located, and how much build automation they’ll use. When I worked as a consultant I would often ask these basic questions in the process of hardware sizing for a deployment, but in reality you’re just guessing if you try to figure out how many servers you need years in advance. SVN MultiSite Plus gives you a dynamic Subversion deployment that grows in response to your environment.

As a simple example, let’s say that I start out with three nodes in a single office. After six months I need to add two more nodes to handle additional build automation load, then after a year I need to add a node at a remote site to support a new office. These routine capacity events shouldn’t cause a big disruption to your Subversion service.

With SVN MultiSite Plus I can simply add a new server to the deployment on the fly. After the server is provisioned and configured, I simply go to the administration console and add the new server to the replication group:


After I add a node, SVN MultiSite Plus walks me through the process of synchronizing data onto the new node. While the new node is being synchronized it is not usable, of course, but it is automatically activated once the data transfer is complete.

If you’re tired of constantly scrambling to keep your Subversion deployments up to speed, grab a free trial of SVN MultiSite Plus. If you face the same problem for Git, Git MultiSite offers the same support for quickly expanding a Git deployment.


Challenges of the Git Enterprise Architect, Part 1

This is the first of more than 20 articles, each examining a key challenge facing anyone responsible for deploying Git at scale in their enterprise software development environment.

Part of my role in Product Management is to seek out early adopters of emerging technology and study their process, challenges, and techniques deploying new technology. This helps ensure that we build products that people want to buy. As I wrote in Problem-centric Products, “it’s so important to deeply understand the challenges faced by your customers, and speak to the problems first whenever possible.” By studying a variety of early adopters, patterns start to emerge, and that’s where Product Management’s “ear to the ground” starts to turn information into a product vision.

I also like to share what I’ve learned, so my original “Top Challenges for the Git Enterprise Architect” document was circulated with first, our customers, then as a talk at our Subversion & Git Live 2013 conference in Boston, San Francisco and London, and now as a set of blog articles.

It was a surprise to me that many found my intermediate-level talk: “Git Enterprise Challenges” to be sobering or even frightening.  Perhaps the rose-colored glasses of my optimism cause me to see problems as opportunities, or equally possible, common challenges mean there is a chance to create a product that will benefit many people.

Note that not every development environment will face every one of these challenges, however together they comprise a checklist of issues to consider when adopting Git into your environment.  I’ll drill down into each of these over the next few months, and the result should paint a reasonably complete, if high-level picture.

I should also point out that I don’t address solutions in this series. Our products, Git MultiSite and Git Access Control, are just the first step in a roadmap that eventually visits every challenge. You’ll know that, just as your needs around deploying Git in your enterprise grow, WANdisco’s Git products will grow with you.

As a side note, WANdisco is now a general SCM expert, with deep knowledge for deploying and supporting leading tools like Git and Subversion, as well as advice and professional services for migrating from legacy tools like ClearCase, CVS, TFS, Perforce, and others.

And without further ado, here are the topics I’ll be covering:

Managing many repos, Access control, Multi-repo codebases, Ever growing repos, Large binaries, Shared code, Large repos, Long clone times, Supporting add-ons, Splitting repos, Combining repos, IP protection, IP reuse, Contaminating licenses, Code refactoring, Multi-site, Scaling myth of dictator-lieutenant, Untracked rename, Supporting a successor to Git, Untracked rebase, Permanent file removal, Excessive cloning

If there are any additional topics you want to see covered, please leave a comment and I’ll try to address it.  

Tune in soon for “Managing Many Repos”.

Efficient Incremental Backups for Subversion

Subversion 1.8 introduces an improved method for efficient incremental backups for Subversion. In a nutshell, the hotcopy command can run incrementally for faster backups.

To take a step back, Subversion repositories hold the intellectual property equivalent of your crown jewels, and guarding that data requires a layered strategy.

  • The first step is usually a local mirror that has a full copy of the repository data and can be used as a warm or hot spare. Of course, if you’re using WANdisco SVN MultiSite, every replicated peer has a complete set of data and failover is automated.

  • The second step is a remote mirror that has a full copy of the repository data and can be used for failover in a disaster recovery scenario. Again, SVN MultiSite provides this capability (and more) along with automated failover.

  • Finally, you need to have full offline backups of your repository data. Offline backups often progress through a storage cycle, with the most recent backups kept on faster, short-term storage and the oldest moving to cheaper offline storage. The backups generated by svnadmin hotcopy (possibly from a spare server) fall into this category.

Prior to Subversion 1.8, hotcopy created a full backup of the entire repository. The operation was often slow on large repositories, requiring careful scheduling or alternative strategies. Since hotcopy is only bound by disk I/O speed, it is faster than either native Subversion replication (svnsync) or creating incremental dump files, making it a great choice for a more efficient backup strategy. And of course hotcopy retains all of its usual advantages: it makes a full backup of a running repository in the right sequence to ensure data integrity.

Though hotcopy is only part of a layered backup strategy, it is a powerful tool for Subversion administrators not enjoyed by administrators of many commercial SCM systems. hotcopy is simple, reliable, and now pretty efficient. There are no concerns over backing up several parts of a complex system in the right order.

To use the incremental mode, simply use the new –incremental option. If you want to give it a try, download the latest certified SVN binaries for Subversion 1.8.

Subversion is a registered trademark of the Apache Foundation.


SmartSVN 8 RC1 released!

Yesterday we released the first Release Candidate (RC) for SmartSVN 8.

SmartSVN is the cross-platform graphical client for Apache Subversion.

New SmartSVN 8 RC1 features include:

  • The Project menu: “Open or Manage projects” is now available without project window
  • OS X: dock icon click will reopen minimized windows
  • Reintegrate Merge: removed (as it’s no longer relevant with Subversion 1.8)
  • Upgrade: SmartSVN will convert 1.7 working copies to 1.8 format

Fixes include:

  • Refresh: file and property conflicts were not displayed at all
  • Start Up: crash on Ubuntu 13.10
  • Conflict Solver: possible modification of edited file even if modifications were rejected
  • Commit: committing a removal of a directory using svn protocol did not work

For a full list of all improvements and bug fixes, view the changelog.

Have your feedback included in a future version of SmartSVN

Many of the fixes and suggestions included in new versions of SmartSVN are raised via our dedicated SmartSVN forum, so if you’ve got an issue or a request for a new feature, head over there and let us know.

You can download RC1 for SmartSVN 8 from our early access page.

Haven’t yet started with SmartSVN? Claim your free trial of SmartSVN Professional here.

Thoughts on RICON West 2013

RICON bills itself as a “Distributed Systems Conference for Developers” but it also has in recent years increasingly morphed into a lively intersection of leading academics and distributed systems practitioners. WANdisco also parallels these worlds, on the one hand being a 5 year sponsor for Berkeley’s AMPLab, and on the other having deployed advanced distributed systems into enterprise production environments for almost a decade.

A recent and gratifying trend is the maturation of understanding of the Paxos algorithm as a practical solution for implementing distributed consensus, an essential building block for distributed systems. Even though we still heard some speakers repeat the common opinion that Paxos is “too hard to implement”, we also saw others dipping their toes into Paxos.  And as evidence of increased interest in coordination algorithms, one talk presented a “stepping stone” algorithm proven easier to teach to undergraduate computer science students.

Another area of beneficial progress is the growth of understanding around the subtleties of the CAP theorem. As expressed by Michael Bernstein, “now you have CAP, which is an acronym, which is super easy to make s–t up about.” Of course, what we’ve often heard are witty sounding but dead wrong simplifications about CP and AP tradeoffs.  As industry knowledge about distributed systems matures, real life implementations prove increasingly effective and durable.

There was also increased interest in methods for strengthening consistency in eventual consistency databases. Note that the entire subject of eventual consistency leaves us a little squeamish. As I wrote in Why Cassandra Lies to You, the eventual consistency model does not practically provide strong consistency.  In cases where true consistency is required, choosing the weaker BASE guarantee of eventual consistency will likely be a painful mistake.

Perhaps it is inevitable that distributed databases will eventually displace the relational databases powering the vast, churning machines of industry.  RICON is one window into that future.

How much git down time can you tolerate?

Enterprise SCM administrators realize the valuable service they’re providing to development organizations and strive to avoid outages, but exactly how costly is SCM downtime? Put another way, how much is avoiding Git downtime worth to a company that relies on Git for enterprise software development?

A recent study concluded that a data center outage costs an average of $5,600 per minute in general use cases. To get a more concrete number, assume the cost of a single developer is $50-$300/hour depending on location and faced with SCM downtime, a few hundred of them are unable to be productive.

A developer with a private Git repository or a local read-only mirror can still get some work done, but they can’t get the most recent work from other developers if the master Git repository is down, and they can’t commit their own work. The productivity loss factor may not be 100%, but it’s not trivial either. You also need to include the cost of any schedule impact – days matter when deadlines are looming.

That’s why WANdisco provides non-stop data solutions for Git and Subversion. Zero downtime  and continuous availability are guarantees, not perks. Enterprise SCM administrators can count on High Availability (HA) and Disaster Recovery (DR) out of the box, with an aggressive Recovery Point Objective (RPO) and Recovery Time Objective (RTO). When compared to Git MultiSite and SVN MultiSite, one-off home-baked solutions simply aren’t battle tested.

With WANdisco MultiSite products for Git and Subversion, every node in the deployment is a replicated peer node, and every node is fully writable. You can choose how these nodes behave by setting up different replication groups, but for the purposes of HA/DR, you might set up a deployment like this:


HA/DR configuration

HA/DR configuration

In this simplified view, users at two sites have a set of local nodes to use. If one node fails, failover to another is automated by a load balancer (the HA case). Note that all the nodes are active and thus also serve to improve overall performance. In a DR scenario, users at one site can simply switch their Git remote to the load balancer at the other site, which routes them to any of the fully writable nodes at the second site.

This setup is quite simple to achieve with Git MultiSite and a stock load balancer like HAProxy, giving you an effective zero downtime solution at very low cost.

How much downtime can you tolerate in your enterprise Git deployment? If the answer is close to zero, learn more about Git MultiSite or start a free trial.



Subversion 1.8.4 released!

Today the Apache Software Foundation (ASF) announced the release of Subversion 1.8.4, which features a number of bug fixes.

Apache Subversion 1.8.4 fixes include:

– revert: fix problems reverting moves

– translation updates for Swedish language users

– merge: reduce network connections for automatic merge

– fix crash on windows when piped command is interrupted

– fix assertion when upgrading old working copies

– fsfs: improve error message when unsupported fsfs format found

For a full list of all bug fixes and improvements, see the Apache changelog for Subversion 1.8.

You can download our fully tested, certified binaries for Subversion 1.8.4 free here.

WANdisco’s binaries are a complete, fully-tested version of Subversion based on the most recent stable release, including the latest fixes, and undergo the same rigorous quality assurance process that WANdisco uses for its enterprise products that support the world’s largest Subversion implementations.

Using TortoiseSVN?

To go along with the update to Apache Subversion we are pleased to announce an update to TortoiseSVN. You can download the latest version for free here.

WANdisco Announces SVN On-Demand Training for Administrators and Developers

Whether you’re looking to get started with Subversion or build your skills in managing large-scale Subversion deployments, WANdisco’s new SVN On-Demand Training offers courses designed for Subversion administrators and developers of all skill levels.

SVN On-Demand Training offers instruction to boost administrators’ and developers’ knowledge of Subversion and the library includes more than 30 videos and supporting reference materials. New material is being continually added for subscribers.

Some of the current SVN On-Demand Training courses include:

  • Introduction to Subversion

  • Subversion for Beginners

  • Intermediate Subversion

  • Advanced Subversion

SVN On-Demand Training is available now. Visit for more information and to request a quote.

Non-Stop Hadoop for Hortonworks HDP 2.0

As part of our partnership with Hortonworks, today we announced support for HDP 2.0. With its new YARN-based architecture, HDP 2.0 is the most flexible, complete and integrated Apache Hadoop distribution to date.

By combining WANdisco’s non-stop technology with HDP 2.0, WANdisco’s Non-Stop Hadoop for Hortonworks addresses critical enterprise requirements for global data availability so customers can better leverage Apache Hadoop. The solution delivers 100% uptime for large enterprises using HDP with automatic failover and recovery both within and across data centers. Whether a single server or an entire site goes down, HDP is always available.

“Hortonworks and WANdisco share the vision of delivering an enterprise-ready data platform for our mutual customers,” said David Richards, Chairman and CEO, WANdisco. “Non-Stop Hadoop for Hortonworks combines the YARN-based architecture of HDP 2.0 with WANdisco’s patented Non-Stop technology to deliver a solution that enables global enterprises to deploy Hadoop across multiple data centers with continuous availability.”

Stop by WANdisco’s booth (110) at Strata + Hadoop World in New York October 28-30 for a live demonstration and pick up an invitation to theCUBE Party @ #BigDataNYC Tuesday, October 29, 2013 from 6:00 to 9:00 PM at the Warwick Hotel, co-sponsored by Hortonworks and WANdisco.

Reliable Git Replication with Git MultiSite

Setting up Git replication to help with backups and scalability may seem easy: just use a read-only mirror. In reality, setting up reliable Git replication, particularly in a large global deployment, is much more difficult than simply creating a mirror. That’s where Git MultiSite comes in.

Replication is Hard

Let’s look at a few of the challenges involved in managing a Git deployment. If you’re an enterprise Git administrator, I’ll wager that you’ve run into several of these problems:

  1. Failures will happen – especially in a WAN environment. Network interruptions, hardware failures, user error: all of these factors interrupt the ‘golden path’ of simple master-mirror replication. Since Git doesn’t provide replication out of the box you need to either write your own tools or rely on a free mirror solution. In either case you won’t have a replication solution that stands up to every failure condition.

  2. Replicas get out of sync. When a failure happens, it must be reliably detected and corrective action must be taken. Otherwise, a replica can be out of sync and contain old or incorrect data without your knowledge.

  3. Replication should be the first tier in your High Availability / Disaster Recovery (HA/DR) plan. Your data is the most important thing you own and you want a multi-tiered strategy for keeping it safe. Unreliable replication takes away a vital part of that strategy. Plus, failover is hard. Even if you have a perfect backup, how quickly can you bring it online and redirect all of the connections to it? How do you fail back when the primary server is back online?

  4. Security is essential. Every Git mirror should be subject to the same access control rules, yet there is almost no capability to enforce that in most systems.

  5. The biggest sites need replication more than anyone. Are you running 50 Git mirrors to support a large distributed user base and build automation? How are you monitoring all of those mirrors?

Git MultiSite Solution

So how does Git MultiSite solve these problems?

All failure conditions are accounted for.

Git MultiSite uses a patented algorithm based on the Paxos design. That means that it accounts for all failure conditions in the algorithm itself. Dropped data, not enough nodes online to agree on a proposal – these are all accounted for in the design. It’s hard stuff, and that’s why we wrote a very long paper on it.

Easy monitoring and administration.

Git MultiSite monitors the consistency and availability of each replicated peer node. The administration console tells you at a glance if anything is out of sync.

MultiSite Dashboard

MultiSite Dashboard

Zero down time.

If a node is out of sync it will go offline automatically while work continues on other peer nodes.  Failover is accomplished instantly with a load balancer or manually by using a secondary Git remote. When a node recovers it will catch up automatically (failback) and start participating in new activity.

Consistent security across the deployment.

Access control is consistent across every node. Using flexible replication groups you can control where a repository is replicated and how it is used at a site.


Replication group

Replication group

Guaranteed data safety.

Every commit is guaranteed to reach at least one other node before it is accepted, guaranteeing the safety of your data. The content delivery system has several parameters you can adjust to suit your deployment.

How Important is Your Data?

If you need a zero down time solution with guaranteed data safety, start a free trial of Git MultiSite’s reliable Git replication. Our team of Git experts can help you get started.

Choosing Between Subversion and Git

Subversion and Git are the two dominant SCM systems in use today. Collectively they represent more than 85% of the open source projects and around 60% of the enterprise (private development) market. The numbers vary a bit depending on which survey you read, but the trends are clear. So how do you choose between Subversion and Git?

Much has been written about this topic already, and those of us who follow the SCM space could spend a very long time debating different features. I’m going to throw in my 2¢ with a focus on the things that matter most in large enterprise deployments.


Git is a distributed SCM system. Subversion is not. Though this may seem like a substantial advantage for Git, it matters much less in the enterprise than it does for personal projects or at smaller shops. In the enterprise, Git is deployed a lot like Subversion, with master repositories that are secure, highly available, and controlled. Local operations in Git are faster; it’s easier for a small team to stand up a new Git repository for a skunk works project, and it’s easier for a road warrior to work from their laptop for a few days, but otherwise the central model of Subversion is not a key limitation.


Subversion is very good at the mainline model but can be used in a lot of other ways. Git supports the mainline model, some workflows based on the pull request concept, and a stream-like workflow known as Git Flow. Tool selection in this area largely boils down to a matter of how you prefer to work.

If you collaborate frequently with teams outside the firewall, then Git is a solid choice. History in Git is not tightly coupled to a particular repository, making it very easy to push and pull changes between repositories on separate networks and even do it via sneakernet.

Maturity and the Cool Factor

Subversion has been around longer than Git and is widely used in large deployments. It’s a proven tool with a solid feature set. It has a few shortcomings but in most situations it just works. Subversion administrators know how to deploy, configure, secure, and maintain their systems.

Git is less of a known quantity in the enterprise, although that’s changing rapidly. Some of the enterprise parts of Git are still evolving (although the introduction of Git MultiSite has helped a lot), but Git is riding a wave of popularity, and that matters too. The next generation of developers is growing up on Git: they know it and prefer it.

Learning Curve

Subversion and Git are both very easy to learn for daily use and have good tool and plugin support, but Git’s learning curve gets very steep once you’re past the basics, and not every Git feature is exposed through a GUI.

Community and Future

Both Subversion and Git benefit from a strong open source community with commercial sponsors. Although the Git community has the momentum, both tools will be strong and viable for many years.

The Choice is Yours

Subversion and Git are powerful SCM tools that have different strengths. Whichever you choose, you can feel confident that the software and its community will be around for many years to come.  And if you’re using CVS or some other legacy SCM system, there’s no better time to move to one of the two powerhouse open source choices.

If you’re interested in training, support, data migration, or advice on how to use both Subversion and Git in tandem, call on our team of Subversion and Git experts.

Reporting on Subversion & Git Live 2013

We’re two-thirds of the way through Subversion & Git Live 2013, and I’d like to share a few observations before we take the show to London next week.

For this year’s conference, we also offered a Git track for the first time to go along with WANdisco’s enterprise Git products, services and support.  This proved quite popular, with good attendance at all sessions.  The response to the talks and the questions asked gave a good indication of enterprise software development’s nascent progress with adoption of Git.  In contrast, the Subversion sessions tended to be more closely focused around specific and deeper technical material, representing Subversion’s role as SCM workhorse for the enterprise, and Git as the new kid on the block.

That said, there were also some companies with years of experience supporting Git deployments involving thousands of users and tens of thousands of shared repositories.  The majority of attendees were in earlier stages of Git enterprise adoption, either with relatively few users or in initial evaluations.

Many found my intermediate-level talk: “Git Enterprise Challenges” to be sobering or even frightening. It was certainly not intended that way, but I can understand that reaction; Git poses yet unanswered questions for enterprise scale deployments. WANdisco is addressing a number of these with our Git MultiSite and Git Access Control products.

Although most of the sessions focused on either Git or Subversion topics, the reality is virtually every SCM administrator we talked to is seeing or thinking about supporting both Subversion and Git, along with a variety of legacy SCM systems throughout their organizations. Clearly, Subversion and Git will be co-deployed or in hybrid configurations for a long time to come.

There were so many interesting discussions; I’ll touch on a few topics in each article over the next few weeks. If there are any follow up questions on your mind, please leave them in the comments below and/or come see us in London on the 16th.

Reliable Git Replication: Low Overhead and Good WAN Performance

Recently I wrote about Git MultiSite’s reliable Git replication – guaranteed data safety, zero down time, and no concern about replicas falling out of sync. These are huge benefits, but do they come at a cost? To the skeptical, ‘reliable replication’ sounds like overhead. I did some simple testing to find out, and I’m happy to report that the overhead is insignificant compared to the value of reliable replication. What’s more, using Git MultiSite makes you less vulnerable to the effects of network latency.

Test Goals

I wanted to measure three things:

  • Pure overhead of replication algorithm. Compared to pushing directly to a bare Git repo on a remote server, how much overhead do you see when pushing to a local Git MultiSite node in a replication group with a remote node?

  • Effect of increasing latency. As the latency between the user and the bare Git master repo or the remote Git MultiSite node increases, what happens to performance?

  • Effect of guaranteed content delivery. What’s the additional overhead imposed by guaranteeing replication of content to at least one other remote or local node before the push succeeds?

Just for fun I decided to include some other name brand Git repository management solutions in the test.

Test Setup

The test configuration is shown below.

Test Configuration

Test Configuration

The test program would push 50 times to each remote in round-robin fashion. After each run I would increase the latency, using these values:

  • 0 ms

  • 32 ms (e.g. coast to coast in US)

  • 64 ms (e.g. trans-Atlantic)

  • 128 ms (e.g. trans-Pacific)

  • 256 ms (e.g. US to India)

Since most Git repository management solutions offer only read-only mirrors, pushes for those systems go directly to the master repo. All pushes were done over SSH.

I used three different Git MultiSite configurations and measured the push completion times for each. First I used the two-node configuration shown in the diagram without guaranteed content delivery (i.e. measuring the overhead of the replication algorithm). Next I used the two-node configuration with guaranteed content delivery to the remote node over SSL. In both of these cases the remote node had to participate in the replication proposal. Finally, I used the three-node configuration with guaranteed content delivery to one node over SSL.

To account for any outliers in the data I used the median value of the results for each Git remote.


The graphs below summarize the results.

Median Push Time

Median Push Time

This chart shows the median push time for every system tested. Let’s drill down into the most common Git MultiSite configurations compared to bare Git repos.

Median Push Time (Summary)

Median Push Time (Summary)

It’s easy to see that there is a nominal amount of overhead with no latency. The gap quickly closes and Git MultiSite even pulls ahead with higher latency.

Effects of Latency

Effects of Latency

This chart shows the delta between any Git system’s performance with no latency and its performance at higher latencies. In other words, it’s the penalty you pay with a given system when latency increases. It’s very clear that Git MultiSite (the three bars on the right of each group) has the best relative performance as latency increases.

Replicable Replication: Low Overhead, Better Over a WAN

As the results show, the basic overhead for the replication algorithm, even with content delivery, is minimal. Compared to the benefits of 100% confidence in the replication system, it’s a small price to pay.

Of course, real world performance will vary based on usage. Guaranteed content delivery of very large commits will take a bit of time, but that’s why Git MultiSite gives you the ability to control how and where data is replicated and how many nodes must receive the push before it is accepted. You can choose the balance of performance and redundancy that makes sense for you.

Finally, for those of you supporting remote sites, you’ll appreciate that Git MultiSite’s performance holds up well to increased latency.

If 100% data safety and good WAN performance are important to you, give Git MultiSite a try.  You can start a free trial or start talking to one of our Git experts.

Git Subtrees and Dependency Management

Component-based development has always seemed difficult to manage directly in Git. Legacy systems like ClearCase UCM have the idea of formal baselines to manage the dependencies in a project, and of course Subversion uses externals to capture the same concept. By contrast, Git started life as a single-project repository system, and the submodule and subtree concepts seemed clunky and difficult to manage. A lot of software teams overcame that problem by deferring component manifests to build or CI systems. The latest incarnation of Git subtrees is significantly improved, however, and worth a second look for dependency management.

The latest version of Git subtree is available with Git 1.7.11+. (If you need the most recent version of Git for your platform, WANdisco offers certified binaries.) It offers a much simplified workflow for importing dependencies, updating the version of an imported dependency, and making small fixes to a dependency.

For example, let’s say we have three components in our software library, and we have two teams working on different sets of those components.

Component Architecture

Component Architecture

With subtrees, we can easily create new ‘super project’ repositories containing different sets of components. To get started, we add component repos as new remotes in the super project, then define the subtree.


git remote add modA git@repohost:modA
git fetch modA
git subtree add --prefix=modA --squash modA/master
git remote add modB git@repohost:modB
git fetch modB
git subtree add --prefix=modB --squash modB/master

We repeat this process with a different set of components in the second super project, yielding a directory tree that looks like this:

│   ├───modA
│   └───modB

As the architect I’ve determined the set of components used in the super projects, and the rest of the team gets the right set of data just by regular clones and pulls. Similarly, if I want to update to the latest code, I just run:

git subtree pull --prefix=modB --squash modB master

Or, if I want to peg a component to a specific branch:

git subtree pull --prefix=modB --squash modB r1.1

By using –squash I generate a single merge commit when I add or update a subtree. That’s equivalent to one commit every time I adjust the version of a component, which is usually the right way to track this activity. Keep in mind that it is very easy to create a new branch off of a specific tag or commit at any time.

Similarly, if I want to contribute a bug fix, I just commit into the component and push the change back:

echo "mod b change from super 1" >> .\modB\readme.txt
git commit -am "change to modB from super 1"
git subtree push --prefix=modB modB master

There are a couple of good rules to follow when using subtrees. First, don’t make changes to a subtree unless you really want to contribute a bug fix or patch back upstream. Second, don’t make commits that span multiple subtrees or a subtree and the super project code. Both of these rules can be enforced with hooks if necessary, and you can rebase to fix any mistakes before pushing.

Git subtrees are now a very effective and convenient tool for component and dependency management. Combined with the power of modern build and CI systems, they can manage a reasonably complex development project.

Questions about how to take advantage of Git subtrees? WANdisco is here to help with Professional Git Support and Services.



Delegated Subversion Access Control

Managing Subversion access control for a small team is fairly simple, but if your team is growing to several hundred developers, you don’t want to get a phone call whenever a new developer joins the team or someone needs a different access level. Delegated Subversion access control is what you need, and WANdisco’s SVN Access Control product uses the concept of team owners and sub-teams to get you there.

As a simple example, let’s say that we want all developers to have read access to the web-ui project. A subset of developers will have also have write access to the trunk, and as the Subversion administrators we don’t want to decide which developers belong in each group. Using SVN Access Control, administrators can delegate that responsibility to the team leads while still being able to audit what’s happening.

In SVN Access Control we simply define a group called web-ui-devs and a subgroup called web-ui-committers. Next we set the permissions appropriately on the group and subgroup, and each is assigned an an owner who can then manage membership.

Simple enough! SVN Access Control also allows a subgroup to have subgroups of its own, so you can set up a structure as deep as necessary to model your permissions and rules.

If you’d like more information on how to use SVN Access Control to solve your Subversion management challenges, contact our team of Subversion experts for advice. If you’re not yet enjoying the power of SVN Access Control, you can start a free trial.


WANdisco Announces SVN On-Demand Training for Administrators and Developers

Whether you’re looking to get started with Subversion or build your skills in managing large-scale Subversion deployments, WANdisco’s new SVN On-Demand Training offers courses designed for Subversion administrators and developers of all skill levels.

SVN On-Demand Training offers instruction to boost administrators’ and developers’ knowledge of Subversion and the library includes more than 30 videos and supporting reference materials. New material is being continually added for subscribers.

Some of the current SVN On-Demand Training courses include:

  • Introduction to Subversion
  • Subversion for Beginners
  • Intermediate Subversion
  • Advanced Subversion

SVN On-Demand Training is available now. Visit for more information and to request a quote.

Attendees of Subversion & Git Live 2013 in Boston, San Francisco, and London this October will receive two weeks free with a special code at the conference. Visit to register using promo code REF2013 to save 30%.

Data Auditing in Subversion

I’ve been writing a lot lately about the new features in Subversion 1.8, but there’s a little nugget in Subversion 1.7 that just caught my attention recently. I knew that Subversion stored md5 checksums of files in the repository, but I wasn’t quite sure how to easily access that information. The svnrdump command introduced in Subversion 1.7 provides the answer, and makes data auditing in Subversion much easier.

So why is this important? Well, to put it bluntly, stuff happens to data: it may be corrupted due to hardware failure, lost due to improper backup procedures, or purposely damaged by someone with bad intentions. Subversion MultiSite can protect you against all the vagaries of hardware and network, but if you work in a regulated environment you will someday have to prove that the data you took out of Subversion is the same as the data you put in.

That’s where the checksums come in. Let’s say I check out an important file from Subversion, like a configuration script or a data file with sensitive information. I can easily compare a local checksum against the checksum on the server to see if they match.

> md5sum BigDataFile.csv
3eba79a554754ac31fa0ade31cd0efe5  BigDataFile.csv
> svnrdump dump svn://myrepo/trunk/BigDataFile.csv
Text-content-md5: 3eba79a554754ac31fa0ade31cd0efe5

Simple enough, and very easy to script for automated auditing. If you store any important data in Subversion in a regulated environment, this simple feature is another way to help satisfy any compliance concerns about data integrity.

If you have any regulatory or compliance concerns around Subversion then grab the latest certified binaries, ask us for advice, or try out SVN MultiSite’s 100% data safety capability.


Git Data Mining with Hadoop

Detecting Cross-Component Commits

Sooner or later every Git administrator will start to dabble with simple reporting and data mining.  The questions we need to answer are driven by developers (who’s the most active developer) and the business (show me who’s been modifying the code we’re trying to patent), and range from simple (which files were modified during this sprint) to complex (how many commits led to regressions later on). But here’s a key fact: you probably don’t know in advance all the questions you’ll eventually want to answer. That’s why I decided to explore Git data mining with Hadoop.

We may not normally think of Git data as ‘Big Data’. In terms of sheer volume, Git repositories don’t qualify. In several other respects, however, I think Git data is a perfect candidate for analysis with Big Data tools:

  • Git data is loosely structured. There is interesting data available in commit comments, commit events intercepted by hooks, authentication data from HTTP and SSH daemons, and other ALM tools. I may also want to correlate data from several Git repositories. I’m probably not tracking all of these data sources consistently, and I may not even know right now how these pieces will eventually fit together. I wouldn’t know how to design a schema today that will answer every question I could ever dream up.

  • While any single Git repository is fairly small, the aggregate data from hundreds of repositories with several years of history would be challenging for traditional repository analysis tools to handle. For many SCM systems the ‘reporting replica’ is busier than the master server!

Getting Started

As a first step I decided to use Flume to stream Git commit events (as seen by a post-receive hook) to HDFS. I first set up Flume using a netcat source connected to the HDFS sink via a file channel. The flume.conf looks like:

git.sources = git_netcat
git.channels = file_channel
git.sinks = sink_to_hdfs
# Define / Configure source
git.sources.git_netcat.type = netcat
git.sources.git_netcat.bind =
git.sources.git_netcat.port = 6666
# HDFS sinks
git.sinks.sink_to_hdfs.type = hdfs
git.sinks.sink_to_hdfs.hdfs.fileType = DataStream
git.sinks.sink_to_hdfs.hdfs.path = /flume/git-events
git.sinks.sink_to_hdfs.hdfs.filePrefix = gitlog
git.sinks.sink_to_hdfs.hdfs.fileSuffix = .log
git.sinks.sink_to_hdfs.hdfs.batchSize = 1000
# Use a channel which buffers events in memory
git.channels.file_channel.type = file
git.channels.file_channel.checkpointDir = /var/flume/checkpoint
git.channels.file_channel.dataDirs = /var/flume/data
# Bind the source and sink to the channel
git.sources.git_netcat.channels = file_channel = file_channel

The Git Hook

I used the post-receive-email template as a starting point as it contains the basic logic to interpret the data the hook receives. I eventually obtain several pieces of information in the hook:

  • timestamp

  • author

  • repo ID

  • action

  • rev type

  • ref type

  • ref name

  • old rev

  • new rev

  • list of blobs

  • list of file paths

Do I really care about all of this information? I don’t really know – and that’s the reason I’m just stuffing the data into HDFS right now. I don’t care about all of it right now, but I might need it a couple years down the road.

Once I marshal all the data I stream it to Flume via nc:

nc_data = \
 "{0}|{1}|{2}|{3}|{4}|{5}|{6}|{7}|{8}|{9}|{10}\n".format( \
 timestamp, author, projectdesc, change_type, rev_type, \
 refname_type, short_refname, oldrev, newrev, ",".join(blobs), \
p = Popen(['nc', NC_IP, NC_PORT], stdout=PIPE, \
 stdin=PIPE, stderr=STDOUT)
nc_out = p.communicate(input="{0}".format(nc_data))[0]

The First Query

Now that I have Git data streaming into HDFS via Flume, I decided to tackle a question I always find interesting: how isolated are Git commits? In other words, does a typical Git commit touch only one part of a repository, or does it touch files in several parts of the code? If you work in a component based architecture then you’ll recognize the value of detecting cross-component activity.

I decided to use Pig to analyze the data, and started by ingesting data with HCat.

hcat -e "CREATE TABLE GIT_LOGS(time STRING, author STRING, \
  repo_id STRING, action STRING, rev_type STRING, ref_type STRING, \
  ref_name STRING, old_rev STRING, new_rev STRING, blobs STRING, paths STRING) \

Now for the fun part – some Pig Latin! Actually detecting cross-component activity will vary depending on the structure of your code; that’s part of the reason why it’s so difficult to come up with a canned schema in advance. But for a simple example let’s say that I want to detect any commit that touches files in two component directories, modA and modB. The list of file paths contained in the commit is a comma delimited field, so some data manipulation is required if we’re to avoid too much regular expression fiddling.

-- load from hcat
raw = LOAD 'git_logs' using org.apache.hcatalog.pig.HCatLoader();

-- tuple, BAG{tuple,tuple}
-- new_rev, BAG{p1,p2}
bagged = FOREACH raw GENERATE new_rev, TOKENIZE(paths) as value;
DESCRIBE bagged;

-- tuple, tuple
-- tuple, tuple
-- new_rev, p1
-- new_rev, p2
bagflat = FOREACH bagged GENERATE $0, FLATTEN(value);
DESCRIBE bagflat;

-- create list that only has first path of interest
modA = FILTER bagflat by $1 matches '^modA/.*';

-- create list that only has second path of interest
modB = FILTER bagflat by $1 matches '^modB/.*';

-- So now we have lists of commits that hit each of the paths of interest.  Join them...
-- new_rev, p1, new_rev, p2
bothMods = JOIN modA by $0, modB by $0;
DESCRIBE bothMods;

-- join on new_rev
joined = JOIN raw by new_rev, bothMods by $0;
DESCRIBE joined;

-- now that we've joined, we have the rows of interest and can discard the extra fields from both_mods
final = FOREACH joined GENERATE $0, $1, $2, $3, $4, $5, $6, $7, $8, $9, $10;
DUMP final;

As the Pig script illustrates, I manipulated the data to obtain a new structure that had one row per file per commit. That made it easier to operate on the file path data; I made lists of commits that contained files in each path of interest, then used a couple of joins to isolate the commits that contain files in both paths. There are certainly other ways to get to the same result, but this method was simple and effective.

In A Picture

A simplified data flow diagram shows how data makes its way from a Git commit into HDFS and eventually out again in a report.

Data Flow

Data Flow

What Next?

This simple example shows some of the power of putting Git data into Hadoop. Without knowing in advance exactly what I wanted to do, I was able to capture some important Git data and manipulate it after the fact. Hadoop’s analysis tools make it easy to work with data that isn’t well structured in advance, and of course I could take advantage of Hadoop’s scalability to run my query on a data set of any size. In the future I could take advantage of data from other ALM tools or authentication systems to flesh out a more complete report. (The next interesting question on my mind is whether commits that span multiple components have a higher defect rate than normal and require more regression fixes.)

Using Hadoop for Git data mining may seem like overkill at first, but I like to have the flexibility and scalability of Hadoop at my fingertips in advance.

Certified Git 1.8.4 Binaries Available

Git 1.8.4 (released on August 23rd) contains a nice collection of improvements and bug fixes.  Best of all, there’s no need to wait for an updated package or rebuild from source.  WANdisco has just released certified Git 1.8.4 binaries for all major platforms.

Here’s a list of key improvements:

  • Support for Cygwin 1.7
  • An update for git-gui
  • More flexible rebasing options allow you to select rebase strategy and automatically stash local changes before rebase begins
  • A contrib script that mines git blame output to show you who else might be interested in a commit
  • Improvements to submodule support, including the ability to run git submodule update from the submodule directory and the option to run a custom command after an update
  • A performance improvement when fetching between repositories with many refs

Of course, WANdisco offers Git training and consulting to help you get the most out of your Git deployments.  Grab Git 1.8.4 and start taking advantage of those new features!

Don’t forget to sign up for Subversion & Git Live 2013 in Boston, San Francisco or London!


Announcing Speakers for Subversion & Git Live 2013

Subversion & Git Live 2013 returns this October featuring expert-led workshops, presentations from industry leading analysts on the future of Subversion & Git, and unique committer roundtable discussions. In addition to all of the great sessions at this year’s conference, we’re pleased to announce a new keynote by Jeffrey S. Hammond, VP, Principal Analyst with Forrester Research. Hammond is a widely recognized expert in software application development and delivery. He will present “The Future of Subversion and Git,” and discuss the differences between these popular version control systems, while pointing out the issues IT organizations should consider before deciding which one to use.

Jeffrey Hammond joins Apache Software Foundation Vice Chairman and VP Apache Subversion Greg Stein who will present “Why Open Source is Good for your Health,” a look at how open source software works, how communities manage complex projects, and why it’s better for your business to rely on open-source rather than proprietary software.

Sessions include:

  • The Future of Subversion and Git

  • Why Open Source is Good for your Business

  • Subversion: The Road Ahead

  • What Just Happened? Intro to Git in the Real World

  • Practical TortoiseSVN

  • Introduction to Git Administration

  • Progress in Move Tracking

  • Developments in Merging

  • Git Workflows

  • …and more!

Subversion & Git Live is coming to Boston (Oct. 3), San Francisco (Oct. 8), and London (Oct. 16). View the agenda, travel and hotel details, and register at  We’re offering you and anyone you’d like to invite a 30% discount off the normal $199 registration fee if you register using promo code REF2013. Normally conferences featuring speakers of this caliber cost four times as much. Space is going fast, so register now!

Verifying Git Data Integrity

As a Git administrator you’re probably familiar with the git fsck command which checks the integrity of a Git database, but in a large deployment you may have several mirrors in different locations supporting users and build automation (the record I’ve heard so far is over 50 mirrors).  You can run git fsck on each one as normal maintenance, but even if all of the mirrors have intact databases, how do you make sure that all of the mirrors are consistent with the master repository?  You need a simple way to verify Git data integrity for all repositories at all sites.

That’s quite a difficult question to answer. If you have 20 or 30 mirrors, you want to know if any of them are not in sync with the master. Inconsistencies may arise if the replication is lagging behind, or if there is some other subtle corruption.

Git MultiSite provides a simple consistency checker to answer this question quickly. (Bear in mind that Git MultiSite nodes are all writable peer nodes; it does not use a master-slave paradigm.  But the ability to make sure that all peer nodes are consistent is equally valuable.)  The consistency checker can be invoked for any repository in the administration console:

Git Consistency Check

Git Consistency Check

The consistency checker computes a SHA1 ID over the values of all the current refs in the repository on each replicated peer node. This SHA1 is tied to the Global Sequence Number (GSN), which uniquely identifies all of the proposals in Git MultiSite’s Distributed Coordination Engine. The result looks like this:

Consistency Check Result

Consistency Check Result


First, I see that the GSN matches across all three nodes. I’m now confident that they’re all reporting results at a consistent point, when the same transactions should be present in all nodes. In other words, I’m able to discount any inconsistencies due to network lag.

More importantly, I see that the SHA1 for the second node doesn’t match the other two. That’s a red flag, and it means that I should immediately investigate what’s wrong on that node.

Now consider this example:


Different GSN

Different GSN

Notice that the third node is reporting an earlier GSN (23 versus 29) compared to the other two nodes. That tells me that this node is lagging behind, which may be expected if it’s connected over a WAN and always running 2-3 minutes behind the other nodes.

Running a distributed SCM environment is very difficult, and the consistency check is another way that Git MultiSite makes things easier for you. Check out a free trial and see for yourself!



Why We Don’t Build Software Like We Build Houses

smallhouseIn Why We Should Build Software Like We Build Houses, the author Leslie Lamport makes the case for creating detailed specifications for building software following the model of how an architect’s blueprints are used for building a house.

Lamport goes on to suggest: “Can this be why houses seldom collapse and programs often crash?” Seldom, perhaps, but with especially catastrophic consequences, such as the 68,000+ people killed in the recent 2008 Sichuan earthquake. Most of the deaths were due to the collapse of poorly designed or constructed houses.

However, as bad as that is, the 1556 Shaanix earthquake killed more than 800,000 people, which at the time was at least 0.25% of the world’s total population. The equivalent death toll today: 17.5 million.

It’s hard to imagine an earthquake today that would kill 17.5 million people. And that’s largely because we know more about how to design and build houses for events such as earthquakes than we did in 1556. The point of this is to illustrate that methods for building houses are far more mature than for building software. Humans have been building houses at least since the last Ice Age, software, for barely 50 years. That’s a reason why we don’t build software like we build houses – it’s simply so new we are still in the early days of figuring out what works.
Knowing how to build houses or software to be robust during failure conditions is part of the solution. Another part is that architects and contractors have a mature form for communication: the blueprint. Blueprints are such complete and specific documents that a contractor could be reasonably expected to complete a house even without continued input from the architect. Architects who create blueprints have extensive knowledge about materials, joints, weatherproofing, building codes, fasteners, almost every detail. Most people responsible for software products rarely create software specifications to the same level of detail as an architect’s blueprint. Whatever the equivalent is, it clearly needs to be more developed that what we normally think of as a software “spec”.

Indeed, the software industry in large part has pointedly turned its back on well-developed construction paradigms with trends like Extreme Programming, Agile, Lean, and Minimally Viable Product. Many of these techniques involve iteration, quick results, and fast prototyping. I see these reactions not as embracing a superior method, but instead reflecting creative searching within a new technology. Eventually, I think we will develop more formal methods for planning and building software, technology for software blueprints will emerge, and highly technical product architects will effectively span the divide between vision and code.

The relative newness of consumer and even enterprise software means we have still have workarounds and a measure of tolerance for computer failures. However, even only a few decades into the computer era, we are increasing our expectations around continuous availability of our systems and disaster recovery/data safety for our data.

This will likely drive new baselines for reliability and availability of software, and with it, the need to more fully visualize the system we are trying to build. At WANdisco, we are seeing urgent requirements emerge around software development tools, Git and Subversion, and also as Hadoop and big data analysis transitions from promising curiosity to ubiquitous backbone technology. Non-Stop Data™, indeed.

We don’t build software like we build houses because we don’t know enough yet about building software, and to a lesser extent because we aren’t completely dependent on software yet. As the field matures, Lamport’s recommendation may yet become standard practice.

WANdisco Sponsors UC Berkeley AMPLab, Creators of Spark and Shark

We’re pleased to sponsor the UC Berkeley AMPLab (Algorithms, Machines, and People), a five-year collaborative effort responsible for the development of Spark, Shark and Mesos.

WANdisco previously announced the integration of Spark and Shark into our certified Apache Hadoop binaries, and look forward to working closely with the talented AMPLab team on continued research into in-memory data storage for Hadoop.

“We are pleased with WANdisco’s strong support of AMPLab as well as Spark and Shark,” said Ion Stoica, co-director of AMPLab of UC Berkeley Electrical Engineering and Computer Sciences. “Their participation helps with market validation, and our continued work will enable businesses to quickly deploy interactive data analytics on Hadoop.”

Interested in learning more about Hadoop? Register for one of our Hadoop Webinars. Or register for 25% off registration fees to Strata RX Boston September 25-27 using promo code WANDISCO25.

Subversion 1.7.13 and 1.8.3 Released

Today the Apache Software Foundation (ASF) announced the release of Subversion 1.7.13 and 1.8.3, bringing a number of fixes to each.

Apache Subversion 1.7.13 includes fixes for the following:

  • merge – fix bogus mergeinfo with conflicting file merges
  • diff – fix duplicated path component in ‘–summarize’ output
  • ra_serf – ignore case when checking certificate common names
  • svnserve – fix creation of pid files
  • mod_dav_svn – better status codes for commit failures
  • mod_dav_svn – do not map requests to filesystem

1.8.3 fixes:

  • ra_serf – fix crash when committing cp with deep deletion
  • diff – issue an error for files that can’t commit in memory
  • update – fix a crash when a temp file doesn’t exist
  • diff – continue on missing or obstructing files
  • ra_serf – include library version in ‘–version’ output
  • svnadmin – fix output encoding in non-UTF8 environments

For a full list of all bug fixes and improvements, see the Apache changelog for 1.7 and 1.8.

You can download our fully tested, certified binaries for Subversion 1.7.13 and 1.8.3 free here.

WANdisco’s binaries are a complete, fully-tested version of Subversion based on the most recent stable release, including the latest fixes, and undergo the same rigorous quality assurance process that WANdisco uses for its enterprise products that support the world’s largest Subversion implementations.

SmartSVN 8 Preview 1 Released

Yesterday we released SmartSVN 8, Preview 1. SmartSVN is the cross-platform graphical client for Apache Subversion.

New SmartSVN 8 features include:

  • Support for Subversion 1.8 working copy
  • Ability to specify different merge tools for different file patterns as conflict solvers

SmartSVN 8 fixes include:

  • Possible internal error closing a project window
  • Text editors:
    • “Autoindent new lines” did not work correctly when typing, e.g. CJK characters using an IME
    • Internal error related to syntax highlighting when using an IME

For a full list of all improvements and bug fixes, view the changelog.

Have your feedback included in a future version of SmartSVN

Many issues resolved in this release were raised via our dedicated SmartSVN forum, so if you’ve got an issue or a request for a new feature, head over there and let us know.

You can download Preview 1 for SmartSVN 8 from our early access page.

Haven’t yet started with SmartSVN? Claim your free trial of SmartSVN Professional here.

Git as a Service

Git MultiSite Solves Availability and Management Challenges

As an enterprise SCM administrator you’re a service provider to development organizations.  You may even have a formal Service Level Agreement (SLA) that identifies the allowed mean time between failures (MTBF) and mean time to recovery (MTTR) among other metrics. In layman’s terms, development teams expect their SCM system to be secure, highly available, and high-speed. So how do you provide Git as a service to a global development team?


First, you need to ensure availability:

  • Data integrity. How do you safeguard against potential data corruption?

  • Outages. How do you prevent downtime and meet your SLA requirements for MTBF and MTTR?

  • Performance. How do you scale up the Git service to handle more users, more sites, and more build automation?

Active-active replication provides a unique solution for each of these concerns. Each replicated peer node in an active-active deployment has a full copy of repository data, and each node periodically validates itself to detect any corruption. If one of the nodes falls out of sync due to subtle data corruption on its file system, you’ll see a warning in the Git MultiSite administration console.

Similarly, an active-active deployment provides failover out of the box. Git users can simply start using another node as their Git remote in the event of failure, or a load balancer can transparently redirect them to another node. This built-in failover provides a zero down time solution, yielding an excellent MTBF and MTTR.

Simple HA/DR Configuration.  Note that all five nodes in this example are replicated peers; marking them as ‘HA’ or ‘DR’ nodes is simply by convention

Simple HA/DR Configuration. Note that all five nodes in this example are replicated peers; marking them as ‘HA’ or ‘DR’ nodes is simply by convention

Finally, additional read-only nodes can be added to the deployment to service build automation, while end user traffic can be handled by adding more writable nodes. Every commit is done locally, and depending on your deployment configuration, data may never have to transfer over a WAN for the commit to complete. If you are supporting users at multiple locations with servers in several data centers, you’ll appreciate the performance and flexibility that active-active replication provides.

Supporting Users in Multiple Data Centers with Flexible Replication Groups. In this example, replication groups support granular replication to different sites based on repository type and security.

Supporting Users in Multiple Data Centers with Flexible Replication Groups. In this example, replication groups support granular replication to different sites based on repository type and security.



Next, you need to be concerned about the security of your data. This includes:

  • Access control

  • Security of data in transit

  • Security of data at rest

  • Controlling where and how Git repositories are used

Once again, Git MultiSite covers all the bases. Git MultiSite works with WANdisco’s own access control product, and also interfaces with leading third party solutions like Gitolite. It works with secure Git transmission protocols like HTTPS and SSH, and because it is a pure Git solution with no closed back-end storage, it is compatible with disk encryption mechanisms.

Git MultiSite also offers selective replication to control where and how your repositories are accessed. Flexible replication groups can be configured to limit where sensitive repositories are replicated, and can designate read-only nodes. If you need to push a subset of your repositories to a public cloud for deployment purposes, it’s as simple as setting up a new replication group.


Read-only nodes used to serve deployment repositories

Read-only nodes used to serve deployment repositories


Just like any other IT service, Git repositories and servers must be managed properly. You’ll need proper reporting, auditing, and administration tools.

Git MultiSite’s administration console is your entry point to a comprehensive solution. In the console you can see the status of every node and repository, define their replication behavior, and induct new servers and repositories as they become available. More detailed deployment metrics can be obtained by sending information from Git MultiSite into a collation tool like Graphite; the REST API provides a useful integration point.


Git MultiSite Global Management

Git MultiSite Global Management


Auditing can be accomplished in several ways. Git MultiSite’s logs contain a rich history of how the deployment is configured, and normal system logs (e.g. from Apache or SSHD) provide another layer of information.


Enterprise Git

The short answer to all of these concerns is that you need Git to be an enterprise software solution, just like your databases, email servers, and other critical infrastructure. WANdisco provides the total package. Git MultiSite solves the availability, security, and management challenges of Git, and is backed by WANdisco’s Git support and services offerings. Eventually you’ll need to pick up the phone and get help quickly for a problem, and WANdisco has a team of experts ready to help.

Contact us for a free Git MultiSite trial and get started on your way to providing an outstanding Git service to your development organization.

Git Repository Metrics

Managing Git repositories means looking ahead, not just fighting today’s fire. Keeping an eye on key Git repository metrics will keep you a step ahead – and keep your development teams happy. There are several useful predictive metrics you can look at including repository size, growth rate, number of references, number of files exceeding a size threshold, and number of operations per day. These metrics help you with hardware sizing and also help you maintain good performance. You can see if you need more Git replicas to handle clones and pulls from a new development team, or if someone is checking in too many large binary files and slowing down repository performance.

Metrics in the Dashboard

How do you go about collecting this data? Most Git reporting tools focus purely on development metrics like number of commits and developer activity. By contrast, Git MultiSite has some useful metrics built right into its administration console, viewable either graphically or in a list.

Repository Size and Activity Over Time

Repository Size and Activity Over Time

Collecting and Viewing Metrics with Graphite

The administration console dashboard gives you a quick snapshot of key metrics over time, however you may have your own reporting and analysis tools that provide a more elaborate monitoring framework. In that case you can pull data out of Git MultiSite’s REST API to feed into an external system, giving you complete control over how you use the repository metrics.

As a simple example, let’s look at how to track repository size over time using Graphite.  Graphite is an open source tool for storing and charting any type of numeric time-series metric.  Internally it uses a round-robin database that allows for flexible data storage management and purging of old data.

Collecting Data

First, I’ll write a script that uses curl to gather the latest repository statistics from Git MultiSite’s REST API, parse out the size, and feed it to Graphite using the plaintext protocol.

use XML::Simple;
use Data::Dumper;
my $ENDPOINT = 'http://gitms1:8082/dcone/';
my $REPOSITORIES = 'repositories/';
my $PORT='2003';
my $SERVER='';
my $FEED_PREFIX = 'gitms.';
my $FEED_SUFFIX = '.size';

my $rest_call = 'curl ' . $ENDPOINT . $REPOSITORIES;
my $rest_output = `$rest_call`;

my $ref = XMLin($rest_output, ForceArray=>1);
my $date = `date +%s`;

for(my $ii=0; $ii <= $#{ $ref->{repository} }; $ii++) {
   my $size = $ref->{repository}->[$ii]->{repoSize}->[0];
   my $name = $ref->{repository}->[$ii]->{name}->[0];
   my $feed = $FEED_PREFIX . $name . $FEED_SUFFIX;
   print "$name\n$size\n$date\n";
   system("echo \"$feed $size $date\" | nc $SERVER $PORT");

I’ll set up a cron job to run this script every 5 minutes. The script will insert a metric called gitms.<repo name>.size for each repository.

Note that there are more efficient ways to send the data to Graphite, but the plaintext protocol works well for demonstrations.

Viewing Data

Next I’ll configure a simple Graphite chart that shows the repository size over time.

Repository Size

Repository Size

Graphite can also show calculated metrics. Here I’ll look at a chart showing repository growth over time. (Specifically, the chart is showing the 7 day delta in repository size for each time point.)

Repository Growth

Repository Growth

As a Git administrator, I’d keep an eye out for unusual spikes in repository growth. These spikes may indicate an automated build system run amok, or a new project starting up. I may need to take corrective action or start planning for a capacity upgrade.

Tools like Graphite are purpose-built for metric storage and charting, so having an easy way to extract data from Git MultiSite using the REST API makes a great integration point.

Get Going

Git MultiSite provides an open and extensible management framework for your Git repositories, along with all the benefits of true active-active replication. If you’re interested in setting up a comprehensive Git monitoring system, ask for advice or start a free trial of Git MultiSite today.

Simple Subversion Benchmarking

A simple Subversion benchmarking tool included in the Subversion 1.8 release helps you sort out performance complaints from developers more quickly. Whenever someone complains about slow Subversion performance, you know there are at least three possibilities:

  • The Subversion server is actually slow, perhaps due to heavy load.

  • The user’s machine is slow. Recall that the Subversion client does some disk I/O and other processing during some operations, and the user might be running virus scanners and the like.

  • The user suffers from a slow network connection.

The svn-bench command is a lightweight Subversion client that omits most of the local processing. That makes it easier to get a real performance measurement without being affected by the user’s virus scanner or slow file system.

If you run svn-bench on the Subversion server itself you’ll get a baseline performance metric for a few Subversion operations. If that baseline seems slow, you can try to improve the server performance.

If you then run svn-bench on a client workstation, you can get a sense of the effects of network latency. If there’s little latency apparent, then the problem may lie in the user’s workstation.

For instance, I ran svn-bench null-export on the trunk of a Subversion repository. On the server itself, the real time was 4.1 seconds. On a workstation connected over a slow network, the real time was 32.5 seconds. That’s a good indicator that network latency is slowing things down. Just to confirm my suspicion, I ran a normal svn export on that workstation and the time only slowed down by a second or so, which gives me a good sense that the problem lies in the network.

svn-bench is a simple but useful tool for Subversion benchmarking. You can try it out by downloading a certified SVN 1.8 binary. If you need help with Subversion performance analysis, our team of Subversion experts can help.

 Subversion is a registered trademark of the Apache Software Foundation.

Non-Stop News

Last week, our Non-Stop NameNode received a lot of attention. First of all, we announced the next version of Non-Stop NameNode WAN Edition, which includes:

• Dynamic Group Evolution – This provides the ability to add and remove namenodes within Hadoop clusters on the fly without downtime, eliminating the need for scheduled maintenance.
• Configurable Quorum Schema – New node configurations enable increased availability and deployment flexibility for more efficient use of IT infrastructure.
• Rapid recovery for a namenode that has been down for an extended period of time.

In addition, Non-Stop NameNode received integration and interoperability certification with Dell PowerEdge servers. Customers can now deploy Apache Hadoop in mission-critical environments where processing and access to data requires continuous availability, as opposed to relying on active-passive solutions.

Non-Stop NameNode WAN Edition applies WANdisco’s patented replication technology to deliver 100% uptime by eliminating Hadoop’s most problematic single point of failure — the NameNode — providing the first and only continuous availability solution for globally distributed deployments. With it, all NameNode servers in a Hadoop cluster deployed over a WAN actively support clients at each location and along with the data nodes, are continuously synchronized. The result is LAN-speed performance and access to the same data at every location. Failover and recovery are automatic both within and across data centers. Whether a single NameNode or an entire site goes down, Hadoop is always available.

Securing Your Data with Selective Git Replication

Git MultiSite Gives You Control Over Where Your Data Ends Up

If you administer Git for anything other than a personal project, you’ll wind up thinking about replication – and then you’ll wind up thinking about securing your Git data during the process.  Git MultiSite is the first Enterprise Git management system that lets you control both where and how your data is available.

To recap, there are a lot of reasons why you’ll want to replicate Git data:

  • You need a non-stop data solution with zero down time (high availability and disaster recovery).

  • You support development teams at different sites and they all need good performance.

  • You’ve invested in a continuous integration system to support Agile and continuous delivery, and it’s putting a strain on your Git repositories.

  • Your company has grown by organic expansion or acquisition and your SCM infrastructure needs to scale up to support the larger user base.

Whatever the reason, you’ve realized that you need highly available Git data. Git MultiSite is the only active-active replication solution that supports truly distributed development – but putting that aside for now, you also need to think about the security of your data as it moves around the world. [1]

There are three key questions to consider.

  1. Where should each repository be available? You may have a very sensitive repository that should not be available to partner sites in different locations. You may want to limit which repositories are available in a public cloud environment that’s used for deploying production app servers; typically only the repositories that contain your runtime configuration and environment settings belong there. Alternatively, you may not need every repository available at every location and don’t want to waste the bandwidth.

  2. How is each repository being used at each site? Should a repository be writable, or should it only be available as a read-only resource for build farms and downstream consumption?

  3. How easy is it to manage the problem? As your deployment grows from a few Git repositories to a few hundred, how are you going to monitor and audit your replication strategy?

Git MultiSite has selective replication and effective management tools baked in, so it provides an out-of-the-box answer to all three questions.

Where Does This Repository Go: Defining Replication Groups

Git MultiSite lets you define one or more replication groups to manage your deployment. A replication group is a flexible way to define how the replicated peer nodes in your MultiSite deployment share data.

As a simple example, assume that the deployment has five nodes in total, one each in Boston, Seattle, London, Sydney, and Chennai. Boston and London are the primary offices; Seattle and Sydney are data centers used for deploying production app servers; and Chennai is a partner site.

I might set up three replication groups.

  • Default Group replicates to all of the development sites – Boston, London, and Chennai.

  • Proprietary Group contains repositories with sensitive IP, and only replicates to the primary offices in Boston and London.

  • Deployment Group contains repositories with runtime configuration and environment data like Puppet manifests. It replicates to the development sites and the data center sites.

Replication Groups

Replication Groups

How Is the Repository Used: Refining Replication Groups

WANdisco’s Distributed Coordination Engine (DConE) distinguishes between several types of replicated peer nodes. The most common type is an active voter, which participates in transaction proposals and can accept write activity. Another type is a passive node, which receives all repository data but will not accept write activity.

In the example in the previous section, the two data center nodes in the Deployment Group are passive nodes. They are necessary to provide runtime data to the production servers in the data centers, but any changes are made at the development sites.

Different Node Types

Different Node Types

Management and Auditing: Easy Administration, Central View

Git MultiSite provides easy central management of replication groups. The administration console, available with proper authentication from any site, first provides a single view of all the nodes in the system.

Global View

Global View

The console provides a simple graphical tool for setting up replication groups, where you can define which nodes belong to a group and how they are used.

Managing Replication Groups

Managing Replication Groups

And finally, there’s a quick list of the repositories belonging to each replication group.

Replicas in Group

Replicas in Group

The entire configuration is captured in the audit logs.

Audit log snippet…

< X-Jersey-Trace-006: matched resource method: public
INFO: 4984 * Server in-bound request
4984 > GET

Total Control Over the Non-Stop Data

WANdisco provides non-stop data solutions, but we haven’t forgotten about the administration and security side of the picture. Git MultiSite gives you complete control and visibility over where and how your repositories are used.

 [1]  For the purposes of this discussion, consider the problem of secure transmission solved by Git’s use of either SSH or HTTPS as transmission protocols.

Versioned Access Control in Subversion 1.8

Managing and monitoring access control just got a little easier thanks to the introduction of versioned access control files in Subversion 1.8. You can now store the authz file often used to govern repository access when Subversion is running over Apache or svnserve.

The easiest way to try this is to check in your authz file, then reference it in the server configuration using relative repository syntax. Let’s say I have it in the repository under the path svn://repo-host/protected/authz. I would then refer to it in svnserve.conf:

authz-db = ^/protected/authz

You should, of course, make sure that only authorized users can see and change the authz file. You may worry that you’ll lock yourself out of the repository if you make a mistake that denies all write access to the authz file, but you can always temporarily switch Apache or svnserve back to using a local authz.

If you manage several related repositories, you can store all of their authz files in a central management repository, and refer to the authz files with local file syntax. In this case, all of the repositories must have access to the same file system.

With this change, Subversion takes one step closer to the ideal of ‘infrastructure as code’, taking a lesson from the DevOps space. In many ways, your SCM configuration is as important as the data in the SCM system itself, so capturing this data in the SCM system is simply good practice.

Grab a certified SVN 1.8 binary today and give it a try.

Subversion is a registered trademark of the Apache Software Foundation

SmartSVN 7.6 – It’s All About Performance

We’re pleased to announce SmartSVN 7.6 is now available to download. SmartSVN is the cross-platform graphical client for Apache Subversion.

The focus with 7.6 has been performance, performance and more performance. Responding to customer feedback, we’ve worked to make 7.6 faster and lighter than its predecessors.

New SmartSVN 7.6 features include:

– Auto-update – there is no need to install new versions manually

– Repository Browser – defined svn:externals are shown as own entries

– proxy auto-detection

– external tools menu

– OS X retina support

– Project data is saved on project creation rather than when exiting

GUI improvements include:

– file/directory input fields – support for ~ on unix-like operating systems

– natural sorting (“foo-9.txt” before “foo-10.txt”)

– more readable colors on Transactions and other panes

SmartSVN 7.6 fixes include:

– speed-search – possible internal error typing Chinese characters

– Revision Graph – errors when deselecting all branches

– Tag Browser – possible internal error

– SVN operations – significant performance improvements

– Check Out – checking out to an already versioned directory appeared to work, then failed later

– Refresh – possible performance problems and a fix for displaying conflicts at drive root

– Issues with migrating settings and auth credentials from pre-7.5 versions

– Foundation edition: changing the project root was not possible

For a full list of all improvements and bug fixes, view the changelog

Contribute to Future Releases

Many features and enhancements in this release were due to comments made by users in our dedicated SmartSVN forum, so if you’ve got an issue, or a request for a new feature, head over there and let us know.

Supporting Git Build Automation with Git MultiSite

The two building blocks of continuous delivery are version control and automation – and boy, do you need a lot of automation. If you’re serious about building every commit and then running a progressively tougher sequence of test and deployment steps in your delivery pipeline, you should also make sure that your version control infrastructure doesn’t start creaking under the load. That’s particularly true for Git, since cloning repositories and fetching the latest data can be intensive operations. So let’s look at supporting Git build automation.

The natural first step is to use a replicated Git repository to serve the build farm. That way the load from the build servers doesn’t impact the human users on the main repository. Git MultiSite provides the perfect solution:

  • New replicated peer nodes can be added at any time to serve a growing build farm.

  • The nodes serving the build farm can be designated as read-only for better performance and security.

  • Like any Git MultiSite node, a node serving a build farm is easily managed from the Git MultiSite administration console.


Git MultiSite Deployment for Build Farm

Git MultiSite Deployment for Build Farm

Git MultiSite Replication Management

Git MultiSite Replication Management

Sometimes your build processes can’t just use a read-only replica. They may check in modified configuration data, or simply create a tag. Do you let your main Git repository serve the build farm, or try to make the CI system use a different Git remote for write operations? After all, very few Git management solutions even offer a pass-through mirror (a mirror that forwards write activity to the master repository).

Git MultiSite’s active-active replication solves this performance challenge as well. You can use an active replicated peer node(s) to serve the build farm.

Git MultiSite Deployment for Build Farm - Commits Accepted

Git MultiSite Deployment for Build Farm – Commits Accepted


In this configuration, all read requests are purely local, and any writes are coordinated with the other Git nodes in the deployment. It’s really that seamless, since each Git MultiSite node is a fully writable peer. Git MultiSite enables the ideal configuration: build automation doesn’t put load on the Git nodes that service your developers, yet your build processes can still push data if necessary.

Finally, Git MultiSite’s flexible replication groups give you the ability to provide more peer nodes for build farms on busier projects. Less active projects are not replicated to as many nodes, saving valuable bandwidth and hardware resources.

Replication Groups Support Projects with Different Loads

Replication Groups Support Projects with Different Loads


Supporting build automation is another example of the fantastic flexibility and scalability achieved with active-active replication. If you’re ready to set up an enterprise continuous delivery system built on Git, contact WANdisco for expert advice and a free trial of Git MultiSite.


Hadoop Summit 2013

hadoop_summit_logo  I left Hadoop Summit last month very excited to the see traction the market is having. The number of Hadoop vendors, practitioners and customers continues to grow and the knowledge about the technology continues to deepen.

One of the key areas of discussions on the trade show floor was the limitation and design of the namenode.

In Apache Hadoop 1.x, the namenode was a single point of failure (SPoF).

cut rope 1This SPoF has become such a significant issue, the community has accelerated the need to find ways to mitigate against this earlier design choice. While the community has started to develop a solution which addresses earlier attempts, the overall system is still what we call an active-passive implementation.

Active-passive solutions have been around for many years and were designed to provide recovery where disruption of services was resolved at a different layer of the stack. For example, active-passive security solutions like firewalls have traditionally been deployed with a primary and a standby unit. In the event the primary failed, the secondary would recover and clients communications (TCP) would retry and retransmit until a connection could be established. With services like HTTP, these active-passive solutions are sufficient and widely deployed.

However, when we start to discuss components of an architecture that are key to availability and access, a new term starts to emerge; Continuous Availability.

In the past, active-passive solutions could solve what the industry has accepted as “Highly Available” solutions. However, today’s architectures and technologies are evolving and have shifted to a new need, which is being described as “Continuously Available”.

One area we found ourself explaining was the difference in our 100% Continuous Availability™ solution for HDFS, compared to the design changes being implemented in the Apache Hadoop 2.x. branch. As you can see from the references in the Apache documentation, the new Quorum Journal Manager is an active-passive solution.

“This guide discusses how to configure and use HDFS HA using the Quorum Journal Manager (QJM) to share edit logs between the Active and Standby NameNodes.” REF:

This is the main difference between open-source HA and the technology developed by WANdisco.  WANdisco’s Non-Stop NameNode is an Active-Active solution built using a widely agreed family of protocols, known as Paxos, for solving consensus in a network of unreliable processors.

WANdisco’s implementation, known as DConE is the core IP used to ensure 100% uptime of the Apache Hadoop 2.0 namenode processes and therefore provides continuous availability and access to HDFS during planned and unplanned outages and critical infrastructures.

After spending two days speaking to many attendees, it became very clear to me that WANdisco’s strength in its DConE technology and how it has been applied to the Apache Hadoop namenode are not trivial.

We had the opportunity to talk with representatives from some of the largest Web 2.0 in the Silicon Valley, including Yahoo, LinkedIn, Facebook and Ebay. Being able to demonstrate our active-active solution to key industry technologists, architects and Hadoop developers was the highlight for our team and we are excited about the official release of our WAN solution.

Improved Subversion Property Searching in Subversion 1.8

Subversion 1.8 has a new feature that makes for improved property searching. This feature has several possible applications, including the ability to dictate some configuration centrally, but I’ll focus on a custom workflow example.

First, what’s the new feature? Subversion 1.8 now lets you ask for the value of a property and see the value for a particular element and its ancestors. For example, let’s say that I use a custom property called quality to store a quality metric from an automated review tool. The tool normally stores the property on directories holding major components, but occasionally stores the property on lower level directories or even single files if it identifies a quality hot spot.

Normally propget only shows me the value on a single element:

> svn propget quality libs/audio/codec/
red: 42/100

The new –show-inherited-props option will show me the value for that element, plus all of the parent directories:

> svn propget -v --show-inherited-props quality libs/audio/codec/
Inherited properties on 'libs/audio/codec/',
from 'libs':
Inherited properties on 'libs/audio/codec/',
from 'libs/audio':
Inherited properties on 'libs/audio/codec/',
from 'libs/audio/codec':
Properties on 'libs/audio/codec/':
   red: 42/100

Now it’s easy for me to see quality metrics at whatever granularity I’d like – and in this example I can see the metrics all the way up from a single file to the component level.

Inherited properties can also be used to centrally manage some configuration variables, like defining MIME types via auto-props. Paul Burba has a good sequence of articles describing these use cases in more detail, as well as some of the nuances of inherited properties.

If you’d like to try inherited properties, grab a certified SVN 1.8 binary from WANdisco.

Subversion is a registered trademark of the Apache Software Foundation.

Isn’t Git Already Multi-Site?

Attrib. Andrew Jian 2007We recently released our Git MultiSite product, the newest member of the MultiSite family, built on the unique and proven replication platform powering SVN MultiSite and SVN MultiSite Plus.

I can hear the cries already: Git is multi-site by nature, so why build a multi-site product for Git?

There are definitely common use cases where Git by itself works just fine, just as there are hundreds of thousands of Subversion servers in happy use without MultiSite, mirrors or a replication strategy. In particular, common types of open source projects where Subversion and Git were born and bred are generally well served by the stock tools. But recently Git has been joining battle-tested Subversion in the enterprise, where its untamed nature has the potential to become a multi-headed monster instead of a multi-site deployment.

Extreme environments

The nature of enterprise and some other types of software development create additional requirements for availability, data safety and global scalability, stretching nascent tools of open source into challenging new circumstances.  This extreme environment is WANdisco’s natural habitat, where we combine strong support of open source projects with enterprise know-how.

Global Software Development

Enterprise software development has largely become global software development, with virtual teams often being assembled for a project based on availability and skill set rather than location. With larger teams and larger projects, cloning over a LAN become increasingly preferable to cloning over the WAN. Replication brings the data to the LAN near each developer site, and so replication forms the core technology behind a scalable enterprise Git solution, as in our Git MultiSite product.


High availability and business continuity are also important to large enterprise software development projects.  Even seemingly small amounts of downtime can add up to millions of dollars a year in costs, as reported in recent Forrester research.  And while developers can continue to commit locally with Git even if their shared repository is down, this runs contrary to the trends of continuous integration and continuous delivery, where it’s important to get changes into the mainline as soon as possible. That’s why Git MultiSite can be configured in a variety of ways to provide seamless failover. When failed nodes come back online, they transparently rejoin the group and “self-heal” by automatically catching up to their replicated peers.

Disaster Recovery

I’ve seen it stated that since cloned Git repositories are all copies of the master repository, disaster recovery is as simple as finding a developer that’s pulled recently, and then rsyncing that repo as the new master shared repo. Now I hear cries from the enterprise administration side of the room, because how do we restore server side configurations, scripts and access control? What does “recently pulled” mean and how do I find that developer at 3AM?  Did that developer pull all the branches or just master?  What if someone had just accidentally deleted a branch ref, can I still recover using a reflog from a clone? Wouldn’t it be better to consider a system that solves all of that with no operator interaction?


These are some of the reasons that we built a Git MultiSite product despite Git’s reputation of being multi-site by nature.  Do any of these needs speak to you?

Five Things to Avoid in your Enterprise Git Solution

Are you a Git administrator considering a new enterprise Git solution? To help you along the decision path, here are five things to avoid when choosing your Git solution.

No plan for data and business continuity

The data you keep in Git is critical to the successful operation of your business. It has to be secure and highly available. When evaluating a Git solution, if it doesn’t provide excellent protection against hardware failures and other disasters, look elsewhere.

No growth path

Successful companies grow rapidly. Within a year, the size of your team could double thanks to growth and acquisition, and new challenges such as a partnership with a company overseas demand flexibility from your infrastructure. Your Git solution should help you scale to meet these challenges, not make you design your own replication system.

It’s not really Git

Avoid any Git solution that doesn’t use plain old Git repositories under the hood. Otherwise you may end up tied into a proprietary framework, missing out on the tremendous portability that Git offers. Data translation is a necessary evil during migration – not something you should do on a daily basis.

Narrow field of vision

Keeping track of the tens and hundreds of Git repositories you maintain on several servers in multiple locations is a challenge. You want a solution that allows you to see the whole deployment at a glance, rather than one piece at a time. Can you see whether the repositories in your satellite office are up and performing well without calling someone there?

They don’t know Git

When you purchase a Git solution, you want a vendor that will stand behind it and help you get the most out of Git. Do they offer Git support and services? If you can’t pick up the phone and talk to an expert who knows more about Git than you do, what are you paying for?

Learn more about our Git solutions here and our Git services and support here.

Certified Git Binaries Now Available

WANdisco has supplied certified Subversion binaries for years, and now our certified Git binaries are available for several major platforms. And all I can say is…thank goodness.

If you’re a Linux user and your distribution has an out of date Git package in its repositories, it can be quite an inconvenience. I’ve built and installed Git from source a number of times, and it requires several third-party packages as dependencies, especially if you want to build the docs. Some of these packages can be hard to find on Ubuntu, and downloading and building everything from scratch requires 15 minutes I can’t afford each time. Now I can just grab a certified, up-to-date package from WANdisco.

SCM administrators have an even bigger challenge: making sure every user has a suitable Git package on their workstation (you may not care all the time, but some Git tools and integrations require newer versions of Git). In this case, you definitely don’t want to build Git from source, or have your users do it themselves. You could end up with a release that has bugs, leaving you to rebuild, pick up the patches and push a new build out to all your users – which can quickly become a maintenance headache.

Now you can take advantage of fully tested and certified, continuously up-to-date Git binaries with all the latest bug fixes. WANdisco also offers support and services packages. Just visit our download page and you’ll find all the information you need to get started.

Subversion and the Mainline Model

Subversion: Built for the Best Branching Model

Over the years Subversion has taken a few lumps for its branching and merging tools, but the latest release has fixed a key problem, and it’s worth remembering that the workflow Subversion supports best is also the best workflow: the mainline model. This model is the recommended approach in the continuous integration and continuous delivery communities.  In this article we’ll look at Subversion and the mainline model from a high level.

Most of the workflows that are popular in the Git community can be used for SVN as well; even if the implementation differs, the mainline model is just about the same.

In the mainline model, you have very few long lived branches. Most commits happen directly on trunk, and developers are encouraged to commit frequently. Avoiding long lived feature branches ensures that integration happens sooner rather than later, avoids painful merges, and gives you all the benefit of continuous integration practices.

SVN Mainline Model

SVN Mainline Model

The diagram above shows what your branching model looks like if you’re a true believer in the mainline model, meaning you don’t branch until you hit a release point and need to start isolating bug fixes from new development work. Topic branches [1] are used for very short periods of time, primarily to give developers a private area to save work before committing to trunk. Topic branches also are natural points for pre-flight code review and continuous integration.

If this simple diagram looks familiar, it’s because it’s the model that Subversion was built for, matching the familiar trunk/branches/tags layout of a Subversion repository. As of Subversion 1.8, Subversion’s merge engine handles this model well. Merges are primarily done to keep a topic branch up to date, and since a topic branch only lives for a very short time, the merges are easy. Similarly, symmetric merges between trunk and release branches are handled very well.

It’s also worth recalling one of Subversion’s perennial strong points: making branches is cheap and easy, requiring only one command or a few clicks in a Subversion GUI. There’s no overhead with Subversion branches, so developers can make as many private topic branches as they like without incurring any penalties. Making a lot of very small, short-lived topic branches is a much safer practice than working on a few big feature branches. Similarly, you can tag (make a read-only branch) at any point to indicate key milestones. If making a branch is hard in your SCM tool, or you’ve been warned not to make too many branches in order to avoid a performance problem, then you should take a look at Subversion or Git. The open source SCM tools have solved this problem.

Subversion’s merge engine may not be perfect, but it works very well as part of a continuous delivery system. If you have an elaborate branching model with multiple levels and a lot of sideways merges, no merge engine will save you from eventual trouble. Not only will your revision graph look like spaghetti, but you’ll eventually run into bigger workflow and process problems.

And of course, you can scale Subversion to support a large distributed software team using WANdisco’s suite of MultiSite products. The mainline model isn’t limited to a small co-located team anymore.

Trying to figure out the best way to deploy Subversion? WANdisco has a team of SVN experts waiting to help!

Subversion is a registered trademark of the Apache Software Foundation.

 [1] The Git community uses the term topic branch.  Other names include task branches, development branches, and feature branches.  But I think topic branch captures the use case perfect – a branch that holds one small topic of work.


Subversion 1.8.1 Released

Following June’s long-awaited release of Subversion 1.8, the Apache Software Foundation (ASF) has announced the first update, 1.8.1.

Apache Subversion 1.8.1 is largely a bug-fix release, including fixes for the following:

  • upgrade –  fix notification of 1.7.x working copies
  • resolve –  improve the interactive conflict resolution menu
  • translation updated for German and Simplified Chinese
  • improved error messages when encoding conversion fails
  • update –  fix some tree conflicts not triggering resolver
  • merge –  rename ‘automatic merge’ to ‘complete merge’
  • log –  reduce network usage on repository roots
  • commit –  remove stale entries from wc lock table when deleting
  • wc –  fix crash when target is symlink to a working copy root
  • mod_dav_svn –  better status codes for anonymous user errors
  • mod_dav_svn –  better status codes for commit failures

For a full list of all bug fixes and improvements, see the Apache changelog.

You can download our fully tested, certified binaries for Subversion 1.8.1 free here.

WANdisco’s binaries are a complete, fully tested version of Subversion based on the most recent stable release, including the latest fixes, and undergo the same rigorous quality assurance process that WANdisco uses for its enterprise products that support the world’s largest Subversion implementations.

Using TortoiseSVN?

There is an updated version of TortoiseSVN, fully compatible with Subversion 1.8.1, available for free download now.

Git Workflows and Continuous Delivery

Using MultiSite Replication to Facilitate a Global Mainline

Although Git is a distributed version control system (DVCS), it can support almost any style of software configuration management (SCM) workflow. The lines between the four prominent workflows in the Git user community can be blurry in implementation, but there are important conceptual differences between them.  Understanding these differences is important when considering the use of Git workflows and continuous delivery in your organization.

After an introduction to these workflows, we’ll evaluate how they match up against continuous integration and continuous delivery best practices, and then look at their application with global software development teams.

Workflow Overview

Fork and pull

In this model, a developer will fork (clone) a Git repository and work independently on their own server-side copy. When the developer has a change ready to contribute, he/she will ask the upstream maintainers to pull the changes into the original repository.

This model originated in open source projects and is prominent in that community. Contributors to open source projects may not even know one another and rely on a trusted set of upstream maintainers to review any contributions.

Fork and pull workflow

Figure 1: Fork and pull workflow

Feature branches

In this model, new branches are made for each feature (also called a task or topic) and are sometimes shared with the master repository. When changes are approved they are merged to the mainline (master) branch.

This model suits many small teams, as they are able to collaborate in a single shared repository yet still isolate new work to an individual or a small group. Functionally it is very similar to fork-and-pull but a feature branch usually has a shorter lifespan than a forked repository.

Feature branch workflow

Figure 2: Feature branch workflow

Mainline model

In this model, most work is committed directly to the trunk (master branch). There are few, if any, long lived branches less stable than the trunk. Long lived branches are sometimes used for release maintenance. Developers are encouraged to commit to the trunk frequently, perhaps daily. Local branches and stashes can be used for pre-flight review and build, but are not promoted to the shared repository.

The mainline model is strongly recommended in continuous integration and continuous delivery paradigms. It encourages very frequent reconciliation of new work, preventing any buildup of merge debt. Following this model, work is merged and up to date on a regular basis and available for testing and possible deployment.

The mainline model scales to large teams in enterprise settings but requires a high level of development discipline. For example, large new features must be decomposed into small incremental changes that can be committed rapidly. Furthermore, incomplete work may be hidden by configuration or feature toggles.

There is often a fine distinction between practical use of the mainline model and the feature branch workflow. If feature branches are personal, local, and short lived, they are consistent with the mainline model. However, use of a formal promotion process (merge request) versus a pure push can slow down the pace of commits. If every developer commits once a day, all of those commits would need a human review.

Mainline workflow

Figure 3: Mainline workflow

Git Flow

“Git Flow” is a popular model developed by Vincent Driessen[1]. It recommends a long lived development branch containing work-in-progress, a stable mainline, and feature, hot fix, and release branches as necessary. It is somewhat similar to a mainline model with long lived integration branches and feature branches.

Unlike the mainline model, however, the Git flow model violates some of the precepts of continuous integration. Notably, work may be left on the development branch or feature branches, not integrated with the latest changes on the mainline, for a long period of time. Nonetheless this model is often a comfortable transition for teams new to Git and continuous integration. It may also feel more natural for products with a clear distinction between stable development and production code, as opposed to SaaS products that deliver new changes daily.

Git Flow

Figure 4: Git flow

Application to Continuous Delivery

Continuous delivery indicates that each commit is a potential release candidate. Building on continuous integration principles, each commit is merged into the trunk and subjected to a progressively more difficult series of test and verification steps. For example, a commit may run through a pre-flight build, unit testing, component testing, performance testing, staging deployment, and production deployment. The latter stages are more expensive and time consuming, and may even involve human review. A commit that passes all the stages is available to deploy (but is only deployed when the business is ready). A failure must be addressed as soon as possible.

To view it in another light, continuous delivery tries to reduce isolation by vetting and surfacing new work as quickly as possible. Important new features are not hidden in forks or branches for weeks – they are integrated, tested, and made available to the business as soon as possible.


As noted earlier, the mainline model is best suited to continuous delivery and is strongly recommended in the literature. Eliminating long lived development branches ensures that every change is tested and integrated quickly, delivering value to the business frequently. It also enforces good habits like decomposing stories and features into incremental tasks that are less likely to cause breakages.

The fork and pull model can leave changes isolated in other repositories for long periods of time, and often involves a gated promotion process. It is the workflow least suited to rapid development in large enterprise teams.

The feature branch workflow occupies a middle ground. If the feature branches are local and short lived, they effectively serve as private staging areas. The promotion process (merge request) should be automated as much as possible with little human intervention.

Git flow is a workable model but introduces a second long-lived branch, putting distance between development and deployment.


Consider adoption of the mainline development model as advocated by the continuous integration paradigm. Committing once a day to the trunk is a sea change for developers used to working on isolated branches (or forks) for long periods of time. Though developers may be skeptical, the risk and discomfort are mitigated by:

  • Running rigorous pre- and post-commit tests if you have the latest code and dependencies and can rely on fast continuous integration.

  • Being able to pull updates quickly several times a day.

  • Being able to commit quickly, particularly if a prior commit introduced a breakage and you must fix it or roll back.

Reducing the risk and discomfort of the mainline model imposes several demands of this nature on the SCM system. These demands are even more challenging when you are working with several teams in different locations; you have many more contributors, and the product is assembled from multiple components.

These scaling and infrastructure challenges illuminate the isolation that often arises from working in a large distributed environment. Data may be local to or effectively mastered at one site; and all the complications of working over a WAN will hinder performance and slow down the development tempo.

Global software development on complex projects is common to enterprise software development and complicates the adoption of continuous delivery. In order for a set of large distributed teams to adopt continuous delivery and the mainline model, they must have the tools to overcome data isolation of all kinds:

  • A version control infrastructure that allows a developer at any site full access to the latest source code with the ability to commit frequently.

  • The ability to set up continuous integration (build and test) infrastructure that operates well under heavy load at multiple locations.

  • The support to cope with tens or hundreds of repositories containing the product components, configuration data, environment settings, and other necessary material.

In short, the mainline model reduces isolation introduced by non-optimal codeline models (i.e. new work lingering in long lived branches) to make sure that new work is available quickly. Development teams need the support of a solid SCM infrastructure to adopt the mainline model and avoid the isolation that often comes from working in large distributed teams.

Solving Continuous Delivery Challenges for Global Development with MultiSite Replication

An SCM system that only functions well in a LAN environment under moderate load will not suffice for global development projects. A simple master-slave data replication scheme will not overcome the complexities of operating in a large distributed environment.

Only a true active-active replication system can scale up an SCM system to cope with continuous delivery for a global distributed software organization. With active-active replication as provided by WANdisco’s family of MultiSite products, each node in the system is a peer, usable for any operation at LAN speeds.

  • With an active-active replication system, teams at all sites are first class citizens and can use and access key data with no latency bottlenecks.

  • Likewise, additional peer nodes can handle the load imposed by larger teams of contributors and the associated build and test automation.

  • Since the system is self-healing with automated failover and high availability, there is no risk of down time due to maintenance windows, hardware failures, or network outages.

  • Selective replication means that an administrator can choose which repositories are replicated to which sites. Repositories with production environment data may only be replicated to sites that interact with runtime servers, for example.

  • The MultiSite administration console provides global visibility across all servers and repositories, making it easier to coordinate a product assembled from several components kept in separate repositories.


Git can support many development workflows. The mainline model is considered optimal for continuous delivery.

The code in the SCM system delivers value to the business when it is available to the customer. Continuous delivery is a set of practices designed to reduce the isolation of the data and get it to customers sooner. Active-active replication fully supports the mainline model and other continuous delivery best practices by making the data available when and where it is needed throughout the delivery pipeline.

Learn more about our Git solutions here and our Git services and support here.


Git Is Not Distributed

disconnected_320Everyone knows Git as a “Distributed Version Control System”, or DVCS. There it is: “distributed”, right in the description.

Except that Git is better described as “disconnected.”

The main reason is that true distributed computing systems feature coordinated communication between the distributed nodes. Git, although it can communicate over a WAN to other nodes, has no such coordination of pushes. Pushes are initiated manually or with ad-hoc scripts.

This lack of coordination between related Git repos means that Git is really a disconnected system.

The developer using Git on her laptop is well described as being disconnected. She has no idea what is going on in other repos, particularly the shared repo her teammates push to at the end of a completed task. Note that there’s nothing essentially wrong with the disconnected model here, it’s only the term “distributed” that is at issue.

Enterprise Git

Enterprise Git deployments face different scalability challenges than most other types of projects. The need to support large, geographically distributed development teams with scalability and performance combined with business continuity through high availability and disaster recovery raises a number of questions supporting large, global Git deployments. How do I provide fast clones at remote sites? How do I recover from hardware or connectivity failures? How do I avoid picking winners and losers in my development organization when I choose the master server location?


Sometimes master/slave replication is used to provide local read-only mirrors at sites worldwide as an attempt to answer some of these questions. We’ve seen the same pattern in Subversion deployments using tools like svnsync to support these mirrors. This is not a very satisfying solution in practice, as I wrote in “Why svnsync Is Not Good Enough for Enterprise Global Software Development“.

It’s the coordination, stupid

While this title riffs off a famous snowclone, Git without coordinated replication is similar to Subversion with svnsync in terms of being distributed. Git has the capability of svnsync built in: it can already reconcile repositories over a WAN. What it lacks, just as Subversion with svnsync does, is the coordination of the reconciliation and replication process. So Git is no more distributed than svnsync makes Subversion distributed.

Making Git Distributed

This is where WANdisco’s patented replication technology steps in to provide 100% data safe and optimally coordinated replication of shared Git and Subversion repositories. While the industry has enjoyed the benefits of SVN MultiSite for years, the recently announced Git MultiSite makes Git, at least between the shared enterprise repositories, finally worthy of truly being called “distributed.”

Forrester TEI Shows SVN MultiSite Delivers ROI of 357% and Payback Period of 2 Months

Analyst Study Confirms SVN MultiSite Boosts Productivity and Ensures Uptime, Learn More during Webinar

We are proud to announce the results of Forrester’s Total Economic Impact (TEI) Report for SVN MultiSite. The subject of the study, a WANdisco customer, was a Fortune 500 company with annual revenues of over $5 billion. Forrester concluded that SVN MultiSite generated a return on investment (ROI) of 357% with a payback period of less than 2 months.

Significant benefits and cost savings were found in a broad range of areas that Forrester attributed to SVN MultiSite’s ability to provide remote developers with local real-time access to Subversion repositories and the elimination of downtime.

Learn about the report during a webinar Wednesday, July 24 at 10:00 AM Pacific / 1:00 PM Eastern when guest speaker Jean-Pierre Garbani, Forrester Research, Inc., Vice President and Principal Analyst, Infrastructure and Operations will present the findings.

Learn how WANdisco’s SVN MultiSite enables:

  • Reducing application development costs via faster development, build, and release cycles.
  • Eliminating downtime by implementing a failover strategy and removing any single point of failure.
  • Significantly reducing cost of ownership based on a proven open source strategy.

Register here.

Ready to try SVN Multisite? Register for a free trial today!

SmartSVN 7.6 Release Candidate 1 Issued

Today we launched SmartSVN 7.6, release candidate 1. SmartSVN is the cross-platform graphical client for Apache Subversion.

SmartSVN 7.6 represents a major step forward from 7.5.5 in features as well as performance improvements.

New SmartSVN 7.6 features include:

– Auto-update – there is no need to install new versions manually

– Repository Browser – defined svn:externals are shown as own entries

– proxy auto-detection

– external tools menu

– OS X retina support

GUI improvements include:

– file/directory input fields – support for ~ on unix-like operating systems

– natural sorting (“foo-9.txt” before “foo-10.txt”)

– more readable colors on Transactions and other panes

 SmartSVN 7.6 fixes include:

– speed-search – possible internal error typing Chinese characters

– Revision Graph – errors when deselecting all branches

– Tag Browser – possible internal error

– SVN operations – significant performance improvements

– Check Out – checking out to an already versioned directory appeared to work, then failed later

– Refresh – possible performance problems

For a full list of all improvements and bug fixes, view the changelog.

Have your feedback included in a future version of SmartSVN

Many issues resolved in this release were raised via our dedicated SmartSVN forum, so if you’ve got an issue, or a request for a new feature, head over there and let us know.

You can download Release Candidate 1 for SmartSVN 7.6 from our early access page.

Haven’t yet started with SmartSVN? Claim your free trial of SmartSVN Professional here.

Apache HTTP Server Project Releases Version 2.2.25

WANdisco Subversion Committer Ben Reser Recognized for Contributions

The Apache HTTP Server Project announced the release of version 2.2.25 of the Apache HTTP Server (“httpd”) on July 10th, 2013.

WANdisco Subversion committer Ben Reser was thanked in the official announcement for identifying a Denial of Service vulnerability.  The vulnerability known as CVE-2013-1896 may allow remote users with write access to crash the httpd server hosting Subversion repositories.  Subversion administrators are urged to upgrade their installation of httpd to 2.2.25 or the latest 2.4.x release.

WANdisco’s Subversion binary packages for Solaris and Windows operating systems have been updated to include the updated version of httpd. Download them from our website.

According to Apache, 2.2.25 “is principally a security and bugfix release.”  See the official 2.2.25 changelist for a complete list of improvements in this release.

Download Apache HTTP Server 2.2.25 or the latest 2.4.x release here.

Subversion 1.8 Backwards Compatibility

Upgrading to a new version of your SCM system is a big decision, often requiring careful planning by administrators to balance the benefits of new features and capabilities against any compatibility and upgrade concerns.

Fortunately for administrators, the Subversion project has always been very good about trying to maintain backwards compatibility and documenting which features require newer servers and clients. As you’ll notice, some of the most appealing parts of Subversion 1.8 only require a client upgrade.

In the interest of saving time, I’ve summarized the most important information here.

Upgrade highlights

  • Pre-1.8 servers and clients are compatible with 1.8 servers and clients, but not all the new features are available with older servers and clients.

  • You do not need to dump and reload repositories when upgrading the server. However, doing so may give you better performance and a smaller repository size.

  • You do need to upgrade your working copy with the subversion upgrade command when you start to use a 1.8 client. If your working copy was created with a pre-1.6 client, start by upgrading to 1.6 or 1.7.

When do I need a 1.8 client?

When do I need a 1.8 server?

Other concerns

  • Upgrading the server is always a bigger decision than upgrading the client. Subversion 1.8 is a new release and you should confirm that it works well in your environment before upgrading the server.

  • Make sure you test compatibility with all of your third party software including continuous integration servers, IDE plugins, and GUIs before upgrading.

Ready to take the leap? You can find certified Subversion 1.8 binaries on the WANdisco web site.

Subversion is a registered trademark of the Apache Software Foundation.

Enterprise Git… The Way It Should Be

100% Uptime.  LAN-Speed.  Enterprise Ready.

Git has fast become a preferred SCM solution for open source projects and, increasingly, for the world’s largest enterprise software development teams. Developers enjoy Git’s speed, powerful local branching, and versatile toolkit.

Despite its benefits, enterprises have struggled to turn Git into a solution that works as well for the 1000th developer in London as it does for the first developer in California. Git MultiSite solves the problems caused by a single master repository by leveraging WANdisco’s patented replication technology to provide high availability, easy management, and superior performance to distributed teams using Git.

Develop at the speed of Git

Git MultiSite makes it easy to add new nodes to your Git installation both locally and for remote offices. These nodes can distribute the performance load of a large user base or automated build and test plans. Since every node is writable and all pushed commits are transparently replicated to all locations, they also provide LAN-speed performance to users connected over a WAN. Git MultiSite handles coordination of network activity and synchronization after a failure.

Zero down time

With Git MultiSite, each node serves as a Disaster Recovery node, eliminating the single point of failure inherent in most enterprise Git deployments. Downtime, data loss and slow performance become a thing of the past, and merge conflicts and other issues are identified and resolved as soon as they occur, instead of days later.

Global repository management

Git MultiSite makes it easy to deploy new repositories and servers and monitor them at a glance.  Administrators can select which repositories are shared between sites, and because Git MultiSite’s administrative tools are based on the proven architecture of WANdisco’s existing MultiSite products, there is no need to deploy and maintain ad-hoc Git mirrors.

Pure Git

Git MultiSite is built on two foundations: Git and WANdisco’s patented Distributed Coordination Engine (DConE) – no black boxes or closed tool stacks. Developers can continue to use all the monitoring and other tools they know and love.

Enterprise Support and Services

WANdisco is a leading supporter of open source SCM technology. All of our products are backed by our team of open source software experts providing secure, high-quality live assistance to customers anywhere in the world whenever they need it.

Git, the way it should be

Backed by Git MultiSite, your distributed development teams will enjoy fast and reliable Git repositories, your administrators will enjoy peace of mind knowing that your data is safe and easy to manage, and the whole company will benefit from the best enterprise Git solution available.

Learn more

Ready to find out more about Git MultiSite? Register for a free trial today!

Better Move Tracking in Subversion 1.8

Refactoring adherents in the Subversion community will be pleased with the better move tracking in Subversion 1.8.  Refactoring is a regular housekeeping operation for many Agile developers, and one that often involves renaming files or moving them to new locations. If you refactor on a regular basis you’ll appreciate a small improvement in local move tracking in Subversion 1.8.

Prior to Subversion 1.8, a move was simply a copy followed by a delete – and that’s still how a move is permanently recorded in the repository. In Subversion 1.8, moves have special significance in your working copy.

svn status and svn info now show a move as a coherent operation

Let’s do a quick rename and see what svn status reports.

$ svn move BUILDING.txt build.txt
   A         build.txt
   D         BUILDING.txt
$ svn status
   D       BUILDING.txt
           > moved to build.txt
   A  +    build.txt
           > moved from BUILDING.txt

So far so good – a handy little notification in the status report.

A move must be committed atomically

You can’t commit just one part of the operation.

$ svn commit -m "just the new one" build.txt
   svn: E200009: Commit failed (details follow):
   svn: E200009: Cannot commit 'build.txt' because it was moved from 'BUILDING.txt' which is not 
part of the commit; both sides of the move must be committed together

Some tree conflicts can be resolved automatically

Now let’s assume that someone else has edited BUILDING.txt, the file under the original name, before I can commit the move. When I run the update, I’m notified and can choose to have the other edit applied to the renamed file in my working copy.

$ svn update
   Updating '.':
      C BUILDING.txt
   At revision 35.
   Tree conflict on 'BUILDING.txt'
      > local file moved away, incoming file edit upon update
   Select: (p) postpone, (mc) my side of conflict, (r) resolved,
           (q) quit resolution, (h) help: mc
   U    build.txt
   Updated to revision 35.
   Resolved conflicted state of 'BUILDING.txt'
   Summary of conflicts:
     Tree conflicts: 0 remaining (and 1 already resolved)

Using the mc conflict resolution choice, the edits submitted by the other user are merged into my copy under the new name.

Try it out!

Other notable changes in Subversion 1.8 can be found in the release notes including svn move no longer operating on a mixed revision working copy. Download fully tested, certified Subversion 1.8 and try it out today, you don’t need to be working against a 1.8 server to use this new capability.

Subversion is a registered trademark of the Apache Software Foundation.

Subversion Merge Improvements in SVN 1.8

Subversion 1.8 Solves Symmetric Merge

Subversion 1.8 made some big improvements to Subversion merge capabilities. A major enhancement users will notice is the fact that there is no need to bother with the –reintegrate option when promoting changes from a development or task branch to trunk. Likewise, there’s no need to run any special steps if you want to keep using the development or task branch after the promotion.

I was most curious about a merge case that Subversion handled poorly up until now. I’ve illustrated the situation below:

  • Add file in trunk with four lines

  • Branch trunk to dev

  • Edit trunk and modify line one

  • Edit dev and modify line three

  • Merge dev to trunk, leaving trunk with the change from dev on line three

Subversion merge scenario

At the point of the last merge, the only unique change on trunk is the text on line one. This should be a trivial merge with no conflicts. The final automatically merged result should be:





In Subversion 1.7 the final merge generated a merge conflict. It was trying to reinsert the original change on line 3, even though that was not a unique change from trunk.

Subversion 1.8 handles the merge cleanly with the correct anticipated output. That gives me a lot more confidence in Subversion merging.

There’s a very comprehensive description of the symmetric merging improvements available on the Subversion wiki. The bottom line is that Subversion merging is more robust and reliable now.  Want to try it? Grab a certified, fully tested Subversion 1.8 release from our website.

Subversion is a registered trademark of the Apache Software Foundation.


WANdisco Announcements at Hadoop Summit 2013!

We’ve been busy at WANdisco and Hadoop Summit gave us the chance to share all of the product, partnership, and feature news we’ve been working on.

First and foremost, the release of Non-Stop NameNode WAN Edition marks the first and only Continous Availability™ solution for Hadoop clusters deployed over a WAN. With it, servers actively support clients at globally distributed locations and continuously synchronize with data nodes. The result is LAN-speed performance and access to the same data at every location with automatic failover and recovery both within and across data centers, enabling 100% uptime with unparalleled scalability and performance.

Non-Stop NameNode WAN Edition provides 100% Hadoop uptime over the WAN, automatic WAN backup and failover, complete WAN namenode sync, and more. For more information, visit the product page.

In addition to the launch of Non-Stop NameNode WAN Edition, WANdisco Distro (WDD) version 3.6 is now available, bringing a number of new features, fixes, and capabilities to our fully tested, certified and production-ready version of Apache Hadoop 2. WDD 3.6 includes S3-Enabled HDFS, which simplifies migration from Amazon’s public cloud without sacrificing support for third-party applications that use the S3 API.

Looking to the future, we’re proud to announce our collaboration with AMPlab of UC Berkeley in providing a technology preview of Spark and Shark. Spark offers significant new computation model capabilities and performance enhancements to deliver speeds up to 100 times faster than MapReduce on any Hadoop-supported filesystem. Shark addresses limitations of Apache Hive while natively supporting SQL and HQL as well as metastore, serialization formats, and user-defined functions. In addition to improving upon Hive, Shark allows users to cache data in-memory to vastly increase efficiency and provide maximum performance.

In partnership news, WANdisco announced another Non-Stop Alliance Partner, TCloud, to provide enterprises in Great China with comprehensive enterprise Hadoop deployment solutions. TCloud, a subsidiary of Trend Micro, has its headquarters in Beijing with additional satellite locations through Asia.

It is an exciting time for WANdisco and we are looking forward to bringing you more Big Data news.

WANdisco and Zaloni Announce Big Data Partnership

We’re pleased to announce we’ve partnered with Zaloni, a leading provider of agile Big Data and data management solutions, to enable enterprises to accelerate the adoption of Apache Hadoop and support Continuous Availability™ for their mission-critical applications.

Zaloni’s Bedrock Data Management Platform™ provides a unique foundation for Hadoop end-to-end design, build and deployment solutions. WANdisco provides Continuous Availability™ to data processed and stored in Hadoop. Together, we will provide enterprises with a full-lifecycle approach to Hadoop deployment, including use case discovery, development, integration, delivery and ongoing maintenance and support.

“We offer solutions that deliver 100% uptime for Hadoop, and our clients want to know that in addition to these products, they will be getting an effective implementation plan and support,” said David Richards, CEO of WANdisco. “The partnership with Zaloni will provide exactly that.”

WANdisco will be at Hadoop Summit June 26-27. Click here for 20% off registration.

Why Cassandra Lies to You

115px-Cassandra1Apache Cassandra is an open source, replicated store of key-value pairs (also commonly known as a NoSQL database) modeled after Amazon’s Dynamo.  Like Dynamo, Cassandra’s replication can be described as active-active, fault-tolerant, highly available and peer-to-peer.  Unlike WANdisco’s DConE replication technology, however, they do not guarantee consistency. Instead, they guarantee what they call “eventual consistency”.

“Eventual consistency” in this context means that Cassandra has to lie to you sometimes.

Let’s look at an example to illuminate this further.

Imagine an online merchant using Cassandra with a dozen worldwide replicas for inventory management. Our merchant’s supply of Widgets is down to the last one, and we start with all Cassandra replicas accurately reflecting this worldwide supply of one Widget.

Two customers, Michael and James, order a Widget at the nearly same time.  Both orders are fulfilled because the ordering software accesses two different replicas of Cassandra, each replica reporting one available Widget.  One replica has a record indicating that the last Widget was sold to Michael. The other has a conflicting record indicating that the last Widget was sold to James. Nobody is aware of this conflict.

Using the “gossip protocol”, the replicas now push records around until they are fully distributed.

Eventually, a replica sees both records, detects a conflict and moves to resolve the conflict.   A simplistic algorithm may resolve the conflict in favor of Michael, perhaps using timestamps and automatically generating an apology to James.  Over time, James’ order is rescinded in favor of Michael’s at all replicas, and other than the effect on customer relationship with a disappointed James, everything returns to normal.  Let’s just hope that in the midst of this process of conflict resolution and gossip protocol, the shipping department did not consult a replica with the wrong information and sent the last Widget to James!

Let’s now look at the same scenario using a true active-active technology with absolute consistency like WANdisco’s DConE replication.

As before, Michael and James try to buy a Widget.

Two replicas generate proposals.  One proposes that a Widget be sold to Michael. The other proposes that a Widget be sold to James. DConE delivers these proposals to the replicas, and they conclude Michael’s order was fulfilled and James’ was not.

Since all replicas have the same information, the replica accessed by Michael informs him that his order is fulfilled. The replica accessed by James informs him that Widgets are sold out. The replica accessed by Shipping indicates that the Widget should ship to Michael.

So while Cassandra may not want to lie to you, it can’t help it due to the effect of its eventual consistency.  DConE never lies, so when your data is really important to you, DConE’s absolute consistency is the ultimate in data-safe replication technology.

WANdisco and Dataguise Form Strategic Alliance

We’re pleased to announce another Non-Stop Alliance Partner: WANdisco and Dataguise have partnered to deliver a certified solution to protect data privacy and deliver risk assessment intelligence for enterprises using WANdisco Distro (WDD). With this announcement, DG for Hadoop™ is certified for WDD customers.

DG for Hadoop is a comprehensive, flexible and modular solution that delivers enterprise-class protection to sensitive data aggregated in Hadoop installations. Together, WANdisco and Dataguise will meet the standards that customers demand for their mission critical Hadoop environments.

“Data security is a critical requirement in today’s enterprise environment. Big Data brings new challenges to security and Dataguise meets those requirements and accelerates the adoption of Hadoop in these environments,” said David Richards, CEO of WANdisco. “Our mutual customers can look forward to meeting data protection challenges as well as overcoming global availability challenges with this partnership.”

Both companies will be at Hadoop Summit in San Jose, California next week. Click here for 20% off registration.

WANdisco Announces Partnership with Data Tactics

We’re proud to announce we’ve partnered with Data Tactics to deliver planning, implementation, and support services for Continuous Availability™ of Big Data deployments.

Data Tactics is focused on solving big data problems and has been building petascale enterprise systems leveraging Hadoop since 2008. Together, we will provide a comprehensive approach to Hadoop deployment, including business use case investigation, development, integration, delivery and ongoing services and support.

“Our clients need expert and effective implementation planning and support for our 100% uptime Hadoop solutions,” said David Richards, CEO of WANdisco. “Data Tactics has the insight, experience, and knowledge to deliver that.”

WANdisco will be at Hadoop Summit June 26-27. Click here for 20% off registration.


TortoiseSVN 1.8 is now available!

We’re very pleased to announce that TortoiseSVN 1.8 is now available to download. TortoiseSVN 1.8 is fully compatible with the newly-released Subversion 1.8.

This release of the popular Windows client for Apache Subversion contains a whole raft of changes designed to make your day to day development easier.

The changes include:

  • Coloring for TortoiseBlame

  • The ability to commit only parts of a file

  • An improvement to the Repository Browser to enable you to see all repositories by pointing at the root

  • Improvements to custom properties and client hook scripts

You can see a complete list of the changes here.

TortoiseSVN users are recommended to upgrade to this release as soon as possible.


WANdisco Announces Availability of Apache Subversion 1.8 binaries

Today sees the long-awaited release of Subversion 1.8, featuring significantly improved merge capabilities as promised at SVN Live last year, to address user needs.

With Subversion 1.8, users will benefit from merge functionality enhancements as well as major improvements in storage efficiency. The enhancements include:

  • Symmetric, or automatic, merge capability for simplifying the merge process and eliminating conflicts caused by users selecting the wrong type of merge.

  • Simultaneous change handling between branches rather than differentiating between sync and reintegration merge forms.

  • Rejection of merge attempts between unrelated branches to decrease the likelihood of user errors that often result in conflicts.

  • Decrease in server set-up costs.

  • Revision property packing and directory deltification to reduce backup and restore times as well as the number of files stored for up to more than 90% storage capacity savings.

You can see a full list of the changes in the release notes here.

“While Subversion simplifies development and helps teams seamlessly collaborate, it has had limited merge capabilities, leading some developers to choose other SCM systems,” said David Richards, WANdisco CEO. “Subversion 1.8, under the development leadership of Julian Foad, has addressed those issues and users will greatly appreciate the powerful new features.”

To save you the hassle of compiling from source you can download our fully tested, certified binaries free from our website here:

WANdisco’s Subversion binaries provide a complete, fully tested version of Subversion based on the most recent stable release, including the latest fixes, and undergo the same rigorous quality assurance process that WANdisco uses for its enterprise products that support the world’s largest Subversion implementations.

Using TortoiseSVN?

Following last weeks acquisition news, we’re very pleased to announce that TortoiseSVN 1.8 is available for free download now, and is fully compatible with Subversion 1.8.

You can get your hands on it here: is now part of WANdisco!

You may have seen the news release earlier today, advising that WANdisco has acquired, home to the world’s most popular Subversion client for Windows, TortoiseSVN.

We’re excited about this for a number of reasons:

– TortoiseSVN is an award-winning open source Subversion client software for Windows with millions of users.

– We (WANdisco) have been a major contributor to the TortoiseSVN project since 2010.

– We already support customers who are quite happy with TortoiseSVN, and want to continue using it.

– We’ve had a dedicated TortoiseSVN section on our popular Subversion forum for a while now so we’re familiar with a lot of the issues users face and the new features they’re looking for.

– This acquisition, along with our cross-platform Subversion client, SmartSVN, puts us in a great position to help drive development of features users will love, regardless of their environment.

As part of the acquisition Stefan Küng, the lead developer on the TortoiseSVN project since the start, has joined the ‘Disco to continue development on TortoiseSVN.

Welcome aboard Stefan, and the TortoiseSVN community!

Subversion Live 2013 Registration Open!

We’re back! WANdisco is proud to announce the return of our popular conference series, Subversion Live for 2013 along with a roster of new and returning community experts.

This year’s conference will start with a keynote speech from Apache Software Foundation Director Greg Stein and sessions will cover new merge features in Subversion 1.8, insight into future releases, and the future of the open source software in general. Expert-led presentations, live demos, and an in-depth committer roundtable will give attendees a unique opportunity to learn from and interact with Subversion core committers.

Sessions include:

  • What’s new in Subversion 1.8

  • The Flow of Change: How Software Evolves

  • Subversion: The Road Ahead

  • Practical TortoiseSVN

  • Move Tracking

  • Benchmarking Subversion

  • Apache Bloodhound

  • …and more!

Stefan Furhmann presenting at Subversion Live last year


Registration is open for Subversion Live BostonSan Francisco, and London being held October 3rd, 8th, and 16th respectively. Follow @WANdisco and @uberSVN for up-to-date news on Subversion Live 2013.


Subversion Live roundtable


Two New Apache Subversion Releases

The Apache Subversion team has announced two new releases: Subversion 1.7.10 and 1.6.23.

Subversion 1.7.10 includes a number of fixes, such as improving the error messages for fatal errors. Others include:

– a fix for the “no such table: revert_list” error in ‘svn revert’

– multiple fixes for ‘svn diff’ showing incorrect data

– a fix for repository corruption issues on disk failure in Windows

– svnserve exit and memory issues

More information on Apache Subversion 1.7.10 can be found in the Changes file.

Meanwhile, Subversion 1.6.23 includes a fix for svnserve exit issues and other minor bug fixes, all of which can be found in the Changes file.

Both versions can be downloaded free via the WANdisco website.


If you have any feedback on the WANdisco binaries, or on Subversion itself, head on over to our SVNForum.

SmartSVN 7.5.5 Released

SmartSVN, the cross-platform graphical client for Apache Subversion, has been updated to version 7.5.5. SmartSVN 7.5.5 focuses on enhancements and fixes requested by you, the users, in preparation for introducing new features in the next major release.

SmartSVN 7.5.5 includes fixes for:

– copying multiple files

– changes being incorrectly reported in first line of some file types

– automation of ignore patterns on Project Settings

– refresh issues on Windows

– incorrect version reported in SpiceWorks

– problems with EOL’s in UTF-16 files


Many of the issues resolved in this release were raised via our dedicated SmartSVN forum, so if you’ve got an issue, or a request for a new feature, head over there and let us know. More information on what’s new and noteworthy in this release is available at the Changelog.


If you’re already using SmartSVN, you can get the latest version within the client by checking for updates (Help > Check for new version).

Haven’t yet started with SmartSVN? Claim your free trial of SmartSVN Professional here.


Why WANdisco?

The seventh most watched video on, as of mid 2013, is a talk by Simon Sinek, called “How Great Leaders Inspire Action.” In the video he describes the concept of The Golden Circle, although I remember it better as the “Why, How, What” method for creating enduring innovation and competitive advantage.  If you aren’t familiar with this, I’ll wait here while you watch this must-see video.

Done? And suitably inspired?  I hope so, because it’s a powerful concept.

It also raises the question of why does WANdisco exist? This was the subject of discussion at a recent offsite, and the answer emerged clear and bright.

Why WANdisco? Because your data is important to you.

If you recall in a previous post, Why DConE is Ideal, I related how DConE, our true active-active replication engine, “is ideal in the sense that we are always safe, and live to the extent allowed by physics.”  In this context, “safe” is a property of distributed computing that means “will never do anything wrong.”

Think about that. That’s a remarkable statement. How many systems do you know of that will never do anything wrong?  How many have no single point of failure? How many globally distributed systems have no rare-but-catastrophic edge conditions based on hardware or software failures?

In a world that is increasingly becoming dependent on computers, downtime and data corruption is becoming less and less acceptable. If Netflix can’t stream videos for a few hours, or GitHub is in one of its regular outages, it’s currently considered an unavoidable and non-critical business cost.  But what if the computer runs the safety system on a nuclear reactor?  Failures in some systems are costly at best and catastrophic at worst.

When your data is mission critical and high availability is a requirement, WANdisco Non-Stop technology creates a safety net for your data in the high failure environment of Wide Area Network distributed computing. That’s our mission. That’s why WANdisco.

Why does the company you work for exist?

Webinar Roundup: Introduction to Git

On May 30th, we hosted our first webinar on the Git revision control and SCM system, “Introduction to Git” led by 33 year industry veteran and WANdisco Director of Training, Michael Lester. Over 600 people attended the webinar, which covered key Git functionality ranging from initiating repositories to merging branches and resolving conflicts.


Topics included using Git via command line and GUI; repository initialization, testing, and population; checkouts, working folders, and commits; staging and un-staging; branching and merging; and more.


The presentation was followed by a short Q&A that focused on ignoring files to prevent accidental addition, how to keep binaries out of your Git repository, how to configure repository policies, and why rebasing local repositories is useful for simplifying the remote log.


Be sure to check our webinar replays for the “Introduction to Git” VOD coming soon.


Interested in our Webinars? Registration is currently open for “Selecting the Right SCM Tool for Global Software Development” on June 12th at 9:00 am Pacific / 12:00 pm Eastern as well as our upcoming Subversion Hook Scripts and Advanced Hook Scripts webinars here.

On Achieving Consensus


“Our life is what our thoughts make it.” M. Aurelius, 121 AD – 180 AD

Since WANdisco is a distributed computing company, you might be thinking this will be another article on the subtleties of obtaining consensus between multiple processors using the Paxos algorithm.

Almost, except without the computers.

Recently we had a kickoff summit for a new product. The core team met here in San Ramon, gathered from Belfast, Sheffield and Massachusetts.  We all have significantly different backgrounds, live in four different places, and came together for a few days to try to sort out a beefy list of major questions about integration points, requirements and architecture for the new product.

The technical chops of the group were without question, the motivation high, and the time short. It seemed clear that this was a challenge primarily of consensus.  We had to resolve our list of open questions efficiently and decisively.

And despite coming into the meeting with many questions still unresolved after long email exchanges, we did! The process was the quickest and most effective I’ve experienced in 25 years building enterprise software, and I started to wonder if WANdisco’s core technology of achieving consensus between computers was bleeding over and helping humans reach consensus as well.

It occurred to me that maybe it has something to do with the language that I overhear from the development teams. They speak constantly of “proposals”, “agreements”, “consensus”.  In my programming days, I recall using words like “cast”, “derive”, “protected”. Could the simple use of highly cooperative language have a beneficial side effect on group decision making?

Absolutely, and it’s variously described as Linguistic Relativity, Sapir–Whorf hypothesis, or Whorfianism.  In short, it’s the theory that the language we employ affects our thoughts and subsequent actions.

Looks like we caught an example of this in action. How might the language you use in your software development process affect your ability to reach consensus on big decisions?






WANdisco CEO, David Richards, presents at Tech London Advocates Launch as Founding Member

Last week saw the launch of Tech London Advocates, a new advocacy group launched by angel and venture investor Russ Shaw to support London’s technology start-ups into high growth.


With a founding membership of 150 comprising international CEOs, CTOs, fund managers and private investors, Tech London Advocates launched with an event featuring presentations by high profile executives, including our own founder and CEO, David Richards.


High profile supporters of Tech London Advocates include Saul Klein, partner at Index Ventures; David Richards, founder and CEO of WANdisco; Julie Meyer, founder of Ariadne Capital; Sherry Coutu, co-chair of Silicon Valley comes to the UK; Simon Devonshire, director at Wayra Europe, Dan Crow, CTO of Songkick and Rajeeb Dey, CEO and founder of Enternships.


Tech London Advocates will work in partnership with existing groups and initiatives to support ongoing efforts to establish London as a world-class hub for digital and technology businesses. WANdisco is honored to be part of the advocacy group.


Ignoring Files with SmartSVN

It’s common for Apache Subversion projects to contain files you don’t wish to place under version control; for example, your own notes or a list of tasks you need to complete.

Users of SmartSVN, the popular cross-platform SVN client from WANdisco, will be reminded of these unversioned files whenever they perform an ‘svn commit.’ In most instances you’ll want to add these files to SmartSVN’s ignore list to prevent them from cluttering up your commit dialog, and to safeguard against accidentally committing them to the repository.

To add a file to SmartSVN’s ignore list:

1) Select the unversioned file you wish to ignore.

2) Open the ‘Modify’ menu and click ‘Ignore…’ If the ‘Ignore’ option is greyed out, double check the file in question hasn’t already been committed!

3) Choose either ‘Explore Explicitly,’ which adds the selected file/directory to the ignore list, or ‘Ignore As Pattern.’

If ‘Ignore As Pattern’ is selected, SmartSVN ignores all files with the specified naming convention. Enter the names of the files you wish to ignore, or use the * wildcard to ignore all files that:

  • End with the specified file extension (*.png, *.txt, *.class)
  • Contain certain keywords (test_*, draft*)

The above two options are useful if you wish to ignore a group of related files, for example all image files. You can also opt to ignore all files, by entering the * wildcard and no other information.

4) Select ‘OK’ to add the file(s) to SmartSVN’s ignore list.

Ignore Patterns Property

You may also wish to apply the ‘Ignore Patterns’ property to your project. This has the same effect as selecting ‘Ignore Patterns’ in SmartSVN’s ignore list (described above) but it doesn’t require you to select a file first. This means you can configure SmartSVN to ignore groups of files before you even add them to your project.

To apply the ‘Ignore Patterns’ property:

1) Open the ‘Properties’ menu and select ‘Ignore Patterns…’

edit ignore patterns

2) Enter the names of the files you wish to ignore. Again, you can use the * wildcard where necessary.

Visit to try SmartSVN Professional free before you buy.

Understanding SmartSVN’s Revision Graph

SmartSVN, the popular cross-platform client for Apache Subversion, provides all the tools you need to manage your SVN projects out of the box, including a comprehensive Revision Graph.

SmartSVN’s Revision Graph offers an insight into the hierarchical history of your files and directories, by displaying information on:

  • Merged revisions

  • Revisions yet to be merged

  • Whether a merge occurred in a specific revision

  • Which changes happened in which branch

  • When a file was moved, renamed or copied, along with its history

The Revision Graph is useful in several tasks, including identifying changes made in each revision before rolling back to a previous revision, and gathering more information on the state of a project before a merge.

Accessing the Revision Graph

To access the Revision Graph, open the ‘Query’ menu and select ‘Revision Graph.’

revision graph

Understanding the Revision Graph

In the Revision Graph, projects are mainly represented by:

node Nodes – represent a specific entry (file/directory) at a specific revision.


    Branches – a collection of linked nodes at the same URL.



The main section of the Revision Graph is the ‘Revisions’ pane, which displays the parent-child relationships between revisions. Revisions are arranged by date, with the newest at the top. In addition to the main ‘Revisions’ pane, the SmartSVN Revision Graph includes several additional views:

  • Revision Info – displays information on the selected revision (such as revision number, date, author who created the revision etc.)

revision info

  • Directories and files – displays modified files in the selected revision. This is useful for pinpointing the revision at what point a particular file changed or disappeared from the project.

From this screen, you can access several additional options:

  • Export – export the Revision Graph as an HTML file by selecting ‘Export as HTML…’ from the ‘Graph’ menu. This file can then be easily shared with other team members.

  • Merge Arrows – select the ‘Show Merge Arrows’ option from the ‘Query’ menu to view the merge arrows. These point from the merge source to the merge target revisions. If the merge source is a range of revisions, the corresponding revisions will be surrounded by a bracket. This allows you to get an overview of merges that have occurred within your project, at a glance.

  • Merge Sources – select the ‘Show Merge Sources’ option from the ‘Query’ menu to see which revisions have been merged into the currently selected target revision.

  • Merge Targets – select ‘Show Merge Targets’ from the ‘Query’ menu to see the revisions where the currently selected target revisions have been merged.

  • Search – if you’re looking for a particular revision, you can save time by using ‘Edit’ and ‘Search.’ Enter the ‘Search For’ term and specify a ‘Search In’ location.

  • Branch Filter – clicking the ‘Branch Filter’ option in the ‘View’ menu allows you to filter the display for certain branches. This is particularly useful if you’re examining a large project consisting of many different branches.

WANdisco Announces SVN MultiSite Plus

We are proud to announce SVN MultiSite Plus, the newest product in our enterprise Subversion product line. WANdisco completely re-architected SVN MultiSite and the result is SVN MultiSite Plus, a replication software solution delivering dramatically improved performance, flexibility and scalability for large, global organizations.

SVN MultiSite Plus enables non-stop performance, scalability and backup, alongside 24/7 availability for globally distributed Apache Subversion deployments. This new product takes full advantage of recent enhancements to our patented active-active replication technology to improve flexibility, scalability, performance and ultimately developer and administrator productivity.

“SVN MultiSite has been improving performance and productivity for global enterprises since 2006 and SVN MultiSite Plus builds on those features for even greater benefits,” said David Richards, WANdisco CEO. “We’re committed to providing organizations with the most robust and flexible solutions possible and we’re confident SVN MultiSite Plus will meet and exceed the requirements of the largest globally distributed software development organizations.”

To find out more, visit our SVN MultiSite Plus product page, download the datasheet, or see how it compares to SVN MultiSite. You can try SVN MultiSite Plus firsthand by signing up for a free trial, or attend the free, online SVN MultiSite Plus demo we’ll be holding on May 1st. This webinar will demonstrate how SVN MultiSite Plus:

  • Eliminates up to 90% of communication overhead at each location

  • Eliminates downtime completely by providing administrators with the ability to add/remove servers on-the-fly

  • Delivers additional savings over SVN MultiSite through tools consolidation and greater deployment flexibility

  • Provides increased efficiency and flexibility with selective repository replication

  • And more.

This webinar is free but register now to secure a spot.

Subversion Tip of the Week

An Apache Subversion working copy can be created quite simply by running the ‘svn checkout’ command. However, sometimes you’ll want to have more control over the contents of your working copy; for example, when you’re working on a large project and only need to checkout a single directory.

In this post, we share two ways to get greater control over your checkout commands.

1. Checkout a particular revision

By default, Subversion performs a checkout of the HEAD revision, but in some instances you may wish to checkout a previous revision, for example when you’re recovering a file or directory that has been deleted in the HEAD revision.

To specify a revision other than HEAD, add the -r switch when performing your checkout:

svn checkout (URL) -r(revision number) (Location)

In this example, we are performing a checkout of the project as it existed at revision 10.

customizing working copy

2. Performing Shallow Checkouts

A standard Subversion checkout copies the entire directory, including every folder and file. This can be too time-consuming if you’re working on a large project, or too complicated if your project contains many different branches, tags and directories. If you don’t require a copy of your entire project, a ‘shallow checkout’ restricts the depth of the checkout by preventing Subversion from descending recursively through the repository.

To perform a shallow checkout, run the ‘svn checkout’ command with one of the following switches:

  • –depth immediates: checkout the target and any of its immediate file or children. This is useful if you don’t require any of the children’s contents.

  • –depth files: checkout the target and any of its immediate file children.

  • –depth empty: checkout the target only, without any of the files or children. This is useful when you’re working with a large project, but only require the contents of a single directory.

In this example we are performing a shallow checkout on a ‘bug fix branch’ located within the branches folder, and specifying that only the immediate file children should be included (–depth files):

customizing working copy 2

Looking for a cross-platform Subversion client? Get a free trial of SmartSVN Professional at

WANdisco Releases New Version of Hadoop Distro

We’re proud to announce the release of WANdisco Distro (WDD) version 3.1.1.

WDD is a fully tested, production-ready version of Apache Hadoop 2 that’s free to download. WDD version 3.1.1 includes an enhanced, more intuitive user interface that simplifies Hadoop cluster deployment. WDD 3.1.1 supports SUSE Linux Enterprise Server 11 (Service Pack 2), in addition to RedHat and CentOS.

“The number of Hadoop deployments is growing quickly and the Big Data market is moving fast,” said Naji Almahmoud, senior director of global business development, SUSE, a WANdisco Non-Stop Alliance partner. “For decades, SUSE has delivered reliable Linux solutions that have been helping global organizations meet performance and scalability requirements. We’re pleased to work closely with WANdisco to support our mutual customers and bring Hadoop to the enterprise.”

All WDD components are tested and certified using the Apache BigTop framework, and we’ve worked closely with both the open source community and leading big data vendors to ensure seamless interoperability across the Hadoop ecosystem.

“The integration of Hadoop into the mainstream enterprise environment is increasing, and continual communication with our customers confirms their requirements – ease of deployment and management as well as support for market leading operating systems,” said David Richards, CEO of WANdisco. “With this release, we’re delivering on those requirements with a thoroughly tested and certified release of WDD.”

WDD 3.1.1 can be downloaded for free now. WANdisco also offers Professional Support for Apache Hadoop.

Apache Subversion Team Releases 1.7.9 and 1.6.21

The Apache Subversion team has announced two new releases: Subversion 1.7.9 and 1.6.21.

Subversion 1.7.9 improves the error messages for svn:date and svn:author props, and it improves the logic in mod_dav_svn’s implementation of lock, as well as a list of other features and fixes:

  • Doxygen docs now ignore prefixes when producing the index

  • Javahl status api now respects the ignoreExternals boolean

  • Executing unnecessary code in log with limit is avoided

  • A fix for a memory leak in `svn log` over svn://

  • An incorrect authz failure when using neon http library has been fixed

  • A fix for an assertion when rep-cache is inaccessible

More information on Apache Subversion 1.7.9 can be found in the Changes file.

Meanwhile, Subversion 1.6.21 improves memory usage when committing properties in mod_dav_svn, and also improves logic in mod_dav_svn’s implementation of lock, alongside bug fixes including:

  • A fix for a post-revprop-change error that could cancel commits

  • A fix for a compatibility issue with g++ 4.7

More information on Apache Subversion 1.6.21 can be found in the Changes file.

Both versions can be downloaded free via the WANdisco website.

Free Webinar: Enterprise-Enabling Hadoop for the Data Center

We’re pleased to announce that WANdisco will be co-hosting a free Apache Hadoop webinar with Tony Baer, Ovum’s lead Big Data analyst. Ovum is an independent analyst and consultancy firm specializing in the IT and telecommunications industries.

This webinar, ‘Big Data – Enterprise-Enabling Hadoop for the Data Center’, will cover the key issues of availability, performance and scalability and how Apache Hadoop is evolving to meet these requirements.

“This webinar will discuss the importance of availability, performance and scalability,” said Ovum’s Tony Baer. “Ovum believes that for Hadoop to become successfully adopted in the enterprise, that it must become a first class citizen with IT and the data center. Availability, performance and scalability are key issues, and also where there is significant innovation occurring. We’ll discuss how the Hadoop platform is evolving to meet these requirements.”

Topics include:

  • How Hadoop is becoming a first class, enterprise-hardened technology for the data center
  • Hadoop components and the role of reliability and performance in those components

  • Disaster recovery challenges faced by globally distributed organizations and how replication technology is crucial to business continuity

  • The importance of seamless Hadoop migration from the public cloud to private clouds, especially for organizations that require secure 24/7 access with real-time performance

Big Data – Enterprise-Enabling Hadoop for the Data Center’ will be held on Tuesday, April 30th at 10:00 am Pacific / 1:00 pm Eastern. Register for this free webinar here.

Introduction to SmartSVN

SmartSVN is a powerful and easy-to-use graphical client for Apache Subversion. There are several clients for Subversion, but here are just a few reasons you should try SmartSVN:

  • It’s cross-platform – SmartSVN runs on Windows, Linux and Mac OS X, so you can continue using the operating system (OS) that works the best for you. It can also be integrated into your OS, via Mac’s Finder Integration or Windows Shell.

  • Everything you need, out of the box – SmartSVN comes complete with all the tools you need to manage your Subversion projects:

  1. Conflict solver – this feature combines the freedom of a general, three-way-merge with the ability to detect and resolve any conflicts that occur during the development lifecycle.

  2. File compare – this allows you to make inner-line comparisons and directly edit the compared files.

  3. Built-in SSH client – allows users to access servers using the SSH protocol. This security-conscious protocol encrypts every piece of communication between the client and the server, for additional protection.

  • A complete view of your project at a glance – the most important files (such as conflicted, modified or missing files) are placed at the top of the file list. SmartSVN also highlights which directories contain local modifications, which directories have been changed in the repository, and whether individual files have been modified locally or in the central repo. This makes it easy to get a quick overview of the state of your project.

  • Fully customizable – maximize productivity by fine-tuning your SmartSVN installation to suit your particular needs: Change keyboard shortcuts, write your own plugin with the SmartSVN API, group revisions to personalize your display, create Change Sets, and alter the context menus and toolbars to suit you. You can learn more about customizing SmartSVN at our ‘5 Ways to Customize SmartSVN’ blog post.

  • Comprehensive bug tracker support – Trac and JIRA are both fully supported.

Multitude of support options – SmartSVN users have access to a range of free support, from refcards to blogs and documentation, the SmartSVN forum and a Twitter account maintained by our open source experts. If you need extra support with your SmartSVN installation, expert email support is included with SmartSVN Professional licenses.

Want to learn more about SmartSVN? On April 18th, WANdisco will be be holding a free ‘Introduction to SmartSVN’ webinar covering everything you need to get off to a great start with this popular client:

  • Repository basics

  • Checkouts, working folders, editing files and commits

  • Reporting on changes

  • Simple branching

  • Simple merging

This webinar is free so register now.

Subversion Tip of the Week

Tagging and Branching with SmartSVN’s ‘Copy Within Repository’

SmartSVN’s ‘Copy Within Repository’ command allows users to perform pure repository copies, which is particularly useful for quickly creating tags and branches.

To create a repository copy within SmartSVN:

1) Open the ‘Modify’ menu and select ‘Copy within Repository’.

2) From the ‘Copy From’ dropdown menu, select the repository where the source resides.

3) In the ‘Copy From’ textbox, specify the directory being copied. In ‘Source Revision,’ tell SmartSVN whether it should copy the HEAD revision (this is selected by default) or a different revision. Use the ‘Browse’ button if you need more information about the contents of the different directories and/or revisions that make up your project.

copy within repo

4) Select either:

  • Copy To – source is copied into the ‘Directory’ under the filename specified by ‘With Name’

  • Copy Contents Into – the contents of the source are copied directly into the ‘Directory’ under ‘With Name.’

5) Enter the copy’s destination in the ‘Directory’ textbox. You can view the available options by clicking the ‘Browse’ button.

6) Give your copy a name in the ‘With Name’ textbox.

7) The copy is performed directly in the repository, so you’ll need to enter an appropriate commit message.

8) Once you’re happy with the information you’ve entered, hit ‘Copy’ to create your new branch/tag.

Try SmartSVN Professional free today! Get a free trial at

SmartSVN’s Project Settings: Properties

You can easily change how SmartSVN handles all your Apache Subversion projects using the popular, cross-platform client’s ‘global preferences’ settings. However, sometimes you’ll want to be more flexible and change SmartSVN’s settings on a per-project basis.

In this post, we take a closer look at the changes you can make to Subversion’s properties, on a project-by-project basis using SmartSVN’S ‘Project Settings’ menu.

Accessing Project Settings

To access SmartSVN’s Project Settings, open the ‘Project’ menu and select ‘Settings.’ The different options are listed on the dialog box’s left-hand side.

project settings

EOL Style

Subversion doesn’t pay attention to a file’s end-of-line (EOL) markers by default, which can be a problem for teams who are collaborating on a document across different operating systems. Different operating systems use different characters to represent EOL in a text file, and some operating systems struggle when they encounter unexpected EOL markers.

The ‘EOL Style’ option specifies the end-of-line style default for your current project. You can choose from:

  • Platform-Dependent/Native – files contain EOL markers native to your operating system.

  • LF (Line Feed) – files contain LF characters, regardless of the operating system.

  • CR+LF (Carriage Return & Line Feed) – files contain CRLF sequences, regardless of the operating system.

  • CR (Carriage Return) – files contain CR characters, regardless of the operating system.

  • As is (no convention) – this is typically the default value of EOL-style.

The ‘In case of inconsistent EOLs’ allows you to define how SmartSVN should handle files with inconsistent EOLs.

You can more about EOL Style at the ‘Subversion Properties: EOL-Style’ blog post.

EOL Style — Native

Usually, text files are stored with their ‘native’ EOL Style in the Subversion repository. However, under certain circumstances, it might be convenient to redefine what ‘native’ means, for example, when you’re working on a project on Windows but frequently uploading it to a Unix server. Open this dialog and choose from Linux/Unix, Mac or Windows.

Keyword Substitution

Allows you to automatically add ‘keywords’ into the contents of a file itself. These keywords are useful for automatically maintaining information that would be too time-consuming to keep updating manually.

You can choose from:

  • Author – the username of the person who created the revision.
  • Date – the UTC the revision was created (note, this is based on the server’s clock not the client’s.)

  • ID – a compressed combination of the keywords ‘Author,’ ‘Date’ and ‘Revision.’

  • Revision – describes the last revision in which the selected file was changed in the repository.

  • URL – a link to the latest version of the file in the repository.

  • Header – similar to ‘ID,’ this is a compressed combination of the other keywords, plus the URL information.

You can find out more about Keyword Substitution at our ‘Exploring SVN Properties’ post.

Learn more about the other options available in SmartSVN’s ‘Project Settings’ dialog by reading our Subversion Tip of the Week post.

Subversion Tip of the Week

SmartSVN’s Project Settings Menu 

SmartSVN’s ‘global preferences’ is a method of specifying settings across all your SmartSVN projects for efficiency and simplicity. However, sometimes you need to change settings for a single project, which is where the ‘Project Settings’ menu comes in handy.

In this week’s tip, we’ll look at some of the SmartSVN settings you can apply using this menu.

Accessing Project Settings

To access SmartSVN’s Project Settings, open the ‘Project’ menu and select ‘Settings.’ The different options are listed on the dialog box’s left-hand side.

project settings

1) Text File Encoding

This affects how file contents are presented. Choose from:

  • Use system’s default encoding – SmartSVN uses the system’s encoding when displaying files. This is the default setting for SmartSVN.

  • Use the following encoding – Select your own encoding from the dropdown menu. This is useful if you’re dealing with international characters, which may otherwise be encoded incorrectly.

Note, if you’ve specified a file type using the MIME-Type property, SmartSVN will choose this over the text file encoding settings.

2) Refresh/Scan

SmartSVN can either scan the ‘whole project’ or the ‘root directory only’ when you open a project. In most instances, you’ll want to scan the entire project, but if you’re working with particularly large repositories, the ‘root directory only’ option can speed up this initial scan and avoid high memory consumption.

3) Working Copy

Clicking on ‘Working Copy’ presents you with several checkboxes:

working copy

  • (Re)set to Commit-Times after manipulating local files – tells SmartSVN to always use a local file’s internal Apache Subversion property commit-time. This is useful for ensuring consistency across timezones, and between clients and the Subversion repository.

  • Apply auto-props from SVN ‘config’ file to added files – tells SmartSVN to use the auto-props from the SVN ‘config’ file. With auto-props enabled, you can perform tasks such as automatically inserting keywords into text files and ensuring every file has EOLs that are consistent with the operating system. Not only are auto-props a time-saving feature, but they can help you avoid human error within your project.

  • Keep input files after merging (monitored merge) – tells SmartSVN to always keep the .aux files following a merge, even for non-conflicting files. These files are stored in the ‘merged’ state and can be used to gain a deeper insight into what has changed during the merge.

4) Locks

Apache Subversion is built around a ‘copy-modify-merge’ model, but there are times when a ‘lock-modify-unlock’ model may be appropriate, for example when you’re working on image files, which cannot easily be merged. SmartSVN has full support for locking and unlocking files, but if you’re going to make heavy use of locks, you can configure SmartSVN to automatically flag certain files as requiring locking before anyone begins working on them. This is a useful reminder, especially if your project contains multiple non-mergeable files. Open the ‘Lock’ section of the Project Settings dialog and select either ‘all binary files’ or ‘every file,’ if required. The default is ‘no file.’

You can also choose whether SmartSVN should suggest releasing or keeping locks whenever you perform a commit, which is a helpful reminder if your team are working with multiple locks. Finally, the ‘Automatically scan for locks’ option tells SmartSVN to scan for locked files at specified intervals.

Find out more about locks by reading our ‘Locking and Unlocking in SmartSVN’ blog post.

5) Conflicts

When SmartSVN encounters conflicts, it adds new extensions to the conflicting files to help distinguish between them. By default, SmartSVN will take its cues from the config file, but if you want to specify particular extensions, you can select ‘Use following extensions’ and type the desired extensions into the textbox.

Remember, you can download your free edition of SmartSVN Professional at

ASF Announces Apache Bloodhound as Top-Level Project

WANdisco submitted Bloodhound to the Apache Incubator in December 2011 and our developers have been involved in the Apache Bloodhound project since its inception. So we’re pleased that today the Apache Software Foundation (ASF) officially announced Bloodhound as a Top-Level Project (TLP).

Bloodhound is a Trac-based software development collaboration tool that includes an Apache Subversion repository browser, wiki, and defect tracker. It’s also compatible with the hundreds of free plugins available for Trac, allowing users to customize their experience even further.

WANdisco received many requests for an issue tracker and at the time, open source options available for integration were limited, which is why we decided to invest in setting one up in the Apache Incubator,” said David Richards, CEO of WANdisco. “WANdisco has been actively supportive of the ASF, and we’re proud to have played a leading role in Bloodhound.”

When Bloodhound entered the incubator, while it was built on the Trac framework, it was a completely new project,” said Gary Martin, Vice President of Apache Bloodhound and WANdisco developer. “Bloodhound’s strengths lie in its powerful combination of Apache Subversion source control and robust ticket system.”

You can learn more about Apache Bloodhound, and download the latest 0.5.2 release, at the Bloodhound website.


WANdisco’s March Roundup

Following the recent issuance of our “Distributed computing systems and system components thereof” patent, which cover the fundamentals of active-active replication over a Wide Area Network, we’re excited to announce the filing of three more patents. These patents involve methods, devices and systems that enhance security, reliability, flexibility and efficiency in the field of distributed computing and will have significant benefits for users of our Hadoop Big Data product line.

“Our team continues to break new ground in the field of distributed computing technology,” said David Richards, CEO for WANdisco. “We are proud to have some of the world’s most talented engineers in this field working for us and look forward to the eventual approval of these most recent patent applications. We are particularly excited about their application in our new Big Data product line.”

Our Big Data product line includes Non-Stop NameNode, WANdisco Hadoop Console and WANdisco Distro (WDD.)

This month, we also welcomed Bas Nijjer, who built CollabNet UK from startup to multimillion dollar recurring revenue, to the WANdisco team. Bas Nijjer has a proven track record of increasing customer wins, accelerating revenue and providing customer satisfaction, and he takes on the role of WANdisco Sales Director, EMEA.

“Bas is an excellent addition to our team, with great insight on developing and strengthening sales teams and customer relationships as well as enterprise software,” said David Richards. “His expertise and familiarity with EMEA and his results-oriented attitude will help strengthen the WANdisco team and increase sales and renewals. We are pleased to have him join us.”

If joining the WANdisco team interests you, visit our Careers page for all the latest employment opportunities.

We’ve also posted lots of new content at the WANdisco blog. Users of SmartSVN, our cross-platform graphical Subversion client, can find out how to get even more out of their installation with our ‘Performing a Reverse Merge in SmartSVN’ and ‘Backing Up Your SmartSVN Data’ tutorials. For users running the latest and greatest, 7.5.4 release of SmartSVN, we’ve put together a deep dive into the fixes and new functionality in this release with our ‘What’s New in SmartSVN 7.5.4?’ post. If you haven’t tried SmartSVN yet, you can claim your free trial of this release by visiting

We also have a new post from James Creasy, WANdisco’s Senior Director of Product Management, where he takes a closer look at the “WAN” in “WANdisco:”

“We’ve all heard about the globalization of the world economy. Every globally relevant company is now highly dependent on highly available software, and that software needs to be equally global. However, most systems that these companies rely on were architected with a single machine in mind. These machines were accessed over a LAN (local area network) by mostly co-located teams.

All that changed, starting in the 1990’s with widespread adoption of outsourcing. The WAN computing revolution had begun in earnest.”

You can read “What’s in a name, WANdisco?” in full now.

Also at the blog we address the hot topic of ‘Is Subversion Ready for the Enterprise?’ And, if you need more information on the challenges and available solutions for deploying Subversion in an enterprise environment, be sure to sign up for our free-to-attend ‘Scaling Subversion for the Enterprise’ sessions. Taking place a few times a week, these webinars cover limitations and risks related to globally distributed SVN deployments, as well as free resources and live demos to help you overcome them. Take advantage of the opportunity to get answers to your business-specific questions and live demos of enterprise-class SVN products.

Performing a Reverse Merge in SmartSVN

Apache Subversion remembers every change committed to the repository, making it possible to revert to previous revisions of your project. Users of SmartSVN, the cross-platform client for SVN, can easily perform a revert using the built-in ‘Transactions’ window.

Simply right-click on the revision you wish to revert to in SmartSVN’s ‘Transactions’ window (by default, this window is located in the bottom right-hand corner of your SmartSVN screen) and select ‘Rollback.’

smartsvn transactions

Alternatively, reverse merges can be performed through the ‘Merge’ dialogue:

1) Select ‘Merge’ from SmartSVN’s ‘Modify’ menu.

2) In the Merge dialogue, enter the revision number you’re reverting to.

merge changes from a diff branch

If you’re not sure of the revision you should be targeting, click the ‘Select…’ button next to the ‘Revision Range’ textbox. In the subsequent dialogue, you can review information about the different revisions, including the commit message, author and the timestamp of the commit.

select a revision

3) Ensure ‘Reverse merge’ is selected and click ‘Merge.’

4) Remember to commit the reverse merge to the repository to share this change with the rest of your team!

Remember, you can claim your 30 day free trial of SmartSVN Professional now.

Backing Up Your SmartSVN Data

No matter how experienced you are with Apache Subversion, accidents and unavoidable occurrences happen, so it’s important to make repository data backups. If you’re using SmartSVN, the cross-platform graphical client for Subversion, the built-in ‘Export Backup’ functionality makes it quick and easy to create a backup of a selected file/directory.

To backup your data in SmartSVN:

1) Highlight the file(s)/directory to backup, and select the ‘Export’ option from SmartSVN’s ‘Query’ menu.

2) In the subsequent ‘Export Backup’ dialog, you’ll be presented with several options:

  • ‘Relative To’ – the common root of all files to be exported

  • Into zip-file/Into directory – select how you want to export your data. In both cases, you must specify the location where the backup will be created

  • Include Ignored Files – files marked as ‘ignored’ will not be included in the backup

  • Include Ignored Directories – note, this option includes all the items in the ignored directories

  • Wipe directory before copying – wipe the selected directory before performing your backup

export backup

Depending on the selection of files or directories, the ‘Export’ option will either display the number of files being exported or a ‘All files and directories’ message.

3) Once you are satisfied with the information you have entered, click ‘Export’ to create your backup.

Want more free Subversion training? We offer plenty of webinar replays available on-demand, or you can sign up for our upcoming webinars.

Subversion Tip of the Week

Apache Subversion supports the creation and use of ‘patches’ – text files containing the differences between two files. Patches specify which lines have been removed, added and changed, and are particularly useful when you don’t have write access to a repository. In these instances, you can create a patch file showing the changes between a file as it exists in the repository, and the version in your working copy. Then, you can create a ticket and attach your patch file for someone with repository write access to review and commit the accepted changes to the repository.

To create a patch file, you first need to review the differences between the specific files/revisions you are targeting using the ‘svn diff’ command. In this example, we are examining the differences between the version of the project in our working copy and the central repository.

tip of the week

If you’re satisfied with the differences ‘svn diff’ has identified, run the following command to create a patch:

svn diff > patch_name.diff

tip of the week 2

All the changes will now be written to a patch on your local machine.

tip of the week 3

You can now send this patch to a user who does have write access to the repository.

Creating a Patch Between Revisions

Alternatively, if you want to create a patch containing the differences between two revisions, run the following command:

svn diff r:(revision)(revision) (working-copy-location)

Followed by:

svn diff > patch_name.diff

Again, this patch file can now be submitted to someone with write access.

Want more advice on your Apache Subversion installation? We have a full series of SVN refcards for free download, covering hot topics such as branching and merging, and best practices. You can find out more at

What’s New in SmartSVN 7.5.4?

The latest release of SmartSVN, the cross-platform graphical client for Apache Subversion, features plenty of improvements you will find useful. In this post, we take a closer look at some of the functionality we’ve added to SmartSVN 7.5.4.


SmartSVN’s ‘switch’ option allows users to update a working copy to a different URL. This is particularly useful when you need to update your working copy to mirror a newly created branch. SmartSVN 7.5.4 adds support for the –ignore-ancestry option, which forces SmartSVN to switch to a URL even when it cannot find a common ancestor for the URL and your working copy.

JIRA Fixes

SmartSVN supports the popular JIRA issue tracker through its ‘Bugtraq’ properties option, allowing users to seamlessly integrate JIRA into the commit wizard and other modules. SmartSVN 7.5.4 fixes an internal error that could close the ‘Resolve’ dialogue, ensuring that SmartSVN’s JIRA integration continues to run smoothly.

Shell Integration Updates

In addition to being available as a standalone program, SmartSVN integrates with Windows Explorer and Mac OS X Finder, giving you the freedom to work the way you want. SmartSVN 7.5.4 includes fixes and new functionality for this integration, including:

  • Settings for shell integration are now stored

  • A fix for an internal error that could occur when working with root-level working copies (Windows)

  • A fix for a bug that could cause commands to be erroneously enabled (Windows)


The Transactions view automatically provides information about new project revisions, ensuring users are kept up-to-date with changes being committed to the repository. If you’re using SmartSVN Professional, this Transactions window can watch for commits in any repository, keeping you informed on changes in the libraries being used by your project, or about the Subversion-related activities of your entire team.


SmartSVN 7.5.4 addresses a bug that could cause the ‘Copy Revision Number’ command to copy multiple items.

Additional Fixes

SmartSVN 7.5.4 also includes fixes for:

  • An internal error in the Merge Preview

  • An error in the SmartSVN Log that could occur when loading merged revisions

  • The “smartsvn.defaultConnectionLogging” system property failing to work

  • Trac plugin failing when querying Trac ticket db

More information on what’s new and noteworthy in this release is available at the Changelog.

Haven’t started with SmartSVN? You can claim a free trial of SmartSVN Professional 7.5.4 now.


Resolving Conflicts in Subversion

When you’re committing changes to Apache Subversion’s central repository, you may occasionally encounter a conflict which will cause your commit to fail.

resolving conflict

You’ll be unable to commit any changes to the repository until you’ve resolved all the conflicts. The good news is that Apache Subversion has all the functionality needed to quickly resolve whatever conflicts you may encounter.

1) Perform an Update

It’s possible that the changes you’ve made and the changes that have already been committed affect different parts of the conflicted file. Therefore, the first step is to perform an svn update:

svn update (path)

Subversion will then try and merge the changes from the server into your working copy, without overriding any of your local changes. If the changes affect different areas of the file, the server will merge the changes and you’ll be able to perform your commit. However, if you’ve modified the same sections of the file (e.g the same lines in a text file), Subversion will be unable to automatically merge the changes and the command line window will present you with several options to resolve the conflict:

  • (p) postpone – marks the conflict to be resolved later.

  • (df) diff-full – displays the differences between the HEAD revision and the conflicted file.

  • (e) edit – opens the conflicted file in an editor (this is set in the EDITOR environment variable)

  • (mc) mine-conflict – discards changes from the server that conflict with your local changes; all non-conflicting changes are accepted

  • (tc) theirs-conflict – discards local changes that conflict with changes from the server; all non-conflicting local changes are preserved

  • (s) show all options – displays additional options

resolving conflict 2

Enter ‘s’ to be presented with some additional options:

avoiding conflicts 3

Once you’ve resolved the conflict, perform an ‘svn commit’ to send your changes to the repository.

Looking for an easy-to-use cross platform Subversion client? Claim your free 30 day trial of SmartSVN Professional by visiting:

Reviewing Changes with Subversion’s ‘SVN Diff’

Sometimes you need to review the differences between files and revisions, for example before you commit your changes to the repository or when you’re trying to pinpoint the revision you need to revert to. This is when Apache Subversion’s ‘svn diff’ command comes in handy, allowing you to see the differences between files and revisions by printing a line-by-line breakdown of human-readable files. This helps by showing you exactly what has changed in the specified file, at the specified revision. The results include lines prefixed by a character representing the nature of the change:

  • + Line was added

  • – Line was deleted

  • A blank space represents no change

The ‘svn diff’ command can be used to perform several different tasks:

  • View Local Modifications

When ‘svn diff’ is performed on a working copy, it prints line-by-line information on all local modifications:

svn diff (working-copy-path)

svn diff

  • Compare Different Revisions

To use the ‘svn diff’ command to compare different revisions of the same file, use the ‘-r’ switch:

svn diff -r(number):(number) (working-copy-path)/filename

svn diff 2

This command also works at the repository level.

svn diff 3

Additional Options

  • –notice-ancestry

By default ‘svn diff’ ignores the ancestry of file(s), but you can force Subversion to take ancestry into consideration by adding the –notice-ancestry switch.

  • –show-copies-as-adds

By default, ‘svn diff’ displays the content difference for a file created by a copy command, as a delta against the original file. Adding this switch forces Subversion to display the copied content as though it’s a brand new file.


Why svnsync Is Not Good Enough for Enterprise Global Software Development

If you’ve found your way to this article, you are likely already familiar with svnsync, a commonly used, free and open source tool for replicating Subversion repositories using the master-slave paradigm. What you might not know is that it’s far from the best solution for modern multi-site enterprise software development.

You don’t have to take it from me, however. Let our video and case study of Navis’ deployment of Subversion MultiSite make a more convincing argument on why they moved from svnsync.

In a customer’s own words:

“We were already using SVNSync, … we were frequently having to deal with repositories out of sync, … we didn’t see the stability just using the straight up SVNSync”

“We found that check-in times were up to ten times faster due to the fact that the repository was adjacent to the end user”

“One of the things we were quite surprised with was the actual ease of the implementation”

How does the WANdisco Subversion MultiSite product achieve this remarkable improvement over what would normally be a perfectly serviceable free tool?  A 20-page algorithm stands behind our patented, true active-active replication implementation. It’s proven to be mathematically ideal, so if you are using svnsync today, you’ll know our solution offers a step up in scalability, performance and data safety for your existing Subversion deployment.

WANdisco Files Three New Patents with USPTO

We are pleased to announce the filing of three new patents with the United States Patent and Trademark Office (USPTO) related to distributed computing.

These three innovations involve methods, devices and systems that enhance security, reliability, flexibility and efficiency in the field of distributed computing. The patents are expected to have significant benefits for users of our new Hadoop Big Data product line.

Our team continues to break new ground in the field of distributed computing technology,” said David Richards, CEO for WANdisco. “We are proud to have some of the world’s most talented engineers in this field working for us and look forward to the eventual approval of these most recent patent applications. We are particularly excited about their application in our new Big Data product line.”

Our Big Data product line includes Non-Stop NameNode, which turns the NameNode into an active-active shared-nothing cluster, and the comprehensive wizard-driven management dashboard ‘WANdisco Hadoop Console.’ We also offer a free-to-download, fully-tested and production-ready version of Apache Hadoop 2. Visit the WANdisco Distro (WDD) to learn more.

This news comes after we announced the issuance of our “Distributed computing systems and system components thereof” patent, which covers the fundamentals of active-active replication over a Wide Area Network.


Subversion Tip of the Week

SVN Revert

Apache Subversion’s ‘svn revert’ command allows you to discard local changes on a file or directory and replace it with the version in the repository. This saves you the overhead of performing a fresh checkout, and is also helpful when you need to quickly resolve a conflict.

To revert the changes on a single file, run the ‘svn revert’ command followed by the file path:

svn revert (working-copy)/filename

svn revert

It’s also possible to revert all the changes within an entire directory using the –depth=infinity switch. When this switch is added, any files that have been changed within the specified directory are replaced with their repository equivalent:

svn revert –depth=infinity (working-copy)

svn revert infinity

Useful Additional Commands

  • svn status

Before discarding your local changes, you may want to review exactly which files have been altered at the working copy level by using the ‘svn status’ command:

svn status (working-copy-path)

svn status

  • svn diff

The ‘svn diff’ command prints all the changes that have been made to human-readable files within the working copy, which is useful for identifying the file(s) you want to revert. Each line is prefixed by a character representing the nature of the change:

  1. + Line was added
  2. – Line was deleted
  3. A blank space represents no change

To run ‘svn diff’ enter the following command:

svn diff (working-copy-path)

svn diff

Looking for an easy-to-use cross platform Subversion client? Claim your free 30 day trial of SmartSVN Professional by visiting:



Switch your Subversion Working Copy

Apache Subversion’s ‘svn switch’ command allows users to update a working copy to a different URL. This is useful when you need to update your working copy to mirror a newly-created branch.

Although it’s possible to achieve the same effect by performing a fresh checkout, the ‘svn switch’ command is a quicker alternative. It saves you the overhead of running ‘svn checkout,’ and applies the changes required to bring your working copy inline with the new location, making it a very efficient command. It also preserves any changes you’ve made in the working copy.

To perform a switch, run ‘svn switch’ followed by the URL path you wish to mirror. Apache Subversion will then go ahead and update your working copy.

svn switch

Additional Options

There are some additional options you can apply to fine-tune the ‘svn switch’ command:

  • Ignore Ancestry

If Subversion cannot find a common ancestor for the URL and your working copy, it will block the operation and display an error message.

svn switch 2

It is possible to force Subversion to switch to this URL anyway, by adding the –ignore-ancestry option.

svn switch (target-URL) –ignore-ancestry

svn switch 3

  • Target a Particular Revision

You can also specify a particular revision of the URL you’re switching to. Note that Subversion defaults to the HEAD revision, if no alternate revision is specified:

svn switch -r(revision-number) (target-URL)

svn switch 4

Want more free Subversion training? We offer plenty of webinar replays available on-demand, or you can sign up for our upcoming webinars.

Subversion’s SVN Annotate Command

Apache Subversion’s ‘svn annotate’ command allows users to view a line-by-line breakdown of all the changes that have been applied to a human-readable file in your working copy. This information is printed alongside details on:

  • The person responsible for each change
  • The revision number where each change occurred

Note that this line-by-line attribution is based on the file as it currently appears within the working copy.

To run ‘svn annotate’ on a file, enter:

svn annotate (working-copy-location)/file-being-examined

In this example, we’re examining all the changes for the ‘Changelog’ file, which is located inside the trunk of our working copy.

svn annotate

If you need a more comprehensive printout, the -verbose (-v) switch adds the full datestamp to each line.

svn annotate 2

The –force Switch

The ‘svn annotate’ command uses Multipurpose Internet Mail Extensions (MIME) types to automatically detect whether a file is human-readable. By default, ‘svn annotate’ will block any file that’s non human-readable. If you attempt to perform an ‘svn annotate’ on a file that Subversion judges not to be readable, you’ll get an error message.

svn annotate 3

If you want to go ahead regardless, you can add the –force switch. Of course, this may result in a screen full of strange symbols if the file truly is non human-readable!

svn annotate 4

Not yet started with SmartSVN, the easy-to-use graphical client from Subversion? Get your free 30 day trial at

What’s in a Name, WANdisco?


I recently asked a new hire how they ended up considering WANdisco. “The intriguing, memorable name” was the paraphrased answer. Frankly, when I first heard the name myself some years ago I didn’t think much of it. It’s certainly different, but struck me as too different, too weird for a company peddling Subversion support contracts.

That’s before I understood what WANdisco is about.

WANdisco lived at the edge of my radar as a bit player in the SCM space during my time as Director of Product Technology at Perforce Software. There were always little companies scraping out a living around open source software, so we lumped WANdisco in with CollabNet and called it a day. Wrong!

Turns out WANdisco founders David Richards, Jim Campigli and Dr. Yeturu Aahlad had a much more ambitious plan from the start. The core technology, a patented WAN-capable Paxos implementation, is a key enabler for the evolution of software into globally distributed, highly available systems: in other words Wide Area Network distributed computing, or WANdisco.  Much like the name “Microsoft” was about the revolution of software for microcomputers in the 1980’s, the name WANdisco is about the new revolution of WAN-based distributed computing.

What’s so important about the WAN?

We’ve all heard about the globalization of the world economy. Every globally relevant company is now highly dependent on highly available software, and that software needs to be equally global. However, most systems that these companies rely on were architected with a single machine in mind. These machines were accessed over a LAN (local area network) by mostly co-located teams.

All that changed, starting in the 1990’s with widespread adoption of outsourcing. The WAN computing revolution had begun in earnest.

But there was a problem. The WAN wasn’t like a bigger LAN. It’s a different environment all together.  Single machine systems like those involving a central server perform pitifully for remote workers. And it’s not easy to update these systems because the WAN is a high failure environment that punishes single machine systems with their inevitable single points of failure. It was time for the science of distributed computing to come to the rescue of this new requirement of wide area network distributed computing.

Ergo, “WANdisco.”

Is Subversion Ready for the Enterprise?

At WANdisco, we firmly believe that Apache Subversion is a commercial quality version control system ready for the enterprise. With everything that Subversion has to offer enterprise users, it’s easy to see why it’s becoming such a popular choice:

  • It’s open source – cost is one of the most commonly-cited reasons for adopting open source solutions such as Subversion, but there are many other benefits. Most notably, open source projects tend to be collaborative efforts between many developers, so users reap the benefit of a team of developers, all with their own particular skills and areas of expertise.
  • It’s an established project – accepted into the Apache Incubator in 2009 and graduating a year later, today Subversion is an Apache Top Level Project maintained by a global community of contributors.
  • It’s the center of a vibrant ecosystem – Apache Subversion users have access to countless additional client tools, GUIs and plugins. Subversion also integrates with most of the major IDEs, including Eclipse and Microsoft Visual Studio.
  • Free community support – another benefit of utilizing open source technology is the transparent, archived communication that makes up an open source project’s mailing lists and forums, including Subversion’s dedicated SVNForum. This communication can be an invaluable source of information for users, and in many instances, a question will have already been asked – and answered – by someone else. If you can’t find the answer you were looking for, ask the community directly. There’s also no shortage of free training resources available online, including webinars, refcards and tutorials.
  • Professional support option – Subversion has an extensive community of users who are always willing to answer queries, but mailing lists and forums aren’t always the ideal place to reach out to when disaster strikes your enterprise deployment. As a long-established open source solution, there are professional support options available for Apache Subversion.

Our professional support services for Subversion includes:

  1. 24-by-7 worldwide coverage
  2. Guaranteed response times
  3. Indemnification coverage
  4. Subversion system health check
  5. 8 hours of free consulting or training

Enterprise training is another option for users who need additional support with their Subversion installation.

Despite all the benefits, there are some potential issues to consider when working with large Subversion deployments. If you’re using multiple SVN repositories across globally distributed teams, you may encounter the following challenges:

  • Loss of productivity when the central server fails and users at remote sites cannot access the latest version of your project.
  • Slow networks encourage developers at remote sites to checkout and/or commit infrequently. This increases the chances of encountering time-consuming conflicts.
  • Unnecessary read operations taking place over the WAN, as users at remote sites repeatedly perform read operations to access the same files. This degrades the performance of both the central Subversion server and the network.
  • Every remote request entails a WAN penalty. Although Subversion clients only send changes to the central server when modifications to existing source code files are committed, when a new file is committed or an existing file is checked out, the entire file is sent over the WAN.
  • When Subversion is implemented with an Apache Web Server as a front-end, and the WebDAV HTTP protocol is used, the WAN penalty can be significant. This is particularly true of commits that consist of a large number of files.

To help enterprises overcome these challenges, we’ve just announced an ongoing series of free webinars. Over the course of each hour-long ‘Scaling Subversion for the Enterprise’ session, our expert Solution Architect Patrick Burma will cover all the issues enterprises can encounter when using multiple Subversion repositories across globally distributed teams. He will also discuss the accompanying solutions from the administrative, business and IT perspectives, and will be available to answer specific questions.

You can register for all of this week’s sessions now:

Subversion Tip of the Week

Using SVN Move

Apache Subversion’s ‘svn move’ command allows the user to move files and directories and can be applied to both the working copy and the repository. The key difference between this command and ‘svn copy,’ is that ‘svn move’ also deletes the original file. This makes running ‘svn move’ equivalent to performing an ‘svn copy’ followed by ‘svn delete.’

…in the Working Copy

Running this command at the working copy level requires you to specify the file you’re moving and the location you’re moving it to:

svn move (working-copy-path)/item-being-moved (working-copy-path)/new-location

In this example we’re moving the ‘Release2’ item to the ‘Releases’ directory.

svn move

Note, you’ll need to perform a commit to send this change to the repository and share it with the rest of your team.

…in the Repository

It’s also possible to move items inside the repository. As this creates a new revision, you’ll need to supply a log message alongside the command:

svn move (repository-URL)/item-being-moved -m “log message” (repository-URL)/new-location

In this example we’re moving the item ‘Release’ to the ‘Releases’ directory.

svn move 2

Looking for an easy-to-use cross platform Subversion client? Claim your free 30 day trial of SmartSVN Professional by visiting:

WANdisco’s February Roundup

This month, we launched a trio of innovative Hadoop products: the world’s first production-ready distro; a wizard-driven management dashboard; and the first and only 100% uptime solution for Apache Hadoop.

hadoop big data

We started this string of Big Data announcements with WANdisco Distro (WDD) a fully tested, free-to-download version of Apache Hadoop 2. WDD is based on the most recent Hadoop release, includes all the latest fixes and undergoes the same rigorous quality assurance process as our enterprise software solutions.

This release paved the way for our enterprise Hadoop solutions, and we announced the WANdisco Hadoop Console (WHC) shortly after. WHC is a plug-and-play solution that makes it easy for enterprises to deploy, monitor and manage their Hadoop implementations, without the need for expert HBase or HDFS knowledge.

The final product in this month’s Big Data announcements was WANdisco Non-Stop NameNode. Our patented technology makes WANdisco Non-Stop Namenode the first and only 100% uptime solution for Hadoop, and offers a string of benefits for enterprise users:

  • Automatic failover and recovery
  • Automatic continuous hot backup
  • Removes single point of failure
  • Eliminates downtime and data loss
  • Every NameNode server is active and supports simultaneous read and write requests
  • Full support for HBase

To support the needs of the Apache Hadoop community, we’ve also launched a dedicated Hadoop forum. At this forum, users can get advice on their Hadoop installation and connect with fellow users, including WANdisco’s core Apache Hadoop developers Dr. Konstantin V. Shvachko, Dr. Konstantin Boudnik, and Jagane Sundar.


For Apache Subversion users, we announced the next webinars in our free training series:

  • Subversion Administration – everything you need to administer a Subversion development environment
  • Introduction to SmartSVN – a short introduction to how Subversion works with the SmartSVN graphical client
  • Checkout Command – how to get the most out of the checkout command, and the meaning of the various error messages you may encounter
  • Commit Command – learn more about this command, including diff usage, working with unversioned files and changelists
  • Introduction to Git – everything a new user needs to get started with Git
  • Hook Scripts – how to use hook scripts to automate tasks such as email notifications, backups and access control
  • Advanced Hook Scripts – an advanced look at hook scripts, including using a config file with hook scripts and passing data to hook scripts

We’ve announced an ongoing series of free webinars, which demonstrate how you can overcome these challenges from an administrative, business and IT perspective, and get the most out of deploying Subversion in an enterprise environment. These ‘Scaling Subversion for the Enterprise’ webinars will be conducted by our expert Solution Architect three times a week (Tuesday, Wednesday and Thursday) at 10.00am PST/1.00pm EST, and will cover:

  • The latest technology that can help you overcome the limitations and risks associated with globally distributed deployments
  • Answers to your business-specific questions
  • How to solve critical issues
  • The free resources and offers that can help solve your business challenges

uberSVN ‘Chimney House’ Release 8 Released

We’re pleased to announce a new release of uberSVN, the free, open ALM platform from WANdisco. The new uberSVN ‘Chimney House’ Release 8 ships with Apache Subversion versions 1.7.8 and 1.6.20, and introduces some new features and fixes, including:

  • The system-wide ability to handle case insensitivity. A new toggle within the Admin > Preferences tab allows you to specify whether usernames should be case insensitive or not.
  • A fix for Subversion password and authz files not being regenerated, an issue that could result in authentication problems.
  • A fix for formatting errors that could result in difficulties using LDAP for uberSVN logins.

More information on what’s new and noteworthy is available at the uberSVN release notes. uberSVN can be downloaded for free from

All About SVN Copy

Apache Subversion’s commit command allows you to quickly create a copy of item(s) at both the working copy and the repository level. It’s most commonly used in creating branches.

In this tutorial, learn how to use the ‘svn copy’ command to copy file(s) in the working copy and the repository, alongside options such as copying items at specific revisions.

…in the Working Copy

The ‘svn copy’ command allows you to create a copy and place it in a new location within the working copy by running the command followed by the location of the item(s) you’re copying and the new location.

In this example, we’re creating a copy of the ‘Release3’ folder and placing it inside the ‘Releases’ directory.

svn copy (working-copy-path)/item-being-copied (working-copy-path)/item-being-created

svn copy

Check your working copy and you’ll see the file (‘Release4’) has successfully been created. Remember, this is a local change so you’ll need to perform an ‘svn commit’ to share it with the rest of your team.

svn copy 2

….in the Repository

Alternatively, you can create copies at the repository level. This change will automatically create a new revision so you’ll need to provide a log message alongside the ‘svn copy’ command.

svn copy (repository-URL)/item-being-copied -m “log message” (repository-URL)/item-being-created

svn copy 3

You can also copy item(s) as they existed in particular revisions, by specifying a revision number:

svn copy -r(revision-number) (repository-URL)/item-being-copied -m “log message” (repository-URL)/item-being-created

If no revision number is given, Subversion will default to HEAD.


svn copy 4

….Or Both 

Finally, you can copy item(s) between the working copy and the central repository. Note that when you’re copying to/from the repository, the usual rules apply: A log message is required, and the repo will copy the HEAD revision unless instructed otherwise.

In this example, we’re creating a copy of the “Release3” folder in the working copy and adding it to the repository as a folder called “Release5.”

svn copy 4

Want more advice on your Apache Subversion installation? We have a full series of SVN refcards for free download, covering hot topics such as branching and merging, and best practices. You can find out more at

Checking Your Changes with SVN

Before committing work to the central repository, you may want to review the changes you’ve made. This can be achieved by running Apache Subversion’s ‘svn status’ command at the top of your working copy, which will print out a list of all your local changes:

svn status (working-copy-path)

svn status

The characters in the first column represent the state of the file/directory listed. The different characters are:

  • ? – item is not under version control.
  • ! – item is missing.
  • A – item is scheduled to be added to the repository.
  • C – item is in a state of conflict.
  • D – item is scheduled to be deleted in the repository.
  • M – item contents have been modified.

Alternatively, if you need information about all the files in your working copy, regardless of whether they have been modified, you can add Subversion’s –verbose (–v) switch to ‘svn status:’

svn status –v (working-copy-path)

svn status 2

Need additional support with your Subversion installation? We provide a range of professional support services for SVN users, including indemnification coverage, guaranteed response times, system health checks and more.

Subversion Tip of the Week

SVN Blame

In certain situations, you may need more information about how a file changed in a particular Apache Subversion revision and crucially, who was responsible for that change. This is achieved by running the ‘svn blame’ command. This command prints each modified line of the specified file, alongside the revision number and the username of the person responsible for that change.

To run the ‘svn blame’ command, enter:

svn blame (repository-URL)/file

svn blame

However, sometimes the change may simply be an arbitrary whitespace or other formatting change. If you suspect this could be the case, the extensions switch (-x) can be used in conjunction with several other switches to filter out arbitrary changes:

  • –ignore-all-space (-w) – ignores all whitespace.
  • –ignore-space-change (-b) – ignores all changes in the amount of whitespace.
  • –ignore-eol-style – ignores changes in end-of-line-style.

In this example, we’re running ‘svn blame’ on the same file, but this time specifying that any EOL changes should be ignored.

svn blame -x –ignore-eol-style (repository-URL)/file

svn blame 2

Looking for an easy-to-use cross platform Subversion client? Claim your free 30 day trial of SmartSVN Professional by visiting:

WANdisco Announces Free Online Hadoop Training Webinars

We’re excited to announce a series of free one-hour online Hadoop training webinars, starting with four sessions in March and April. Time will be allowed for audience Q&A at the end of each session.

Wednesday, March 13 at 10:00 AM Pacific, 1:00 PM Eastern

A Hadoop Overview” will cover Hadoop, from its history to its architecture as well as:

  • HDFS, MapReduce, and HBase
  • Public and private cloud deployment options
  • Highlights of common business use cases and more

March 27, 10:00 AM Pacific, 1:00 pm Eastern

Hadoop: A Deep Dive” covers Hadoop misconceptions (not all clusters include thousands of machines) and:

  • Real world Hadoop deployments
  • Review of major Hadoop ecosystem components including: Oozie, Flume, Nutch, Sqoop and others
  • In-depth look at HDFS and more

April 10, 10:00 AM Pacific, 1:00 pm Eastern

Hadoop: A MapReduce Tutorial” will cover MapReduce at a deep technical level and will highlight:

  • The history of MapReduce
  • Logical flow of MapReduce
  • Rules and types of MapReduce jobs
  • De-bugging and testing
  • How to write foolproof MapReduce jobs

April 24, 10:00 AM Pacific, 1:00 pm Eastern

Hadoop: HBase In-Depth” will provide a deep technical review of HBase and cover:

  • Its flexibility, scalability and components
  • Schema samples
  • Hardware requirements and more

Space is limited so click here to register right away!

WANdisco Non-Stop NameNode Removes Hadoop’s Single Point of Failure

We’re pleased to announce the release of the WANdisco Non-Stop NameNode, the only 100% uptime solution for Apache Hadoop. Built on our Non-Stop patented technology, Hadoop’s NameNode is no longer a single point of failure, delivering immediate and automatic failover and recovery whenever a server goes offline, without any downtime or data loss.

“This announcement demonstrates our commitment to enterprises looking to deploy Hadoop in their production environments today,” said David Richards, President and CEO of WANdisco. “If the NameNode is unavailable, the Hadoop cluster goes down. With other solutions, a single NameNode server actively supports client requests and complex procedures are required if a failure occurs. The Non-Stop NameNode eliminates those issues and also allows for planned maintenance without downtime. WANdisco provides 100% uptime with unmatched scalability and performance.”

Additional benefits of Non-Stop NameNode include:

  • Every NameNode server is active and supports simultaneous read and write requests.
  • All servers are continuously synchronized.
  • Automatic continuous hot backup.
  • Immediate and automatic recovery after planned or unplanned outages, without the need for administrator intervention.
  • Protection from “split-brain” where the backup server becomes active before the active server is completely offline. This can result in data corruption.
  • Full support for HBase.
  • Works with Apache Hadoop 2.0 and CDH 4.1.

“Hadoop was not originally developed to support real-time, mission critical applications, and thus its inherent single point of failure was not a major issue of concern,” said Jeff Kelly, Big Data Analyst at Wikibon. “But as Hadoop gains mainstream adoption, traditional enterprises rightly are looking to Hadoop to support both batch analytics and mission critical apps. With WANdisco’s unique Non-Stop NameNode approach, enterprises can feel confident that mission critical applications running on Hadoop, and specifically HBase, are not at risk of data loss due to a NameNode failure because, in fact, there is no single NameNode. This is a major step forward for Hadoop.”

You can learn more about the Non-Stop NameNode at the product page, where you can also claim your free trial.

If you’d like to get first-hand experience of the Non-Stop NameNode and are attending the Strata Conference in Santa Clara this week, you can find us at booth 317, where members of the WANdisco team will be doing live demos of Non-Stop NameNode throughout the event.

WANdisco Announces Non-Stop Hadoop Alliance Partner Program

We’re pleased to announce the launch of our Non-Stop Alliance Partner Program to provide Industry, Technology and Strategic Partners with the competitive advantage required to compete and win in the multi-billion dollar Big Data market.

There are three partner categories:

  • For Industry Partners, which include consultants, system integrators and VARs, the program provides access to customers who are ready to deploy and the competitive advantage necessary to grow business through referral and resale tracks.
  • For Technology and Strategic Partners, including software and hardware vendors, the program accelerates time-to-market through Non-Stop certification and reference-integrated solutions.
  • For Strategic Partners, the program offers access to WANdisco’s non-stop technology for integrated Hadoop solutions (OEM and MSP)

Founding Partners participating in the Non-Stop Alliance Partner Program include Hyve Solutions and SUSE.

“Hyve Solutions is excited to be a founding member of WANdisco’s Non-Stop Alliance Partner Program,” said Steve Ichinaga, Senior Vice President and General Manager of Hyve Solutions. “The integration of WANdisco and SUSE’s technology with Hyve Solutions storage and server platforms gives enterprise companies an ideal way to deploy Big Data environments with non-stop uptime quickly and effectively into their datacenters.”

“Linux is the undisputed operating system of choice for high performance computing. For two decades, SUSE has provided reliable, interoperable Linux and cloud infrastructure solutions to help top global organizations achieve maximum performance and scalability,” said Michael Miller, vice president of global alliances and marketing, SUSE.  “We’re delighted to be a Non-Stop Strategic Technology Founding Partner to deliver highly available Hadoop solutions to organizations looking to solve business challenges with emerging data technologies.”

Find out more about joining the WANdisco Non-Stop Alliance Partner Program or view our full list of partners.

Scaling Subversion for the Enterprise

Apache Subversion is one of the world’s most popular open source version control solutions. It’s also becoming increasingly popular within the enterprise, with plenty to offer enterprise users, including:

  • Established professional support options
  • A commercial-friendly Apache license
  • Atomic commits that allow enterprise users to track and audit changes
  • Plenty of free training resources, such as webinars, refcards and online tutorials

However, large Subversion deployments have limitations that can negatively affect your business. If you are using multiple Subversion repositories across globally distributed teams, you’re likely facing challenges around performance and productivity, repository sync, WAN latency and connectivity, access control or the need for HADR (high availability and disaster recovery).

In our new, free-to-attend ‘Office Hours’ sessions, our expert Solution Architect will conduct live demos, showcasing how our Subversion MultiSite technology can help you overcome the limitations and risks related to globally distributed SVN deployments. Over the course of the hour, our Solution Architect Patrick Burma will cover these issues and accompanying solutions, from the administrative, business and IT perspectives, and will be available to answer all of your business-specific questions.

You can register for all of this week’s sessions now:

All sessions will take place at 10:00am PST (1:00pm EST) and are free to attend.


Subversion Tip of the Week

SVN Import

There are two main options when you need to add new file(s) to your Apache Subversion project: the ‘SVN Add’ command and ‘SVN Import.’ The advantage of performing an ‘SVN Import’ is that:

  • ‘SVN Import’ communicates directly with the repository, so no working copy or checkout is required.
  • Your files are immediately committed to the repository, and are therefore available to the rest of the team.
  • Intermediate directories that don’t already exist in the repository are automatically created without the need for additional switches.

‘SVN Import’ is typically used when you have a local file tree that’s being added to your Subversion project. Run the following to add a file/file tree to your repository:

svn import -m “log message” (local file/file tree path) (repository-URL)

In this example, we’re adding the contents of the “Release2” folder to the repository, in an existing ‘branches’ directory.

svn import 1

As already mentioned, intermediate directories do not need to exist prior to running the ‘SVN Import’ command. In this example, we’re again importing the contents of ‘Release2,’ but this time we’re simultaneously creating a ‘Release2’ directory to contain the files.

svn import create new folder

If you check the repository, you’ll see a new ‘Release2’ directory has been created. The contents of your ‘Release2’ file tree are located inside.

ubersvn import

Want more advice on your Apache Subversion installation? We have a full series of SVN refcards for free download, covering hot topics such as branching and merging, and best practices. You can find out more at

Adding and Deleting Files from the Command Line

When working with files under Apache Subversion’s version control, eventually you will need to start adding and removing files from your project. This week’s tip explains how to add a file to a project at the working copy level or, alternatively, commit it straight to the central repository. It will also highlight how to delete a file, either by scheduling it for deletion via the working copy or deleting it straight from the central repository.

Adding Files

Files can be added to a project via the working copy. After you’ve added the file to your working copy, it’ll be sent to the central repository and shared with the rest of your team the next time you perform an ‘svn commit.’

To add a file to your working copy (and schedule it for addition the next time you perform a commit) run:

svn add (working-copy-location)/file-to-be-added

In this example we’re adding a file called ‘executable’ to the trunk directory of the ‘NewRepo’ working copy.

Subversion 1

You’ll need to perform a commit to send this item to the repository and share it with the rest of your team.

Subversion 2

Deleting Files 

Once you start adding files to your working copy, sooner or later you’ll need to remove files. When files are deleted in the working copy, they’re scheduled for deletion in the repository the next time you perform a commit, in exactly the same way as the ‘svn add’ command.

Schedule files for deletion in the working copy by running:

svn delete (working-copy-location)/file-to-be-deleted

In this example, we’re scheduling ‘executable.png’ for deletion.

Subversion 3

Alternatively, you can delete files from the repository immediately. Note, this operation creates a new revision and therefore requires a log message.

svn delete -m “log message” (repository-URL)/file-to-be-deleted

Subversion 4

Looking for an easy-to-use cross platform Subversion client? Claim your free 30 day trial of SmartSVN Professional by visiting:

Fetching Previous Revisions in Subversion

One of the fundamental features of Apache Subversion is that it remembers every change committed to the central repository, allowing users to easily recover previous versions of their project.

There are several methods available to users who wish to roll back to an earlier revision:

1) Perform a Checkout

By default, Subversion checks out the head revision, but you can instruct it to checkout a previous revision by adding a revision number to your command:

svn checkout -r(revision-number) (repository-URL)

In this example, we’re creating a working copy from the repository data in revision 5.

checking out revision 5

2) ‘Update’ to Previous Revision

If you already have a working copy, you can ‘update’ it to a previous revision by using ‘svn update’ and specifying the revision number:

svn update -r(revision-number) (working-copy-location)

In this example, we’re updating the ‘Project’ working copy to revision 5.

svn update to past revision

3) Perform a Reverse Merge

Alternatively, you can perform a reverse merge on your working copy. Usually, a reverse merge is followed by an svn commit, which sends the previous revision to the repository. This effectively rolls the project back to an earlier version and is useful if recent commit(s) contain errors or features you need to remove.

To perform a reverse merge, run:

svn merge -r(revision-to-be-merged):(target-revision) (working-copy-URL)

svn reverse merge

Looking for an easy-to-use cross platform Subversion client? Claim your free 30 day trial of SmartSVN Professional by visiting:

Hadoop Console: Simplified Hadoop for the Enterprise

We are pleased to announce the latest release in our string of Big Data announcements: the WANdisco Hadoop Console (WHC.) WHC is a plug-and-play solution that makes it easy for enterprises to deploy, monitor and manage their Hadoop implementations, without the need for expert HBase or HDFS knowledge.

This innovative Big Data solution offers enterprise users:

  • An S3-enabled HDFS option for securely migrating from Amazon’s public cloud to a private in-house cloud
  • An intuitive UI that makes it easy to install, monitor and manage Hadoop clusters
  • Full support for Amazon S3 features (metadata tagging, data object versioning, snapshots, etc.)
  • The option to implement WHC in either a virtual or physical server environment.
  • Improved server efficiency
  • Full support for HBase

“WANdisco is addressing important issues with this product including the need to simplify Hadoop implementation and management as well as public to private cloud migration,” said John Webster, senior partner at storage research firm Evaluator Group. “Enterprises that may have been on the fence about bringing their cloud applications private can now do so in a way that addresses concerns about both data security and costs.”

More information about WHC is available from the WANdisco Hadoop Console product page. Interested parties can also download our Big Data whitepapers and datasheets, or request a free trial of WHC. Professional support for our Big Data solutions is also available.

This latest Big Data announcement follows the launch of our WANdisco Distro, the world’s first production-ready version of Apache Hadoop 2.

Subversion Tip of the Week

Getting Help With Your Subversion Working Copy

When it comes to getting some extra help with your Apache Subversion installation, you will find plenty of documentation online and even a dedicated forum where SVN users can post their questions and answer others. However, Subversion also comes with some handy built-in commands that can show you specific information about your working copy, files, directories, and all of Subversion’s subcommands and switches. This post explains how to access all of this information from the command line.

1) SVN Help

One of the most useful features of command line Subversion is the instant access to its built-in documentation through the ‘svn help’ command. To review all of the details about a particular subcommand, run:

svn help (subcommand)

In the example below, we’ve requested information on the ‘unlock’ subcommand. The printout includes all the additional switches that can be used in conjunction with ‘svn unlock.’

svn help unlock

Alternatively, if you need to see a list of all the available subcommands, simply run ‘svn help.’

svn help

2) SVN Info

If you need more information about the paths in a particular working copy, run the ‘svn info’ command. This will display:

  • Path
  • Repository URL
  • Repository Root
  • Repository UUID
  • Current revision number
  • Node Kind
  • Schedule
  • Information on the last change that occurred (author, revision number, date)

svn info

3) SVN Status

This command prints the status of your files and directories in your local working copy:

svn status (working-copy-path)

svn status

Want more advice on your Apache Subversion installation? We have a full series of SVN refcards for free download, covering hot topics such as branching and merging, and best practices. You can find out more at


WANdisco Launches Apache Hadoop Forum

Last week, we launched WANdisco Distro (WDD), a fully tested, production-ready version of Apache Hadoop 2 that undergoes the same rigorous quality assurance process as our enterprise software solutions. To support the needs of WDD users and the wider Apache Hadoop community, we’ve also launched a dedicated Apache Hadoop forum.

In addition to sections on the enterprise Hadoop products WDD, Non-Stop NameNode and WANdisco Console for Hadoop, forum users can connect with other users and get advice on their Hadoop installations – especially installing and configuring Hadoop, and running Hadoop on Amazon’s Simple Storage Service.

The Hadoop forum is also the place to connect with WANdisco’s core Hadoop developers. These include Dr. Konstantin V. Shvachko, a veteran Hadoop developer, member of the team that created the Hadoop Distributed File System (HDFS) and current member of the Apache Hadoop PMC; Jagane Sundar, who has extensive big data, cloud, virtualization, and networking experience and former Director of Hadoop Performance and Operability at Yahoo!; and Dr. Konstantin Boudnik, one of the original developers of Hadoop and founder of Apache BigTop.

This forum is intended to be a useful resource for the Apache Hadoop community, so we’d love to hear your feedback on the Hadoop Forum. If there’s a section or functionality you would like to suggest we add to improve your forum experience, please let us know. You can leave a post at the forum, at this blog or Contact Us directly.

We look forward to hearing from you!

WANdisco Joins Fusion-io Technology Alliance Program

WANdisco is excited to announce its partnership with Fusion-io. Following the launch of our first Big Data offering, the world’s first production-ready Apache Hadoop 2 distro, we’ve joined Fusion-io’s Technology Alliance Program. This program focuses on working with leaders in strategic market segments to deliver proven solutions, access to resources and expertise to enhance the value of technology offerings.

“With rapid growth in big data demands around the world, customers require proven solutions and expertise that deliver Hadoop availability with no downtime or data loss,” said Tyler Smith, Fusion-io’s Vice President of Alliances. “WANdisco is a valuable addition to our Technology Alliance Program as we work together to fulfill the market demand for innovative and proven big data solutions.”

As mentioned, this partnership news follows the launch of WANdisco Distro (WDD), a fully tested, production-ready version of Apache Hadoop, based on the most recent Hadoop release. WDD lays the foundation for WANdisco’s upcoming enterprise Hadoop solutions, including the WANdisco Hadoop Console, a comprehensive, wizard-driven management dashboard and the Non-Stop NameNode, which combines our patented replication technology with open source Hadoop to deliver optimum performance, scalability and availability on a 24-by-7 basis.

You can find out more about the Technology Alliance announcement by reading the press release, or visiting Fusion-io’s Technology Alliance Program webpage.

Subversion Tip of the Week

Creating a New Directory in Apache Subversion

There are two ways to create a new directory in Apache Subversion. You can either create the directory in your working copy and then commit it to the repository as a separate operation, or simply create the new directory in the central repository.

1) Creating a directory in the working copy:

svn mkdir (working-copy-location/name-of-new-directory)

In this example, we’re creating a new directory called ‘Release2’ in the branches folder. You’ll need to perform a commit to send this new directory to the repository and share it with the rest of the team.

mkdir working copy

2) Creating a directory in the repository:

svn mkdir -m “log message” (repository-URL/name-of-new-directory)

mkdir url

3) The –parents Option

Note that regardless of whether you’re creating a new directory in the repository or in a working copy, the intermediate directories must already exist. If you need to create the  intermediate directories, you must use the –parents option.

In this example, we’re creating two directories in the ‘NewRepo’ repositories, a ‘Releases’ directory and a ‘Release3’ directory.

svn mkdir with parent switch

Looking for an easy-to-use cross platform Subversion client? Claim your free 30 day trial of SmartSVN Professional by visiting:

New Webinar Replay: The Future of Big Data for the Enterprise

You may have heard that we’ve just launched the world’s first production-ready Apache Hadoop 2 distro. This WANdisco Distro (WDD) is a fully tested, production-ready version of Apache Hadoop, based on the most recent Hadoop release. We’re particularly excited, as the release of WDD lays the foundation for our upcoming enterprise Hadoop solutions. If you want to find out more about WANdisco’s plans for big data, the replay of our ‘The Future of Big Data for the Enterprise’ webinar is now available.

This webinar is led by WANdisco’s Chief Architect of Big Data, Dr. Konstantin Shvachko, and Jagane Sundar, our Chief Technology Officer and Vice President of Engineering for Big Data. Jagane and Konstantin were part of the original Apache Hadoop team, and have unparalleled expertise in Big Data.

This 30 minute webinar replay covers:

  • The cross-industry growth of Hadoop in the enterprise.
  • The new “Active-Active Architecture” for Apache Hadoop that improves performance.
  • Solving the fundamental issues of Hadoop: usability, high availability, HDFS’s single-point of failure and disaster recovery.
  • How WANdisco’s active-active replication technology will alleviate these issues by adding high-availability, data replication and data security to Hadoop, taking a fundamentally different approach to Big Data.

You can watch the full ‘The Future of Big Data for the Enterprise’ replay, along with our other webinars, at our Webinar Replays page.

WANdisco Launches World’s First Production-Ready Apache Hadoop 2 Distro


We’re excited to announce the launch of our WANdisco Distro (WDD) a fully tested, production-ready version of Apache Hadoop 2. WDD is based on the most recent Hadoop release, includes all the latest fixes and undergoes the same rigorous quality assurance process as our enterprise software solutions.

The team behind WDD is led by Dr. Konstantin Boudnik, who is one of the original Hadoop developers, has been an Apache Hadoop committer since 2009 and served as a Hadoop architect with Yahoo! This dedicated team of Apache Hadoop development, QA and support professionals is focused exclusively on delivering the highest quality version of the software.

We are also now offering enterprise-class professional support for organizations deploying Hadoop clusters that utilize WDD. Delivered by our team of open source experts, WANdisco’s professional support for Hadoop includes online service request and case tracking, customer discussion forums, online access to service packs and patches, indemnification coverage, Hadoop cluster health checks, consulting and training and more. You can find out more about the available support options at

We’re particularly excited to make this announcement, as WDD lays the foundation for our enterprise Hadoop solutions that deliver 24-by-7 availability, scalability and performance globally, without any downtime or data loss.

“This is one of a number of key Big Data product announcements WANdisco will be making between now and the upcoming Strata 2013 Big Data conference in Santa Clara, CA, February 26-28. It’s a great time for enterprises requiring a hardened, non-stop Hadoop,” said David Richards, CEO of WANdisco. “Only our patented active-active technology removes the single point of failure inherent in Hadoop and works locally and globally. We are excited to have Dr. Konstantin Boudnik, one of the original developers of Hadoop, leading this rollout.”

You can learn more about WDD at the official press release, or by visiting the Download WANdisco Distro webpage.

Subversion Tip of the Week

Intro to the ‘svnversion’ Command

If you need to discover the revision number (or revision range, if working with mixed revisions) of your Apache Subversion working copy, you can use the svnversion command. This is particularly useful if your working copy contains mixed revisions, and you want to find out the range of revisions currently in your working copy. Run this command, followed by the location of your working copy:

svnversion (working copy path)

This will print either a single revision number or the revision range. In this example, the working copy contains files at revision 31 and revision 32:


Additional Useful Options

1) –no-newline

Removes the usual newline from the printed output.

svnversion –n (working copy path)

2) –committed

Lists the highest locally available revisions.

svnversion –c (working copy path)

3) –version

Prints the version of svnversion you’re using, and additional information such as compile date and copyright disclaimers.

svnversion –version

svnversion version

Looking for an easy-to-use cross platform Subversion client? Claim your free 30 day trial of SmartSVN Professional by visiting:

WANdisco Teams up with Cloudera

We’re pleased to announce that WANdisco is now an authorized member of the Cloudera Connect Partner Program. This program focuses on accelerating the innovative use of Apache Hadoop for a range of business applications.

“We are pleased to welcome WANdisco into the Cloudera Connect network of valued service and solution providers for Apache Hadoop and look forward to working together to bring the power of Big Data to more enterprises,” said Tim Stevens, Vice President of Business and Corporate Development at Cloudera. “As a trusted partner, we will equip WANdisco with the tools and resources necessary to support, manage and innovate with Apache Hadoop-based solutions.”

As a member of Cloudera Connect, we are proud to add Cloudera’s extensive tools, use case insight and resources to the expertise of our core Hadoop committers.

You can learn more about this program at Cloudera’s website and by reading the official announcement in full.

At WANdisco, we’re working on our Hadoop-based products, including WANdisco Non-Stop NameNode, which will enable each NameNode server to support simultaneous read and write requests, alongside balancing workload across servers for optimum scalability and performance.

You can learn more about Non-Stop NameNode, and our other upcoming Hadoop-based offerings at our Hadoop Big Data Products page.

Why DConE is Ideal

In a previous article, I stated that DConE’s performance coordinating distributed replication was “mathematically ideal.”  Without going into the actual mathematics, one way to shed a little more light on this technology is to examine what distributed computing defines as properties of “safety” and “liveness”:

Safety: “Will never do anything wrong”
Liveness: “Will eventually do something right”

Note that we need both properties because it’s easy for a program to satisfy only the safety principle, simply by never doing anything!

DConE is ideal in the sense that we are always safe, and live to the extent allowed by physics1.

Always Safe

With DConE there are no error edge conditions to discover. Paxos (and thus DConE) contains the entire set of failure conditions inside its algorithm2.

This is especially important in the failure-rich environment of distributed computing.  The classic “Fallacies of Distributed Computing” outlines the major failure points encountered by the unwary or inexperienced.

Beyond the event horizon

There’s an inevitable tradeoff between safety and liveness, which you can also think of as a tradeoff between correctness and performance. By trading correctness for performance, it’s possible to move theoretical performance of distributed coordination beyond the Paxos 100% correctness event horizon.

How important is your data?

Less than 100% correctness might be an acceptable tradeoff if your data is not very important to you. One example might be an e-commerce website that deliberately chooses to not globally coordinate inventory for products because of performance limitations in their coordination technology. There’s a chance that the customer may order a product that is reported to be in stock, only to be informed later that there was a mistake, and it is backordered. In this case, the system traded correctness for performance, and it’s a business decision to tolerate the occasional misstep.

At the event horizon

Where DConE sits is right at the event horizon of maximum performance with 100% correctness. So when your data Absolutely, Positively has to be there intact, DConE offers you that guarantee with the maximum possible performance.

1 2 Y. Aahlad

WANdisco’s January Roundup

Happy new year from WANdisco!

This month we have plenty of news related to our move into the exciting world of Apache Hadoop. Not only did another veteran Hadoop developer join our ever-expanding team of experts, but we announced a partnership with Cloudera, and WANdisco CEO David Richards and Vice President of Big Data Jagane Sundar met with Wikibon’s lead analyst for an in-depth discussion on active-active big data deployments.

WANdisco big data

You may have heard that AltoStor founders and core Apache Hadoop creators, Dr. Konstantin Shvachko and Jagane Sundar joined WANdisco last year. Now we’re excited to announce that another veteran Hadoop developer has joined our Big Data team. Dr Konstantin Boudnik is the founder of Apache BigTop and was a member of the original Hadoop development team. Dr. Boudnik will act as WANdisco’s Director of Big Data Distribution, leading WANdisco’s Big Data team in the rollout of certified Hadoop binaries and graphical user interface. Dr. Boudnik will ensure quality control and stability of the Hadoop open source code.

In building our Big Data team, we’ve been seeking Hadoop visionaries and authorities who demonstrate leadership and originality,” said David Richards, CEO of WANdisco. “Konstantin Boudnik clearly fits that description, and we’re honored that he’s chosen to join our team. He brings great professionalism and distribution expertise to WANdisco.”

Also on the Big Data-front, CEO David Richards, and Vice President of Big Data Jagane Sundar, spoke to Wikibon’s lead analyst about our upcoming solution for active-active big data deployments.

We can take our secret sauce, which is this patented active-active replication algorithm, and apply it to Hadoop to make it bullet-proof for enterprise deployments,” said David Richards. “We have something coming out called the Non-Stop NameNode … that will ensure that Hadoop stays up 100% of the time, guaranteed.”

Watch the ‘WANdisco Hardening Hadoop for the Enterprise’ video in full, or read Wikibon’s Lead Big Data Analyst Jeff Kelly’s post about the upcoming Non-Stop NameNode.

Capping off our Big Data announcements, WANdisco is now an authorized member of the Cloudera Connect Partner Program. This program focuses on accelerating the innovative use of Apache Hadoop for a range of business applications.

We are pleased to welcome WANdisco into the Cloudera Connect network of valued service and solution providers for Apache Hadoop and look forward to working together to bring the power of Big Data to more enterprises,” said Tim Stevens, Vice President of Business and Corporate Development at Cloudera. “As a trusted partner, we will equip WANdisco with the tools and resources necessary to support, manage and innovate with Apache Hadoop-based solutions.”

As a member of Cloudera Connect, we are proud to add Cloudera’s extensive tools, use case insight and resources to the expertise of our core Hadoop committers.

You can learn more about this program at Cloudera’s website and by reading the official announcement in full.

apache subversion logo

On the Subversion side of things, the SVN community announced their first release of 2013, with an update to the Subversion 1.6 series.

Apache Subversion 1.6.20 includes some useful fixes for 1.6.x users:

  • Vary: header added to GET responses
  • Fix fs_fs to cleanup after failed rep transmission.
  • A fix for an assert with SVNAutoVersioning in mod_dav_svn

Full details on Apache Subversion 1.6.20 can be found in the Changes file. As always, the latest, certified binaries can be downloaded for free from our website, along with the latest release of the Subversion 1.7 series.

How many developers can a single Apache Subversion server support? In his recent blog post, James Creasy discussed how DConE replication technology can support Subversion deployments of 20,000 or more developers.

“While impressive, DConE is not magic,” writes James. “What DConE delivers is a completely fault tolerant, mathematically ideal coordination engine for performing WAN connected replication.”

In another new DConeE post, James explains where DConE fits into the ‘software engineering vs. computer science’ debate, and warns “in the world of distributed computing, you’d better come armed with deep knowledge of the science.”

Finally, WANdisco China, a Wholly Foreign Owned Enterprise was announced this month, following WANdisco’s first deal in China with major telecommunications equipment company Huawei. From this new office we’ll be providing sales, training, consulting and 24/7 customer support for WANdisco software solutions sold in China, and are excited to be expanding our activities within this region.

We view China as an emerging and high growth market for WANdisco,” said David Richards. “It was a natural progression to establish our Chengdu office as a WFOE and ramp up staff there as so many companies have operations in the country. We are excited about this announcement and look forward to the growth opportunities this brings.”

To keep up with all the latest WANdisco news, be sure to follow us on Twitter.


Subversion and Git (git-svn): A New SVNForum

You’ll soon notice a new discussion area on SVNforum for using Git and Subversion together. Subversion has been the defacto face of SCM for a number of years, and now many open source and other projects are migrating to Git, a DVCS (Distributed Version Control System.)

Often a step in that migration is to use a hybrid Subversion/Git environment with features like git-svn that allow exporting codes to Git, and pushing content back to the central Subversion server.  Just as we are Supporting Git to Support You, we hope to further support you by hosting conversation on important trends affecting the Subversion community such as the rising popularity of Git.

Do you use Git and Subversion together? Are there any challenges you’re facing working with both? Which toolsets do you use? How has this changed your development environment?

Whether you’ve gone over the curve and are now seeing the benefits, or just beginning to work in a Git/SVN environment and have a bunch of questions, head on over to the forum and let us know.

This blog was co-authored with James Creasy, our Senior Director of Product Management.

WANdisco Joins Hortonworks’ Technology Partner Program

We’re pleased to announce that WANdisco has joined the Hortonworks Technology Partner Program. This program aims to support and accelerate the growth of the Apache Hadoop ecosystem. As a part of the Hortonworks Technology Partner Program we will offer our Big Data products on the Hortonworks Data Platform which is powered by Apache Hadoop.

The Hortonworks Data Platform delivers an enterprise-class distribution of Apache Hadoop that is endorsed and adopted by some of the largest vendors in the IT ecosystem.

“We are pleased to welcome WANdisco into the Hortonworks Technology Partner Program,” said Mitch Ferguson, vice president of business development, Hortonworks. “We look forward to working with WANdisco to deliver innovative Apache Hadoop-based solutions for the enterprise.”

Our upcoming Big Data products will remove the single point of failure inherent in Hadoop, providing enterprises with non-stop availability and allowing servers to be taken offline for planned maintenance without interrupting user access.

“WANdisco is bringing active-active replication technology to enterprises for high-availability global Hadoop deployments,” said David Richards, WANdisco CEO. “Hortonworks Data Platform customers will greatly benefit from WANdisco non-stop Big Data solutions through this partnership.”

You can learn more at the official press release, or get more information on the Technology Partner Program at HortonWorks’ website.

Subversion Tip of the Week

Intro to Automatic Properties 

Properties are a powerful and useful feature of Apache Subversion, but they can be easily overlooked. If you’re regularly using properties in your project, it’s a good idea to configure Subversion to add properties to new files automatically.

Subversion already sets some properties automatically: whenever you add a new file, it sets a value for the MIME type and decides whether the file is executable. You can extend this by leveraging Subversion’s automatic property setting feature “svn:auto-props.” With auto-props enabled, you can perform tasks such as automatically inserting keywords into text files and ensuring every file has EOLs that are consistent with the OS.

To enable auto-props:

1) Locate and open the config file on your local machine.

2) Scroll down to the [miscellany] section and uncomment the following line:

# enable-auto-props = yes

auto props

3) Just below this line, edit the ‘Section for configuring automatic properties’ text according to the properties you want to apply to your files.

Apache Bloodhound 0.4 (Incubating) Released

The Apache Bloodhound team has just announced their first release of 2013. Bloodhound (Incubating) is a software collaboration tool that builds on the proven project management and issue tracking system of Trac.

The just-released Bloodhound 0.4 (Incubating) includes the following highlights:

  • White-labeling for error messages and basic branding
  • Improvements to the quick ticket creation form including the ability to specify the select fields and their order
  • A new ‘in-place’ edit and workflow control replacing the ticket edit form
  • Various bug fixes

Congratulations to the Apache Bloodhound team on their 0.4 release!

Although WANdisco are sponsoring some of the initial committers, one of the Apache Bloodhound project’s core goals is to create a strong developer community around the Trac code base in a vendor-neutral location. If you’re interested in participating in the Apache Bloodhound project, we invite you to review the information available at the ‘Getting Involved With Apache Bloodhound’ page.

Apache Bloodhound 0.4 can be downloaded from the Bloodhound website.

Intro to Tagging in Subversion

What are Tags?

In Apache Subversion, branches and tags are essentially the same thing: a copy of an existing folder and its contents, in a new location within the same repository. The key difference is the way the user handles these folders.

Tags should be used as “cold milestones” that provide a snapshot of your project at a specific point in time. Although a revision already acts as a snapshot, tags allow you to give them a more human-readable name (“Release 7.0” rather than “Revision 24973.”) Tagging also allows you to take snapshots of specific sections of the repository.

Why Should I Create a Tag?

Creating a tag uses the ‘svn copy’ command, followed by the section of the repository that’s being tagged, and the location where the new tag will reside. As ever, don’t forget to leave a log message:

svn copy -m “useful log message”(URL) (location of new tag)

In this example, we are creating a new tag called ‘Release1,’ by copying all the files currently in the trunk.


Tip. Whether you are creating a branch or a tag, it’s worth putting some thought into your naming strategy. A coherent naming strategy allows others to get an insight into what development work is happening in which branch/tag, at a glance.


Looking for an easy-to-use cross platform Subversion client? Claim your free 30 day trial of SmartSVN Professional by visiting:

WANdisco Announces New Wholly Foreign Owned Enterprise

We are excited to announce WANdisco China, a Wholly Foreign Owned Enterprise located in Chengdu, China. From this new base in Chengdu, we’ll provide the full suite of WANdisco services: sales, training, consulting and 24/7 customer support for software solutions sold in the country. Chengdu will also serve as our Chinese headquarters, supporting our existing office in Beijing.

“We view China as an emerging and high growth market for WANdisco,” said David Richards, CEO of WANdisco. “It was a natural progression to establish our Chengdu office as a WFOE and ramp up staff there as so many companies have operations in the country. We are excited about this announcement and look forward to the growth opportunities this brings.”

This announcement follows our first sale in China with major telecommunications equipment company Huawei.

You can find out more about WANdisco China at the official press release.

WANdisco Announces Free Subversion Webinars for 2013

After getting a fantastic response to our free Subversion webinars in 2012, we’re pleased to announce the first webinars of 2013.

Getting Started With Subversion

A one-hour course to kickstart newcomers to both Subversion and version control in general, covering everything you need to get up and running with this popular open source version control system.

The webinar will cover:

  • Repository basics
  • Performing commits
  • Performing checkouts
  • Simple merging
  • The working copy
  • Simple branching

Getting Started With Subversion’ will take place on January 24th, 2013 10:00am PST / 1:00pm EST, so be quick to avoid missing out.

Branching Options for Development

Branching can cause confusion for many Subversion users, but once mastered it can be one of Subversion’s greatest strengths. In this one hour webinar our Subversion expert will cover the different types of branches and deep dive into their particular uses. Topics covered will include:

  • What is concurrent development?
  • What is a branch?
  • Different development models
  • What triggers a branch?
  • Communication for branching and merging

‘Branching Options for Development’ will take place on February 14th, 2013. Registration will open soon, so keep checking back for all the latest information or follow us on Twitter.

Getting Info out of Subversion

Need to build a report based on your Subversion project? This free-to-attend online training will share techniques for extracting information out of Subversion, for reporting purposes.

Topics will include:

  • Log information
  • Property information
  • Difference information
  • Using Project and User information
  • Using Hook scripts to log information

Getting Info out of Subversion’ will take place on February 28th, 2013.

Have an idea for a future webinar, or feedback on our current schedule of free Subversion training? Please don’t hesitate to leave us a comment on this blog, or Contact Us directly.


Subversion Tip of the Week

Deleting a Branch

When something is deleted from Apache Subversion, it only disappears from the revision where it was deleted and all subsequent revisions. Deleting branches in Subversion has no effect on repository size, as it still exists in all previous revisions and can be viewed or recovered at any time. So, the question is why would you ever delete a branch?

1) House-keeping – regularly deleting branches reduces the clutter in the branches directory, and makes browsing the repository less confusing. When all abandoned branches are routinely deleted, a quick glance at the branches directory can tell you which branches are still active.

2) Following a merge – in some situations where you’ve finished working on a branch and merged the changes into the trunk, the branch may become completely redundant and you should consider deleting the branch to reduce clutter in the repository.

3) Following reintegration – the ‘–reintegrate’ command allows merging from a branch to the trunk, by replicating only the changes that are unique to that branch. A ‘–reintegrate’ merge uses Subversion’s merge-tracking features to calculate the correct revision ranges to use, and checks to ensure the branch is truly up-to-date with the latest changes in the trunk. These checks ensure the reintegration will not override work other team members have committed to the trunk.

Once a ‘–reintegration’ merge has been performed, the branch shouldn’t be used for development as any future reintegration will be interpreted as a trunk change by Subversion’s merge tracking. This will trigger an attempt to merge the branch-to-trunk merge back into the branch. To avoid this, the reintegrated branch should be deleted.

How to Delete a Branch

To delete a branch, run the ‘svn delete’ command alongside the URL of the branch and an appropriate log message. In this example, we are deleting the ‘bug fix branch.’

delete branch

“Shared” is Code for “Single Point of Failure”

You’ll encounter the word “shared” often in the computing world, and I’ve begun to think that that word is sometimes a clue to finding SPOFs (Single Points of Failure).  As you likely know, a SPOF is one thing that, if it breaks, brings down a whole system.

For Want of a Nail

“For Want of a Nail…” is a proverb with a rich history, which describes how the loss of a seemingly unlikely and unimportant thing can snowball into monumental consequences. I’d call the nail in the proverb a SPOF for losing the Kingdom. But as Jez Humble points out in his article “On Antifragility in Systems and Organizational Architecture“ commenting on Nassim Taleb’s book AntiFragile, it’s not always easy to recognize the SPOF, or how multiple redundant components together might collectively form a SPOF. Incidentally, a good example of that might be Netflix’s 2012 Christmas Eve outage. In this case, the whole of Amazon EC2 was and remains a Netflix single point of failure.

Smoke that leads to fire

Since WANdisco’s DConE technology operates without a single point of failure, bringing cloud-like capabilities to existing applications, I’m interested in where this capability can be put to good use.  When hunting product opportunities, it’s always nice to have help knowing where to look.  In this case, I think the word “shared” is a red flag for SPOFs, and is the smoke that helps lead to the fire of SPOF fragility.


I didn’t have to go far for an example, as the Hadoop NameNode, subject of WANdisco’s recent AltoStor acquisition, resides in a shared edits directory which is a single point of failure for a Hadoop deploy.

What examples can you find of “shared” unmasking a SPOF that needs to be addressed?

Software Engineering versus Computer Science

Reading Jeff Hodges’ excellent post, containing “rubber meets the road” advice for budding engineers in distributed computing projects, caused me to reflect on the rise of ad hoc engineering over the certainty of science.

When I took my first college computer science course, the emphasis was on the science part of computer science, and “writing code” was primarily proof that you understood the science. As dissatisfying as that seemed at the time, my theoretical start later felt like a competitive advantage in a world dominated by so-called hackers: often brilliant, rarely disciplined.

In more recent years, the ability to quickly hack together a product, often by combining existing open source technologies in some novel combination has become the foundational technology for companies such as Twitter, Instagram, and GitHub. My impression, though, is these are, at least initially, triumphs of software engineering more so than computer science.  Perhaps the impressiveness of these engineering accomplishments lead some to think that most computer software problems are software engineering problems, as opposed to computer science problems?

Much of Jeff’s post is a warning to an unwary hacker plowing into engineering of distributed systems. Simplistic ideas around replication and transaction coordination are the ticket to a rabbit hole from which there is no escape. Put simply: you can’t debug your way out of knowing the math.

And there is a mention of math, although it’s in a section titled “Coordination is very hard” and goes on to recommend you avoid it whenever possible. This is, of course, my excuse to mention our patented, WAN-capable Paxos coordination implementation called DConE.  DConE currently powers our SCM related CVS and Subversion MultiSite products, and will soon to come to bear on other important industry needs. In contrast to software engineering driven technology, WANdisco’s core technology is clearly and firmly rooted in computer science.

You’ll not catch me saying engineering or science is better. Different types of problems and opportunities will advise different emphasis on one side or the other. But in the world of distributed computing, you’d better come armed with deep knowledge of the science.

Subversion Tip of the Week

Performing Shallow Checkouts

A standard Apache Subversion checkout includes the whole file and directory content of a specified repository. If you don’t require this level of content, you can perform a ‘shallow checkout’ that prevents Subversion from descending recursively through the repository, by restricting the depth of the checkout.

This is achieved by running the ‘svn checkout’ command as normal, but with an additional command:

  • –depth immediates: checkout the target and any of its immediate file or children. Note that the children themselves will be empty.
  • –depth files: checkout the target and any of its immediate file children.
  • –depth empty: checkout the target only. None of its file or children will be included in the operation.

In this example we are performing a shallow checkout on a ‘bug fix branch’ located within the branches folder, and specifying that only the immediate file children should be included (–depth files):

depth files

If you’re using SmartSVN, the cross-platform graphical client for Subversion you can set the checkout depth from the drop down menu when performing your checkout.

smartsvn depth

Not yet started with SmartSVN? Claim your free 30 day trial at

Veteran Apache Hadoop Developer Joins WANdisco

We’re pleased to announce that founder of Apache BigTop and one of the original developers of Apache Hadoop, Dr. Konstantin Boudnik has joined WANdisco as Director of Big Data Distribution. An Apache Hadoop committer since 2009, Dr. Boudnik has previously worked at Cloudera, Yahoo! and Sun Microsystems.

“In building our Big Data team, we’ve been seeking Hadoop visionaries and authorities who demonstrate leadership and originality,” said David Richards, CEO of WANdisco. “Konstantin Boudnik clearly fits that description, and we’re honored that he’s chosen to join our team. He brings great professionalism and distribution expertise to WANdisco.”

Dr. Boudnik will lead the Big Data team in the rollout of certified Hadoop binaries and its related GUI. Dr. Boudnik joins core Hadoop creators Dr. Konstantin Shvachko and Jagane Sundar as part of WANdisco’s Big Data team.

Apache Subversion 1.6.20 Released

The Apache Subversion community has just announced their first release of 2013, with an update to the Subversion 1.6 series.

Apache Subversion 1.6.20 includes some useful fixes for users of 1.6.x:

  • Vary: header added to GET responses
  • Fix fs_fs to cleanup after failed rep transmission.
  • A fix for an assert with SVNAutoVersioning in mod_dav_svn

More information on Subversion 1.6.20 can be found in the Changes file. As always, the latest, certified binaries can be downloaded for free from the WANdisco website or, if you’re looking for an easy-to-use cross platform Subversion client, why not claim your free 30 day trial of SmartSVN Professional?

Find out more about the benefits of SmartSVN, by visiting the SmartSVN ‘Features’ page.

Scaling Subversion with WANdisco

How many developers do you think can be supported by a single Apache Subversion server? I normally hear answers between 300 up to 1000 or so with specialized hardware. While addressing the vast majority of development projects, this level of scalability falls short of the largest enterprise development needs.

With WANdisco’s patented DConE replication technology, we can support Subversion deployments for the largest enterprise development requirements of 20,000 or more developers. Add built-in HADR (High Availability and Disaster Recovery) and transparent global multi-site capabilities, and now the enterprise has a proven and popular industry standard SCM tool that can support any known size of development project.

What is DConE?

DConE, WANdisco’s core technology, powers our SVN MultiSite products. DConE stands for “Distributed Coordination Engine” and implements a WAN (Wide Area Network) capable Paxos coordination algorithm, which is then integrated with Subversion to create the SVN MultiSite product.  DConE implements a mathematically proven ideal solution; there’s no need to wonder if a more efficient coordination solution exists.

How does DConE achieve this?

One reason is that every developer performs all read and write operations to a local, LAN (Local Area Network) resident server, even if the replicated server exists in multiple locations across the world on a WAN. It seems slightly magical, yet accurate, to think of this as a single instance of Subversion existing simultaneously on different machines. Distributed computing refers to this as “one copy equivalence.”

Further, even though writes must occur on each machine in the replication group, they take place continuously and in the background. So another reason for greater scalability is that maximum write load tends to follow the sun- giving the moonlit servers time to catch up. This has the effect of distributing the write load across multiple machines.

Another reason is because Subversion, as with most other data repositories, typically supports far more read traffic than write traffic (c. 95-99% reads). The write data for each server is delivered just once per transaction; the subsequent much higher read load occurs against the local server and without creating additional WAN traffic.

Not Magic

While impressive, DConE is not magic. The quality of the connection, and, ultimately, the speed of light limit how quickly data be moved from one replication node to another. What DConE delivers is a completely fault tolerant, mathematically ideal coordination engine for performing WAN connected replication.

Becoming Cloud-like

As I wrote in the article “Putting the Cloud into an Eyedropper”, the end result is to give today’s essential applications, originally architected for single machines and co-located teams, a new way forward as multi-machine, multi-master, and multi-site replication groups.

The alternative approach often finds these same applications forced into awkward service for globally distributed development teams and modern public, hybrid, and private cloud IT environments. They may employ fragile master/slave replication schemes to scale read traffic, attempt grueling ground up rewrites, or be forced to simply wait to be supplanted by newer technology.

Subversion at Extreme Scale

DConE is an effective way for Subversion to support more developers than previously thought to collaborate on massive, enterprise software projects.  We would love to hear about your experiences, positive and negative, with extreme Subversion deployments.

Subversion Tip of the Week

Solving Conflicts with SmartSVN

Conflicts can be tricky for Apache Subversion users, but SmartSVN comes with a dedicated ‘Conflict Solver’ that takes the pain out of resolving them. SmartSVN’s built-in Conflict Solver combines the freedom of a general, three-way-merge with the ability to detect and resolve any conflicts that occur during the development lifecycle.

To access this conflict solver, open the ‘Query’ menu and select ‘Conflict solver.’

conflict solver

The contents of the two conflicted files are displayed on the right and left text areas, and the differences between the left and right content is highlighted by coloured regions within the text views.

Once you have finished editing your files, open the ‘Modify’ menu and select ‘Mark Resolved’ to mark the conflicting file(s) as resolved. In this dialog, you can also opt to:

  • Leave as is – apply no further modifications to the resolved file.
  • Take old – accept the version in the working copy, as it was before the update or merge  was performed.
  • Take new – the pristine copy after the update or merge was performed.
  • Take working copy – the pristine copy before the update or merge was performed.

If you’re working with conflicted directories, you have the option to ‘Resolve files and subdirectories recursively.’  If selected, all conflicting files and directories within the selected directory will be resolved.

Note, you must resolve all the conflicts before you can commit the file(s)/directories.

Not yet started with SmartSVN? Claim your free 30 day trial at

Subversion Tip of the Week

Getting More Out of SVN Log

Apache Subversion‘s ‘SVN log’ is a useful command, allowing you to view the entire history of a directory, including who made changes to the directory, at what revision, the exact time, and any accompanying log messages.

In it’s most basic form, a simple ‘SVN log’ command will give you a printout of everything that’s changed in your entire directory, across all revisions:

svn log

However, there’s plenty of options if you need to tailor ‘SVN log’s output. If this is far too much information, you can specify a particular revision:

svn log -r14

svn log 2

Or, if you’re after all the available information about this particular revision, add the verbose option to your command:

svn log -r14 -v

svn log 3

Alternatively, if you do not need to review the log message, you can add the quiet option to suppress this piece of information:

svn log -r14 -q

svn log 4

Of course, both of these options can also be applied to an entire directory:

svn log -q
svn log -v

When the two options are combined, ‘svn log’ prints just the file names:

svn log -q -v

svn log 5

Looking for an easy-to-use cross platform Subversion client? Claim your free 30 day trial of SmartSVN Professional by visiting:

Problem-centric Products

Some businesses have a better understanding of their company’s products than they do of their customers’ problems. That’s understandable because it’s easier to focus on the thing you build, you control, you sell. Customer problems are slippery, complex, and sometimes uncomfortably unrelated to your product. The result can be a very product-centric message. Ever go to a website full of product features and information, and have a hard time figuring out what the thing is supposed to do?

One problem with a product-centric message is that the hard work of mapping problems to your product is left up to your customer. If it’s too hard to figure out if a product actually will solve my problem, I may just give up. That’s why it’s so important to deeply understand the challenges faced by your customers, and speak to the problems first whenever possible.

Of course nothing is new; a search quickly turned up this nice example: “Sell the Problem” by Seth Godin.

I’m using these principles to create a new product I’m working on for WANdisco. Listening to people tasked with solving specific problems gives me a clear idea of the challenges they face. We can then take a deep look at the technologies we have at our disposal for creating products that solve those problems. Then we also hope there’s a measure of inspiration with the perspiration, as Edison suggested: can we wrap up everything we’ve learned into a solution that goes past solving problems and into enabling previously unrevealed possibilities?

That’s the high bar we hope to reach: transformative solutions inspired by real problems.  Although it’s likely to be powered by WANdisco’s transformative replication technology, you’ll know from the start, YOUR problems were where we started looking first.

The problems we know best have to do with global and local multi-site, replication, high availability, and scalability.  If you are experiencing this with an existing application, let us know!

Happy Holidays from WANdisco!

wandisco-christmas-2012-blog (1)

2012 has been an amazing year for WANdisco: a successful flotation, a patent approval, two acquisitions, a global series of WANdisco-organized Subversion conferences and upgrading our sponsorship of the Apache Software Foundation were just some of the highlights of the past twelve months.

We have plenty of exciting announcements planned for 2013, but for now we’d just like to thank everyone who has used our products, joined us for a webinar, eTraining or enterprise training session, picked us for your support needs, or provided us with the crucial feedback we need to make our products and services even better.

And, of course, we’d like to wish you a very happy holidays from the WANdisco Team!

Subversion Tip of the Week

Intro to Subversion Switch

When working with branches, Apache Subversion provides a useful shortcut for switching your current working copy to a new branch’s location, without the overhead of checking out a fresh working copy containing the targeted branch. Leveraging this functionality, it’s possible to build a working copy that contains data from a range of repository locations, although these locations must originate from the same repository.

To achieve this, enter the ‘svn switch’ command, followed by the URL you wish to switch to:

svn switch repository-URL

svn switch

Users of SmartSVN, the cross-platform graphical client for SVN, can perform a switch simply by opening the ‘Modify’ menu and selecting the ‘Switch…’ option.

svn switch 2

In the subsequent dialog, enter the new URL – or select the ‘Browse’ option to view the different branches – and specify whether you’re switching to the latest revision (HEAD) or a particular revision number.

svn switch 3

Tip. Use the ‘Select…’ button to view more information about the different revisions.

Not yet started with SmartSVN? Claim your free 30 day trial at



WANdisco’s December Roundup

2012 has been an amazing year for WANdisco, but we still had a few more announcements for you this month, including news that we are extending our suite of service and support offerings to include the Git distributed version control system.

“Expanding our support offering to include Git is an obvious step to enable you to deploy and support the trending as well as the leading SCM tools,” said James Creasy, WANdisco’s Senior Director of Product Management in his ‘Supporting Git to Support You’ blog.

Our Git support includes:

  • Guaranteed response times
  • Availability 24 hours a day, 7 days a week
  • Contact via email or toll-free telephone

Git support is available immediately, please contact for more information.

This news comes hot on the heels of our Big Data and Apache Hadoop announcements last month. WANdisco CEO, David Richards, and core creators of Apache Hadoop Dr. Konstantin Shvachko and Jagane Sundar recently conducted a webinar that covered how WANdisco sees the future of big data, following our acquisition of AltoStor.

This 30 minute webinar discussed:

  • The cross-industry growth of Hadoop in the enterprise.
  • How Hadoop’s limitations, including HDFS’s single-point of failure, are impacting the productivity of the enterprise.
  • How WANdisco’s replication technology will alleviate these issues by adding high-availability, data replication and data security to Hadoop.

If you missed out on the webinar, you can still find out all about WANdisco, Hadoop and Big Data by checking out the webinar slides on SlideShare.

The Subversion community also found the time for one more release before the holiday season. Subversion 1.7.8 features plenty of fixes and enhancements, including:

  • Adding missing attributes to “svn log -v –xml” output
  • Fixing a hang that could occur during error processing
  • Fixing incorrect status returned by 1.6 API
  • Adding Vary: header to GET responses to improve cacheability
  • Subversion 1.7.8 ignores file externals with mergeinfo when merging

A full list of everything that’s new in Subversion 1.7.8 is available at the Changes file. Free binaries of Subversion 1.7.8 are available to download through the WANdisco website. Users of SmartSVN, the popular cross-platform client for Subversion can also grab an update: SmartSVN 7.5.3 features plenty of improvements and bug fixes, including:

  • Refresh option to ask for master password, if required
  • Support launching on Solaris
  • Fix for an internal error that could occur after removing Tag
  • Special characters (e.g ‘:’) no longer cause problems in URLs

More information on the latest changes, is available at the SmartSVN changelog. If you haven’t tried SmartSVN yet, remember you can claim your 30 day free trial of SmartSVN Professional by visiting

There’s been plenty of new content at the blog this month, including the first blog from Hadoop core creator Jagane Sundar, WANdisco’s new Vice President of Engineering of Big Data.

In his ‘Design of the Hadoop HDFS NameNode: Part 1 – Request processing’ post, Jagane demonstrates how a client RPC request to the Hadoop HDFS NameNode flows through the NameNode.

hadoop namenode

When you think of “the cloud”, what comes to mind? In his first WANdisco blog, Director of Product Management James Creasy takes a fresh look at one of IT’s biggest buzzwords. He argues that most of the applications used by enterprises were not originally architected for cloud infrastructures, and looks at how this problem could be overcome by “putting the cloud into a virtual eyedropper.” In his second blog, ‘Planned Downtime Is Still Downtime’ James argues that planned outages of critical applications aren’t inevitable:

Through the 20th century and into the 21st we’ve gritted our teeth against this inescapable cost. We’ve built massive failover servers, concocted elaborate master/slave replication schemes, and built businesses around High Availability and Disaster Recovery scenarios (HADR). We thought we were doing the best we can.

And we were, until recently.”

You can read the ‘Planned Downtime is Still Downtime’ post in full at the WANdisco blog.

We also had some new team photos taken by our friend and neighbour at our Electric Works offices, Matt Lollar. We even managed to get some shots outside in the Sheffield sunshine.

wandisco team

Finally, to celebrate the holiday season, we had a little ‘Decorate Your Desk’ competition in the Sheffield office. We even had a roaring log fire!

christmas fire

We have plenty of exciting announcements planned for 2013, but for now we’d just like to thank everyone who has used our products, joined us for a webinar, eTraining or enterprise training session, picked us for your support needs, or provided the crucial feedback we need to make our products and services even better. And, of course, we’d like to wish you a very happy holidays from the WANdisco Team.



Apache Subversion 1.7.8 Released

It may be nearly the end of the year, but there’s still time for one more release of Apache Subversion. SVN 1.7.8 features plenty of fixes and enhancements, including:

  • Adding missing attributes to “svn log -v –xml” output
  • Fixing a hang that could occur during error processing
  • Fixing incorrect status returned by 1.6 API
  • Adding Vary: header to GET responses to improve cacheability
  • Subversion 1.7.8 ignores file externals with mergeinfo when merging

A full list of everything’s that new in Subversion 1.7.8 is available at the Changes file. Free binaries of Subversion 1.7.8 are available to download through the WANdisco website.

Looking for a cross-platform Subversion client? Claim your free 30 day trial of SmartSVN Professional by visiting

Subversion Properties: Needs Lock

Apache Subversion is built around a ‘copy-modify-merge’ model, but there are times when a ‘lock-modify-unlock’ model may be appropriate (for example, when you are working on image files, which cannot easily be merged.) Once you’ve mastered locking and unlocking, you may want to look at Subversion’s dedicated lock property, which is useful to help prevent time wasted working on files that have already been locked by others.

If present on a file, the ‘Needs Lock’ property reminds users that they should lock the file before starting work on it. The SmartSVN Subversion client automatically sets files which require locking (due to this property) to read-only when checking out or updating. When a lock token is present, the file becomes read/write. This prevents users from making changes that are difficult to merge, on a file that is also being edited in another working copy (for example, two users simultaneously editing an image file.)

To add this property to a file using SmartSVN, select a file and click the ‘Change Needs Lock’ option in SmartSVN’s ‘Locks’ menu.

smartsvn needs lock

SmartSVN will automatically add this property to the selected file.

smartsvn properties change

To remove the ‘Needs Lock’ property, repeat the process: selecting ‘Change Needs Lock’ for a file that already contains this property, will remove the property instead.

Looking for a cross-platform graphical client for Apache Subversion? Claim your free 30 day trial at


Subversion Tip of the Week

Intro to Branching in Subversion

For developers who aren’t familiar with version control, getting to grips with Apache Subversion’s branching functionality can be daunting. But, when used correctly, branching can be one of Subversion’s most powerful and useful features.

What is a Branch?

Put simply, a branch is a line of development that exists independently of another line.

intro to branching


Every project is different, but there are some common reasons why developers might choose to create a branch:

  • To isolate development – This is often called a ‘concurrent development branch,’ and is the most common reason to create a branch.
  • To experiment with new technology – Branches can be used to test a new technology that might not necessarily become a part of the main branch. If the experiment works out, the new code can easily be merged into the trunk.
  • To tag a project and keep track of released code – This is often called a ‘release branch.’ Upon completing a release, it is a good idea to create a branch and tag it with a name that is meaningful to that release (for example, “release-2.0.”) This serves as an easy-to-retrieve record of how your code looked, in a certain release.
  • To fulfil a particular customer’s need – This is often called a ‘custom branch,’ and is useful when you need to modify your project for a particular customer’s requirements.

Creating Your First Branch

To create a new branch, use the ‘svn copy’ command followed by a log message (-m) and the URL of both the resource being branched, and the location where you want to create your new branch:

svn copy -m “Creating a new branch” folder-being-branched location-of-new-branch

In this example, we are creating a new branch called ‘bug fix branch’ inside the ‘branches’ folder, that will contain all the files in the trunk:


Looking for a cross-platform graphical client for Apache Subversion? Claim your free 30 day trial at

Supporting Git to Support You

I can only hope that today’s press release sparked a water-cooler conversation or two:

“WANdisco announces enterprise Git support? I thought they were a Subversion company.”

Of course, we still are a Subversion company, empowering the leading version control system with global scalability and high availability.  Subversion’s dominance of the SCM market has been so great, three times the share of the next most used tool according to a 2009 Forrester Research survey, that one could be forgiven for seeing Subversion as the defacto “face of SCM.”

However, here at WANdisco, we take a strategic view of the industry to provide products and solutions for wherever your version control and SCM needs take you. Scanning the crowd of faces, new and old, in the SCM marketplace is part of our job. Clearly there is one important newcomer that has proven its mettle in open source and is now knocking on the door of the enterprise: the Git distributed version control system.

Expanding our support offering to include Git is an obvious step to enable you to deploy and support the trending as well as the leading SCM tools.  And if you’ve discovered other challenges deploying Subversion or Git in your environments, we’d love to hear about them.

Intro to SmartSVN’s Revision Graph

SmartSVN’s built-in Revision Graph tool provides a quick and easy way to get an overview of the hierarchical history of your files and directories. This history is primarily represented as ‘nodes’ and ‘branches.’ (Note, because the Revision Graph displays branches and tags, the Tag-Branch-Layout must be configured correctly.)

The Revision Graph is useful for seeing at a glance:

  • Merged revisions
  • Revisions that have yet to be merged
  • Whether a merge occurred in the selected revision
  • Which changes happened in which branch
  • Which revision represents which tag
  • When a file was moved, renamed or copied, along with its history

To access the Revision Graph, open SmartSVN’s ‘Query’ menu and select ‘Revision Graph.’

revision graph 2

This will open the main Revision Graph screen.


The main section of the Revision Graph is the ‘Revisions’ pane. This displays the parent-child relationships between your revisions. Revisions are arranged by date, with the newest at the top.

In the Revision Graph, there are four main types of relationships that are represented by different line styles:

  • Normal parent-child relationship – represented by thick, coloured lines.
  • Complete merge relationship – created by performing a merge commit where all the source revisions are merged into the target. When ‘Merge Arrows’ is enabled, it is represented by thin, coloured lines.
  • Partial merge relationship – created by performing a partial merge (cherry-pick) where not all source revisions are merged into the target. When ‘Merge Arrows’ is enabled it’s represented by thin, coloured, dashed lines.
  • URL relationship – this is where branches have the same URL, but are not related (e.g when you have removed and re-added a branch.) When ‘Join Same Locations’ is enabled, this is represented by thin, gray lines.

In addition to the main ‘Revisions’ pane, the SmartSVN Revision Graph includes several additional views:

1) Revision Info – displays attributes of the selected revision (revision number, date, state, author who created the revision etc.)

revision info

2) Directories and files – displays the files that were modified as part of the selected revision.

revision graph 3

From this screen, you can access several additional options:

  • Export – export the Revision Graph as an HTML file by selecting ‘Export as HTML…’ from the ‘Graph’ menu. In the recently-released SmartSVN 7.5, this export function was improved to support exporting smaller HTML graphic files.
  • Merge Arrows – select the ‘Show Merge Arrows’ option from the ‘Query’ menu to display the merge arrows. These point from the merge source to the merge target revisions. If the merge source is a range of revisions, the corresponding revisions will be surrounded by a bracket.
  • Merge Sources – select the ‘Show Merge Sources’ option from the ‘Query’ menu to see which revisions have been merged into the currently selected target revision.

Haven’t yet started with SmartSVN? You can claim your free SmartSVN Professional trial by visiting

5 Ways to Customize SmartSVN

One of the features that makes the SmartSVN graphical client for Apache Subversion particularly user-friendly, is how easy it is to tailor to your particular needs. Users have plenty of options available for fine-tuning their SmartSVN installation – in this post, we’ll cover just a few of them.

1) Grouping Revisions

The ‘Transactions’ menu offers several options for grouping your revisions, helping you to get a wealth of information about different revisions at-a-glance.


The different categories are:

  • Ungrouped
  • Weeks
  • Days
  • Date
  • Authors
  • Location (repository)

Changing your revision grouping affects the ‘Transactions’ view (by default, this is located in the bottom right-hand corner of the screen.)

2) Show Branches and Tags

This option can also be found in the ‘Transactions’ menu. When selected, ‘Show Branches and Tags’ displays not just the working copy revisions, but also revisions of the trunk, branches and tags.

3) Accelerators

SmartSVN allows you to set custom ‘accelerators’ for common tasks (e.g copy name, show more, change commit message etc.) To customize these accelerators, open the ‘Edit’ menu and select the ‘Customize…’ option. Open the ‘Accelerators’ tab.


To create a new accelerator or change an existing one, select a menu item and click the ‘Accelerator’ field at the bottom of the screen.

accelerator 2

Press the key combination you wish to add and click ‘Assign’ to confirm. If you ever need to restore the default accelerators, select the appropriate menu Item and click ‘Reset.’

4) Customize the Toolbar

You can customize SmartSVN’s toolbar to display the icons you use the most. To get started, open the ‘Edit’ menu and select ‘Customize,’ and open the ‘Toolbar’ tab.


Use the ‘Add’ button to add one or more available buttons, and ‘Remove’ to remove buttons from the toolbar. Drag-and-drop the ‘Selected’ icons to rearrange the order in which they appear in the toolbar.

Right-clicking on any icon in the ‘Selected’ pane will bring up a context menu with some additional options:

smartsvn customize

  • Add Fixed Separator – add a separator before the currently selected icon.
  • Add Stretching Separator – add a stretching space before the currently selected button.

(Note, the remaining space is divided and assigned to the stretching separators.)

5) Context Menu

The third option available when you open the ‘Customize…’ dialog is ‘Context Menu.’ In this dialog, open the ‘Context Menu’ dropdown and select which menu you wish to change.

context menu

Once you have chosen your context menu, the available menu items are displayed in the left-hand pane, and the current context menu structure in the right-hand pane.

context menu 2

Use the ‘Add and ‘Remove’ buttons to customize the selected context menu or, alternatively use drag-and-drop. Right-clicking on an item in the right-hand pane will bring up some additional options.

context menu 3

The ‘Add Separator’ and ‘Add Menu’ options can be used to add the corresponding item before the selected item on the right-hand side. You can also click ‘Reset to Defaults’ to undo any changes you’ve made to the context menu.

Remember, you can claim your 30 day free trial of SmartSVN Professional now.




Planned Downtime Is Still Downtime

Unlike unplanned outages of your key systems that cause staff to grumble, pace around the office or think of heading home, most of us endure planned outages of critical applications with a sense of inevitable endurance.  Who has never seen the “Server down for planned maintenance 5-6PM Friday” email? Even for SAAS applications, it is not uncommon to see the cold shoulder of a “Site down for planned service, try again later” message glaring at you from your browser.

WANdisco’s CTO and VP Engineering of Big Data, Jagane Sundar, reports the weekly planned outages in the big data Hadoop infrastructure while at Yahoo! were one of the biggest pain points they faced. Big business is becoming dependent on big data.

Through the 20th century and into the 21st we’ve gritted our teeth against this inescapable cost. We’ve built massive failover servers, concocted elaborate master/slave replication schemes, and built businesses around High Availability and Disaster Recovery scenarios (HADR). We thought we were doing the best we can.

And we were, until recently.

WANdisco’s active:active replication blows the doors off the horse and carriage of master/slave.  With the ability to give existing applications some cloud-like capabilities, DConE is a patented replication technology that slices the head off of a single point of failure, and with it many of the headaches traditionally associated with master/slave replication and HADR architectures.

Eliminating “planned outages” is one such desirable outcome.  A node can be taken offline and upgraded or serviced. When it can communicate with the group again, it will transparently catch up with the queued transactions.  The users of the replication should not even be aware changes are being made to the infrastructure.

So the next time you see a notice for a planned outage, consider asking yourself if that application could be DConE enabled. And then let us know!

Subversion Properties: Ignore Patterns

When working on your Apache Subversion project with SmartSVN, you may include items in your working copy that do not need to be placed under version control. While it’s possible to include unversioned items in your working copy, these items will continue to clutter up your Subversion dialogs, e.g the commit dialog: