The inspiration for WANdisco Fusion

Screen Shot 2015-04-21 at 10.08.22 PM

Roughly two years ago, we sat down to start work on a project that finally came to fruition this week.

At that meeting, we had set ourselves the challenge of redefining the storage landscape. We wanted to map out a world where there was complete shared storage, but where the landscape remained entirely heterogeneous.

Why? Because we’d witnessed the beginnings of a trend that has only grown more pronounced with the passage of time.

From the moment we started engaging with customers, we were struck by the extreme diversity of their storage environments. Regardless of whether we were dealing with a bank, a hospital or utility provider, different types of storage had been introduced across every organization for a variety of use cases.

In time, however, these same companies wanted to start integrating their different silos of data, whether to run real-time analytics or to gain a full 360 perspective of performance. Yet preserving diversity across data center was critical, given that each storage type has its own strengths.

They didn’t care about uniformity. They cared about performance and this meant being able to have the best of both worlds. Being able to deliver this became the Holy Grail – at least in the world of data centers.

This isn’t quite The Gordian Knot but it’s certainly a very difficult, complex problem and possibly one that could only be solved with our core, patented IP DConE.

Then we had a breakthrough.

Months later and I’m proud to formally release WANdisco Fusion (WD Fusion), the only product that enables WAN-scope active-active synchronization of different storage systems into one place.

What does this mean in practice? Well it means that you can use Hadoop distributions like Hortonworks, Cloudera or Pivotal for compute, Oracle BDA for fast compute, EMC Isilon for dense storage. You could even use a complete variety of Hadoop distros and versions. Whatever your set-up, with WD Fusion you can leverage new and existing storage assets immediately.

With it, Hadoop is transformed from being something that runs within a data center into an elastic platform that runs across multiple data centers throughout the world. WD Fusion allows you to update your storage infrastructure one data center at a time, without impacting your application ability or by having to copy vast swathes of data once the update is done.

When we were developing WD Fusion we agreed upon two things. First, we couldn’t produce anything that made changes to the underlying storage system – this had to behave like a client application. Second, anything we created had to enable a complete single global name-space across an entire storage infrastructure.

With WD Fusion, we allow businesses to bring together different storage systems by leveraging our existing intellectual property – the same Paxos-powered algorithm behind Non-Stop Hadoop, Subversion Multisite and Git Multisite – without making any changes to the platform you’re using.

Another way of putting it is we’ve managed to spread our secret sauce even further.

We have some of the best computer scientists in the world working at WANdisco, but I’m confident that this is the most revolutionary project any of us have ever worked on.

I’m delighted to be unveiling WD Fusion. It’s a testament to the talent and character of our firm, the result of looking at an impossible scenario and saying: “Challenge accepted.”

WANdisco Fusion Q&A with Jagane Sundar, CTO

Tuesday we unveiled our new product: WANdisco Fusion. Ahead of the launch, we caught up with WANdisco CTO Jagane Sundar, who was one of the driving forces behind Fusion.

Jagane joined WANdisco in November 2012 after the firm’s acquisition of AltoStor and has since played a key role in the company’s product development and rollout. Prior to founding AltoStor along with Konstantin Shvachko, Jagane was part of the original team that developed Apache Hadoop at Yahoo!.

Jagane, put simply, what is WANdisco Fusion?

JS: WANdisco Fusion is a wonderful piece of technology that’s built around a strongly consistent transactional replication engine, allowing for the seamless integration of different types of storage for Hadoop applications.

It was designed to help organizations get more out of their Big Data initiatives, answering a number of very real problems facing the business and IT worlds.

And the best part? All of your data centers are active simultaneously: You can read and write in any data center. The result is you don’t have hardware that’s lying idle in your backup or standby data center.

What sort of business problems does it solve?

JS: It provides two new important capabilities for customers. First, it keeps data consistent across different data centers no matter where they are in the world.

And it gives customers the ability to integrate different storage types into a single Hadoop ecosystem. With WANdisco Fusion, it doesn’t matter if you are using Pivotal in one data center, Hortonworks in another and EMC Isilon in a third – you can bring everything into the same environment.

Why would you need to replicate data across different storage systems?

JS: The answer is very simple. Anyone familiar with storage environments knows how diverse they can be. Different types of storage have different strengths depending on the individual application you are running.

However, keeping data synchronized is very difficult if not done right. Fusion removes this challenge while maintaining data consistency.

How does it help future proof a Hadoop deployment?

JS: We believe Fusion will form a critical component of companies’ workflow update procedures. You can update your Hadoop infrastructure one data center at a time, without impacting application availability or by having to copy massive amounts of data once the update is done.

This helps you deal with updates from both Hadoop and application vendors in a carefully orchestrated manner.

Doesn’t storage-level replication work as effectively as Fusion?

JS: The short answer is no. Storage-level replication is subject to latency limitations that are imposed by file systems. The result is you cannot really run storage-level replication over long distances, such as a WAN.

Storage-level replication is nowhere nearly as functional as Fusion: It has to happen at the LAN level and not over a true Wide Area Network.

With Fusion, you have the ability to integrate diverse systems such as NFS with Hadoop, allowing you to exploit the full strengths and capabilities of each individual storage system – I’ve never worked on a project as exciting and as revolutionary as this one.

How did WANdisco Fusion come about?

JS: By getting inside our customers’ data centers and witnessing the challenges they faced. It didn’t take long to notice the diversity of storage environments.

Our customers found that different storage types worked well for different applications – and they liked it that way. They didn’t want strict uniformity across their data centers, but to be able to leverage the strengths of each individual storage type.

At that point we had the idea for a product that would help keep data consistent across different systems.

The result was WANdisco Fusion: a fully replicated transactional engine that makes the work of keeping data consistent trivial. You only have to set it up once and never have to bother with checking if your data is consistent.

This vision of a fully utilized, strongly consistent diverse storage environment for Hadoop is what we had in mind when came up with the Fusion product.

You’ve been working with Hadoop for the last 10 years. Just how disruptive is WANdisco Fusion going to be?

JS: I’ve actually been in the storage industry for more than 15 years now. Over that period I’ve worked with shared storage systems, and I’ve worked with Hadoop storage systems. WANdisco Fusion has the potential to completely revolutionize the way people use their storage infrastructure. Frankly, this is the most exciting project I’ve ever been part of.

As the Hadoop ecosystem evolved I saw the need for this virtual storage system that integrates different types of storage.

Efforts to make Hadoop run across different data centers have been mostly unsuccessful. For the first time, we at WANdisco have a way to keep your data in Hadoop systems consistent across different data centers.

The reason this is so exciting is because it transforms Hadoop into something that runs in multiple data centers across the world.

Suddenly you have capabilities that even the original inventors of Hadoop didn’t really consider when it was conceived. That’s what makes Fusion exciting.

Scalable and Secure Git

Now that WANdisco has released an integration between Git MultiSite and GitLab, it’s worth putting the entire Git lineup at WANdisco into perspective.

Git MultiSite is the core product providing active-active replication of Git repository data. This underpins our efforts to make Git more reliable and better performing. Active-active replication means that you have full use of your Git data at several locations, not just in a single ‘master’ Git server. You get full high availability and disaster recovery out of the box, and you can load balance your end user and build demands between several Git servers. Plus, users at every location get fast local read and write access. As one of our customers recently pointed out, trying to make regular Git mirrors work this way requires a few man-years of effort.

On top of Git MultiSite you have three options for user management, security, and collaboration.

  • Use WANdisco’s Access Control Plus for unified, scalable user and permission management. It features granular permissions, delegated team management, and full integration with SVN MultiSite Plus for unified Subversion-Git administration.
  • Use Gerrit to take advantage of powerful continuous review workflows that underpin the Android community.
  • Use GitLab for an enterprise-grade social coding and collaboration platform.

Not sure which direction to take? Our solution architects help you understand how to choose between Subversion, Git, and all the other tools that you have to contend with.

Active-active strategies for data protection

A new report preview  from 451 Research highlights some of the challenges facing data center operators.  Two of the conclusions stood out in particular.  First, disaster recovery (DR) strategies are top of mind as IT operations become increasingly centralized, increasing the cost of an outage.  42% of data center operators are evaluating DR strategies, and a majority (62%) are using active-active strategies for data protection.  Second, data center operators are playing in a more complicated world now.  The ability to operator applications and data centers in a hybrid cloud environment is called out as a particular area of interest.

These findings echo what we’re hearing from our own customers.  For many enterprise IT architects, active-active data replication is a checklist item when deploying a vital service like a Hadoop cluster.  Many customers of Nonstop Hadoop and WANdisco Fusion buy our products for precisely that reason.  And we’re also seeing strong interest in WANdisco Fusion’s unique ability to provide that replication between Hadoop clusters that use different distributions and storage systems, on-premise or in the cloud.

Read the full report when its available, and in the meantime our solution architects can help you evaluate your own DR and hybrid deployment strategies.

Improving HBase Scalability for Real-time Applications

When we introduced Nonstop for Apache HBase, we explained how it would improve HBase reliability for critical applications.  But Nonstop or Apache HBase also uniquely improves HBase scalability and performance.

By making multiple active-active region servers, Nonstop or Apache HBase alleviates some common HBase performance woes.  First, clients are load balanced between several region servers for any particular region.  By spreading the load among several region servers, the impact of problems like region ‘hot spots’ is alleviated.

architecture-nshbase-wan

So far so good, but you might be thinking that you could get the same benefit by using HBase read-HA.  However, HBase read-HA is limited to read operations in a single data center.  Nonstop or Apache HBase lets you put region servers in several data centers, and any of them can handle write operations.  That gives you a few nice benefits:

  • Writes can be directed to any region server, reducing the chance that a single region server becomes a bottleneck due to hot spots or garbage collection.
  • Applications at other data centers now have fast access to a ‘local’ region server.

Although the HBase community continues to try to improve HBase performance, there are some bottlenecks that just can’t be eliminated without active-active replication.  No other solution lets you use several active region servers per region, and put those region servers at any location without regard to WAN latency.

If you’ve ever struggled with HBase performance, you should give Nonstop or Apache HBase a close look.

Reducing Hadoop network vulnerabilities for backup and replication

Managing network connections between Hadoop clusters in different data centers is a significant challenge for Hadoop and network administrators.  WANdisco Fusion reduces the number of connections required for any cross-cluster data flow, thereby reducing Hadoop network vulnerabilities for backup and replication.

DistCP is the tool used for data transfer and backup in almost every Hadoop backup and workflow system.  DistCP requires connectivity from each data node in the source cluster to each data node in the target cluster.

DistCP networking

DistCP requires connections from each data node to each data node

Typically each data node – to – data node connection requires configuring two connections for inbound and outbound traffic to cross the firewall and navigate any intermediate proxies.  In a case where you have 16 data nodes in each cluster, that means [16x16x2] connections to configure, secure, and monitor – 512 in total!  That’s a nightmare for Hadoop administrators and network operators.  Just ask the people responsible for planning and running your Hadoop cluster.

WANdisco Fusion solves this problem by routing cross-cluster communication through a handful of servers.  As the diagram below illustrates, in the simplest case you’ll have one server in the source cluster talking to one server in the target cluster, requiring a grand total of 2 connections to configure.

WDFusion network connections

WANdisco Fusion requires only a handful of network connections

In a realistic deployment, you’d require additional connections for the redundant WANdisco Fusion servers – this is an active-active configuration after all.  Still, in a large deployment you’d see a few tens of connections, rather than many hundreds.

The most recent spate of data breaches is driving a 10% annual increase in cybersecurity spending.  Why make yourself more vulnerable by exposing your entire Hadoop cluster to the WAN?  Our solution architects can help you reduce your Hadoop network exposure.

Improving HBase Resilience for Real-time Applications

HBase is the NoSQL database of choice for Hadoop, and now supports critical real-time workloads in financial services and other industries. As HBase has grown more important for these workloads, the Hadoop community has focused on reducing potential down time in the event of region server failure. Rebuilding a region server can take 15 minutes or more, and even the latest improvements only provide timeline-consistent read access using standby region servers. In many critical applications, losing write access for more than a few seconds is simply unacceptable.

architecture-nshbase-hbase

Enter Non-Stop for Apache HBase. Built on WANdisco’s patented active-active replication engine, WANdisco provides fully consistent active-active access to a set of replicated region servers. That means that your HBase data is always safe and accessible for read and write activity.

architecture-nshbase-small

By providing fully consistent active-active replication for region servers, Non-Stop for Apache HBase gives applications always-on read/write access for HBase.

Putting a replica in a remote location also provides geographic redundancy. Unlike native HBase replication, region servers at other data centers are fully writable and guaranteed to be consistent. Non-Stop for Apache HBase includes active-active HBase masters, so full use of HBase can continue even if an entire cluster is lost.

Non-Stop for Apache HBase also simplifies the management of HBase, as you no longer need complicated asynchronous master-slave setups for backup and high availability.

Take a look at what Non-Stop for Apache HBase can do for your low-latency and real-time analysis applications.

Monitoring active-active replication in WANdisco Fusion

WANdisco Fusion provides a very unique capability: active-active data replication between Hadoop clusters that may be in different locations and run very different types of Hadoop.

From an operational perspective, that capability poses some new and interesting questions about cross-cluster data flow.  Which cluster is data most often originating at?  How fast is getting moving between clusters?  And how much data is flowing back and forth?

WANdisco Fusion captures a lot of detailed information about the replication of data that can help to answer those questions, and it’s exposed through a series of REST end points.  The captured information includes:

  • The origin of replicated data (which cluster it came from)
  • The size of the files
  • Transfer rate
  • Transfer start, stop, and elapsed time
  • Transfer status

A subset of this information is visible in the WANdisco Fusion user interface, but I decided it would be a good chance to dust off my R scripts and do some visualization on my own.

For example, I can see that the replication data transfer rate between my two EC2 clusters is roughly bteween 800 and 1200 kb/s.

file-xfer-rateAnd, the size of the data is pretty small, between 600 and 900 kb.

file-xfer-sizeThose are just a couple of quick examples that I captured while running a data ingest process.  But over time it will be very helpful to keep an eye on the flow of data between your Hadoop clusters.  You could see, for instance, if there are any peak ingest times for clusters in different geographic regions.

Beyond all the other benefits of WANdisco Fusion, it provides this wealth of operational data to help you manage your Hadoop deployment.  Be sure to contact one of our solutions architects if this information could be of use to you.

Transforming health care with Big Data

There’s a lot of hype around Big Data these days, so it’s refreshing to hear a real success story directly from one of the practitioners.  I was lucky a couple of weeks ago to attend a talk given by Charles Boicey, an Enterprise Analytics Architect, at an event sponsored by WANdisco, Hortonworks, and Slalom Consulting.  Charles helped put a Big Data strategy in place at the University of California – Irvine (UCI) Medical Center, and is now working on a similar project at Stony Brook.

If you’ve ever read any of Atul Gawande‘s publications, you’ll know that the U.S. health care system is challenged by a rising cost curve.  Thoughtful researchers are trying to address costs and improve quality of care by reducing error rates, focusing on root causes of recurring problems, and making sure that health care practitioners have the right data at the right time to make good decisions.

Mr. Boicey is in the middle of these transformational projects.  You can read about this work on his Twitter feed and elsewhere, and WANdisco has a case study available.  One thing that caught my attention in his latest talk is the drive to incorporate data from social media and wearable devices to improve medical care.  Mr. Boicey mentioned that sometimes patients will complain on Facebook while they’re still in the hospital – and that’s probably a good thing for the doctors and nurses to know.

And of course, all of the wearable devices that track daily activity and fitness would be a boon to medical providers if they could get a handle on that data easily.  The Wall Street Journal has a good write-up on the opportunities and challenges in this area.

It’s nice to see that Big Data is concrete applications that will truly benefit society.  It’s not just a tool for making the web work better anymore.

Benefits of WANdisco Fusion

In my last post I described WANdisco Fusion’s cluster-spanning file system. Now think of what that offers you:

  • Ingest data to any cluster and share it quickly and reliably with other clusters. That’ll remove fragile data transfer bottlenecks while still letting you process data at multiple places to improve performance and get more utilization out of backup clusters.
  • Support a bimodal or multimodal architecture to enable innovation without jeopardizing SLAs. Perform different stages of the processing pipeline on the best cluster. Need a dedicated high-memory cluster for in-memory analytics? Or want to take advantage of an elastic scale-out on a cheaper cloud environment? Got a legacy application that’s locked to a specific version of Hadoop? WANdisco Fusion has the connections to make it happen. And unlike batch data transfer tools, WANdisco Fusion provides fully consistent data that can be read and written from any site.

blog-graphics-usage

  • Put away the emergency pager. If you lose data on one cluster, or even an entire cluster, WANdisco Fusion has made sure that you have consistent copies of the data at other locations.

blog-graphics-usage-failover

  • Set up security tiers to isolate sensitive data on secure clusters, or keep data local to its country of origin.

blog-graphics-usage-security

  • Perform risk-free migrations. Stand up a new cluster and seamlessly share data using WANdisco Fusion. Then migrate applications and users at your leisure, and retire the old cluster whenever you’re ready.

Read more

Interested? Check out the product brief or call us for details.