Monthly Archive for March, 2015

Improving HBase Resilience for Real-time Applications

HBase is the NoSQL database of choice for Hadoop, and now supports critical real-time workloads in financial services and other industries. As HBase has grown more important for these workloads, the Hadoop community has focused on reducing potential down time in the event of region server failure. Rebuilding a region server can take 15 minutes or more, and even the latest improvements only provide timeline-consistent read access using standby region servers. In many critical applications, losing write access for more than a few seconds is simply unacceptable.


Enter Non-Stop for Apache HBase. Built on WANdisco’s patented active-active replication engine, WANdisco provides fully consistent active-active access to a set of replicated region servers. That means that your HBase data is always safe and accessible for read and write activity.


By providing fully consistent active-active replication for region servers, Non-Stop for Apache HBase gives applications always-on read/write access for HBase.

Putting a replica in a remote location also provides geographic redundancy. Unlike native HBase replication, region servers at other data centers are fully writable and guaranteed to be consistent. Non-Stop for Apache HBase includes active-active HBase masters, so full use of HBase can continue even if an entire cluster is lost.

Non-Stop for Apache HBase also simplifies the management of HBase, as you no longer need complicated asynchronous master-slave setups for backup and high availability.

Take a look at what Non-Stop for Apache HBase can do for your low-latency and real-time analysis applications.

Monitoring active-active replication in WANdisco Fusion

WANdisco Fusion provides a very unique capability: active-active data replication between Hadoop clusters that may be in different locations and run very different types of Hadoop.

From an operational perspective, that capability poses some new and interesting questions about cross-cluster data flow.  Which cluster is data most often originating at?  How fast is getting moving between clusters?  And how much data is flowing back and forth?

WANdisco Fusion captures a lot of detailed information about the replication of data that can help to answer those questions, and it’s exposed through a series of REST end points.  The captured information includes:

  • The origin of replicated data (which cluster it came from)
  • The size of the files
  • Transfer rate
  • Transfer start, stop, and elapsed time
  • Transfer status

A subset of this information is visible in the WANdisco Fusion user interface, but I decided it would be a good chance to dust off my R scripts and do some visualization on my own.

For example, I can see that the replication data transfer rate between my two EC2 clusters is roughly bteween 800 and 1200 kb/s.

file-xfer-rateAnd, the size of the data is pretty small, between 600 and 900 kb.

file-xfer-sizeThose are just a couple of quick examples that I captured while running a data ingest process.  But over time it will be very helpful to keep an eye on the flow of data between your Hadoop clusters.  You could see, for instance, if there are any peak ingest times for clusters in different geographic regions.

Beyond all the other benefits of WANdisco Fusion, it provides this wealth of operational data to help you manage your Hadoop deployment. Be sure to contact one of our solutions architects if this information could be of use to you.

Transforming health care with Big Data

There’s a lot of hype around Big Data these days, so it’s refreshing to hear a real success story directly from one of the practitioners.  I was lucky a couple of weeks ago to attend a talk given by Charles Boicey, an Enterprise Analytics Architect, at an event sponsored by WANdisco, Hortonworks, and Slalom Consulting.  Charles helped put a Big Data strategy in place at the University of California – Irvine (UCI) Medical Center, and is now working on a similar project at Stony Brook.

If you’ve ever read any of Atul Gawande‘s publications, you’ll know that the U.S. health care system is challenged by a rising cost curve.  Thoughtful researchers are trying to address costs and improve quality of care by reducing error rates, focusing on root causes of recurring problems, and making sure that health care practitioners have the right data at the right time to make good decisions.

Mr. Boicey is in the middle of these transformational projects.  You can read about this work on his Twitter feed and elsewhere, and WANdisco has a case study available.  One thing that caught my attention in his latest talk is the drive to incorporate data from social media and wearable devices to improve medical care.  Mr. Boicey mentioned that sometimes patients will complain on Facebook while they’re still in the hospital – and that’s probably a good thing for the doctors and nurses to know.

And of course, all of the wearable devices that track daily activity and fitness would be a boon to medical providers if they could get a handle on that data easily.  The Wall Street Journal has a good write-up on the opportunities and challenges in this area.

It’s nice to see that Big Data is concrete applications that will truly benefit society.  It’s not just a tool for making the web work better anymore.

Benefits of WANdisco Fusion

In my last post I described WANdisco Fusion’s cluster-spanning file system. Now think of what that offers you:

  • Ingest data to any cluster and share it quickly and reliably with other clusters. That’ll remove fragile data transfer bottlenecks while still letting you process data at multiple places to improve performance and get more utilization out of backup clusters.
  • Support a bimodal or multimodal architecture to enable innovation without jeopardizing SLAs. Perform different stages of the processing pipeline on the best cluster. Need a dedicated high-memory cluster for in-memory analytics? Or want to take advantage of an elastic scale-out on a cheaper cloud environment? Got a legacy application that’s locked to a specific version of Hadoop? WANdisco Fusion has the connections to make it happen. And unlike batch data transfer tools, WANdisco Fusion provides fully consistent data that can be read and written from any site.


  • Put away the emergency pager. If you lose data on one cluster, or even an entire cluster, WANdisco Fusion has made sure that you have consistent copies of the data at other locations.


  • Set up security tiers to isolate sensitive data on secure clusters, or keep data local to its country of origin.


  • Perform risk-free migrations. Stand up a new cluster and seamlessly share data using WANdisco Fusion. Then migrate applications and users at your leisure, and retire the old cluster whenever you’re ready.

Read more

Interested? Check out the WANdisco Fusion page or call us for details.

Enter WANdisco Fusion

In my last post I talked about some of the problems of setting up data lakes in real Hadoop deployments. And now here’s the better way: WANdisco Fusion lets you build an effective, fast, and secure Hadoop deployments by bridging several Hadoop clusters – even if those clusters use different distributions, different versions of Hadoop, or even different file systems.

How does it work?

WANdisco Fusion (WD Fusion for short) lets you share data directories between two or more clusters. The data is replicated using WANdisco’s active-active replication engine – this isn’t just a fancier way to mirror data. Every cluster can write into the shared directories, and changes are coordinated in real-time between the clusters. That’s where the reliability comes from: the Paxos-based replication engine is a proven, patented way to coordinate changes coming from anywhere in the world with 100% reliability. Clusters that are temporarily down or disconnected catch up automatically when they’re back online.

The actual data transfer is done as an asynchronous background process and doesn’t consume MapReduce resources.

Selective replication enhances security. You can centrally define if data is available to every cluster or just one cluster.


What benefits does it bring?

Ok, put aside the technical bits. What can this thing actually do? In the next post I’ll show you how WD Fusion helps you get more value out of your Hadoop clusters.