Improving HBase Resilience for Real-time Applications

HBase is the NoSQL database of choice for Hadoop, and now supports critical real-time workloads in financial services and other industries. As HBase has grown more important for these workloads, the Hadoop community has focused on reducing potential down time in the event of region server failure. Rebuilding a region server can take 15 minutes or more, and even the latest improvements only provide timeline-consistent read access using standby region servers. In many critical applications, losing write access for more than a few seconds is simply unacceptable.

architecture-nshbase-hbase

Enter Non-Stop for Apache HBase. Built on WANdisco’s patented active-active replication engine, WANdisco provides fully consistent active-active access to a set of replicated region servers. That means that your HBase data is always safe and accessible for read and write activity.

architecture-nshbase-small

By providing fully consistent active-active replication for region servers, Non-Stop for Apache HBase gives applications always-on read/write access for HBase.

Putting a replica in a remote location also provides geographic redundancy. Unlike native HBase replication, region servers at other data centers are fully writable and guaranteed to be consistent. Non-Stop for Apache HBase includes active-active HBase masters, so full use of HBase can continue even if an entire cluster is lost.

Non-Stop for Apache HBase also simplifies the management of HBase, as you no longer need complicated asynchronous master-slave setups for backup and high availability.

Take a look at what Non-Stop for Apache HBase can do for your low-latency and real-time analysis applications.

Monitoring active-active replication in WANdisco Fusion

WANdisco Fusion provides a very unique capability: active-active data replication between Hadoop clusters that may be in different locations and run very different types of Hadoop.

From an operational perspective, that capability poses some new and interesting questions about cross-cluster data flow.  Which cluster is data most often originating at?  How fast is getting moving between clusters?  And how much data is flowing back and forth?

WANdisco Fusion captures a lot of detailed information about the replication of data that can help to answer those questions, and it’s exposed through a series of REST end points.  The captured information includes:

  • The origin of replicated data (which cluster it came from)
  • The size of the files
  • Transfer rate
  • Transfer start, stop, and elapsed time
  • Transfer status

A subset of this information is visible in the WANdisco Fusion user interface, but I decided it would be a good chance to dust off my R scripts and do some visualization on my own.

For example, I can see that the replication data transfer rate between my two EC2 clusters is roughly bteween 800 and 1200 kb/s.

file-xfer-rateAnd, the size of the data is pretty small, between 600 and 900 kb.

file-xfer-sizeThose are just a couple of quick examples that I captured while running a data ingest process.  But over time it will be very helpful to keep an eye on the flow of data between your Hadoop clusters.  You could see, for instance, if there are any peak ingest times for clusters in different geographic regions.

Beyond all the other benefits of WANdisco Fusion, it provides this wealth of operational data to help you manage your Hadoop deployment.  Be sure to contact one of our solutions architects if this information could be of use to you.

Transforming health care with Big Data

There’s a lot of hype around Big Data these days, so it’s refreshing to hear a real success story directly from one of the practitioners.  I was lucky a couple of weeks ago to attend a talk given by Charles Boicey, an Enterprise Analytics Architect, at an event sponsored by WANdisco, Hortonworks, and Slalom Consulting.  Charles helped put a Big Data strategy in place at the University of California – Irvine (UCI) Medical Center, and is now working on a similar project at Stony Brook.

If you’ve ever read any of Atul Gawande‘s publications, you’ll know that the U.S. health care system is challenged by a rising cost curve.  Thoughtful researchers are trying to address costs and improve quality of care by reducing error rates, focusing on root causes of recurring problems, and making sure that health care practitioners have the right data at the right time to make good decisions.

Mr. Boicey is in the middle of these transformational projects.  You can read about this work on his Twitter feed and elsewhere, and WANdisco has a case study available.  One thing that caught my attention in his latest talk is the drive to incorporate data from social media and wearable devices to improve medical care.  Mr. Boicey mentioned that sometimes patients will complain on Facebook while they’re still in the hospital – and that’s probably a good thing for the doctors and nurses to know.

And of course, all of the wearable devices that track daily activity and fitness would be a boon to medical providers if they could get a handle on that data easily.  The Wall Street Journal has a good write-up on the opportunities and challenges in this area.

It’s nice to see that Big Data is concrete applications that will truly benefit society.  It’s not just a tool for making the web work better anymore.

Benefits of WANdisco Fusion

In my last post I described WANdisco Fusion’s cluster-spanning file system. Now think of what that offers you:

  • Ingest data to any cluster and share it quickly and reliably with other clusters. That’ll remove fragile data transfer bottlenecks while still letting you process data at multiple places to improve performance and get more utilization out of backup clusters.
  • Support a bimodal or multimodal architecture to enable innovation without jeopardizing SLAs. Perform different stages of the processing pipeline on the best cluster. Need a dedicated high-memory cluster for in-memory analytics? Or want to take advantage of an elastic scale-out on a cheaper cloud environment? Got a legacy application that’s locked to a specific version of Hadoop? WANdisco Fusion has the connections to make it happen. And unlike batch data transfer tools, WANdisco Fusion provides fully consistent data that can be read and written from any site.

blog-graphics-usage

  • Put away the emergency pager. If you lose data on one cluster, or even an entire cluster, WANdisco Fusion has made sure that you have consistent copies of the data at other locations.

blog-graphics-usage-failover

  • Set up security tiers to isolate sensitive data on secure clusters, or keep data local to its country of origin.

blog-graphics-usage-security

  • Perform risk-free migrations. Stand up a new cluster and seamlessly share data using WANdisco Fusion. Then migrate applications and users at your leisure, and retire the old cluster whenever you’re ready.

Read more

Interested? Check out the product brief or call us for details.

Enter WANdisco Fusion

In my last post I talked about some of the problems of setting up data lakes in real Hadoop deployments. And now here’s the better way: WANdisco Fusion lets you build an effective, fast, and secure Hadoop deployments by bridging several Hadoop clusters – even if those clusters use different distributions, different versions of Hadoop, or even different file systems.

How does it work?

WANdisco Fusion (WD Fusion for short) lets you share data directories between two or more clusters. The data is replicated using WANdisco’s active-active replication engine – this isn’t just a fancier way to mirror data. Every cluster can write into the shared directories, and changes are coordinated in real-time between the clusters. That’s where the reliability comes from: the Paxos-based replication engine is a proven, patented way to coordinate changes coming from anywhere in the world with 100% reliability. Clusters that are temporarily down or disconnected catch up automatically when they’re back online.

The actual data transfer is done as an asynchronous background process and doesn’t consume MapReduce resources.

Selective replication enhances security. You can centrally define if data is available to every cluster or just one cluster.

blog-graphics-arch

What benefits does it bring?

Ok, put aside the technical bits. What can this thing actually do? In the next post I’ll show you how WD Fusion helps you get more value out of your Hadoop clusters.

WANdisco Fusion: A Bridge Between Clusters, Distributions, and Storage Systems

The vision of the data lake is admirable: collect all your valuable business data in one repository. Make it available for analysis and generate actionable data fast enough to improve your strategic and tactical business decisions.

Translated to Hadoop language, that implies putting all the data in a single large Hadoop cluster. That gives you the analysis advantages of the data lake while leveraging Hadoop’s low storage costs. And indeed, a recent survey found that 61% of Big Data analytics projects have shifted some EDW workload to Hadoop.

But in reality, it’s not that simple. 35% of those involved in Big Data projects are worried about maintaining performance as the data volume and work load increase. 44% are concerned about lack of enterprise-grade backup. Those concerns argue against concentrating ever more data into one cluster.

And meanwhile, 70% of the companies in that survey have multiple clusters in use. Small clusters that started as department-level pilots become production clusters. Security or cost concerns may dictate the use of multiple clusters for different groups. Upgrades to new Hadoop distributions to take advantage of new components (or abandon old ones) can be a difficult migration process. Whatever the reason, the reality of Hadoop deployments is more complicated than you’d think.

As for making multiple clusters play well together… well, the fragility of the tools like DistCP brings back memories of those complicated ETL processes that we wanted to leave behind us.

So are we doomed to an environment of data silos? Isn’t that what we were trying to avoid?

blog-graphics-concerns

There is a better way. In the next post I’ll introduce WANdisco Fusion, the only Hadoop-compatible file system that quickly and easily shares data across clusters, distributions, and file systems.

Survey source: Wikibon

SmartSVN has a new home

We’re pleased to announce that from 23/02/2015 SmartSVN will be owned, maintained and managed by SmartSVN GmbH, a 100% child of Syntevo GmbH.

Long term customers will remember that Syntevo were the original creators and suppliers of SmartSVN, before WANdisco’s purchase of the product.

We’ve brought a lot of great features and enhancements to SmartSVN since we purchased it in 2012, particularly with the change from SVNkit to JAVAHL, which brought significant performance improvements and means that SmartSVN will be compatible with updates to core Subversion much faster than previously.

During the last two years the founders of Syntevo have continued to work with WANdisco on both engineering and consulting levels, so the transition back into their ownership will be smooth and seamless. We’re confident that having the original creators of SmartSVN take over the reins again will ensure that SmartSVN remains the best cross-platform Subversion product available for a long time to come.

Will this affect my purchased SmartSVN license?

No, SmartSVN GmbH will continue to support current SmartSVN users and you’ll be able to renew through them when the free upgrade period of your SmartSVN license has expired.

Where should I raise issues in the future?

The best place to go is Syntevo’s contact page where you’ll find the right contact depending on the nature of your issue.

A thank you to the SmartSVN community

Your input has been invaluable in guiding the improvements we’ve made to SmartSVN, we couldn’t have done it without you. We’d like to say thank you for your business over the last two years, and hope you continue to enjoy the product.

Regards,
Team WANdisco

Join WANdisco at Strata

The Strata conferences are some of the best Big Data shows around.  I’m really looking forward to the show in San Jose on February 17-20 this year.  The presentations look terrific, and there are deep-dive sessions into Spark and R for all of the data scientists.

Plus, WANdisco will have a strong presence.  Our very own Jagane Sundar and Brett Rudenstein will be in the Cube to talk about WANdisco’s work on distributed file systems.  They’ll also show early demos of some exciting new technology, and you can always stop by our booth to see more.

Look forward to seeing everyone out there!

Register for Hadoop Security webinar

Security in Hadoop is a challenging topic.  Hadoop was built without very much of a security framework in mind, and so over the years the distribution vendors have added new authentication layers.  Kerberos, Knox, Ranger, Sentry – there are a lot of security components to consider in this fluid landscape.  Meanwhile, the demand for security is increasing thanks to increased data privacy concerns, exacerbated by the recent string of security breaches at major corporations.

This week Wikibon’s Jeff Kelly will give his perspective on how to secure sensitive data in Hadoop.  It should be a very interesting and useful Hadoop security webinar and I hope you’ll join us.  Just visit http://www.wandisco.com/webinars to register.

Data locality leading to more data centers

In the ‘yet another headache for CIOs’ category, here’s an interesting read from the Wall Street Journal on why US companies are going to start building more data centers in Europe soon.  In the wake of various cybersecurity threats and some recent political events, national governments are more sensitive to their citizens’ data leaving their area of control.  That’s data locality leading to more data centers – and it’ll hit a lot of companies.

Multinational firms are of course affected as they have customer data originating from several areas.  But in my mind the jury is out on how big the impact will be.  If you’re even a consumer of social media information, do you need a local data center in every area where you’re trying to get that data feed?  It’s likely going to take a few years (and probably some legal rulings) before the dust settles.

You can imagine that this new requirement puts a real crimp in Hadoop deployment plans.  Do you now need at least a small cluster in each area you do business in?  If so, how do you easily keep sensitive data local while still sharing downstream analysis?

This is one of the areas where a geographically distributed HDFS with powerful selective replication capabilities can come to the rescue.  For more details, have a listen to the webinar on Hadoop data transfer pipelines that I ran with 451 Research’s Matt Aslett last week.