Hadoop Data Protection

I just came across this nice summary of data protection strategies for Hadoop.  It hits on a key problem: typical backup strategies for Hadoop just won’t handle the volume of data, and there’s not much available from dedicated backup vendors either.  Because it’s a daunting problem, companies just assume that Hadoop is “distributed enough” to not require a bulletproof backup strategy.

But as we’ve heard time and again from our customers, that’s just not the case.  The article shows why – if you read on to the section on DistCP, the tool normally used for cluster backup, you’ll see that it can take hours to back up a few terabytes of data.

As the article mentions, what’s necessary is an efficient block-level backup solution.  Luckily, that’s just what WANdisco provides in Nonstop Hadoop.  The architect of our geographically distributed solution for a unified cross-cluster HDFS data layer described the approach at a Strata conference last year.

The article actually mentions our solution, but I think there was a slight misunderstanding.  WANdisco does not actually make “just a backup” solution, so we don’t provide any more versioning than what you get out of regular HDFS.  In fact that’s the whole point – we provide an HDFS data layer that spans multiple clusters and locations.  It provides a very effective Disaster Recovery strategy as well as other benefits like cluster zones and multiple data-center ingest.

arch-nsh-components

Interested in learning more?  We’re here to help.

0 Responses to “Hadoop Data Protection”


  • No Comments

Leave a Reply