I just came across this nice summary of data protection strategies for Hadoop. It hits on a key problem: typical backup strategies for Hadoop just won’t handle the volume of data, and there’s not much available from dedicated backup vendors either. Because it’s a daunting problem, companies just assume that Hadoop is “distributed enough” to not require a bulletproof backup strategy.
But as we’ve heard time and again from our customers, that’s just not the case. The article shows why – if you read on to the section on DistCP, the tool normally used for cluster backup, you’ll see that it can take hours to back up a few terabytes of data.
As the article mentions, what’s necessary is an efficient block-level backup solution. Luckily, that’s just what WANdisco provides in Nonstop Hadoop. The architect of our geographically distributed solution for a unified cross-cluster HDFS data layer described the approach at a Strata conference last year.
The article actually mentions our solution, but I think there was a slight misunderstanding. WANdisco does not actually make “just a backup” solution, so we don’t provide any more versioning than what you get out of regular HDFS. In fact that’s the whole point – we provide an HDFS data layer that spans multiple clusters and locations. It provides a very effective Disaster Recovery strategy as well as other benefits like cluster zones and multiple data-center ingest.
Interested in learning more? We’re here to help.