I was baffled last week when I was told that a lot of Hadoop deployments don’t even use a backup procedure. Hadoop does of course provide local data replication that gives you three copies of every file. But catastrophes can and do happen. Data centers aren’t immune to natural disasters or malicious acts, and if you try to put some of your data nodes in a remote site the performance will suffer greatly.
WANdisco of course makes products that solve data availability problems among other challenges, so I’m not an impartial observer. But ask yourself this: is the data in your Hadoop cluster less valuable than the photos on your cell phone that are automatically synced to a remote storage site?
And after that, ask your Hadoop architect these 5 questions:
- How is our Hadoop data backed up?
- How much data might we lose if the data center fails?
- How long will it take us to recover data and be operational again if we have a data center failure?
- Have you verified the integrity of the data at the backup site?
- How often do you test our Hadoop applications on the backup site?
The answers might surprise you.