The vision of the data lake is admirable: collect all your valuable business data in one repository. Make it available for analysis and generate actionable data fast enough to improve your strategic and tactical business decisions.
Translated to Hadoop language, that implies putting all the data in a single large Hadoop cluster. That gives you the analysis advantages of the data lake while leveraging Hadoop’s low storage costs. And indeed, a recent survey found that 61% of Big Data analytics projects have shifted some EDW workload to Hadoop.
But in reality, it’s not that simple. 35% of those involved in Big Data projects are worried about maintaining performance as the data volume and work load increase. 44% are concerned about lack of enterprise-grade backup. Those concerns argue against concentrating ever more data into one cluster.
And meanwhile, 70% of the companies in that survey have multiple clusters in use. Small clusters that started as department-level pilots become production clusters. Security or cost concerns may dictate the use of multiple clusters for different groups. Upgrades to new Hadoop distributions to take advantage of new components (or abandon old ones) can be a difficult migration process. Whatever the reason, the reality of Hadoop deployments is more complicated than you’d think.
As for making multiple clusters play well together… well, the fragility of the tools like DistCP brings back memories of those complicated ETL processes that we wanted to leave behind us.
So are we doomed to an environment of data silos? Isn’t that what we were trying to avoid?
There is a better way. In the next post I’ll introduce WANdisco Fusion, the only Hadoop-compatible file system that quickly and easily shares data across clusters, distributions, and file systems.
Survey source: Wikibon