A new report out from GigaOM analyst Paul Miller provides some insights into the question, is one Hadoop cluster enough for most Big Data needs? It’s surprising how much attention this topic has garnered recently. Up until a few months ago I hadn’t really thought that much about why you’d need more than a single cluster. After all, most of the technical information about Hadoop is geared towards running everything on one cluster, especially since YARN makes it easier to run multiple applications on a single cluster.
But another recent study shows that a majority of Big Data users are running multiple data centers. The GigaOM report dives into some of the reasons why that might be. Workload optimization, load balancing, taking advantage of the cloud for affordable burst processing and backups, regulatory concerns – there are a host of reasons that are driving Hadoop adopters toward a logical data lake consisting of several clusters. And of course there’s also the fact that many Hadoop deployments evolve from a collection of small clusters set up in isolation.
The report also notes that the tools for managing the flow of data between multiple clusters are still rudimentary. DistCP, which underpins many of the ETL-style tools like Falcon, can be quite slow and error-prone. If you only need to sync data between clusters once a day it might be ok, but many use cases are demanding near real-time roll-up analysis.
That’s why WANdisco provides active-active replication: Non-stop Hadoop lets your data span clusters and geographies. In the interest of saving a thousand words:
Interested? Check out some of the reasons why this architecture is attractive to Hadoop data consumers and operators.