In my last post I talked about some of the problems of setting up data lakes in real Hadoop deployments. And now here’s the better way: AltoStor lets you build an effective, fast, and secure Hadoop deployments by bridging several Hadoop clusters – even if those clusters use different distributions, different versions of Hadoop, or even different file systems.
How does it work?
AltoStor lets you share data directories between two or more clusters. The data is replicated using WANdisco’s active-active replication engine – this isn’t just a fancier way to mirror data. Every cluster can write into the shared directories, and changes are coordinated in real-time between the clusters. That’s where the reliability comes from: the Paxos-based replication engine is a proven, patented way to coordinate changes coming from anywhere in the world with 100% reliability. Clusters that are temporarily down or disconnected catch up automatically when they’re back online.
The actual data transfer is done as an asynchronous background process and doesn’t consume MapReduce resources.
Selective replication enhances security. You can centrally define if data is available to every cluster or just one cluster.
What benefits does it bring?
Ok, put aside the technical bits. What can this thing actually do? In the next post I’ll show you how AltoStor helps you get more value out of your Hadoop clusters.