WANdisco Fusion provides a very unique capability: active-active data replication between Hadoop clusters that may be in different locations and run very different types of Hadoop.
From an operational perspective, that capability poses some new and interesting questions about cross-cluster data flow. Which cluster is data most often originating at? How fast is getting moving between clusters? And how much data is flowing back and forth?
WANdisco Fusion captures a lot of detailed information about the replication of data that can help to answer those questions, and it’s exposed through a series of REST end points. The captured information includes:
- The origin of replicated data (which cluster it came from)
- The size of the files
- Transfer rate
- Transfer start, stop, and elapsed time
- Transfer status
A subset of this information is visible in the WANdisco Fusion user interface, but I decided it would be a good chance to dust off my R scripts and do some visualization on my own.
For example, I can see that the replication data transfer rate between my two EC2 clusters is roughly bteween 800 and 1200 kb/s.
Those are just a couple of quick examples that I captured while running a data ingest process. But over time it will be very helpful to keep an eye on the flow of data between your Hadoop clusters. You could see, for instance, if there are any peak ingest times for clusters in different geographic regions.
Beyond all the other benefits of WANdisco Fusion, it provides this wealth of operational data to help you manage your Hadoop deployment. Be sure to contact one of our solutions architects if this information could be of use to you.