Managing network connections between Hadoop clusters in different data centers is a significant challenge for Hadoop and network administrators. WANdisco Fusion reduces the number of connections required for any cross-cluster data flow, thereby reducing Hadoop network vulnerabilities for backup and replication.
DistCP is the tool used for data transfer and backup in almost every Hadoop backup and workflow system. DistCP requires connectivity from each data node in the source cluster to each data node in the target cluster.
Typically each data node – to – data node connection requires configuring two connections for inbound and outbound traffic to cross the firewall and navigate any intermediate proxies. In a case where you have 16 data nodes in each cluster, that means [16x16x2] connections to configure, secure, and monitor – 512 in total! That’s a nightmare for Hadoop administrators and network operators. Just ask the people responsible for planning and running your Hadoop cluster.
WANdisco Fusion solves this problem by routing cross-cluster communication through a handful of servers. As the diagram below illustrates, in the simplest case you’ll have one server in the source cluster talking to one server in the target cluster, requiring a grand total of 2 connections to configure.
In a realistic deployment, you’d require additional connections for the redundant WANdisco Fusion servers – this is an active-active configuration after all. Still, in a large deployment you’d see a few tens of connections, rather than many hundreds.
The most recent spate of data breaches is driving a 10% annual increase in cybersecurity spending. Why make yourself more vulnerable by exposing your entire Hadoop cluster to the WAN? Our solution architects can help you reduce your Hadoop network exposure.