Complete control over Hadoop data locality

Non-Stop Hadoop provides a unified data layer across Hadoop clusters in one or many locations. This unified data layer solves a number of problems by providing a very low recovery point objective for critical data, full continuity of data access in the event of failure, and the ability to ingest and process data at any cluster.

Carrying implementation of this layer to its logical conclusion, you may ask if we’ve introduced a new problem in the process of solving these others, namely, what if you don’t want to replicate all HDFS data everywhere?

Perhaps you have to respect data privacy or locality regulations, or maybe it’s just not practical to ship all your raw data across the WAN. Do you have to fall back to workflow management systems like Falcon to do scheduled selective data transfers, and deal with the delays and complexity of building an ETL-style pipeline?

Luckily, no. Non-Stop Hadoop provides a selective replication capability that is more sophisticated than what you could build manually with the stock data transfer tools. As part of a centralized administration function, for each part of the HDFS namespace you can define:

  • Which data centers receive the data
  • The replication factor in each data center
  • Whether data is available for remote (WAN) read even if it is not available locally
  • Whether data can be written in a particular data center

This solves a host of problems. Perhaps most importantly, if you have sensitive data that cannot be transferred outside a certain area, you can make sure it never reaches data centers in other areas. Further, you can ensure that the restricted part of the namespace is never accessed for reads or writes in other areas.

Non-Stop Hadoop’s selective replication also solves some efficiency problems. Simply choose not to replicate temporary ‘working’ data, or only replicate rarely accessed data on demand. Similarly, you don’t need as high a replication factor if data exists in multiple locations, so you can cut down on some local storage costs.

Selective replication across multiple clusters sharing a Nonstop Hadoop HDFS data layer: Replication policies control where subsets of HDFS are replicated, the replication factor in each cluster, and the availability of remote (WAN) reads

Selective replication across multiple clusters sharing a Nonstop Hadoop HDFS data layer: Replication policies control where subsets of HDFS are replicated, the replication factor in each cluster, and the availability of remote (WAN) reads

Consistent highly available data is really just the starting point for Nonstop Hadoop.  Nonstop Hadoop also gives you powerful tools to control where data resides, how it gets there, and how it’s stored.

By now you’ve probably thought of a problem that selective replication can help you solve.  Give our team of Hadoop experts a call to learn more.

0 Responses to “Complete control over Hadoop data locality”


  • No Comments

Leave a Reply