More efficient cluster utilization with Non-Stop Hadoop

Perhaps the most overlooked capability of WANdisco’s Non-Stop Hadoop is its efficient cluster utilization in secondary data centers. These secondary clusters are often used only for backup purposes, which is a waste of valuable computing resources. Non-Stop Hadoop allows you to take full advantage of the CPU and storage resources that you’ve paid for.

Of course anyone who adopts Hadoop needs a backup strategy, and the typical solution is to put a backup cluster in a remote data center. distcp, a part of the core Hadoop distribution, is used to periodically transfer data from the primary cluster to the backup cluster. You can also run some read-only jobs on the backup cluster, as long as you don’t need immediate access to the latest data.

Still, that backup cluster is a big investment that isn’t being used fully. What if you could treat that backup cluster as a part of your unified Hadoop environment, and use it fully for any processing?  That would give you a better return on that backup cluster investment, and let you shift some load off of the primary cluster, perhaps reducing the need for additional primary cluster capacity.

That’s exactly what Non-stop Hadoop provides: you can treat several Hadoop clusters as part of a single unified Hadoop file system. All of the important data is replicated efficiently by Non-Stop Hadoop, including the NameNode metadata and the actual data blocks. You can write data into any of the clusters, knowing that the metadata will be kept in sync by Non-Stop Hadoop and that the actual data will be transferred seamlessly (and much faster compared to using a tool like distcp).

As a simple example, recently I was ingesting two streams of data into a Hadoop cluster. Each ingest job handled roughly the same amount of data. The two jobs combined took up about 28 seconds of cluster CPU time during each run and consumed roughly 500MB of cluster RAM during operation.

Then I decided to run each job separately on two clusters that are part of a single Non-Stop Hadoop deployment. In this case, again running both jobs at the same time, I took up 15 seconds on the first cluster and 18 seconds on the second cluster, using about 250MB of RAM on each.

The exact numbers will vary depending on the job and what else is running on the cluster, but in this simple example I’ve accomplished three very useful things:

  • I’ve gotten useful work out of a second cluster that would otherwise be idle.
  • I’ve shifted half of the processing burden off of the first cluster. (It also helps to have six NameNodes instead of 2 to handle the concurrent writes.)
  • I don’t have to run the distcp job to transfer this data to a backup site – it’s already on both clusters. Not only am I getting more useful work out of my second cluster, I’m avoiding unnecessary overhead work.

So there you have it – Non-Stop Hadoop is the perfect way to get more bang for your backup cluster buck. Want to know more? We’ll be happy to discuss in more detail.

0 Responses to “More efficient cluster utilization with Non-Stop Hadoop”

  • No Comments

Leave a Reply