Spark and Hadoop infrastructure

I just read another article about how Spark stands a good chance of supplanting MapReduce for many use cases. As an in-memory platform, Spark provides answers much faster than MapReduce, which must perform an awful lot of disk I/O to process data.

Yet MapReduce isn’t going away. Beside all of the legacy applications built on MapReduce, MapReduce can still handle much larger data sets. Spark’s limit is in terabytes, while MapReduce can handle petabytes.

There’s one interesting question I haven’t seen discussed, however. How do you manage the different hardware profiles for Spark and other execution engines? A general-purpose MapReduce cluster will likely balance I/O, CPU, and RAM, while a cluster tailored for Spark will emphasize RAM and CPU much more heavily than I/O throughput. (As one simple example, switching a Spark MLlib job to a cluster that allowed allocation of 12GB of RAM per executor cut the run time from 370 seconds to 14 seconds.)

From what I’ve heard, YARN’s resource manager doesn’t handle hybrid hardware profiles in the same cluster very well yet. It will tend to pick the ‘best’ data node available for a task, but that means it will tend to pick your big-memory, big-CPU nodes for everything, not just Spark jobs.

So what’s the answer? One possibility is to set up multiple clusters, each tailored for different types of processing. They’ll have to share data, of course, which is where it gets complicated. The usual techniques for moving data between clusters (i.e. the tools built on distcp) are meant for schedule synchronization – in other words, backups. Unless you’re willing to accept a delay before data is available to both clusters, and unless you’re extremely careful about which parts of the namespace each cluster effectively ‘owns’, you’re out of luck…unless you use Non-Stop Hadoop, that is.

Non-Stop Hadoop lets you treat two or more clusters as a unified HDFS namespace, even when the clusters are separated by a WAN. Each cluster can read and write simultaneously, using WANdisco’s active-active replication technology to keep the HDFS metadata in sync. In addition, Non-Stop Hadoop’s efficient block-level replication between clusters means data transfers much more quickly.

This means you can set up two clusters with different hardware profiles, running Spark jobs on one and traditional MapReduce jobs on the other, without any additional administrative overhead. Same data, different jobs, better results.

Interested? We’ve got a great demo ready and waiting.


0 Responses to “Spark and Hadoop infrastructure”

  • No Comments

Leave a Reply