Running Hadoop trial clusters

How not to blow your EC2 budget

Lately I’ve been spinning up a lot of Hadoop clusters for various demos and just trying new features. I’m pretty sensitive to exploding my company’s EC2 budget, so I dutifully shut down all my EC2 instances when I’m done.

Doing so has been proven to be a bit painful, however. Restarting a cluster takes time and there’s usually a service or two that don’t start properly and need to be restarted. I don’t have a lot of dedicated hardware available — just my trusty MBP. I realize I could run a single-node cluster using a sandbox from one of the distribution vendors, but I really need multiple nodes for some of my testing.

I want to run multiple node clusters as VMs on my laptop and be able to leave these up and running over a weekend if necessary, and I’ve found a couple of approaches that look promising:

  • Setting up traditional VMs using Vagrant. I use Vagrant for almost all of my VMs and this would be comfortable for me.
  • Trying to run a cluster using Docker. I have not used Docker but have heard a lot about it, and I’m hoping it will help with memory requirements. I have 16 GB on my laptop but am not sure how many VMs I can run in practice.

Here’s my review after quickly trying both approaches.

VMs using Vagrant

The Vagrant file referenced in the original article works well. After installing a couple of Vagrant plugins I ran ‘vagrant up’ and had a 4-node cluster ready for CDH installation. Installation using Cloudera Manager started smoothly.

Unfortunately, it looks like running 4 VMs consuming 2 GB of RAM each is just asking too much of this machine. (The issue might be CPU more than RAM — the fan was in overdrive during installation.) I could only get 3 of the VMs to complete installation, and then during cluster configuration only two of them were available to run services.


Running Docker on the Mac isn’t ideal, as you need to fire up a VM to actually manage the containers. (I’ve yet to find the perfect development laptop. Windows isn’t wholly compatible with all of the tools I use, Mac isn’t always just like Linux, and Linux doesn’t play well with the Office tools I have to use.) On the Mac there’s definitely a learning curve. The docker containers that actually run the cluster are only accessible to the VM that’s hosting the containers, at least in the default configuration I used. That means I had to forward ports from that VM to my host OS.

Briefly, my installation steps were:

  • Launch the boot2docker app
  • Import the docker images referenced in the instructions (took about 10 minutes)
  • Run a one line command to deploy the cluster (this took about 3 minutes)
  • Grab the IP addresses for the containers and set up port forwarding on the VM for the important ports (8080, 8020, 50070, 50075)
  • Log in to Ambari and verify configuration

At this point I was able to execute a few simple HDFS operations using webhdfs to confirm that I had an operational cluster.

The verdict

The verdict is simple — I just don’t have enough horsepower to run even a 4-node cluster using traditional VMs. With Docker, I was running the managing VM and three containers in a few minutes, and I didn’t get the sense that my machine was struggling at all. I did take a quick look at resource usage but I hesitate to report the numbers as I had a lot of other stuff running on my machine at the same time.

Docker takes some getting used to, but once I had a sense of how the containers and VM were interacting I figured out how to manage the ports and configuration. I think learning Docker will be like learning Vim after using a full blown IDE — if you invest the time, you’ll be able to do some things very quickly without putting a lot of stress on your laptop.

1 Response to “Running Hadoop trial clusters”

Leave a Reply