Behind the scenes: Rapid Hadoop deployment

If you’ve ever deployed a Hadoop cluster from scratch on internal hardware or EC2, you know there are a lot of details to get right.  Syncing time with ntp, setting up password-less login across all the nodes, and making sure you have all the prerequisite packages installed is just the beginning.  Then you have to actually deploy Hadoop.  Even with a management tool like Ambari there’s a lot of time spent going through the web interface and deploying software.  In this article I’m going to describe why we invested in a framework for rapid Hadoop deployment with Docker and Ansible.

At WANdisco we have teams of engineers and solutions architects testing our latest products on a daily basis, so automation is a necessity.  Last year I spent some time on a Vagrant-Puppet toolkit to set up EC2 images and deploy Hadoop using Ambari blueprints.  As an initial effort it was pretty good but I never invested the time to handle the cross-node dependencies.  For instance, after the images are provisioned with all the prerequisites I manually ran another Puppet script to deploy Ambari, then another one to deploy Hue, rather than having a master process that handled the timing and coordination.

Luckily we have a great automation team in our Sheffield office that set up a push-button solution using Docker and Ansible.  With a single invocation you get:

  • 3 clusters (mix-and-match with the distributions you prefer)
  • Each cluster has 7 containers.  The first runs the management tool (like Ambari), the second runs the NameNode and most of the master services, the third runs Hue, and the others are data nodes.
  • All of the networking and other services are registered correctly.
  • WANdisco Fusion installed.

Starting from a bare metal host, it takes about 20 minutes to do a one-time setup with Puppet that installs Docker and the Ansible framework and builds the Docker images.  Once that first-time setup is done, a simple script starts the Docker containers and runs Ansible to deploy Hadoop.  That takes about 20 minutes for a clean install, or 2-3 minutes to refresh the clusters with the latest build of our products.

That’s a real time-saver.  Engineers can refresh with a new build in minutes, and solution architects can set up a brand new demo environment in under a half hour.  Docker is ideal for demo purposes as well.  Cutting down the number of nodes lets the whole package run comfortably on a modern laptop, and simply pausing a container is an easy way to simulate node failures.  (When you’re demonstrating the value of active-active replication, simulating failure is an everyday task.)

As always, DevOps is a work-in-progress.  The team is making improvements every week, and I think with improved use of Docker images we can cut the cluster creation time down even more.

That’s a quick peek at how our internal engineering teams are using automation to speed up development and testing of our Hadoop products.  If you’d like to learn more, I encourage you to tweet @wandisco with questions, or ask on our Hadoop forum.

0 Responses to “Behind the scenes: Rapid Hadoop deployment”


  • No Comments

Leave a Reply