If you’ve ever deployed a Hadoop cluster from scratch on internal hardware or EC2, you know there are a lot of details to get right. Syncing time with ntp, setting up password-less login across all the nodes, and making sure you have all the prerequisite packages installed is just the beginning. Then you have to actually deploy Hadoop. Even with a management tool like Ambari there’s a lot of time spent going through the web interface and deploying software. In this article I’m going to describe why we invested in a framework for rapid Hadoop deployment with Docker and Ansible.
At WANdisco we have teams of engineers and solutions architects testing our latest products on a daily basis, so automation is a necessity. Last year I spent some time on a Vagrant-Puppet toolkit to set up EC2 images and deploy Hadoop using Ambari blueprints. As an initial effort it was pretty good but I never invested the time to handle the cross-node dependencies. For instance, after the images are provisioned with all the prerequisites I manually ran another Puppet script to deploy Ambari, then another one to deploy Hue, rather than having a master process that handled the timing and coordination.
Luckily we have a great automation team in our Sheffield office that set up a push-button solution using Docker and Ansible. With a single invocation you get:
- 3 clusters (mix-and-match with the distributions you prefer)
- Each cluster has 7 containers. The first runs the management tool (like Ambari), the second runs the NameNode and most of the master services, the third runs Hue, and the others are data nodes.
- All of the networking and other services are registered correctly.
- WANdisco Fusion installed.
Starting from a bare metal host, it takes about 20 minutes to do a one-time setup with Puppet that installs Docker and the Ansible framework and builds the Docker images. Once that first-time setup is done, a simple script starts the Docker containers and runs Ansible to deploy Hadoop. That takes about 20 minutes for a clean install, or 2-3 minutes to refresh the clusters with the latest build of our products.
That’s a real time-saver. Engineers can refresh with a new build in minutes, and solution architects can set up a brand new demo environment in under a half hour. Docker is ideal for demo purposes as well. Cutting down the number of nodes lets the whole package run comfortably on a modern laptop, and simply pausing a container is an easy way to simulate node failures. (When you’re demonstrating the value of active-active replication, simulating failure is an everyday task.)
As always, DevOps is a work-in-progress. The team is making improvements every week, and I think with improved use of Docker images we can cut the cluster creation time down even more.
That’s a quick peek at how our internal engineering teams are using automation to speed up development and testing of our Hadoop products. If you’d like to learn more, I encourage you to tweet @wandisco with questions, or ask on our Hadoop forum.