Here is what I did to run the newly built hadoop in distributed mode using three CentOS VMs. I have these VMs, and the scripts that I used for creating hadoop available for download later in this blog post.
First, setting up three VMs for running Hadoop
- Install ESXi on a suitable machine. I used a 4 core machine with 8GB of RAM and two 300GB disks
- Create three 32 bit CentOS 6.3 VMs. Each VM has 32GB of disk space, and 2GB of RAM. To the CentOS installer, I specified ‘Basic Server’ as the installation type.
- The root password is altostor
- I assigned IP addresses 10.0.0.175, 10.0.0.176 and 10.0.0.177 to the three virtual machines. I assigned the hostnames master.altostor.net, slave0.altostor.net and slave1.altostor.net to these machines and created a hosts file.
- I setup passwordless ssh so that the ‘root’ user can ssh from master.altostor.net to slave0 and slave1 without typing in a password. The web is full of guides for this, but in brief:
- Turn off selinux on the three VMs by editing /etc/selinux/config and setting SELINUX=disabled.
- Run ‘ssh-keygen -t rsa’ on the master
- Create a /root/.ssh directory on slave0 and slave1 and set its permissions to 700
- scp the file /root/.ssh/id_rsa.pub to slave0 and slave1 as /root/.ssh/authorized_keys. Set permissions of authorized_keys to 644. On the master node itself, copy the id_rsa.pub as authorized_keys.
- From master, test that you can ssh into ‘slave0’, ‘slave1’ and to itself ‘master’
- Create a group hadoop and a user hdfs in this new group on all three machines
Next, creating the hadoop. Download the tar file with the scripts and config files necessary for installing hadoop from the link at the top of this posting. Download the zip file, unzip on the master, drop the hadoop binary hadoop-0.23.3.tar.gz, and run ‘./create-hadoop.sh’. The script does the following:
- master setup:
- kill all java processes
- delete data directory, pid directory and log directory
- Copy config files (config file templates are included with the script tar file) and customize
- Create the slaves file with slave0 and slave1 in it.
- formats the HDFS filesystem
- starts up the namenode and creates /user in hdfs. Note that the NameNode java process is running as the linux user root
- slave setup
- kills all running java processes
- removes hadoop data, pid and log directory
- scp hadoop binaries directory over from master to slave
- starts up the hadoop DataNode process as linux user root