Introduction to Hadoop 2, with a simple tool for generating Hadoop 2 config files

Introduction to Hadoop 2
Core Hadoop 2 consists of the distributed filesystem HDFS and the compute framework YARN.

HDFS is a distributed filesystem that can be used to store anywhere from a few gigabytes to many petabytes of data. It is distributed in the sense that it utilizes a number of slave servers, ranging from 3 to a few thousand, to store and serve files from.

YARN is the compute framework for Hadoop 2. It manages the distribution of compute jobs to the very same slave servers that store the HDFS data. This ensures that the compute jobs do not reach out over the network to access the data stored in HDFS.

Naturally, traditional software written to run on a single server will not work on Hadoop. New software needs to be developed using a special programming paradigm called Map Reduce. Hadoop’s native Map Reduce framework uses java, however Hadoop MR programs can be written in almost any language. Also, higher level languages such as Pig may be used to write scripts that compile into Hadoop MR jobs.

Hadoop 2 Server Daemons
HDFS:  HDFS consists of a master metadata server called the NameNode, and a number of slave servers called DataNodes.

The NameNode function is provided by a single java daemon – the NameNode. The NameNode daemon runs on just one machine – the master. DataNode functionality is provided by a java daemon called DataNode that runs on each slave server.

Since the NameNode function is provided by a single java daemon, it turns out to be Single Point of Failure (SPOF). Open source Hadoop 2 has a number of ways to keep a standby server waiting to take over the function of the NameNode daemon, should the single NameNode fail. All of these standby solutions take 5 to 15 minutes to failover. While this failover is underway, batch MR jobs on the cluster will fail. Further, an active HBase cluster with a high write load may not necessarily survive the NameNode failover. Our company WANdisco has a commercial product called the NonStop NameNode that solves this SPOF problem.

Configuration parameters for the HDFS deamons NameNode and DataNode are all stored in a file called the hdfs-site.xml. At the end of this blog entry, I have included a simple java program that generates all the config files necessary for running a Hadoop 2 cluster. This convenient program generates a hdfs-site.xml.

The YARN framework has a single master daemon called the YARN Resource Manager that runs in a master node, and a YARN Node Manager on each of the slave nodes. Additionally, YARN has a single Proxy server and a single Mapreduce job history server. As indicated by the name, the function of the mapreduce job history server is to store and serve a history of the mapreduce jobs that were run on the cluster.

The configuration for all of these daemons is stored in yarn-site.xml.

Daemons that comprise core Hadoop (HDFS and YARN)

Daemon Name Number Description Web Port (if any) RPC Port (if any)
NameNode 1 HDFS Metadata Server, usually run on the master 50070 8020
DataNode 1 per slave HDFS Data Server, one per slave server 50075 50010 (Data transfer RPC),
50020 (Block metadata RPC)
ResourceManager 1 YARN ResourceManager, usually on a master server 8088 8030 (Scheduler RPC),
8031 (Resource Tracker RPC),
8032 (Resource Manager RPC),
8033 (Admin RPC)
NodeManager 1 per slave YARN NodeManager, one per slave 8042 8040 (Localizer),
8041 (NodeManager RPC)
ProxyServer 1 YARN Proxy Server 8034
JobHistory 1 Mapreduce Job History Server 10020 19888

A Simple Program for generating Hadoop 2 Config files
This is a simple program that generates core-site.xml, hdfs-site.xml, yarn-site.xml and capacity-scheduler.xml. You need to supply this program with the following information :

  1. nnHost: HDFS Name Node Server hostname
  2. nnMetadataDir: The directory on the Name Node server’s local filesystem where the NameNode metadata will be stored – nominally /var/hadoop/name. If you are using our WDD rpms, note that all documentation will refer to this directory
  3. dnDataDir: The directory on each DataNode or slave machine where HDFS data blocks are stored. This location must have be the biggest disk on the DataNode. Nominally /var/hadoop/data. If you are using our WDD rpms, note that this is the location that all of our documentation will refer to.
  4. yarnRmHost: YARN Resource Manager hostname

Download the following jar: makeconf
Here is an example run:

$ java -classpath makeconf.jar com.wandisco.hadoop.makeconf.MakeHadoopConf nnHost=hdfs.hadoop.wandisco nnMetadataDir=/var/hadoop/name dnDataDir=/var/hadoop/data yarnRmHost=yarn.hadoop.wandisco

Download the source code for this simple program here: makeconf-source

Incidentally, if you are looking for an easy way to get Hadoop 2, try our free Hadoop distro – the WANdisco Distro (WDD). It is a packaged, tested and certified version of the latest Hadoop 2.



About Jagane Sundar

0 Responses to “Introduction to Hadoop 2, with a simple tool for generating Hadoop 2 config files”

  • No Comments

Leave a Reply