Tag Archive for 'Hadoop 2.0'

A View From Strata NY: Big Data is Getting Bigger

In general a trade show is a dangerous place to gauge sentiment.  Full of marketing & sales, backslapping & handshakes and marketecture rather than architecture the world is indeed viewed through rose-tinted-spectacles. Strata, the Hadoop Big Data conference in New York last week was very interesting albeit through my rose-tinted-spectacles.

Firstly, the sheer volume of people, over 3,500 is telling.  This show used to be a few hundred, primarily techies inventing the future.  The show is now bigger, much bigger.  A cursory glance at the exhibit hall revealed a mix of the biggest tech companies and hot start-ups.  The keynotes, to the disappointment of those original techies, were primarily press-driven product releases lacking real technical substance.  This is not such a bad thing though. It’s a sign that Hadoop is coming of age. It’s what happens when technology moves into the main stream.

Second, the agenda has changed quite dramatically.  Companies looking to deploy Hadoop are no longer trying to figure out how it might fit into their data centers. They are trying to figure out how to deploy it.  2014 will indeed be the end of trials and the beginning of full-scale enterprise roll-out.  The use-cases are all over the place.  Analysts yearn for clues and clusters to explain this “Are you seeing mainly telco’s or financial services?”  Analysts of course must try to enumerate in order to explain but the wave and shift is seismic and the only explanation is a fundamental shift in the very nature of enterprise applications.

My third theme is the discussion around why Hadoop is driving this move to rewrite enterprise applications.  As someone at the show told me, “the average age of enterprise application is 19 years”.  Hence,this is part of a classic business cycle.  Hadoop is a major technological shift that takes advantage of dramatic changes in the capabilities and economics of hardware.  Expensive spinning hard-disk, processing speeds, bandwidth, networks, etc. were limitations and hence assumptions that the last generation of enterprise applications had to deal with.  Commodity hardware and massive in-memory processing are the new assumptions that Hadoop takes advantage of.  In a few years we will not be talking about ‘Big Data’ we will simply use the term ‘Data’ because it will no longer be unusual for it to be so large in relative terms.

My fourth observation was that Hadoop 2 has changed the agenda for the type of use case.  In very rough terms Hadoop 1 was primarily about wall ststorage and batch processing.  Hadoop 2 is about yarn and run-time applications. In other words processing can now take place on top of Hadoop rather than storing in Hadoop but processing somewhere else.  This change is highly disruptive because it means that software vendors cannot rely on customers to use their products in conjunction with Hadoop.  Rather, they are talking about building on top of Hadoop.  To them Hadoop is a new type of operating system.  This disruption is very good news for the new brand of companies that are building pure applications built from the ground up and really bad news for those who believe that they can mildly integrate or even store data in 2 places. That’s not going to happen. Some of the traditional companies had a token presence at Strata that suggests they are still unsure of exactly what they are going to do – they are neither fully embracing or ignoring this new trend.

My final observation is about confusion.  There’s a lot of money at stake here so naturally everyone wants a piece of the action.  There’s a lot of flashing lights and noise from vendors, lavish claims and a lack of substance.  Forking core open source is nearly always a disaster. As open-source guru Karl Fogel says ‘forks happen due to irreconcilable disagreements, technical disagreements or interpersonal conflicts and is something developers should be afraid of and try to avoid it in any way’.  It creates natural barriers to use tertiary products and with an open source project moving as quickly as this, one has to stay super-close to the de facto open source project.

A forked version of core Hadoop is not Hadoop, it’s something else.  If customers go down a forked path it’s difficult to get back and they will lose competitive edge because they will be unable to use the community of products being built as part of the wider community.  Customers should think of Hadoop like an operating system or database.  If it’s merely embedded and heavily modified then this is not Hadoop.

So 2014 it is then.  As the Wall St Journal put it the Elephant in the Room to Weigh on Growth for Oracle, Teradata

Here’s a great video demo of the new @WANdisco continuous availability technology running on Hortonworks Hadoop 2.2 Distro

 

avatar

About David Richards

David is CEO, President and co-founder of WANdisco and has quickly established WANdisco as one of the world’s most promising technology companies. Since co-founding the company in Silicon Valley in 2005, David has led WANdisco on a course for rapid international expansion, opening offices in the UK, Japan and China. David spearheaded the acquisition of Altostor, which accelerated the development of WANdisco’s first products for the Big Data market. The majority of WANdisco’s core technology is now produced out of the company’s flourishing software development base in David’s hometown of Sheffield, England and in Belfast, Northern Ireland. David has become recognised as a champion of British technology and entrepreneurship. In 2012, he led WANdisco to a hugely successful listing on London Stock Exchange (WAND:LSE), raising over £24m to drive business growth. With over 15 years' executive experience in the software industry, David sits on a number of advisory and executive boards of Silicon Valley start-up ventures. A passionate advocate of entrepreneurship, he has established many successful start-up companies in Enterprise Software and is recognised as an industry leader in Enterprise Application Integration and its standards. David is a frequent commentator on a range of business and technology issues, appearing regularly on Bloomberg and CNBC. Profiles of David have appeared in a range of leading publications including the Financial Times, The Daily Telegraph and the Daily Mail. Specialties:IPO's, Startups, Entrepreneurship, CEO, Visionary, Investor, ceo, board member, advisor, venture capital, offshore development, financing, M&A

Introduction to Hadoop 2, with a simple tool for generating Hadoop 2 config files

Introduction to Hadoop 2
Core Hadoop 2 consists of the distributed filesystem HDFS and the compute framework YARN.

HDFS is a distributed filesystem that can be used to store anywhere from a few gigabytes to many petabytes of data. It is distributed in the sense that it utilizes a number of slave servers, ranging from 3 to a few thousand, to store and serve files from.

YARN is the compute framework for Hadoop 2. It manages the distribution of compute jobs to the very same slave servers that store the HDFS data. This ensures that the compute jobs do not reach out over the network to access the data stored in HDFS.

Naturally, traditional software written to run on a single server will not work on Hadoop. New software needs to be developed using a special programming paradigm called Map Reduce. Hadoop’s native Map Reduce framework uses java, however Hadoop MR programs can be written in almost any language. Also, higher level languages such as Pig may be used to write scripts that compile into Hadoop MR jobs.

Hadoop 2 Server Daemons
HDFS:  HDFS consists of a master metadata server called the NameNode, and a number of slave servers called DataNodes.

The NameNode function is provided by a single java daemon – the NameNode. The NameNode daemon runs on just one machine – the master. DataNode functionality is provided by a java daemon called DataNode that runs on each slave server.

Since the NameNode function is provided by a single java daemon, it turns out to be Single Point of Failure (SPOF). Open source Hadoop 2 has a number of ways to keep a standby server waiting to take over the function of the NameNode daemon, should the single NameNode fail. All of these standby solutions take 5 to 15 minutes to failover. While this failover is underway, batch MR jobs on the cluster will fail. Further, an active HBase cluster with a high write load may not necessarily survive the NameNode failover. Our company WANdisco has a commercial product called the NonStop NameNode that solves this SPOF problem.

Configuration parameters for the HDFS deamons NameNode and DataNode are all stored in a file called the hdfs-site.xml. At the end of this blog entry, I have included a simple java program that generates all the config files necessary for running a Hadoop 2 cluster. This convenient program generates a hdfs-site.xml.

YARN:
The YARN framework has a single master daemon called the YARN Resource Manager that runs in a master node, and a YARN Node Manager on each of the slave nodes. Additionally, YARN has a single Proxy server and a single Mapreduce job history server. As indicated by the name, the function of the mapreduce job history server is to store and serve a history of the mapreduce jobs that were run on the cluster.

The configuration for all of these daemons is stored in yarn-site.xml.

Daemons that comprise core Hadoop (HDFS and YARN)

Daemon Name Number Description Web Port (if any) RPC Port (if any)
NameNode 1 HDFS Metadata Server, usually run on the master 50070 8020
DataNode 1 per slave HDFS Data Server, one per slave server 50075 50010 (Data transfer RPC),
50020 (Block metadata RPC)
ResourceManager 1 YARN ResourceManager, usually on a master server 8088 8030 (Scheduler RPC),
8031 (Resource Tracker RPC),
8032 (Resource Manager RPC),
8033 (Admin RPC)
NodeManager 1 per slave YARN NodeManager, one per slave 8042 8040 (Localizer),
8041 (NodeManager RPC)
ProxyServer 1 YARN Proxy Server 8034
JobHistory 1 Mapreduce Job History Server 10020 19888

A Simple Program for generating Hadoop 2 Config files
This is a simple program that generates core-site.xml, hdfs-site.xml, yarn-site.xml and capacity-scheduler.xml. You need to supply this program with the following information :

  1. nnHost: HDFS Name Node Server hostname
  2. nnMetadataDir: The directory on the Name Node server’s local filesystem where the NameNode metadata will be stored – nominally /var/hadoop/name. If you are using our WDD rpms, note that all documentation will refer to this directory
  3. dnDataDir: The directory on each DataNode or slave machine where HDFS data blocks are stored. This location must have be the biggest disk on the DataNode. Nominally /var/hadoop/data. If you are using our WDD rpms, note that this is the location that all of our documentation will refer to.
  4. yarnRmHost: YARN Resource Manager hostname

Download the following jar: makeconf
Here is an example run:

$ java -classpath makeconf.jar com.wandisco.hadoop.makeconf.MakeHadoopConf nnHost=hdfs.hadoop.wandisco nnMetadataDir=/var/hadoop/name dnDataDir=/var/hadoop/data yarnRmHost=yarn.hadoop.wandisco

Download the source code for this simple program here: makeconf-source

Incidentally, if you are looking for an easy way to get Hadoop 2, try our free Hadoop distro – the WANdisco Distro (WDD). It is a packaged, tested and certified version of the latest Hadoop 2.

 

avatar

About Jagane Sundar