Hadoop Blog

Page 2 of 5

Could Google have happened on WindowsNT?

Last week I had the chance to deliver a keynote presentation at the recent BigDataCamp LA (video). The topic of my talk was how free and open-source software (FOSS) is changing the face of the modern technology landscape. This peaceful, non-violent, and yet very disruptive progress is forcing us to review and adjust our daily technological choices and preferences. From how we browse the Internet, to how we communicate with other people, shop, bank, and guard our privacy – everything is affected by open source platforms, applications, and services.

Open-source-devoted businesses enabled things like Hadoop, NewSQL platforms like VoltDB, and in-memory systems like GridGain. Here at WANdisco, we are working on an open-source implementation of consensus-based replication for HDFS and HBase. This has nothing to do with charity or altruism: delivering enterprise value on top of open-source software is a very viable business model that has been proven by many companies in the last 20 years.

I argue that in the absence of an openly-available Linux OS under the GNU General Public License, Internet search as we know it wouldn’t have happened. Can you imagine Google data centers running on WindowsNT? Judge for yourself. Even better, start contributing to the revolution!


About Konstantin Boudnik

Consensus-based replication in HBase

Following up on the recent blog about Hadoop Summit 2014, I wanted to share an update on the state of consensus-based replication (CBR) for HBase. As some of our readers might know, we are working on this technology directly in the Apache HBase project. As you may also know, we are big fans and proponents of strong consistency in distributed systems, however I think the phrase “strong consistency” is a made up tautology since anything else should not be called “consistency” at all.

When we first looked into availability in HBase we noticed that it relies heavily on the Zookeeper layer. There’s nothing wrong with ZK per se, but the way the implementation is done at the moment makes ZK an integral part of the HBase source code. This makes sense from a historical perspective, since ZK has been virtually the only technology to provide shared memory storage and distributed coordination capabilities for most of HBase’s lifespan. JINI, developed back in the day by Bill Joy, is worth mentioning in this regard, but I digress and will leave that discussion for another time.

The idea behind CBR is pretty simple: instead of trying to guarantee that all replicas of a node in the system are synced post-factum to an operation, such a system will coordinate the intent of an operation. If a consensus on the feasibility of an operation is reached, it will be applied by each node independently. If consensus is not reached, the operation simply won’t happen. That’s pretty much the whole philosophy.

Now, the details are more intricate, of course. We think that CBR is beneficial for any distributed system that requires strong consistency (learn more on the topic from the recent Michael Stonebraker interview [6] on Software Engineering Radio). In the Hadoop ecosystem it means that HDFS, HBase, and possibly other components can benefit from a common API to express the coordination semantics. Such an approach will help accommodate a variety of coordination engine (CE) implementations specifically tuned for network throughput, performance, or low-latency. Introducing this concept to HBase is somewhat more challenging, however, because unlike HDFS it doesn’t have a single HA architecture: the HMaster fail-over process relies solely on ZK, whereas HRegionServer recovery additionally depends on write-ahead log (WAL) splitting. Hence, before any meaningful progress on CBR can be made, we need to abstract most, if not all, concrete implementations of ZK-based functionality behind a well-defined set of interfaces. This will provide the ability to plug in alternative concrete CEs as the community sees fit.

Below you can find the slides from my recent talk at the HBase Birds of Feather session during Hadoop Summit [1] that covers the current state of development. References [2-5] will lead you directly to the ASF JIRA tickets that track the project’s progress.


  1. HBase Consensus BOF 2014
  2. https://issues.apache.org/jira/browse/HBASE-10909
  3. https://issues.apache.org/jira/browse/HBASE-11241
  4. https://issues.apache.org/jira/browse/HADOOP-10641
  5. https://issues.apache.org/jira/browse/HDFS-6469
  6. Michael Stonebraker on distributed and parallel DBs

About Konstantin Boudnik

A View From Strata NY: Big Data is Getting Bigger

In general a trade show is a dangerous place to gauge sentiment.  Full of marketing & sales, backslapping & handshakes and marketecture rather than architecture the world is indeed viewed through rose-tinted-spectacles. Strata, the Hadoop Big Data conference in New York last week was very interesting albeit through my rose-tinted-spectacles.

Firstly, the sheer volume of people, over 3,500 is telling.  This show used to be a few hundred, primarily techies inventing the future.  The show is now bigger, much bigger.  A cursory glance at the exhibit hall revealed a mix of the biggest tech companies and hot start-ups.  The keynotes, to the disappointment of those original techies, were primarily press-driven product releases lacking real technical substance.  This is not such a bad thing though. It’s a sign that Hadoop is coming of age. It’s what happens when technology moves into the main stream.

Second, the agenda has changed quite dramatically.  Companies looking to deploy Hadoop are no longer trying to figure out how it might fit into their data centers. They are trying to figure out how to deploy it.  2014 will indeed be the end of trials and the beginning of full-scale enterprise roll-out.  The use-cases are all over the place.  Analysts yearn for clues and clusters to explain this “Are you seeing mainly telco’s or financial services?”  Analysts of course must try to enumerate in order to explain but the wave and shift is seismic and the only explanation is a fundamental shift in the very nature of enterprise applications.

My third theme is the discussion around why Hadoop is driving this move to rewrite enterprise applications.  As someone at the show told me, “the average age of enterprise application is 19 years”.  Hence,this is part of a classic business cycle.  Hadoop is a major technological shift that takes advantage of dramatic changes in the capabilities and economics of hardware.  Expensive spinning hard-disk, processing speeds, bandwidth, networks, etc. were limitations and hence assumptions that the last generation of enterprise applications had to deal with.  Commodity hardware and massive in-memory processing are the new assumptions that Hadoop takes advantage of.  In a few years we will not be talking about ‘Big Data’ we will simply use the term ‘Data’ because it will no longer be unusual for it to be so large in relative terms.

My fourth observation was that Hadoop 2 has changed the agenda for the type of use case.  In very rough terms Hadoop 1 was primarily about wall ststorage and batch processing.  Hadoop 2 is about yarn and run-time applications. In other words processing can now take place on top of Hadoop rather than storing in Hadoop but processing somewhere else.  This change is highly disruptive because it means that software vendors cannot rely on customers to use their products in conjunction with Hadoop.  Rather, they are talking about building on top of Hadoop.  To them Hadoop is a new type of operating system.  This disruption is very good news for the new brand of companies that are building pure applications built from the ground up and really bad news for those who believe that they can mildly integrate or even store data in 2 places. That’s not going to happen. Some of the traditional companies had a token presence at Strata that suggests they are still unsure of exactly what they are going to do – they are neither fully embracing or ignoring this new trend.

My final observation is about confusion.  There’s a lot of money at stake here so naturally everyone wants a piece of the action.  There’s a lot of flashing lights and noise from vendors, lavish claims and a lack of substance.  Forking core open source is nearly always a disaster. As open-source guru Karl Fogel says ‘forks happen due to irreconcilable disagreements, technical disagreements or interpersonal conflicts and is something developers should be afraid of and try to avoid it in any way’.  It creates natural barriers to use tertiary products and with an open source project moving as quickly as this, one has to stay super-close to the de facto open source project.

A forked version of core Hadoop is not Hadoop, it’s something else.  If customers go down a forked path it’s difficult to get back and they will lose competitive edge because they will be unable to use the community of products being built as part of the wider community.  Customers should think of Hadoop like an operating system or database.  If it’s merely embedded and heavily modified then this is not Hadoop.

So 2014 it is then.  As the Wall St Journal put it the Elephant in the Room to Weigh on Growth for Oracle, Teradata

Here’s a great video demo of the new @WANdisco continuous availability technology running on Hortonworks Hadoop 2.2 Distro



About David Richards

David is CEO, President and co-founder of WANdisco and has quickly established WANdisco as one of the world’s most promising technology companies. Since co-founding the company in Silicon Valley in 2005, David has led WANdisco on a course for rapid international expansion, opening offices in the UK, Japan and China. David spearheaded the acquisition of Altostor, which accelerated the development of WANdisco’s first products for the Big Data market. The majority of WANdisco’s core technology is now produced out of the company’s flourishing software development base in David’s hometown of Sheffield, England and in Belfast, Northern Ireland. David has become recognised as a champion of British technology and entrepreneurship. In 2012, he led WANdisco to a hugely successful listing on London Stock Exchange (WAND:LSE), raising over £24m to drive business growth. With over 15 years' executive experience in the software industry, David sits on a number of advisory and executive boards of Silicon Valley start-up ventures. A passionate advocate of entrepreneurship, he has established many successful start-up companies in Enterprise Software and is recognised as an industry leader in Enterprise Application Integration and its standards. David is a frequent commentator on a range of business and technology issues, appearing regularly on Bloomberg and CNBC. Profiles of David have appeared in a range of leading publications including the Financial Times, The Daily Telegraph and the Daily Mail. Specialties:IPO's, Startups, Entrepreneurship, CEO, Visionary, Investor, ceo, board member, advisor, venture capital, offshore development, financing, M&A

Non-Stop Hadoop for Hortonworks HDP 2.0

As part of our partnership with Hortonworks, today we announced support for HDP 2.0. With its new YARN-based architecture, HDP 2.0 is the most flexible, complete and integrated Apache Hadoop distribution to date.

By combining WANdisco’s non-stop technology with HDP 2.0, WANdisco’s Non-Stop Hadoop for Hortonworks addresses critical enterprise requirements for global data availability so customers can better leverage Apache Hadoop. The solution delivers 100% uptime for large enterprises using HDP with automatic failover and recovery both within and across data centers. Whether a single server or an entire site goes down, HDP is always available.

“Hortonworks and WANdisco share the vision of delivering an enterprise-ready data platform for our mutual customers,” said David Richards, Chairman and CEO, WANdisco. “Non-Stop Hadoop for Hortonworks combines the YARN-based architecture of HDP 2.0 with WANdisco’s patented Non-Stop technology to deliver a solution that enables global enterprises to deploy Hadoop across multiple data centers with continuous availability.”

Stop by WANdisco’s booth (110) at Strata + Hadoop World in New York October 28-30 for a live demonstration and pick up an invitation to theCUBE Party @ #BigDataNYC Tuesday, October 29, 2013 from 6:00 to 9:00 PM at the Warwick Hotel, co-sponsored by Hortonworks and WANdisco.

Git Data Mining with Hadoop

Detecting Cross-Component Commits

Sooner or later every Git administrator will start to dabble with simple reporting and data mining.  The questions we need to answer are driven by developers (who’s the most active developer) and the business (show me who’s been modifying the code we’re trying to patent), and range from simple (which files were modified during this sprint) to complex (how many commits led to regressions later on). But here’s a key fact: you probably don’t know in advance all the questions you’ll eventually want to answer. That’s why I decided to explore Git data mining with Hadoop.

We may not normally think of Git data as ‘Big Data’. In terms of sheer volume, Git repositories don’t qualify. In several other respects, however, I think Git data is a perfect candidate for analysis with Big Data tools:

  • Git data is loosely structured. There is interesting data available in commit comments, commit events intercepted by hooks, authentication data from HTTP and SSH daemons, and other ALM tools. I may also want to correlate data from several Git repositories. I’m probably not tracking all of these data sources consistently, and I may not even know right now how these pieces will eventually fit together. I wouldn’t know how to design a schema today that will answer every question I could ever dream up.

  • While any single Git repository is fairly small, the aggregate data from hundreds of repositories with several years of history would be challenging for traditional repository analysis tools to handle. For many SCM systems the ‘reporting replica’ is busier than the master server!

Getting Started

As a first step I decided to use Flume to stream Git commit events (as seen by a post-receive hook) to HDFS. I first set up Flume using a netcat source connected to the HDFS sink via a file channel. The flume.conf looks like:

git.sources = git_netcat
git.channels = file_channel
git.sinks = sink_to_hdfs
# Define / Configure source
git.sources.git_netcat.type = netcat
git.sources.git_netcat.bind =
git.sources.git_netcat.port = 6666
# HDFS sinks
git.sinks.sink_to_hdfs.type = hdfs
git.sinks.sink_to_hdfs.hdfs.fileType = DataStream
git.sinks.sink_to_hdfs.hdfs.path = /flume/git-events
git.sinks.sink_to_hdfs.hdfs.filePrefix = gitlog
git.sinks.sink_to_hdfs.hdfs.fileSuffix = .log
git.sinks.sink_to_hdfs.hdfs.batchSize = 1000
# Use a channel which buffers events in memory
git.channels.file_channel.type = file
git.channels.file_channel.checkpointDir = /var/flume/checkpoint
git.channels.file_channel.dataDirs = /var/flume/data
# Bind the source and sink to the channel
git.sources.git_netcat.channels = file_channel
git.sinks.sink_to_hdfs.channel = file_channel

The Git Hook

I used the post-receive-email template as a starting point as it contains the basic logic to interpret the data the hook receives. I eventually obtain several pieces of information in the hook:

  • timestamp

  • author

  • repo ID

  • action

  • rev type

  • ref type

  • ref name

  • old rev

  • new rev

  • list of blobs

  • list of file paths

Do I really care about all of this information? I don’t really know – and that’s the reason I’m just stuffing the data into HDFS right now. I don’t care about all of it right now, but I might need it a couple years down the road.

Once I marshal all the data I stream it to Flume via nc:

nc_data = \
 "{0}|{1}|{2}|{3}|{4}|{5}|{6}|{7}|{8}|{9}|{10}\n".format( \
 timestamp, author, projectdesc, change_type, rev_type, \
 refname_type, short_refname, oldrev, newrev, ",".join(blobs), \
p = Popen(['nc', NC_IP, NC_PORT], stdout=PIPE, \
 stdin=PIPE, stderr=STDOUT)
nc_out = p.communicate(input="{0}".format(nc_data))[0]

The First Query

Now that I have Git data streaming into HDFS via Flume, I decided to tackle a question I always find interesting: how isolated are Git commits? In other words, does a typical Git commit touch only one part of a repository, or does it touch files in several parts of the code? If you work in a component based architecture then you’ll recognize the value of detecting cross-component activity.

I decided to use Pig to analyze the data, and started by ingesting data with HCat.

hcat -e "CREATE TABLE GIT_LOGS(time STRING, author STRING, \
  repo_id STRING, action STRING, rev_type STRING, ref_type STRING, \
  ref_name STRING, old_rev STRING, new_rev STRING, blobs STRING, paths STRING) \

Now for the fun part – some Pig Latin! Actually detecting cross-component activity will vary depending on the structure of your code; that’s part of the reason why it’s so difficult to come up with a canned schema in advance. But for a simple example let’s say that I want to detect any commit that touches files in two component directories, modA and modB. The list of file paths contained in the commit is a comma delimited field, so some data manipulation is required if we’re to avoid too much regular expression fiddling.

-- load from hcat
raw = LOAD 'git_logs' using org.apache.hcatalog.pig.HCatLoader();

-- tuple, BAG{tuple,tuple}
-- new_rev, BAG{p1,p2}
bagged = FOREACH raw GENERATE new_rev, TOKENIZE(paths) as value;
DESCRIBE bagged;

-- tuple, tuple
-- tuple, tuple
-- new_rev, p1
-- new_rev, p2
bagflat = FOREACH bagged GENERATE $0, FLATTEN(value);
DESCRIBE bagflat;

-- create list that only has first path of interest
modA = FILTER bagflat by $1 matches '^modA/.*';

-- create list that only has second path of interest
modB = FILTER bagflat by $1 matches '^modB/.*';

-- So now we have lists of commits that hit each of the paths of interest.  Join them...
-- new_rev, p1, new_rev, p2
bothMods = JOIN modA by $0, modB by $0;
DESCRIBE bothMods;

-- join on new_rev
joined = JOIN raw by new_rev, bothMods by $0;
DESCRIBE joined;

-- now that we've joined, we have the rows of interest and can discard the extra fields from both_mods
final = FOREACH joined GENERATE $0, $1, $2, $3, $4, $5, $6, $7, $8, $9, $10;
DUMP final;

As the Pig script illustrates, I manipulated the data to obtain a new structure that had one row per file per commit. That made it easier to operate on the file path data; I made lists of commits that contained files in each path of interest, then used a couple of joins to isolate the commits that contain files in both paths. There are certainly other ways to get to the same result, but this method was simple and effective.

In A Picture

A simplified data flow diagram shows how data makes its way from a Git commit into HDFS and eventually out again in a report.

Data Flow

Data Flow

What Next?

This simple example shows some of the power of putting Git data into Hadoop. Without knowing in advance exactly what I wanted to do, I was able to capture some important Git data and manipulate it after the fact. Hadoop’s analysis tools make it easy to work with data that isn’t well structured in advance, and of course I could take advantage of Hadoop’s scalability to run my query on a data set of any size. In the future I could take advantage of data from other ALM tools or authentication systems to flesh out a more complete report. (The next interesting question on my mind is whether commits that span multiple components have a higher defect rate than normal and require more regression fixes.)

Using Hadoop for Git data mining may seem like overkill at first, but I like to have the flexibility and scalability of Hadoop at my fingertips in advance.

Hadoop Summit 2013

hadoop_summit_logo  I left Hadoop Summit last month very excited to the see traction the market is having. The number of Hadoop vendors, practitioners and customers continues to grow and the knowledge about the technology continues to deepen.

One of the key areas of discussions on the trade show floor was the limitation and design of the namenode.

In Apache Hadoop 1.x, the namenode was a single point of failure (SPoF).

cut rope 1This SPoF has become such a significant issue, the community has accelerated the need to find ways to mitigate against this earlier design choice. While the community has started to develop a solution which addresses earlier attempts, the overall system is still what we call an active-passive implementation.

Active-passive solutions have been around for many years and were designed to provide recovery where disruption of services was resolved at a different layer of the stack. For example, active-passive security solutions like firewalls have traditionally been deployed with a primary and a standby unit. In the event the primary failed, the secondary would recover and clients communications (TCP) would retry and retransmit until a connection could be established. With services like HTTP, these active-passive solutions are sufficient and widely deployed.

However, when we start to discuss components of an architecture that are key to availability and access, a new term starts to emerge; Continuous Availability.

In the past, active-passive solutions could solve what the industry has accepted as “Highly Available” solutions. However, today’s architectures and technologies are evolving and have shifted to a new need, which is being described as “Continuously Available”.

One area we found ourself explaining was the difference in our 100% Continuous Availability™ solution for HDFS, compared to the design changes being implemented in the Apache Hadoop 2.x. branch. As you can see from the references in the Apache documentation, the new Quorum Journal Manager is an active-passive solution.

“This guide discusses how to configure and use HDFS HA using the Quorum Journal Manager (QJM) to share edit logs between the Active and Standby NameNodes.” REF: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithQJM.html

This is the main difference between open-source HA and the technology developed by WANdisco.  WANdisco’s Non-Stop NameNode is an Active-Active solution built using a widely agreed family of protocols, known as Paxos, for solving consensus in a network of unreliable processors.

WANdisco’s implementation, known as DConE is the core IP used to ensure 100% uptime of the Apache Hadoop 2.0 namenode processes and therefore provides continuous availability and access to HDFS during planned and unplanned outages and critical infrastructures.

After spending two days speaking to many attendees, it became very clear to me that WANdisco’s strength in its DConE technology and how it has been applied to the Apache Hadoop namenode are not trivial.

We had the opportunity to talk with representatives from some of the largest Web 2.0 in the Silicon Valley, including Yahoo, LinkedIn, Facebook and Ebay. Being able to demonstrate our active-active solution to key industry technologists, architects and Hadoop developers was the highlight for our team and we are excited about the official release of our WAN solution.

Hadoop 2: “alpha” elephant or not? Part 2: features and opinions

In the first part of this article I looked into the development that brought us Hadoop 2. Let’s now try to analyze whether Hadoop 2 is ready for general consumption, or if it’s all just a business hype at this point. Are you better off sticking to the old, not-that-energetic grandpa who, nonetheless, delivers every time or going riding with the younger fella who might be a bit “unstable”?

New features

Hadoop 2 introduces a few very important features such as

    • HDFS High Availability (HA) with . This is what it does:

      …In order for the Standby node to keep its state synchronized with the Active node in this implementation, both nodes communicate with a group of separate daemons called JournalNodes…In the event of a fail-over, the Standby will ensure that it has read all of the edits from the JournalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a fail-over occurs.

      There’s an alternative approach to HDFS HA that requires an external filer (an NAS or NFS server to store a copy of the HDFS edit logs). In the case of failure of the primary NameNode, a new one can be brought over and the network-stored copy of the logs can be used to serve the clients. This is essentially a less optimal approach than QJM, as it involves more moving parts and requires more complex dev.ops.

    • An HDFS federation that essentially allows to combine multiple namespaces/namenodes to a single logical filesystem. This allows for better utilization of the higher-density storage.
  • YARN essentially implements the concept of Infrastructure-As-A-Service. You can deploy your non-MR applications to cluster nodes using YARN resource management and scheduling.Another advantage is the split of the old JobTracker into two independent services: resource management and job scheduling. It gives a certain advantage in the case of a fail-over and in general is a much cleaner approach to MapReduce framework implementation. YARN is API-compatible with MRv1, hence you don’t need to do anything about your MR applications, just perhaps recompile the code. Just run them on YARN.


The majority of the optimizations were made on the HDFS side. Just a few examples:

  • overall file system read/write improvements: I’ve seen reports of >30% performance increase from 1.x to 2.x with the same workload
  • read improvements for DN and client collocation HDFS-347 (yet to be added to the 2.0.5 release)

Good overall observation on the HDFS road map can be found here


Here’s how the bets are spread among commercial vendors, with respect to supported production-ready versions:

Hadoop 1.x Hadoop 2.x
Cloudera x[1] x
Hortonworks x
Intel x
MapR x[1] x
Pivotal x
Yahoo! x[2]
WANdisco x

The worldview of software stacks

In any platform ecosystem there are always a few layers: they are like onions; onions have layers 😉

  • in the center there’s a core, e.g. OS kernel
  • there are few inner layers: the system software, drivers, etc.
  • and the external layers of the onion… err, the platform — the user space applications: your web browser and email client and such

The Hadoop ecosystem isn’t that much different from Linux. There’s

  • the core: Hadoop
  • system software: Hbase, Zookeeper, Spring Batch
  • user space applications: Pig, Hive, users’ analytics applications, ETL, BI tools, etc.

The responsibility of bringing all the pieces of the Linux onion together lies on Linux distribution vendors: Canonical, Redhat, SUSE, etc. They pull certain versions of the kernel, libraries, system and user-space software into place and release these collections to the users. But first they make sure everything fits nicely and add some of their secret sauce on top (think Ubuntu Unity, for example). Kernel maintenance is not a part of daily distribution vendors’ business. Yet they are submitting patches and new features. A set of kernel maintainers is then responsible to bring changes to the kernel mainline. Kernel advancements are happening under very strict guidelines. Breaking compatibility with user-space is rewarded by placing a guilty person straight into the 8th circle of Inferno.

Hadoop practices a somewhat different philosophy than Linux, though. Hadoop 1.x is considered stable, and only critical bug fixes are getting incorporated into it (Table2). Whereas Hadoop 2.x is moving forward at a higher pace and most improvements are going there. That comes with at a cost to user-space applications. The situation is supposedly addressed by labeling Hadoop 2 as ‘alpha’ for about a year now. On the other hand, such tagging arguably prevents user feedback from flowing into the development community. Why? Because users and application developers alike are generally scared away by the “alpha” label: they’d rather sit and wait until the magic of stabilization happens. In the meanwhile, they might use Hadoop 1.x.

And, unlike the Canonical or Fedora project, there’s no open-source integration place for the Hadoop ecosystem. Or is there?


There are 12+ different components in the Hadoop stack (as represented by the BigTop project). All these are moving at their own pace and, more often than not, support both versions of Hadoop. This complicates the development and testing. It creates a large amount of issues for the integration of these projects. Just think about the variety of library dependencies and such that might all of a sudden be at conflict or have bugs (HADOOP-9407 comes to mind). Every component also comes with its own configuration, adding insult to injury for all the tweaks in Hadoop.

All this brings a lot of issues to the DevOps who need to install, maintain, and upgrade your average Hadoop cluster. In many cases, DevOps simply don’t have the capacity or knowledge to build and test a new component of the stack (or a newer version of it) before bringing it to the production environment. Most of the smaller companies and application developers don’t have the expertise to build and install multiple versions from the release source tarballs, configure and performance tune of the installation.

That’s where software integration projects like BigTop come into the spotlight. BigTop was started by Roman Shaposhnik (ASF Bigtop, Chair PMC) and Konstantin Boudnik (ASF Bigtop, PMC) at the Yahoo! Hadoop team back in 2009-2010. It was a continuation of earlier work based on expertise in software integration and OS distributions. BigTop provides a versatile tool for creating software stacks with predefined properties, validates the compatibility of integral parts, and creates native Linux packaging to ease the installation experience.

BigTop includes a set of Puppet recipes — an industry standard configuration management system — that allows to spin up a Hadoop cluster in about 10 minutes. The cluster can be configured for Kerber’ized or non-secure environments. A typical release of BigTop looks like a stack’s bill-of-materials and source code. It lets anyone quickly build and test a packaged Hadoop cluster with a number of typical system and user-space components in it. Most of the modern Hadoop distributions are using BigTop openly or under the hood, making BigTop a de facto integration spot for all upstream projects


Here’s Milind Bhandarkar (Chief Architect at Pivotal):

As part of HAWQ stress and longevity testing, we tested HDFS 2.0 extensively, and subjected it to the loads it had never seen before. It passed with flying colors. Of course, we have been testing the new features in HDFS since 0.22! EBay was the first to test new features in HDFS 2.0, and I had joined Konstantin Schvachko to declare Hadoop 0.22 stable, when the rest of the community called it crazy. Now they are realizing that we were right.

YARN is known for very high stability. Arun Murthy – RM of all of 2.0.x-alpha releases and one of the YARN authors – in the 2.0.3-alpha release email:

# Significant stability at scale for YARN (over 30,000 nodes and 14 million applications so far, at time of release – see here)

And there’s this view that I guess is shared by a number of application developers and users sitting on the sidewalks:

I would expect to have a non-alpha semi-stable release of 2.0 by late June or early July.  I am not an expert on this and there are lots of things that could show up and cause those dates to slip.

In the meanwhile, six out of seven vendors are using and selling Hadoop 2.x-based versions of storage and data analytics solutions, system software, and service. Who is right? Why is the “alpha” tag kept on for so long? Hopefully, now you can make your own informed decision.


[1]: EOLed or effectively getting phased out

[2]: Yahoo! is using Hadoop 0.23.x in production, which essentially is very close to the Hadoop 2.x source base


About Konstantin Boudnik

Hadoop 2: “alpha” elephant or not?

Today I will look into the state of Hadoop 2.x and try to understand what has kept it in the alpha state to date. Is it really an “alpha” elephant? This question keeps popping up on the Internet, in conversations with customers and business partners. Let’s start with some facts first.

The first anniversary of the Hadoop 2.0.0-alpha release is around the corner. SHA 989861cd24cf94ca4335ab0f97fd2d699ca18102 was made on May 8th, 2012, marking the first-ever release branch of the Hadoop 2 line (in the interest of full disclosure: the actual release didn’t happen until a few days later, May 23rd).[1]

It was a long-awaited event. And sure enough, the market accepted it enthusiastically.  The commercial vendor Cloudera announced its first Hadoop 2.x-based CDH4.0 at the end of June 2012, according to this statement from Cloudera’s VPoP — just a month after 2.0.0-alpha went live! So, was it solid, fact-based trust of the quality of the code base, or something else?  An interesting nuance: MapReduce v1 (MRv1) was brought back despite the presence of YARN (a new resource scheduler and a replacement for the old MapReduce). One of those things that make you go, “Huh…?”

We’ve just seen the 2.0.4-alpha RC vote getting closed: the fifth release in a row in just under one year. Many great features went in: YARN; HDFS HA; HDFS performance optimizations, to name a few. An incredible amount of stabilization has been done lately, especially in 2.0.4-alpha. Let’s consider some numbers:

Table1: JIRAs committed to Hadoop between 2.0.0-alpha and 2.0.4-alpha releases

HDFS 801
YARN 138

That’s about 1,500 fixes and features since the beginning. Which was to be expected, considering the scope of implemented changes and the need for smoothing things out.

Let’s for a moment look into Hadoop 1.x — essentially the same old Hadoop 0.20.2xx — per latest genealogy of elephants — a well-respected and stable patriarchy. Hadoop 1.x had 8 releases altogether in 14 months:

  • 1.0.0 released on Dec 12, 2011
  • 1.1.2 released on Feb 15, 2013

Table2: JIRAs committed to Hadoop between 1.0.0 and 1.1.2 releases

HDFS 111

That’s about five times fewer fixes and improvements than what went into Hadoop 1.x over roughly the same time. If frequency of change is any indication of stability, then perhaps we are onto something.

“Wow,” one might say, “no wonder the ‘alpha’ tag has been so sticky!” Users definitely want to know if the core platform is turbulent and unstable. But wait… wasn’t there that commercial release that happened a month after the first OSS alpha? If it was more stable than the official public alpha, then why did it take the latter another five releases and 1,500 commits to get where it is today? Why wasn’t the stabilization simply contributed back to the community? Or, if both were of the same high quality to begin with, then why is the public Hadoop 2.x still wearing the “alpha” tag one year later?

Before moving any further: all 13 releases — for 1.x and 2.x —  were managed by engineers from Hortonworks. Tipping my hat to those guys and all contributors to the code!

So, is Hadoop 2 that unstable after all? In the second part of this article I will dig into the technical merits of the new development line so we can decide for ourselves. To be continued

[1] All release info is available from official ASF Hadoop release page


About Konstantin Boudnik

On coming fragmentation of Hadoop platform

I just read this interview with the CEO of HortonWorks in which he expresses a fear about Hadoop fragmentation. He calls attention to the valid issue in the Hadoop ecosystem where forking is getting to the point that product space is likely to get fragmented.

So why should the BigTop community bother? Well, for one, Hadoop is the core upstream component of the BigTop stack. By filling this unique position, it has a profound effect on downstream consumers such as HBase, Oozie, etc. Although projects like Hive and Pig can partially avoid potential harm by statically linking with Hadoop binaries, this isn’t a solution for any sane integration approach. As a side note: I am especially thrilled by Hive’s way of working around multiple incompatibilities in the MR job submission protocol. The protocol has been naturally evolving for quite some time, and no one could even have guaranteed compatibility in versions like 0.19 or 0.20. Anyway, Hive solved the problem by simply generating a job jar, constructing a launch string and then – you got it already, right? – System.exec()’ing the whole thing. On a separate JVM, that is! Don’t believe me? Go check the source code yourself.

Anecdotal evidence aside, there’s a real threat of fracturing the platform. And there’s no good reason for doing so even if you’re incredibly selfish, or stupid, or want to monopolize the market. Which, by the way, doesn’t work for objective reasons even with so-called “IP protection” laws in place. But that’s a topic for another day.

So, what’s HortonWorks’ answer to the problem? Here it comes:

Amid current Hadoop developments—is there any company NOT launching a distribution with some value added software?—Hortonworks stands out. Why? Hortonworks turns over its entire distribution to the Apache open source project.

While it is absolutely necessary for any human endeavor to be collaborative in order to succeed, the open source niche might be a tricky one. There are literally no incentives for all players to play by the book, and there’s always that one very bold guy who might say, “Screw you guys, I’m going home,” because he is just… you know…

Where could these incentives come from? How can we be sure that every new release is satisfactory for everyone’s consumption? How do we guarantee that HBase’s St.Ack and friends won’t be spending their next weekend trying to fix HBase when it loses its marbles because of that tricky change in Hadoop’s behavior?

And here comes a hint of an answer:

We’re building directly in the core trunk, productizing the package, doing QA and releasing.

I have a couple of issues with this statement. But first, a spoiler alert: I am not going to attack neither Hortonworks nor their CEO. I don’t have a chip on my shoulder — not even an ARM one. I am trying to demonstrate the fallacy in the logic and show what doesn’t work and why. And now here’s the laundry list:
  • building directly in the core trunk“: Hadoop isn’t released from the trunk. This is a headache. And this is one of the issues that the BigTop community faced during the most recent stabilization exercise for the Hadoop 2.0.4-alpha release. Why’s that a problem? Well, for one, there’s a policy that “everything should go through the trunk”. It means — in context of Hadoop’s current state — that you have to first commit to the trunk, then back-port to branch-2, which is supposed to be the landing ground for all Hadoop 2.x releases, just like branch-1 is the landing ground for all Hadoop 1.x releases. If it so happens that there’s an active release(s) happening at the moment, one would need to back-port the commit to another release branch(es), such as 2.0.4-alpha in this particular example. Mutatis mutandis, some of the changes are reaching only about 2/3 of the way down. Best-case scenario. This approach also gives fertile ground to all “proponents” of open-source Hadoop because once their patches are committed to the trunk, they are as open-source as the next guy. They might get released in a couple of years, but hey — what’s a few months between friends, right?
  • productizing the package“: is Mr. Bearden aware of when development artifacts for an ongoing Hadoop release were last published in the open? ‘Cause I don’t know of a publication of any such thing to date. Neither does Google, by the way. Even the official source tarballs weren’t available until, like, 3 weeks ago. Why does that constitute a problem? How do you expect to perform any reasonable integration validation if you don’t have an official snapshot of the platform? Once your platform package is “productized”, it is a day late to pull your hair out. If you happen to find some issues — come back later. At the next release, perhaps?
  • doing QA and releasing“: we are trying to build an open-source community here, right? Meaning that the code, the tests and their results, the bug reports, the discussions should be in the open. The only place where the Hadoop ecosystem is being tested at any reasonable length and depth is BigTop. Read here for yourself. And feel free to check the regular builds and test runs for _all_ the components that BigTop releases for both secured and non-secured configurations. What are you testing with and how, Mr. Bearden?
So, what was the solution? Did I miss it in the article? I don’t think so. Because a single player — even one as respected as Hortonworks — can’t solve the issue in question without ensuring that anything produced by the Hadoop project’s developers is always in line with the expectations of downstream players.
That’s how you prevent fracturing: by putting in the open a solid and well-integrated reference implementation of the stack – one that can be installed by anyone using open-standard packaging and loaded with third-party applications without tweaking them every time you go from Cloudera’s cluster to MapR’s. Or another pair of vendors’. Does it sound like I am against making money in open-source software? Not at all: most people in the OSS community do this on the dime of their employers or as part of their own business.
You can consider BigTop’s role in the Hadoop centric environment to be similar to that of Debian in the Linux kernel/distribution ecosystem. By helping to close the gap between the applications and the fast-moving core of the stack, BigTop essentially brings reassurance of the Hadoop 2.x line’s stability into the user space and community. BigTop helps to make sure that vendor products are compatible with each other and with the rest of the world; to avoid vendor lock-in and to guarantee that recent Microsoft stories will not be replayed all over again.

Are there means to achieve the goal of keeping the core contained? Certainly! BigTop does just that. Recent announcements from Intel, Pivotal, WANdisco are living proof of it: they all using BigTop as the integration framework and consolidation point. Can these vendors deviate even under such a top-level integration system? Sure. But this will be immensely harder to do.


About Konstantin Boudnik

WANdisco Releases New Version of Hadoop Distro

We’re proud to announce the release of WANdisco Distro (WDD) version 3.1.1.

WDD is a fully tested, production-ready version of Apache Hadoop 2 that’s free to download. WDD version 3.1.1 includes an enhanced, more intuitive user interface that simplifies Hadoop cluster deployment. WDD 3.1.1 supports SUSE Linux Enterprise Server 11 (Service Pack 2), in addition to RedHat and CentOS.

“The number of Hadoop deployments is growing quickly and the Big Data market is moving fast,” said Naji Almahmoud, senior director of global business development, SUSE, a WANdisco Non-Stop Alliance partner. “For decades, SUSE has delivered reliable Linux solutions that have been helping global organizations meet performance and scalability requirements. We’re pleased to work closely with WANdisco to support our mutual customers and bring Hadoop to the enterprise.”

All WDD components are tested and certified using the Apache BigTop framework, and we’ve worked closely with both the open source community and leading big data vendors to ensure seamless interoperability across the Hadoop ecosystem.

“The integration of Hadoop into the mainstream enterprise environment is increasing, and continual communication with our customers confirms their requirements – ease of deployment and management as well as support for market leading operating systems,” said David Richards, CEO of WANdisco. “With this release, we’re delivering on those requirements with a thoroughly tested and certified release of WDD.”

WDD 3.1.1 can be downloaded for free now. WANdisco also offers Professional Support for Apache Hadoop.