Tag Archive for 'hbase'

Consensus-based replication in HBase

Following up on the recent blog about Hadoop Summit 2014, I wanted to share an update on the state of consensus-based replication (CBR) for HBase. As some of our readers might know, we are working on this technology directly in the Apache HBase project. As you may also know, we are big fans and proponents of strong consistency in distributed systems, however I think the phrase “strong consistency” is a made up tautology since anything else should not be called “consistency” at all.

When we first looked into availability in HBase we noticed that it relies heavily on the Zookeeper layer. There’s nothing wrong with ZK per se, but the way the implementation is done at the moment makes ZK an integral part of the HBase source code. This makes sense from a historical perspective, since ZK has been virtually the only technology to provide shared memory storage and distributed coordination capabilities for most of HBase’s lifespan. JINI, developed back in the day by Bill Joy, is worth mentioning in this regard, but I digress and will leave that discussion for another time.

The idea behind CBR is pretty simple: instead of trying to guarantee that all replicas of a node in the system are synced post-factum to an operation, such a system will coordinate the intent of an operation. If a consensus on the feasibility of an operation is reached, it will be applied by each node independently. If consensus is not reached, the operation simply won’t happen. That’s pretty much the whole philosophy.

Now, the details are more intricate, of course. We think that CBR is beneficial for any distributed system that requires strong consistency (learn more on the topic from the recent Michael Stonebraker interview [6] on Software Engineering Radio). In the Hadoop ecosystem it means that HDFS, HBase, and possibly other components can benefit from a common API to express the coordination semantics. Such an approach will help accommodate a variety of coordination engine (CE) implementations specifically tuned for network throughput, performance, or low-latency. Introducing this concept to HBase is somewhat more challenging, however, because unlike HDFS it doesn’t have a single HA architecture: the HMaster fail-over process relies solely on ZK, whereas HRegionServer recovery additionally depends on write-ahead log (WAL) splitting. Hence, before any meaningful progress on CBR can be made, we need to abstract most, if not all, concrete implementations of ZK-based functionality behind a well-defined set of interfaces. This will provide the ability to plug in alternative concrete CEs as the community sees fit.

Below you can find the slides from my recent talk at the HBase Birds of Feather session during Hadoop Summit [1] that covers the current state of development. References [2-5] will lead you directly to the ASF JIRA tickets that track the project’s progress.

References:

  1. HBase Consensus BOF 2014
  2. https://issues.apache.org/jira/browse/HBASE-10909
  3. https://issues.apache.org/jira/browse/HBASE-11241
  4. https://issues.apache.org/jira/browse/HADOOP-10641
  5. https://issues.apache.org/jira/browse/HDFS-6469
  6. Michael Stonebraker on distributed and parallel DBs
avatar

About Konstantin Boudnik

WANdisco’s February Roundup

This month, we launched a trio of innovative Hadoop products: the world’s first production-ready distro; a wizard-driven management dashboard; and the first and only 100% uptime solution for Apache Hadoop.

hadoop big data

We started this string of Big Data announcements with WANdisco Distro (WDD) a fully tested, free-to-download version of Apache Hadoop 2. WDD is based on the most recent Hadoop release, includes all the latest fixes and undergoes the same rigorous quality assurance process as our enterprise software solutions.

This release paved the way for our enterprise Hadoop solutions, and we announced the WANdisco Hadoop Console (WHC) shortly after. WHC is a plug-and-play solution that makes it easy for enterprises to deploy, monitor and manage their Hadoop implementations, without the need for expert HBase or HDFS knowledge.

The final product in this month’s Big Data announcements was WANdisco Non-Stop NameNode. Our patented technology makes WANdisco Non-Stop Namenode the first and only 100% uptime solution for Hadoop, and offers a string of benefits for enterprise users:

  • Automatic failover and recovery
  • Automatic continuous hot backup
  • Removes single point of failure
  • Eliminates downtime and data loss
  • Every NameNode server is active and supports simultaneous read and write requests
  • Full support for HBase

To support the needs of the Apache Hadoop community, we’ve also launched a dedicated Hadoop forum. At this forum, users can get advice on their Hadoop installation and connect with fellow users, including WANdisco’s core Apache Hadoop developers Dr. Konstantin V. Shvachko, Dr. Konstantin Boudnik, and Jagane Sundar.

subversion

For Apache Subversion users, we announced the next webinars in our free training series:

  • Subversion Administration – everything you need to administer a Subversion development environment
  • Introduction to SmartSVN – a short introduction to how Subversion works with the SmartSVN graphical client
  • Checkout Command – how to get the most out of the checkout command, and the meaning of the various error messages you may encounter
  • Commit Command – learn more about this command, including diff usage, working with unversioned files and changelists
  • Introduction to Git – everything a new user needs to get started with Git
  • Hook Scripts – how to use hook scripts to automate tasks such as email notifications, backups and access control
  • Advanced Hook Scripts – an advanced look at hook scripts, including using a config file with hook scripts and passing data to hook scripts

We’ve announced an ongoing series of free webinars, which demonstrate how you can overcome these challenges from an administrative, business and IT perspective, and get the most out of deploying Subversion in an enterprise environment. These ‘Scaling Subversion for the Enterprise’ webinars will be conducted by our expert Solution Architect three times a week (Tuesday, Wednesday and Thursday) at 10.00am PST/1.00pm EST, and will cover:

  • The latest technology that can help you overcome the limitations and risks associated with globally distributed deployments
  • Answers to your business-specific questions
  • How to solve critical issues
  • The free resources and offers that can help solve your business challenges

WANdisco Announces Free Online Hadoop Training Webinars

We’re excited to announce a series of free one-hour online Hadoop training webinars, starting with four sessions in March and April. Time will be allowed for audience Q&A at the end of each session.

Wednesday, March 13 at 10:00 AM Pacific, 1:00 PM Eastern

A Hadoop Overview” will cover Hadoop, from its history to its architecture as well as:

  • HDFS, MapReduce, and HBase
  • Public and private cloud deployment options
  • Highlights of common business use cases and more

March 27, 10:00 AM Pacific, 1:00 pm Eastern

Hadoop: A Deep Dive” covers Hadoop misconceptions (not all clusters include thousands of machines) and:

  • Real world Hadoop deployments
  • Review of major Hadoop ecosystem components including: Oozie, Flume, Nutch, Sqoop and others
  • In-depth look at HDFS and more

April 10, 10:00 AM Pacific, 1:00 pm Eastern

Hadoop: A MapReduce Tutorial” will cover MapReduce at a deep technical level and will highlight:

  • The history of MapReduce
  • Logical flow of MapReduce
  • Rules and types of MapReduce jobs
  • De-bugging and testing
  • How to write foolproof MapReduce jobs

April 24, 10:00 AM Pacific, 1:00 pm Eastern

Hadoop: HBase In-Depth” will provide a deep technical review of HBase and cover:

  • Its flexibility, scalability and components
  • Schema samples
  • Hardware requirements and more

Space is limited so click here to register right away!

WANdisco Non-Stop NameNode Removes Hadoop’s Single Point of Failure

We’re pleased to announce the release of the WANdisco Non-Stop NameNode, the only 100% uptime solution for Apache Hadoop. Built on our Non-Stop patented technology, Hadoop’s NameNode is no longer a single point of failure, delivering immediate and automatic failover and recovery whenever a server goes offline, without any downtime or data loss.

“This announcement demonstrates our commitment to enterprises looking to deploy Hadoop in their production environments today,” said David Richards, President and CEO of WANdisco. “If the NameNode is unavailable, the Hadoop cluster goes down. With other solutions, a single NameNode server actively supports client requests and complex procedures are required if a failure occurs. The Non-Stop NameNode eliminates those issues and also allows for planned maintenance without downtime. WANdisco provides 100% uptime with unmatched scalability and performance.”

Additional benefits of Non-Stop NameNode include:

  • Every NameNode server is active and supports simultaneous read and write requests.
  • All servers are continuously synchronized.
  • Automatic continuous hot backup.
  • Immediate and automatic recovery after planned or unplanned outages, without the need for administrator intervention.
  • Protection from “split-brain” where the backup server becomes active before the active server is completely offline. This can result in data corruption.
  • Full support for HBase.
  • Works with Apache Hadoop 2.0 and CDH 4.1.

“Hadoop was not originally developed to support real-time, mission critical applications, and thus its inherent single point of failure was not a major issue of concern,” said Jeff Kelly, Big Data Analyst at Wikibon. “But as Hadoop gains mainstream adoption, traditional enterprises rightly are looking to Hadoop to support both batch analytics and mission critical apps. With WANdisco’s unique Non-Stop NameNode approach, enterprises can feel confident that mission critical applications running on Hadoop, and specifically HBase, are not at risk of data loss due to a NameNode failure because, in fact, there is no single NameNode. This is a major step forward for Hadoop.”

You can learn more about the Non-Stop NameNode at the product page, where you can also claim your free trial.

If you’d like to get first-hand experience of the Non-Stop NameNode and are attending the Strata Conference in Santa Clara this week, you can find us at booth 317, where members of the WANdisco team will be doing live demos of Non-Stop NameNode throughout the event.

Hadoop Console: Simplified Hadoop for the Enterprise

We are pleased to announce the latest release in our string of Big Data announcements: the WANdisco Hadoop Console (WHC.) WHC is a plug-and-play solution that makes it easy for enterprises to deploy, monitor and manage their Hadoop implementations, without the need for expert HBase or HDFS knowledge.

This innovative Big Data solution offers enterprise users:

  • An S3-enabled HDFS option for securely migrating from Amazon’s public cloud to a private in-house cloud
  • An intuitive UI that makes it easy to install, monitor and manage Hadoop clusters
  • Full support for Amazon S3 features (metadata tagging, data object versioning, snapshots, etc.)
  • The option to implement WHC in either a virtual or physical server environment.
  • Improved server efficiency
  • Full support for HBase

“WANdisco is addressing important issues with this product including the need to simplify Hadoop implementation and management as well as public to private cloud migration,” said John Webster, senior partner at storage research firm Evaluator Group. “Enterprises that may have been on the fence about bringing their cloud applications private can now do so in a way that addresses concerns about both data security and costs.”

More information about WHC is available from the WANdisco Hadoop Console product page. Interested parties can also download our Big Data whitepapers and datasheets, or request a free trial of WHC. Professional support for our Big Data solutions is also available.

This latest Big Data announcement follows the launch of our WANdisco Distro, the world’s first production-ready version of Apache Hadoop 2.