Monthly Archive for January, 2015

Hadoop Data Protection

I just came across this nice summary of data protection strategies for Hadoop.  It hits on a key problem: typical backup strategies for Hadoop just won’t handle the volume of data, and there’s not much available from dedicated backup vendors either.  Because it’s a daunting problem, companies just assume that Hadoop is “distributed enough” to not require a bulletproof backup strategy.

But as we’ve heard time and again from our customers, that’s just not the case.  The article shows why – if you read on to the section on DistCP, the tool normally used for cluster backup, you’ll see that it can take hours to back up a few terabytes of data.

As the article mentions, what’s necessary is an efficient block-level backup solution.  Luckily, that’s just what WANdisco provides in Nonstop Hadoop.  The architect of our geographically distributed solution for a unified cross-cluster HDFS data layer described the approach at a Strata conference last year.

The article actually mentions our solution, but I think there was a slight misunderstanding.  WANdisco does not actually make “just a backup” solution, so we don’t provide any more versioning than what you get out of regular HDFS.  In fact that’s the whole point – we provide an HDFS data layer that spans multiple clusters and locations.  It provides a very effective Disaster Recovery strategy as well as other benefits like cluster zones and multiple data-center ingest.

arch-nsh-components

Interested in learning more?  We’re here to help.

Complete control over Hadoop data locality

Non-Stop Hadoop provides a unified data layer across Hadoop clusters in one or many locations. This unified data layer solves a number of problems by providing a very low recovery point objective for critical data, full continuity of data access in the event of failure, and the ability to ingest and process data at any cluster.

Carrying implementation of this layer to its logical conclusion, you may ask if we’ve introduced a new problem in the process of solving these others, namely, what if you don’t want to replicate all HDFS data everywhere?

Perhaps you have to respect data privacy or locality regulations, or maybe it’s just not practical to ship all your raw data across the WAN. Do you have to fall back to workflow management systems like Falcon to do scheduled selective data transfers, and deal with the delays and complexity of building an ETL-style pipeline?

Luckily, no. Non-Stop Hadoop provides a selective replication capability that is more sophisticated than what you could build manually with the stock data transfer tools. As part of a centralized administration function, for each part of the HDFS namespace you can define:

  • Which data centers receive the data
  • The replication factor in each data center
  • Whether data is available for remote (WAN) read even if it is not available locally
  • Whether data can be written in a particular data center

This solves a host of problems. Perhaps most importantly, if you have sensitive data that cannot be transferred outside a certain area, you can make sure it never reaches data centers in other areas. Further, you can ensure that the restricted part of the namespace is never accessed for reads or writes in other areas.

Non-Stop Hadoop’s selective replication also solves some efficiency problems. Simply choose not to replicate temporary ‘working’ data, or only replicate rarely accessed data on demand. Similarly, you don’t need as high a replication factor if data exists in multiple locations, so you can cut down on some local storage costs.

Selective replication across multiple clusters sharing a Nonstop Hadoop HDFS data layer: Replication policies control where subsets of HDFS are replicated, the replication factor in each cluster, and the availability of remote (WAN) reads

Selective replication across multiple clusters sharing a Nonstop Hadoop HDFS data layer: Replication policies control where subsets of HDFS are replicated, the replication factor in each cluster, and the availability of remote (WAN) reads

Consistent highly available data is really just the starting point for Nonstop Hadoop.  Nonstop Hadoop also gives you powerful tools to control where data resides, how it gets there, and how it’s stored.

By now you’ve probably thought of a problem that selective replication can help you solve.  Give our team of Hadoop experts a call to learn more.

スマートメータのデータをHadoopで解析

british gasConnected HomeはBritish Gasが開発をしているエネルギー使用をモニター・コントロールするサービス。暖房を点けたり、消したりするサービスアプリを提供している。インタネットは家庭の娯楽は大きく変えてきたが、日常生活そのものについてはまだこれからであり、3rd Partyも活用しサービスの差別化をしていくことをBritish Gasは目指している。

WANdiscoは2014年3月に100万世帯のスマートメータからデータを取得し、エネルギー使用のモニタ・コントロールを行うトライアルに参加した。収集されたリアルタイムデータを解析することで、需要パターンと供給をダイナミックにマッチングし、需要に見合う供給を行い、かつ企業および一般家庭での使用のコントロールが可能となることが実証することが目的。

リアルタイム性、コンプライアンス対応のため、Non-Stop Hadoopが導入され、100ノードのクラスタでのデータ損失、ダウンタイムを最小限にし、ストレージコストも大幅に削減することができた。

10か月のトライアルが成功裏に終わり、2倍のスケールで本番運用に入ることになった。WANdiscoは3年間のSubscription契約をUS$750KでBritish Gasと締結。

avatar

About Kenji Ogawa (小川 研之)

WANdisco社で2013年11月より日本での事業を展開中。 以前は、NECで国産メインフレーム、Unix、ミドルウェアの開発に従事。その後、シリコンバレーのベンチャー企業開拓、パートナーマネージメント、インドでのオフショア開発に従事。

Wildcards in Subversion Authorization

Support for wildcards in Subversion authorization rules has been noticeably lacking for many years.  The use cases for using wildcards are numerous and well understood: denying write access to a set of protected file types in all branches, granting access to all sandbox branches in all projects, and so on.

So I’m very pleased to announce that WANdisco is now supporting wildcards for Subversion in our Access Control Plus product.  With this feature you can now easily define path restrictions for Subversion repositories using wildcards.

How does this work given that core Subversion doesn’t support wildcards?  Well, wildcard support is a long-standing feature request in the open source Subversion project, and we picked up word that there was a good design under review.  We asked one of the committers that works for WANdisco to create a patch that we can regression test and ship with our SVN MultiSite Plus and Access Control Plus products until the design lands in the core project.

Besides letting you define rules with wildcards, Access Control Plus does a couple of other clever things.

  • Let you set a relative priority that impacts the ordering of sections in the AuthZ file.  The order is significant when wildcards are in use as multiple sections may match the same path.
  • Warn you if two rules may conflict because they affect the same path but have a different priority.

acp-wildcard-conflictThis feature will likely be a life saver for Subversion administrators – just contact us and we’ll help you take advantage of it.

Is One Hadoop Cluster Enough?

A new report out from GigaOM analyst Paul Miller provides some insights into the question, is one Hadoop cluster enough for most Big Data needs?  It’s surprising how much attention this topic has garnered recently.  Up until a few months ago I hadn’t really thought that much about why you’d need more than a single cluster.  After all, most of the technical information about Hadoop is geared towards running everything on one cluster, especially since YARN makes it easier to run multiple applications on a single cluster.

But another recent study shows that a majority of Big Data users are running multiple data centers.  The GigaOM report dives into some of the reasons why that might be.  Workload optimization, load balancing, taking advantage of the cloud for affordable burst processing and backups, regulatory concerns – there are a host of reasons that are driving Hadoop adopters toward a logical data lake consisting of several clusters.  And of course there’s also the fact that many Hadoop deployments evolve from a collection of small clusters set up in isolation.

The report also notes that the tools for managing the flow of data between multiple clusters are still rudimentary.  DistCP, which underpins many of the ETL-style tools like Falcon, can be quite slow and error-prone.  If you only need to sync data between clusters once a day it might be ok, but many use cases are demanding near real-time roll-up analysis.

That’s why WANdisco provides active-active replication: Non-stop Hadoop lets your data span clusters and geographies.  In the interest of saving a thousand words:

nsh-ref-arch-total

Interested?  Check out some of the reasons why this architecture is attractive to Hadoop data consumers and operators.