Enter AltoStor

In my last post I talked about some of the problems of setting up data lakes in real Hadoop deployments. And now here’s the better way: AltoStor lets you build an effective, fast, and secure Hadoop deployments by bridging several Hadoop clusters – even if those clusters use different distributions, different versions of Hadoop, or even different file systems.

How does it work?

AltoStor lets you share data directories between two or more clusters. The data is replicated using WANdisco’s active-active replication engine – this isn’t just a fancier way to mirror data. Every cluster can write into the shared directories, and changes are coordinated in real-time between the clusters. That’s where the reliability comes from: the Paxos-based replication engine is a proven, patented way to coordinate changes coming from anywhere in the world with 100% reliability. Clusters that are temporarily down or disconnected catch up automatically when they’re back online.

The actual data transfer is done as an asynchronous background process and doesn’t consume MapReduce resources.

Selective replication enhances security. You can centrally define if data is available to every cluster or just one cluster.

blog-graphics-arch

What benefits does it bring?

Ok, put aside the technical bits. What can this thing actually do? In the next post I’ll show you how AltoStor helps you get more value out of your Hadoop clusters.

AltoStor: A Bridge Between Clusters, Distributions, and Storage Systems

The vision of the data lake is admirable: collect all your valuable business data in one repository. Make it available for analysis and generate actionable data fast enough to improve your strategic and tactical business decisions.

Translated to Hadoop language, that implies putting all the data in a single large Hadoop cluster. That gives you the analysis advantages of the data lake while leveraging Hadoop’s low storage costs. And indeed, a recent survey found that 61% of Big Data analytics projects have shifted some EDW workload to Hadoop.

But in reality, it’s not that simple. 35% of those involved in Big Data projects are worried about maintaining performance as the data volume and work load increase. 44% are concerned about lack of enterprise-grade backup. Those concerns argue against concentrating ever more data into one cluster.

And meanwhile, 70% of the companies in that survey have multiple clusters in use. Small clusters that started as department-level pilots become production clusters. Security or cost concerns may dictate the use of multiple clusters for different groups. Upgrades to new Hadoop distributions to take advantage of new components (or abandon old ones) can be a difficult migration process. Whatever the reason, the reality of Hadoop deployments is more complicated than you’d think.

As for making multiple clusters play well together… well, the fragility of the tools like DistCP brings back memories of those complicated ETL processes that we wanted to leave behind us.

So are we doomed to an environment of data silos? Isn’t that what we were trying to avoid?

blog-graphics-concerns

There is a better way. In the next post I’ll introduce AltoStor, the only Hadoop-compatible file system that quickly and easily shares data across clusters, distributions, and file systems.

Survey source: Wikibon

SmartSVN has a new home

We’re pleased to announce that from 23/02/2015 SmartSVN will be owned, maintained and managed by SmartSVN GmbH, a 100% child of Syntevo GmbH.

Long term customers will remember that Syntevo were the original creators and suppliers of SmartSVN, before WANdisco’s purchase of the product.

We’ve brought a lot of great features and enhancements to SmartSVN since we purchased it in 2012, particularly with the change from SVNkit to JAVAHL, which brought significant performance improvements and means that SmartSVN will be compatible with updates to core Subversion much faster than previously.

During the last two years the founders of Syntevo have continued to work with WANdisco on both engineering and consulting levels, so the transition back into their ownership will be smooth and seamless. We’re confident that having the original creators of SmartSVN take over the reins again will ensure that SmartSVN remains the best cross-platform Subversion product available for a long time to come.

Will this affect my purchased SmartSVN license?

No, SmartSVN GmbH will continue to support current SmartSVN users and you’ll be able to renew through them when the free upgrade period of your SmartSVN license has expired.

Where should I raise issues in the future?

The best place to go is Syntevo’s contact page where you’ll find the right contact depending on the nature of your issue.

A thank you to the SmartSVN community

Your input has been invaluable in guiding the improvements we’ve made to SmartSVN, we couldn’t have done it without you. We’d like to say thank you for your business over the last two years, and hope you continue to enjoy the product.

Regards,
Team WANdisco

Join WANdisco at Strata

The Strata conferences are some of the best Big Data shows around.  I’m really looking forward to the show in San Jose on February 17-20 this year.  The presentations look terrific, and there are deep-dive sessions into Spark and R for all of the data scientists.

Plus, WANdisco will have a strong presence.  Our very own Jagane Sundar and Brett Rudenstein will be in the Cube to talk about WANdisco’s work on distributed file systems.  They’ll also show early demos of some exciting new technology, and you can always stop by our booth to see more.

Look forward to seeing everyone out there!

Register for Hadoop Security webinar

Security in Hadoop is a challenging topic.  Hadoop was built without very much of a security framework in mind, and so over the years the distribution vendors have added new authentication layers.  Kerberos, Knox, Ranger, Sentry – there are a lot of security components to consider in this fluid landscape.  Meanwhile, the demand for security is increasing thanks to increased data privacy concerns, exacerbated by the recent string of security breaches at major corporations.

This week Wikibon’s Jeff Kelly will give his perspective on how to secure sensitive data in Hadoop.  It should be a very interesting and useful Hadoop security webinar and I hope you’ll join us.  Just visit http://www.wandisco.com/webinars to register.

Data locality leading to more data centers

In the ‘yet another headache for CIOs’ category, here’s an interesting read from the Wall Street Journal on why US companies are going to start building more data centers in Europe soon.  In the wake of various cybersecurity threats and some recent political events, national governments are more sensitive to their citizens’ data leaving their area of control.  That’s data locality leading to more data centers – and it’ll hit a lot of companies.

Multinational firms are of course affected as they have customer data originating from several areas.  But in my mind the jury is out on how big the impact will be.  If you’re even a consumer of social media information, do you need a local data center in every area where you’re trying to get that data feed?  It’s likely going to take a few years (and probably some legal rulings) before the dust settles.

You can imagine that this new requirement puts a real crimp in Hadoop deployment plans.  Do you now need at least a small cluster in each area you do business in?  If so, how do you easily keep sensitive data local while still sharing downstream analysis?

This is one of the areas where a geographically distributed HDFS with powerful selective replication capabilities can come to the rescue.  For more details, have a listen to the webinar on Hadoop data transfer pipelines that I ran with 451 Research’s Matt Aslett last week.

Hadoop Data Protection

I just came across this nice summary of data protection strategies for Hadoop.  It hits on a key problem: typical backup strategies for Hadoop just won’t handle the volume of data, and there’s not much available from dedicated backup vendors either.  Because it’s a daunting problem, companies just assume that Hadoop is “distributed enough” to not require a bulletproof backup strategy.

But as we’ve heard time and again from our customers, that’s just not the case.  The article shows why – if you read on to the section on DistCP, the tool normally used for cluster backup, you’ll see that it can take hours to back up a few terabytes of data.

As the article mentions, what’s necessary is an efficient block-level backup solution.  Luckily, that’s just what WANdisco provides in Nonstop Hadoop.  The architect of our geographically distributed solution for a unified cross-cluster HDFS data layer described the approach at a Strata conference last year.

The article actually mentions our solution, but I think there was a slight misunderstanding.  WANdisco does not actually make “just a backup” solution, so we don’t provide any more versioning than what you get out of regular HDFS.  In fact that’s the whole point – we provide an HDFS data layer that spans multiple clusters and locations.  It provides a very effective Disaster Recovery strategy as well as other benefits like cluster zones and multiple data-center ingest.

arch-nsh-components

Interested in learning more?  We’re here to help.

Complete control over Hadoop data locality

Non-Stop Hadoop provides a unified data layer across Hadoop clusters in one or many locations. This unified data layer solves a number of problems by providing a very low recovery point objective for critical data, full continuity of data access in the event of failure, and the ability to ingest and process data at any cluster.

Carrying implementation of this layer to its logical conclusion, you may ask if we’ve introduced a new problem in the process of solving these others, namely, what if you don’t want to replicate all HDFS data everywhere?

Perhaps you have to respect data privacy or locality regulations, or maybe it’s just not practical to ship all your raw data across the WAN. Do you have to fall back to workflow management systems like Falcon to do scheduled selective data transfers, and deal with the delays and complexity of building an ETL-style pipeline?

Luckily, no. Non-Stop Hadoop provides a selective replication capability that is more sophisticated than what you could build manually with the stock data transfer tools. As part of a centralized administration function, for each part of the HDFS namespace you can define:

  • Which data centers receive the data
  • The replication factor in each data center
  • Whether data is available for remote (WAN) read even if it is not available locally
  • Whether data can be written in a particular data center

This solves a host of problems. Perhaps most importantly, if you have sensitive data that cannot be transferred outside a certain area, you can make sure it never reaches data centers in other areas. Further, you can ensure that the restricted part of the namespace is never accessed for reads or writes in other areas.

Non-Stop Hadoop’s selective replication also solves some efficiency problems. Simply choose not to replicate temporary ‘working’ data, or only replicate rarely accessed data on demand. Similarly, you don’t need as high a replication factor if data exists in multiple locations, so you can cut down on some local storage costs.

Selective replication across multiple clusters sharing a Nonstop Hadoop HDFS data layer: Replication policies control where subsets of HDFS are replicated, the replication factor in each cluster, and the availability of remote (WAN) reads

Selective replication across multiple clusters sharing a Nonstop Hadoop HDFS data layer: Replication policies control where subsets of HDFS are replicated, the replication factor in each cluster, and the availability of remote (WAN) reads

Consistent highly available data is really just the starting point for Nonstop Hadoop.  Nonstop Hadoop also gives you powerful tools to control where data resides, how it gets there, and how it’s stored.

By now you’ve probably thought of a problem that selective replication can help you solve.  Give our team of Hadoop experts a call to learn more.

スマートメータのデータをHadoopで解析

british gasConnected HomeはBritish Gasが開発をしているエネルギー使用をモニター・コントロールするサービス。暖房を点けたり、消したりするサービスアプリを提供している。インタネットは家庭の娯楽は大きく変えてきたが、日常生活そのものについてはまだこれからであり、3rd Partyも活用しサービスの差別化をしていくことをBritish Gasは目指している。

WANdiscoは2014年3月に100万世帯のスマートメータからデータを取得し、エネルギー使用のモニタ・コントロールを行うトライアルに参加した。収集されたリアルタイムデータを解析することで、需要パターンと供給をダイナミックにマッチングし、需要に見合う供給を行い、かつ企業および一般家庭での使用のコントロールが可能となることが実証することが目的。

リアルタイム性、コンプライアンス対応のため、Non-Stop Hadoopが導入され、100ノードのクラスタでのデータ損失、ダウンタイムを最小限にし、ストレージコストも大幅に削減することができた。

10か月のトライアルが成功裏に終わり、2倍のスケールで本番運用に入ることになった。WANdiscoは3年間のSubscription契約をUS$750KでBritish Gasと締結。

Wildcards in Subversion Authorization

Support for wildcards in Subversion authorization rules has been noticeably lacking for many years.  The use cases for using wildcards are numerous and well understood: denying write access to a set of protected file types in all branches, granting access to all sandbox branches in all projects, and so on.

So I’m very pleased to announce that WANdisco is now supporting wildcards for Subversion in our Access Control Plus product.  With this feature you can now easily define path restrictions for Subversion repositories using wildcards.

How does this work given that core Subversion doesn’t support wildcards?  Well, wildcard support is a long-standing feature request in the open source Subversion project, and we picked up word that there was a good design under review.  We asked one of the committers that works for WANdisco to create a patch that we can regression test and ship with our SVN MultiSite Plus and Access Control Plus products until the design lands in the core project.

Besides letting you define rules with wildcards, Access Control Plus does a couple of other clever things.

  • Let you set a relative priority that impacts the ordering of sections in the AuthZ file.  The order is significant when wildcards are in use as multiple sections may match the same path.
  • Warn you if two rules may conflict because they affect the same path but have a different priority.

acp-wildcard-conflictThis feature will likely be a life saver for Subversion administrators – just contact us and we’ll help you take advantage of it.