Monthly Archive for February, 2015

WANdisco Fusion: A Bridge Between Clusters, Distributions, and Storage Systems

The vision of the data lake is admirable: collect all your valuable business data in one repository. Make it available for analysis and generate actionable data fast enough to improve your strategic and tactical business decisions.

Translated to Hadoop language, that implies putting all the data in a single large Hadoop cluster. That gives you the analysis advantages of the data lake while leveraging Hadoop’s low storage costs. And indeed, a recent survey found that 61% of Big Data analytics projects have shifted some EDW workload to Hadoop.

But in reality, it’s not that simple. 35% of those involved in Big Data projects are worried about maintaining performance as the data volume and work load increase. 44% are concerned about lack of enterprise-grade backup. Those concerns argue against concentrating ever more data into one cluster.

And meanwhile, 70% of the companies in that survey have multiple clusters in use. Small clusters that started as department-level pilots become production clusters. Security or cost concerns may dictate the use of multiple clusters for different groups. Upgrades to new Hadoop distributions to take advantage of new components (or abandon old ones) can be a difficult migration process. Whatever the reason, the reality of Hadoop deployments is more complicated than you’d think.

As for making multiple clusters play well together… well, the fragility of the tools like DistCP brings back memories of those complicated ETL processes that we wanted to leave behind us.

So are we doomed to an environment of data silos? Isn’t that what we were trying to avoid?

blog-graphics-concerns

There is a better way. In the next post I’ll introduce WANdisco Fusion, the only Hadoop-compatible file system that quickly and easily shares data across clusters, distributions, and file systems.

Survey source: Wikibon

SmartSVN has a new home

We’re pleased to announce that from 23/02/2015 SmartSVN will be owned, maintained and managed by SmartSVN GmbH, a 100% child of Syntevo GmbH.

Long term customers will remember that Syntevo were the original creators and suppliers of SmartSVN, before WANdisco’s purchase of the product.

We’ve brought a lot of great features and enhancements to SmartSVN since we purchased it in 2012, particularly with the change from SVNkit to JAVAHL, which brought significant performance improvements and means that SmartSVN will be compatible with updates to core Subversion much faster than previously.

During the last two years the founders of Syntevo have continued to work with WANdisco on both engineering and consulting levels, so the transition back into their ownership will be smooth and seamless. We’re confident that having the original creators of SmartSVN take over the reins again will ensure that SmartSVN remains the best cross-platform Subversion product available for a long time to come.

Will this affect my purchased SmartSVN license?

No, SmartSVN GmbH will continue to support current SmartSVN users and you’ll be able to renew through them when the free upgrade period of your SmartSVN license has expired.

Where should I raise issues in the future?

The best place to go is Syntevo’s contact page where you’ll find the right contact depending on the nature of your issue.

A thank you to the SmartSVN community

Your input has been invaluable in guiding the improvements we’ve made to SmartSVN, we couldn’t have done it without you. We’d like to say thank you for your business over the last two years, and hope you continue to enjoy the product.

Regards,
Team WANdisco

Join WANdisco at Strata

The Strata conferences are some of the best Big Data shows around.  I’m really looking forward to the show in San Jose on February 17-20 this year.  The presentations look terrific, and there are deep-dive sessions into Spark and R for all of the data scientists.

Plus, WANdisco will have a strong presence.  Our very own Jagane Sundar and Brett Rudenstein will be in the Cube to talk about WANdisco’s work on distributed file systems.  They’ll also show early demos of some exciting new technology, and you can always stop by our booth to see more.

Look forward to seeing everyone out there!

Register for Hadoop Security webinar

Security in Hadoop is a challenging topic.  Hadoop was built without very much of a security framework in mind, and so over the years the distribution vendors have added new authentication layers.  Kerberos, Knox, Ranger, Sentry – there are a lot of security components to consider in this fluid landscape.  Meanwhile, the demand for security is increasing thanks to increased data privacy concerns, exacerbated by the recent string of security breaches at major corporations.

This week Wikibon’s Jeff Kelly will give his perspective on how to secure sensitive data in Hadoop.  It should be a very interesting and useful Hadoop security webinar and I hope you’ll join us.  Just visit http://www.wandisco.com/webinars to register.

Data locality leading to more data centers

In the ‘yet another headache for CIOs’ category, here’s an interesting read from the Wall Street Journal on why US companies are going to start building more data centers in Europe soon.  In the wake of various cybersecurity threats and some recent political events, national governments are more sensitive to their citizens’ data leaving their area of control.  That’s data locality leading to more data centers – and it’ll hit a lot of companies.

Multinational firms are of course affected as they have customer data originating from several areas.  But in my mind the jury is out on how big the impact will be.  If you’re even a consumer of social media information, do you need a local data center in every area where you’re trying to get that data feed?  It’s likely going to take a few years (and probably some legal rulings) before the dust settles.

You can imagine that this new requirement puts a real crimp in Hadoop deployment plans.  Do you now need at least a small cluster in each area you do business in?  If so, how do you easily keep sensitive data local while still sharing downstream analysis?

This is one of the areas where a geographically distributed HDFS with powerful selective replication capabilities can come to the rescue.  For more details, have a listen to the webinar on Hadoop data transfer pipelines that I ran with 451 Research’s Matt Aslett last week.