The inspiration for WANdisco Fusion

Screen Shot 2015-04-21 at 10.08.22 PM

Roughly two years ago, we sat down to start work on a project that finally came to fruition this week.

At that meeting, we had set ourselves the challenge of redefining the storage landscape. We wanted to map out a world where there was complete shared storage, but where the landscape remained entirely heterogeneous.

Why? Because we’d witnessed the beginnings of a trend that has only grown more pronounced with the passage of time.

From the moment we started engaging with customers, we were struck by the extreme diversity of their storage environments. Regardless of whether we were dealing with a bank, a hospital or utility provider, different types of storage had been introduced across every organization for a variety of use cases.

In time, however, these same companies wanted to start integrating their different silos of data, whether to run real-time analytics or to gain a full 360 perspective of performance. Yet preserving diversity across data center was critical, given that each storage type has its own strengths.

They didn’t care about uniformity. They cared about performance and this meant being able to have the best of both worlds. Being able to deliver this became the Holy Grail – at least in the world of data centers.

This isn’t quite The Gordian Knot but it’s certainly a very difficult, complex problem and possibly one that could only be solved with our core, patented IP DConE.

Then we had a breakthrough.

Months later and I’m proud to formally release WANdisco Fusion (WD Fusion), the only product that enables WAN-scope active-active synchronization of different storage systems into one place.

What does this mean in practice? Well it means that you can use Hadoop distributions like Hortonworks, Cloudera or Pivotal for compute, Oracle BDA for fast compute, EMC Isilon for dense storage. You could even use a complete variety of Hadoop distros and versions. Whatever your set-up, with WD Fusion you can leverage new and existing storage assets immediately.

With it, Hadoop is transformed from being something that runs within a data center into an elastic platform that runs across multiple data centers throughout the world. WD Fusion allows you to update your storage infrastructure one data center at a time, without impacting your application ability or by having to copy vast swathes of data once the update is done.

When we were developing WD Fusion we agreed upon two things. First, we couldn’t produce anything that made changes to the underlying storage system – this had to behave like a client application. Second, anything we created had to enable a complete single global name-space across an entire storage infrastructure.

With WD Fusion, we allow businesses to bring together different storage systems by leveraging our existing intellectual property – the same Paxos-powered algorithm behind Non-Stop Hadoop, Subversion Multisite and Git Multisite – without making any changes to the platform you’re using.

Another way of putting it is we’ve managed to spread our secret sauce even further.

We have some of the best computer scientists in the world working at WANdisco, but I’m confident that this is the most revolutionary project any of us have ever worked on.

I’m delighted to be unveiling WD Fusion. It’s a testament to the talent and character of our firm, the result of looking at an impossible scenario and saying: “Challenge accepted.”

最新版Subversion 1.9 がダウンロード可能に

Subversionの最新版には多くの新機能とバグ修正が含まれています。性能改善、ネットワークリソース有効利用も可能になっています。リポジトリのバックエンドとして長く使われてきたFSFSが新しいもの(FSX)になりました。これによりログ・マージ等が改善されました。
9月15日(日本では16日2:00AM)に下記のWebinarで新機能を紹介します。

“What’s New in Subversion 1.9.” Register
(Replayもあります。上記の“Register”で登録すれば、Replyに関するメールも届きますので、是非、登録ください。Webinar概要については別途、本ブログにアップ予定です)
詳細な説明は以下で参照できます。

http://subversion.apache.org/docs/release-notes/1.9.html
弊社でテスト済のSubversion 1.9のバイナリは以下からダウンロード可能です。

http://www.wandisco.com/subversion/os/downloads

DevOps is eating the world

You know a technology trend has become fully mainstream when you see it written up in the Wall Street Journal.  So it goes with DevOps, as this recent article shows.

DevOps and continuous delivery have been important trends in many firms for several years.  It’s all about building higher quality software products and delivering them more quickly.  For SaaS companies it’s an obvious fit as they sometimes push out minor changes many times a day.  But even companies with more traditional products can benefit.  And internal IT departments can use DevOps principles to start saying “yes” to business users more often.

For example, let’s say that your business analytics team asks for a small Hadoop cluster to try out some of the latest machine learning algorithms on Spark.  Saying “yes” to that request should only take hours, not weeks.  If you have a private cloud and the right level of automation, you can spin up a new Spark cluster in minutes.  Then you can work with the analysts to automate the deployment of their algorithms.  If they’re wildly successful and they need to move their new project to a production cluster it’s just a matter of deploying somewhere with more resources.

Of course, none of this comes easily.  On the operations side you’ll need to invest in the right configuration and private cloud infrastructure.   Tools like Puppet, Ansible, and Docker can capture the configuration of servers and applications as code.

But equally important is the development infrastructure.  Companies like Google practice mainline development: all of their work is done from the trunk or mainline, supported by a massive continuous build and test infrastructure.  And Gerrit, a tool that Google sponsors, is perhaps the best code review tool for continuous delivery.

If you look at potential bottlenecks in a continuous delivery pipeline, you need to consider how code gets to the mainline, and then how it gets deployed.  With Gerrit there are only two steps to the mainline:

  • Commit the code.  Gerrit makes a de facto review branch on the fly and initiates a code review.
  • Approve the merge request.  Gerrit handles the merge automatically unless there’s a conflict.

With this system you don’t even need to ask a developer to open a pull request or create a private branch.  Gerrit just automates all of that.  And Gerrit will also invoke any continuous build and test automation to make sure that code is passing those tests before a human reviewer even looks at it.

Once it’s on the mainline the rest of the automation kicks in, and those operational tools become important to help you rapidly spin up more realistic test environments.

As you can imagine, this type of infrastructure can put a heavy load on your development systems.  That’s why WANdisco has put the muscle of Git MultiSite behind Gerrit, giving you a horizontally scalable Gerrit infrastructure.

Latest Git binaries available for download

As part of our participation in the open source SCM community, WANdisco provides up-to-date binary downloads for Git and Subversion for all major platforms.  We now have the latest Git binaries available for download on our Git downloads site.

One interesting new feature is git push –atomic.  When you’re pushing several refs (e.g. branches) at once, this feature makes sure that either all the refs are accepted or none are.  That’s useful if you’re making related changes on several branches at once.  Those who merge patches onto several releases at once are often in this position.

The Git community has done a great job of ensuring a stable upgrade process, so there’s generally little concern about upgrading.  It’s always a good idea to review the release notes of course.

Big Data Tech Infrastructure Market Share

The Data Science Association just published this infographic showing market share for a variety of different tools and technologies that form part of the Big Data ecosystem.  The data would’ve been more useful if it was grouped into categories, but here are a few observations:

  • Amazon is dominating the field for cloud infrastructure.  It’d be interesting to see how much of that is used for test and development versus serious production deployments.
  • Cloudera has more market share than vanilla Apache Hadoop, Hortonworks, or MapR.  It’ll be interesting to see how this picture evolves over time with the advent of the Open Data Platform.
  • Mesos has a surprising share of 14%.  At a recent Big Data event in Denver an audience survey showed that only one person out of 50 was even experimenting with Mesos.  Perhaps this survey is oriented more towards early adopters.

It’s always interesting to see these types of surveys as a complement to the analyst surveys from 451, Wikibon, and the like.

The 100 Day Progress Report on the ODP

This blog by Cheryle Custer, Director Strategic Alliance Marketing Hortonworks, has been republished with the author’s permission.

It was just a little over 100 days ago that 15 industry leaders in the Big Data space announced the formation of the Open Data Platform (ODP) initiative. We’d like to let you know what has been going on in that time, to bring you a preview of what you can expect in the next few months and let you know how you can become involved.

Some Background

What is the Open Data Platform Initiative?
The Open Data Platform Initiative (ODP) is an enterprise-focused shared industry effort focused on simplifying adoption and promoting the use and advancing the state of Apache Hadoop® and Big Data technologies for the enterprise. It is a non-profit organization being created by folks that help to create:  Apache, Eclipse, Linux, OpenStack, OpenDaylight, Open Networking Foundation, OSGI, WSI (Web Services Interoperability), UDDI , OASIS, Cloud Foundry Foundation and many others.

The organization relies on the governance of the Apache Software Foundation community to innovate and deliver the Apache project technologies included in the ODP core while using a ‘one member one vote’ philosophy where every member decides what’s on the roadmap. Over the next few weeks, we will be posting a number of blogs to describe in more detail how the organization is governed and how everyone can participate.

What is the Core?
The ODP Core provides a common set of open source technologies that currently includes: Apache Hadoop® (inclusive of HDFS, YARN, and MapReduce) and Apache® Ambari. ODP relies on the governance of the Apache Software Foundation community to innovate and deliver the Apache project technologies included in the ODP core. Once the ODP members and processes are well established, the scope of the ODP Core will expand to include other open source projects.

Benefits of the ODP Core
The ODP core is a set of open source Hadoop technologies designed to provide a standardized core that big data solution providers software and hardware developers can use to deliver compatible solutions rooted in open source that unlock customer choice.

By delivering on a vision of “verify once, run anywhere”, everyone benefits:

  • For Apache Hadoop® technology vendors, reduced R&D costs that come from a shared qualification effort
  • For Big Data application solution providers, reduced R&D costs that come from more predictable and better qualified releases
  • Improved interoperability within the platform and simplified integration with existing systems in support of a broad set of use cases
  • Less friction and confusion for Enterprise customers and vendors
  • Ability to redirect resources towards higher value efforts

100 Day Progress Report

In the 100 days since the announcement, we’ve made some great progress:

Four Platforms Shipping
At Hadoop Summit in Brussels in April, we announced the availability of four Hadoop platforms all based on a vision of a common ODP core: Infosys Information PlatformIBM Open Platform, Hortonworks Data Platformand Pivotal HD. The commercial delivery of ODP based distributions across multiple industry leading vendors immediately after the launch of the initiative demonstrates the momentum behind ODP to accelerate the delivery of compatible Hadoop distributions and the simplification it brings to the ecosystem using that as an industry standard.

New Members and New Participation Levels
In addition to revealing that Telstra is one of the founding Platinum members of the ODP, we’ve added new nine new members, including BMC, DataTorrent,PLDTSquid SolutionsSyncsort, UnifizData, Zettaset. We welcome these new members and are looking forward to their participation and their announcements. We also announced new membership level to provide an easy entrée for any company to participate in the ODP. The Silver level of membership allows companies to have a direct voice into the future of big data and contribute people, tests, and code to accelerate executing on the vision.

Community Collaboration at the Bug Bash
ODP Member Alitscale lead the efforts on a Hadoop Community Bug Bash. This unique event for the Apache Hadoop community, along with co-sponsors Hortonworks, Huawei, Infosys, and Pivotal, saw over 150 participants from eight countries and nine time zones, to strengthen Hadoop and honor the work of the community by reviewing and resolving software patches. Read more about the Bug Bash, where 186 issues were resolved either with closure or patches committed to code. Nice job everyone!  You can participate in upcoming bug bashes, so stay tuned.

Technical Working Group and the ASF
Senior engineers and architects from the ODP member companies have come together as a Technical Working Group (TWG). The goal of the TWG is to jump-start the work required to produce ODP core deliverables and to seed the technical community overseeing the future evolution of the ODP core. Delivering on the promise of “verify once and run anywhere” TWG is building h certification guidelines for “compatibility” (for software running on top of ODP) and “compliance” (for ODP platforms). We have scheduled a second TWG face-to-face meeting at Hadoop Summit and where committers, PMC and ASF members will be meeting to continue these discussions.

What’s Next?

Many of the member companies will be at Hadoop Summit in San Jose.

While you’re at Hadoop Summit, you can attend the IBM Meet Up and hear more about the ODP. Stay tuned to this blog as well – we’ll use this as a platform to inform you of new developments and provide you insight on how the ODP works.

Want to know more about the ODP, here are a few reference documents

Enterprise Hadoop Adoption: Half Empty or Half Full?

This blog by Shaun Connolly, Hortonworks VP of Corporate Strategy, has been republished with the author’s permission.

As we approach Hadoop Summit in San Jose next week, the debate continues over where Hadoop really is on its adoption curve. George Leopold from Datanami was one of the first to beat the hornet’s nest with his article entitled Gartner: Hadoop Adoption ‘Fairly Anemic’. Matt Asay from TechRepublic and Virginia Backaitis from CMSWire volleyed back with Hadoop Numbers Suggest the Best is Yet to Come and Gartner’s Dismal Predictions for Hadoop Could Be Wrong, respectively.

At the center of the controversy is the report published by Merv Adrian and Nick Heudecker from Gartner: Survey Analysis: Hadoop Adoption Drivers and Challenges. Specifically, the Gartner survey shows that 26% of respondents are deployed, piloting or experimenting; 11% plan to invest within 12 months; and an additional 7% plan to invest within 24 months.

Glass Half Empty or Half Full?

I believe the root of the controversy comes not in the data points stated above, but in the phrasing of one of the key findings statements: “Despite substantial hype and reported successes for early adopters, over half of respondents (54%) report no plans to invest at this time. Additionally, only 18% have plans to invest in Hadoop over the next two years.

The statement is phrased in the negative sense, from a lack of adoption perspective. While not wrong, it represents a half-empty perspective that is more appropriate for analyzing mature markets such as the RDBMS market, which is $100s of billions in size and decades into its adoption curve. Comparing today’s Hadoop market size and adoption to today’s RDBMS market is not particularly useful. However, comparing the RDBMS market at the time it was five years into its adoption cycle might be an interesting exercise.

When talking about adoption for newer markets like Enterprise Hadoop, I prefer to frame my view using the classic technology adoption lifecycle that models adoption across five categories with corresponding market share %s: Innovators (2.5%), Early Adopters (13.5%), Early Majority (34%), Late Majority (34%), and Laggards (16%).

Putting the Gartner data into this context shows Hadoop in the Early Majority of the market at the classic inflection point of its adoption curve.

gart_1

As a publicly traded enterprise open source company, not only is Hortonworks code open, but our corporate performance and financials are open too. Earlier this month, we released Hortonworks’ first quarter earnings. In Q4-2014 and Q1-2015, we added 99 and 105 new subscription customers respectively, which means we added over 46% of our 437 subscription customers in the past 6 months. If we look at the Fortune 100, 40% are Hortonworks subscribers including: 71% of F100 retailers, 75% of F100 Telcos, and 43% of F100 banks.

half_glass

We see these statistics as clear indicators of the building momentum of Open Enterprise Hadoop and the powerful Hortonworks model for extending Hadoop adoption across all industries. I won’t hide the fact that I am guilty of having a Half Full approach to life. As a matter of fact, I proudly wear the t-shirt every chance I get. The Half Full mindset serves us well at Hortonworks, because we see the glass filling quickly. The numbers for the last two quarters show that momentum.

Come Feel the Momentum at Hadoop Summit on June 9th in San Jose!

If you’d like to see the Hadoop momentum for yourself, then come join us at Hadoop Summit in San Jose starting June 9th.

Geoffrey Moore, author of Crossing the Chasm, will be a repeat keynote presenter this year. At Hadoop Summit 2012, he laid out a technology adoption roadmap for Big Data from the point of view of technology providers. Join Geoff as he updates that roadmap with a specific focus on business customers and the buying decisions they face in 2015.

Mike Gualtieri, Principal Analyst at Forrester Research, will also be presenting. Join Mike for his keynote entitled Adoption is the Only Option—Five Ways Hadoop is Changing the World and Two Ways It Will Change Yours.

In addition to keynote speakers, Summit will host more than 160 sessions being delivered by end user organizations, such as Aetna, Ernst & Young, Facebook, Google, LinkedIn, Mercy, Microsoft, Noble Energy, Verizon, Walt Disney, and Yahoo!, so you can get the story directly from the elephant’s mouth.

San Jose Summit 2015 promises to be an informational, innovative and entertaining experience for everyone.

Come join us. Experience the momentum for yourself.

Configuring multiple zones in Hadoop

Hortonworks, a WANdisco partner and another member of the Open Data Platform, recently published a list of best practices for Hadoop infrastructure management.  One of the top recommendations is configuring multiple zones in Hadoop.  Having development, test, and production environments gives you a safe way to test upgrades and new applications without disturbing a production system.

One of the challenges with creating multiple similar zones is sharing data between them.  Whether you’re testing backup procedures and application functionality, or prototyping a new data analysis algorithm, you need to see similar data in all the zones.  Otherwise you’re not really testing in a production-like environment.

But in a large cluster transferring terabytes of data around between zones can be time consuming and it’s tough to tell how stale the data really is.  That’s where WANdisco Fusion becomes an essential part of your operational toolkit.  WANdisco Fusion provides active-active data replication between Hadoop clusters.  You can use it to effectively share part of your Hadoop data between dev/test/prod zones in real-time.  All of the zones can make full use of the data, although you can of course use your normal access control system to prevent updates from certain zones.

DevOps principles are coming to Hadoop, so contact one of our solutions architects today to see how WANdisco Fusion can help you maintain multiple zones in your Hadoop deployment.

Different views on Big Data momentum

I was struck recently by two different perspectives on Big Data momentum.  Computing Research just published their 2015 Big Data Review in which they found continued momentum for Big Data projects.  A significantly higher number of their survey respondents in 2015 are using Big Data projects for operational results.  In a contrasting view, Gartner found that only 26% of the respondents were running or even experimenting with Hadoop.

If you dig a little deeper into the Computing study, you’ll see that it’s speaking about a wider range of Big Data options than just Hadoop.  The study mentions that 29% of the respondents are at least considering using Hadoop specifically, up from 15% last year.  So the two studies are closer than they look at first glance, yet the tone is strikingly different.

One possible explanation is that the Big Data movement is much bigger than Hadoop and it’s easier to be optimistic about a movement than a technology.  But even so, I’d tend towards the optimistic view of Hadoop.  If you look at the other technologies being considered for Big Data, analytics tools and databases (including NoSQL databases) are driving tremendous interest, with over 40% of the Computing Research participants evaluating new options.  And the Hadoop community has done a tremendous amount of work to turn Hadoop into a general purpose Big Data platform.

You don’t have to look very far for examples.  Apache Spark is now bundled in mainstream distributions to provide fast in-memory processing, while Pivotal (a member of the Open Data Platform along with WANdisco) has contributed Greenplum and HAWQ to the open source effort.

To sum up, the need for ‘Big Data’ is not in dispute, but the technology platforms that underpin Big Data are evolving rapidly.  Hadoop’s open nature and evolution from a processing framework to a platform are points in its favor.

Behind the scenes: Rapid Hadoop deployment

If you’ve ever deployed a Hadoop cluster from scratch on internal hardware or EC2, you know there are a lot of details to get right.  Syncing time with ntp, setting up password-less login across all the nodes, and making sure you have all the prerequisite packages installed is just the beginning.  Then you have to actually deploy Hadoop.  Even with a management tool like Ambari there’s a lot of time spent going through the web interface and deploying software.  In this article I’m going to describe why we invested in a framework for rapid Hadoop deployment with Docker and Ansible.

At WANdisco we have teams of engineers and solutions architects testing our latest products on a daily basis, so automation is a necessity.  Last year I spent some time on a Vagrant-Puppet toolkit to set up EC2 images and deploy Hadoop using Ambari blueprints.  As an initial effort it was pretty good but I never invested the time to handle the cross-node dependencies.  For instance, after the images are provisioned with all the prerequisites I manually ran another Puppet script to deploy Ambari, then another one to deploy Hue, rather than having a master process that handled the timing and coordination.

Luckily we have a great automation team in our Sheffield office that set up a push-button solution using Docker and Ansible.  With a single invocation you get:

  • 3 clusters (mix-and-match with the distributions you prefer)
  • Each cluster has 7 containers.  The first runs the management tool (like Ambari), the second runs the NameNode and most of the master services, the third runs Hue, and the others are data nodes.
  • All of the networking and other services are registered correctly.
  • WANdisco Fusion installed.

Starting from a bare metal host, it takes about 20 minutes to do a one-time setup with Puppet that installs Docker and the Ansible framework and builds the Docker images.  Once that first-time setup is done, a simple script starts the Docker containers and runs Ansible to deploy Hadoop.  That takes about 20 minutes for a clean install, or 2-3 minutes to refresh the clusters with the latest build of our products.

That’s a real time-saver.  Engineers can refresh with a new build in minutes, and solution architects can set up a brand new demo environment in under a half hour.  Docker is ideal for demo purposes as well.  Cutting down the number of nodes lets the whole package run comfortably on a modern laptop, and simply pausing a container is an easy way to simulate node failures.  (When you’re demonstrating the value of active-active replication, simulating failure is an everyday task.)

As always, DevOps is a work-in-progress.  The team is making improvements every week, and I think with improved use of Docker images we can cut the cluster creation time down even more.

That’s a quick peek at how our internal engineering teams are using automation to speed up development and testing of our Hadoop products.  If you’d like to learn more, I encourage you to tweet @wandisco with questions, or ask on our Hadoop forum.

ビッグデータ時代のストレージ選択

調査会社451と弊社のWebnar:Big Data Storage: Options & Recommendationsのまとめです。Big data storage size

ビッグデータ/Hadoop対応のストレージの需要は(驚くことに)いまだに4%程度。全体の需要の伸びにも追い付いていないのが現状ではあるが、少しずつ変わりつつある。

元々Hadoopは、バッチ処理中心で安価なローカルディスクを使用するのが一般的であった。

しかしながら、リアルタイム処理、解析等々多様なアプリに使われだした為、色々な種類のストレージが使われ始めた。一例としてNetwork Storageを何に使うかを調べたところビッグデータの伸びが一番大きかった。クラウドであろうがオンプレであろうが各種ストレージを適材適所で使用していく事が成功のカギとしている。Stodare hadoop

こうした環境では異なるストレージ間のコネクタ、複製が必要となってくる。一つの解としてWD Fusionが紹介された(WDFusionについては過去のブログを参照ください)

リプレイは下記URL

https://www.brighttalk.com/webcast/11809/153683