Different views on Big Data momentum

I was struck recently by two different perspectives on Big Data momentum.  Computing Research just published their 2015 Big Data Review in which they found continued momentum for Big Data projects.  A significantly higher number of their survey respondents in 2015 are using Big Data projects for operational results.  In a contrasting view, Gartner found that only 26% of the respondents were running or even experimenting with Hadoop.

If you dig a little deeper into the Computing study, you’ll see that it’s speaking about a wider range of Big Data options than just Hadoop.  The study mentions that 29% of the respondents are at least considering using Hadoop specifically, up from 15% last year.  So the two studies are closer than they look at first glance, yet the tone is strikingly different.

One possible explanation is that the Big Data movement is much bigger than Hadoop and it’s easier to be optimistic about a movement than a technology.  But even so, I’d tend towards the optimistic view of Hadoop.  If you look at the other technologies being considered for Big Data, analytics tools and databases (including NoSQL databases) are driving tremendous interest, with over 40% of the Computing Research participants evaluating new options.  And the Hadoop community has done a tremendous amount of work to turn Hadoop into a general purpose Big Data platform.

You don’t have to look very far for examples.  Apache Spark is now bundled in mainstream distributions to provide fast in-memory processing, while Pivotal (a member of the Open Data Platform along with WANdisco) has contributed Greenplum and HAWQ to the open source effort.

To sum up, the need for ‘Big Data’ is not in dispute, but the technology platforms that underpin Big Data are evolving rapidly.  Hadoop’s open nature and evolution from a processing framework to a platform are points in its favor.

Behind the scenes: Rapid Hadoop deployment

If you’ve ever deployed a Hadoop cluster from scratch on internal hardware or EC2, you know there are a lot of details to get right.  Syncing time with ntp, setting up password-less login across all the nodes, and making sure you have all the prerequisite packages installed is just the beginning.  Then you have to actually deploy Hadoop.  Even with a management tool like Ambari there’s a lot of time spent going through the web interface and deploying software.  In this article I’m going to describe why we invested in a framework for rapid Hadoop deployment with Docker and Ansible.

At WANdisco we have teams of engineers and solutions architects testing our latest products on a daily basis, so automation is a necessity.  Last year I spent some time on a Vagrant-Puppet toolkit to set up EC2 images and deploy Hadoop using Ambari blueprints.  As an initial effort it was pretty good but I never invested the time to handle the cross-node dependencies.  For instance, after the images are provisioned with all the prerequisites I manually ran another Puppet script to deploy Ambari, then another one to deploy Hue, rather than having a master process that handled the timing and coordination.

Luckily we have a great automation team in our Sheffield office that set up a push-button solution using Docker and Ansible.  With a single invocation you get:

  • 3 clusters (mix-and-match with the distributions you prefer)
  • Each cluster has 7 containers.  The first runs the management tool (like Ambari), the second runs the NameNode and most of the master services, the third runs Hue, and the others are data nodes.
  • All of the networking and other services are registered correctly.
  • WANdisco Fusion installed.

Starting from a bare metal host, it takes about 20 minutes to do a one-time setup with Puppet that installs Docker and the Ansible framework and builds the Docker images.  Once that first-time setup is done, a simple script starts the Docker containers and runs Ansible to deploy Hadoop.  That takes about 20 minutes for a clean install, or 2-3 minutes to refresh the clusters with the latest build of our products.

That’s a real time-saver.  Engineers can refresh with a new build in minutes, and solution architects can set up a brand new demo environment in under a half hour.  Docker is ideal for demo purposes as well.  Cutting down the number of nodes lets the whole package run comfortably on a modern laptop, and simply pausing a container is an easy way to simulate node failures.  (When you’re demonstrating the value of active-active replication, simulating failure is an everyday task.)

As always, DevOps is a work-in-progress.  The team is making improvements every week, and I think with improved use of Docker images we can cut the cluster creation time down even more.

That’s a quick peek at how our internal engineering teams are using automation to speed up development and testing of our Hadoop products.  If you’d like to learn more, I encourage you to tweet @wandisco with questions, or ask on our Hadoop forum.


調査会社451と弊社のWebnar:Big Data Storage: Options & Recommendationsのまとめです。Big data storage size



しかしながら、リアルタイム処理、解析等々多様なアプリに使われだした為、色々な種類のストレージが使われ始めた。一例としてNetwork Storageを何に使うかを調べたところビッグデータの伸びが一番大きかった。クラウドであろうがオンプレであろうが各種ストレージを適材適所で使用していく事が成功のカギとしている。Stodare hadoop

こうした環境では異なるストレージ間のコネクタ、複製が必要となってくる。一つの解としてWD Fusionが紹介された(WDFusionについては過去のブログを参照ください)



Cos Boudnik on Apache Ignite and Apache Spark

In case you missed it, WANdisco’s own Konstantin (Cos) Boudnik wrote a very interesting blog post about in-memory computing recently.  Apache Spark has attracted a lot of attention for its robust programming model and excellent performance.  Cos’ article points out another Apache project that’s worth keeping an eye on, Apache Ignite.

Ignite is a full in-memory computing system, whereas Spark uses memory for processing.  Ignite also features full SQL-99 support and a Java-centric programming model, compared to Spark’s preference for Scala.  (I’ll note that I do appreciate Spark’s strong support for Python as well.)

Although I won’t pretend to understand all the technical nuances of Ignite and Spark, it seems that there is some overlap in use cases.  That’s a good sign for data analysts looking for more choices for faster big data processing.

ODP(Open Data Platform)とは? Apache v.s. ODP

ODP(Open Data Platform)が今年2月に設立された。スポンサーはHortonworks, Pivotal, IBM, SAS等の19社。OPDは企業向けのHadoopおよびBig Dataを推進する業界共同の努力であるとしている。Hadoopベンダー同士の争いように見え、よく分からないところがあるが、datanamiの”Hadoop’s Next Big Battle: Apache Versus ODP”という記事の解説が興味深いので紹介する。

Apache Software Foundation(AFS)のオープンソースモデルが今日のHadoopの作り上げたこと、このモデルがHadoop強みであることは疑いの余地はない。しかしながら今後の発展をどう進めるかでHadoopコミュニティの中で意見が分かれている。別のガバナンス機関、即ちODPが必要とする意見と不必要とする意見である。

ODPの推進派として弊社CEOの考え方が紹介されている。Hadoopの開発スピードが速すぎて、3rd Partyがついていけない。Name NodeのプラグインによりHadoopのHA・DR対応の製品を出していたが、認証の為の時間・コストが大きすぎる。弊社はこのため上位のプロクシ―で同等の機能を提供するWD Fusionへ切り替え問題は回避したが、ユーザ・3rd Party の為には、APIを一貫性が重要。ODPにこの役目を期待している。技術革新はASFが担いODPは標準化のQAの役割を果たすものであり、開発は行わないとしている。

MapR CEOは反対派の意見としてODPは冗長であり、必要のない課題を解こうとしていると述べている。Hadoopユーザはベンダーロックインの懸念は持っていない。Gartnerの調査でも相互接続、ロックインが問題としているのは1%以下との事。ODPのガバナンスがどうなるのかも不透明。ClouderaのCTOも同意見であり、ODPは昔、OSFがUNIXを分断してしまったのと同じとしている(個人的にはODPはX/Openであるべきと思うが。。。。。)


5 questions for your Hadoop architect

I was baffled last week when I was told that a lot of Hadoop deployments don’t even use a backup procedure.  Hadoop does of course provide local data replication that gives you three copies of every file.  But catastrophes can and do happen.  Data centers aren’t immune to natural disasters or malicious acts, and if you try to put some of your data nodes in a remote site the performance will suffer greatly.

WANdisco of course makes products that solve data availability problems among other challenges, so I’m not an impartial observer.  But ask yourself this: is the data in your Hadoop cluster less valuable than the photos on your cell phone that are automatically synced to a remote storage site?

And after that, ask your Hadoop architect these 5 questions:

  • How is our Hadoop data backed up?
  • How much data might we lose if the data center fails?
  • How long will it take us to recover data and be operational again if we have a data center failure?
  • Have you verified the integrity of the data at the backup site?
  • How often do you test our Hadoop applications on the backup site?

The answers might surprise you.

新製品WD Fusion発表に関わるCTOのQ&A

弊社CTOのJaganeによるWD FusionのQ&Aを紹介します。

Q1: WANdisco Fusionを簡単にいうと何?






これにより異なるタイプのストレージを単一Hadoopシステムに統合することが可能となる。WD Fusionを使えば、あるデータセンタではPivotal、他ではHortonworks、さらに別のデータセンタではEMC Isilonを使っていても問題なく、全てを同一に扱える。











Q6:WD Fusionはどのようにして生まれたのか?



その時点で異なるシステム間でデータの一貫性を保つような製品のアイデアが浮かんだ。その結果がWD Fusion:データの一貫性を保つ完全なトランザクションベースの複製エンジンである。一度、設定すれば、以降、データが矛盾ないかのチェックで悩むことはなくなる。


Q7:あなたはHadoopの仕事をここ10年している。その目からみてWD Fusionは破壊的な技術になると思うか?

実際には15年以上、ストレージ業界で働いている。共有ストレージシステムを長く携わり、その後Hadoopに関わった。WD Fusionはストレージインフラの使い方に革命を起こす大きな可能性を持っている。正直言ってこんなにエキサイティングなプロジェクトは経験したことがない。





WD FusionのDatasheetは以下を参照ください。

Datasheet-WD-Fusion-A4-WEB April2015



Subversionは集中型であり、Gitは分散型ですが、Gitにおいても企業ユースでは、Subversion同様、管理されたマスターリポジトリ(Golden Master)を持つことになります。しかしながらGitでは、開発者同士が変更を自由に共有できること、例えば開発者Aさんが自分の変更を開発者Bさんだけに渡す(Push)ことが可能です。例えばAさんがマスターリポジトリにPushする権限を持っていなくともBさんによりAさんの更新がマスターリポジトリ反映されるようなことも起こります。

Gitは自由度が大きいので多様なワークフローを実現できるのでメンタルチェンジが必要という事です。一方、GitHub, GitLab, Gerrit等の管理ツールが充実してきており企業ユースのハードルも下がってきています。SubversionからGitに移行するには一定期間、共存させるのが、お勧めで、ツールも用意されています。Gitを使用する際の注意点は、リポジトリサイズを小さく維持し管理していく事です。









Hortonworks and WANdisco make it easy to get started with Spark

Hortonworks, one of our partners in the Open Data Platform Initiative, recently released version 2.2.4 of the Hortonworks Data Platform (HDP).  It bundles Apache Spark 1.2.1.  That’s a clear indicator (if we needed another one) that Spark has entered the Hadoop mainstream.  Are you ready for it?

Spark opens up a new realm of use cases for Hadoop since it offers very fast in-memory data processing.  Spark has blown through several Hadoop benchmarks and offers a unified batch, SQL, and streaming framework.

But Spark presents new challenges for Hadoop infrastructure architects.  It favors memory and CPU with a smaller number of drives than a typical Hadoop data node.  The art of monitoring and tuning Spark is still in early days.

Hortonworks is addressing many of these challenges by including Spark in HDP 2.2.4 and integrating it into Ambari.  And now WANdisco is making it even easier to get started with Spark by giving you the flexibility to deploy Spark into a separate cluster while still using your production data.

WANdisco Fusion uses active-active data replication to make the same Hadoop data available and usable consistently from several Hadoop clusters.  That means you can run Spark against your production data, but isolate it on a separate cluster (perhaps in the cloud) while you get up to speed on hardware sizing and performance monitoring.  You can continue to run Spark this way indefinitely in order to isolate any potential performance impact, or eventually migrate Spark to your main cluster.

Shared data but separate compute resources gives you the extra flexibility you need to rapidly deploy new Hadoop technologies like Spark without impacting critical applications on your main cluster.  Hortonworks and WANdisco make it easy to get started with Spark.  Get in touch with our solution architects today to get started.



WANdisco Fusion Q&A with Jagane Sundar, CTO

Tuesday we unveiled our new product: WANdisco Fusion. Ahead of the launch, we caught up with WANdisco CTO Jagane Sundar, who was one of the driving forces behind Fusion.

Jagane joined WANdisco in November 2012 after the firm’s acquisition of AltoStor and has since played a key role in the company’s product development and rollout. Prior to founding AltoStor along with Konstantin Shvachko, Jagane was part of the original team that developed Apache Hadoop at Yahoo!.

Jagane, put simply, what is WANdisco Fusion?

JS: WANdisco Fusion is a wonderful piece of technology that’s built around a strongly consistent transactional replication engine, allowing for the seamless integration of different types of storage for Hadoop applications.

It was designed to help organizations get more out of their Big Data initiatives, answering a number of very real problems facing the business and IT worlds.

And the best part? All of your data centers are active simultaneously: You can read and write in any data center. The result is you don’t have hardware that’s lying idle in your backup or standby data center.

What sort of business problems does it solve?

JS: It provides two new important capabilities for customers. First, it keeps data consistent across different data centers no matter where they are in the world.

And it gives customers the ability to integrate different storage types into a single Hadoop ecosystem. With WANdisco Fusion, it doesn’t matter if you are using Pivotal in one data center, Hortonworks in another and EMC Isilon in a third – you can bring everything into the same environment.

Why would you need to replicate data across different storage systems?

JS: The answer is very simple. Anyone familiar with storage environments knows how diverse they can be. Different types of storage have different strengths depending on the individual application you are running.

However, keeping data synchronized is very difficult if not done right. Fusion removes this challenge while maintaining data consistency.

How does it help future proof a Hadoop deployment?

JS: We believe Fusion will form a critical component of companies’ workflow update procedures. You can update your Hadoop infrastructure one data center at a time, without impacting application availability or by having to copy massive amounts of data once the update is done.

This helps you deal with updates from both Hadoop and application vendors in a carefully orchestrated manner.

Doesn’t storage-level replication work as effectively as Fusion?

JS: The short answer is no. Storage-level replication is subject to latency limitations that are imposed by file systems. The result is you cannot really run storage-level replication over long distances, such as a WAN.

Storage-level replication is nowhere nearly as functional as Fusion: It has to happen at the LAN level and not over a true Wide Area Network.

With Fusion, you have the ability to integrate diverse systems such as NFS with Hadoop, allowing you to exploit the full strengths and capabilities of each individual storage system – I’ve never worked on a project as exciting and as revolutionary as this one.

How did WANdisco Fusion come about?

JS: By getting inside our customers’ data centers and witnessing the challenges they faced. It didn’t take long to notice the diversity of storage environments.

Our customers found that different storage types worked well for different applications – and they liked it that way. They didn’t want strict uniformity across their data centers, but to be able to leverage the strengths of each individual storage type.

At that point we had the idea for a product that would help keep data consistent across different systems.

The result was WANdisco Fusion: a fully replicated transactional engine that makes the work of keeping data consistent trivial. You only have to set it up once and never have to bother with checking if your data is consistent.

This vision of a fully utilized, strongly consistent diverse storage environment for Hadoop is what we had in mind when came up with the Fusion product.

You’ve been working with Hadoop for the last 10 years. Just how disruptive is WANdisco Fusion going to be?

JS: I’ve actually been in the storage industry for more than 15 years now. Over that period I’ve worked with shared storage systems, and I’ve worked with Hadoop storage systems. WANdisco Fusion has the potential to completely revolutionize the way people use their storage infrastructure. Frankly, this is the most exciting project I’ve ever been part of.

As the Hadoop ecosystem evolved I saw the need for this virtual storage system that integrates different types of storage.

Efforts to make Hadoop run across different data centers have been mostly unsuccessful. For the first time, we at WANdisco have a way to keep your data in Hadoop systems consistent across different data centers.

The reason this is so exciting is because it transforms Hadoop into something that runs in multiple data centers across the world.

Suddenly you have capabilities that even the original inventors of Hadoop didn’t really consider when it was conceived. That’s what makes WANdisco Fusion exciting.