Hadoop Blog

Why Data Driven Companies Rely on WANdisco Fusion

Hadoop is now clearly gaining momentum. We are seeing more and more customers attempting to deploy enterprise grade applications. Data protection, governance, performance and availability are top concerns. WANdisco Fusion’s level of resiliency is enabling customers to move out of the lab and into production much faster.

As companies start to scale these platforms and begin the journey to becoming data driven, they are completely focused on business value and return on investment. WANdisco’s ability to optimize resource utilization by eliminating the need for standby servers resonates well with our partners and customers. These companies are not Google or Facebook. They don’t have an endless supply of hardware and their core business isn’t delivering technology.

As these companies add data from more sources to Hadoop, they are implementing backup and disaster recovery plans and deploying multiple clusters for redundancy. One of our customers, a large bank, is beginning to utilize the cloud for DR.

I’ve met 11 new customers in the past eight days. Five of them have architected cloud into their data lake strategy and are evaluating the players. They are looking to run large data sets in the cloud for efficiency as well as backup and DR.

One of those customers, a leader in IT security, tells me they plan to move their entire infrastructure to the cloud within the next 12 months. They already have 200 nodes in production today, which they expect to double in a year.

Many of our partners are interested in how they can make it easy to onboard data from behind the firewall to the cloud while delivering the best performance. They recognize this is fundamental to a successful cloud strategy.

Companies are already embarking on migrations from one Hadoop platform to another. We’re working with customers on migration from MapR to HDP, CDH to HDP, CDH to Oracle BDA, and because we are HCFS compatible, GPFS to IOP. Some of these are petabyte scale.

For many of these companies, WANdisco Fusion’s ability to eliminate downtime, data loss and business disruption is a prerequisite to making that transition. Migration has never been undertaken lightly. I’ve spoken to partners who are unable to migrate their customers due to the required amount of downtime and risk involved.

One customer I met recently completed a large migration to HDP and just last week acquired a company that has a large cluster on Cloudera. We’re talking to them about how we can easily provide a single consistent view of the data. This will allow them to get immediate value from the data they have just acquired. If they choose to migrate completely, they are in control of the timing.

Customers measure their success by time to value. We’re working closely with our strategic partners to ensure our customers don’t have to worry about the nuts and bolts, irrespective of distributions, on-prem, cloud, or hybrid environment so customers can concentrate on the business outcome.

Please reach out to me if these use cases resonate and you would like to learn more.

Peter Scott
SVP Business Development

avatar

About Mackensie Gibson

The 100 Day Progress Report on the ODP

This blog by Cheryle Custer, Director Strategic Alliance Marketing Hortonworks, has been republished with the author’s permission.

It was just a little over 100 days ago that 15 industry leaders in the Big Data space announced the formation of the Open Data Platform (ODP) initiative. We’d like to let you know what has been going on in that time, to bring you a preview of what you can expect in the next few months and let you know how you can become involved.

Some Background

What is the Open Data Platform Initiative?
The Open Data Platform Initiative (ODP) is an enterprise-focused shared industry effort focused on simplifying adoption and promoting the use and advancing the state of Apache Hadoop® and Big Data technologies for the enterprise. It is a non-profit organization being created by folks that help to create:  Apache, Eclipse, Linux, OpenStack, OpenDaylight, Open Networking Foundation, OSGI, WSI (Web Services Interoperability), UDDI , OASIS, Cloud Foundry Foundation and many others.

The organization relies on the governance of the Apache Software Foundation community to innovate and deliver the Apache project technologies included in the ODP core while using a ‘one member one vote’ philosophy where every member decides what’s on the roadmap. Over the next few weeks, we will be posting a number of blogs to describe in more detail how the organization is governed and how everyone can participate.

What is the Core?
The ODP Core provides a common set of open source technologies that currently includes: Apache Hadoop® (inclusive of HDFS, YARN, and MapReduce) and Apache® Ambari. ODP relies on the governance of the Apache Software Foundation community to innovate and deliver the Apache project technologies included in the ODP core. Once the ODP members and processes are well established, the scope of the ODP Core will expand to include other open source projects.

Benefits of the ODP Core
The ODP core is a set of open source Hadoop technologies designed to provide a standardized core that big data solution providers software and hardware developers can use to deliver compatible solutions rooted in open source that unlock customer choice.

By delivering on a vision of “verify once, run anywhere”, everyone benefits:

  • For Apache Hadoop® technology vendors, reduced R&D costs that come from a shared qualification effort
  • For Big Data application solution providers, reduced R&D costs that come from more predictable and better qualified releases
  • Improved interoperability within the platform and simplified integration with existing systems in support of a broad set of use cases
  • Less friction and confusion for Enterprise customers and vendors
  • Ability to redirect resources towards higher value efforts

100 Day Progress Report

In the 100 days since the announcement, we’ve made some great progress:

Four Platforms Shipping
At Hadoop Summit in Brussels in April, we announced the availability of four Hadoop platforms all based on a vision of a common ODP core: Infosys Information PlatformIBM Open Platform, Hortonworks Data Platformand Pivotal HD. The commercial delivery of ODP based distributions across multiple industry leading vendors immediately after the launch of the initiative demonstrates the momentum behind ODP to accelerate the delivery of compatible Hadoop distributions and the simplification it brings to the ecosystem using that as an industry standard.

New Members and New Participation Levels
In addition to revealing that Telstra is one of the founding Platinum members of the ODP, we’ve added new nine new members, including BMC, DataTorrent,PLDTSquid SolutionsSyncsort, UnifizData, Zettaset. We welcome these new members and are looking forward to their participation and their announcements. We also announced new membership level to provide an easy entrée for any company to participate in the ODP. The Silver level of membership allows companies to have a direct voice into the future of big data and contribute people, tests, and code to accelerate executing on the vision.

Community Collaboration at the Bug Bash
ODP Member Alitscale lead the efforts on a Hadoop Community Bug Bash. This unique event for the Apache Hadoop community, along with co-sponsors Hortonworks, Huawei, Infosys, and Pivotal, saw over 150 participants from eight countries and nine time zones, to strengthen Hadoop and honor the work of the community by reviewing and resolving software patches. Read more about the Bug Bash, where 186 issues were resolved either with closure or patches committed to code. Nice job everyone!  You can participate in upcoming bug bashes, so stay tuned.

Technical Working Group and the ASF
Senior engineers and architects from the ODP member companies have come together as a Technical Working Group (TWG). The goal of the TWG is to jump-start the work required to produce ODP core deliverables and to seed the technical community overseeing the future evolution of the ODP core. Delivering on the promise of “verify once and run anywhere” TWG is building h certification guidelines for “compatibility” (for software running on top of ODP) and “compliance” (for ODP platforms). We have scheduled a second TWG face-to-face meeting at Hadoop Summit and where committers, PMC and ASF members will be meeting to continue these discussions.

What’s Next?

Many of the member companies will be at Hadoop Summit in San Jose.

While you’re at Hadoop Summit, you can attend the IBM Meet Up and hear more about the ODP. Stay tuned to this blog as well – we’ll use this as a platform to inform you of new developments and provide you insight on how the ODP works.

Want to know more about the ODP, here are a few reference documents

Enterprise Hadoop Adoption: Half Empty or Half Full?

This blog by Shaun Connolly, Hortonworks VP of Corporate Strategy, has been republished with the author’s permission.

As we approach Hadoop Summit in San Jose next week, the debate continues over where Hadoop really is on its adoption curve. George Leopold from Datanami was one of the first to beat the hornet’s nest with his article entitled Gartner: Hadoop Adoption ‘Fairly Anemic’. Matt Asay from TechRepublic and Virginia Backaitis from CMSWire volleyed back with Hadoop Numbers Suggest the Best is Yet to Come and Gartner’s Dismal Predictions for Hadoop Could Be Wrong, respectively.

At the center of the controversy is the report published by Merv Adrian and Nick Heudecker from Gartner: Survey Analysis: Hadoop Adoption Drivers and Challenges. Specifically, the Gartner survey shows that 26% of respondents are deployed, piloting or experimenting; 11% plan to invest within 12 months; and an additional 7% plan to invest within 24 months.

Glass Half Empty or Half Full?

I believe the root of the controversy comes not in the data points stated above, but in the phrasing of one of the key findings statements: “Despite substantial hype and reported successes for early adopters, over half of respondents (54%) report no plans to invest at this time. Additionally, only 18% have plans to invest in Hadoop over the next two years.

The statement is phrased in the negative sense, from a lack of adoption perspective. While not wrong, it represents a half-empty perspective that is more appropriate for analyzing mature markets such as the RDBMS market, which is $100s of billions in size and decades into its adoption curve. Comparing today’s Hadoop market size and adoption to today’s RDBMS market is not particularly useful. However, comparing the RDBMS market at the time it was five years into its adoption cycle might be an interesting exercise.

When talking about adoption for newer markets like Enterprise Hadoop, I prefer to frame my view using the classic technology adoption lifecycle that models adoption across five categories with corresponding market share %s: Innovators (2.5%), Early Adopters (13.5%), Early Majority (34%), Late Majority (34%), and Laggards (16%).

Putting the Gartner data into this context shows Hadoop in the Early Majority of the market at the classic inflection point of its adoption curve.

gart_1

As a publicly traded enterprise open source company, not only is Hortonworks code open, but our corporate performance and financials are open too. Earlier this month, we released Hortonworks’ first quarter earnings. In Q4-2014 and Q1-2015, we added 99 and 105 new subscription customers respectively, which means we added over 46% of our 437 subscription customers in the past 6 months. If we look at the Fortune 100, 40% are Hortonworks subscribers including: 71% of F100 retailers, 75% of F100 Telcos, and 43% of F100 banks.

half_glass

We see these statistics as clear indicators of the building momentum of Open Enterprise Hadoop and the powerful Hortonworks model for extending Hadoop adoption across all industries. I won’t hide the fact that I am guilty of having a Half Full approach to life. As a matter of fact, I proudly wear the t-shirt every chance I get. The Half Full mindset serves us well at Hortonworks, because we see the glass filling quickly. The numbers for the last two quarters show that momentum.

Come Feel the Momentum at Hadoop Summit on June 9th in San Jose!

If you’d like to see the Hadoop momentum for yourself, then come join us at Hadoop Summit in San Jose starting June 9th.

Geoffrey Moore, author of Crossing the Chasm, will be a repeat keynote presenter this year. At Hadoop Summit 2012, he laid out a technology adoption roadmap for Big Data from the point of view of technology providers. Join Geoff as he updates that roadmap with a specific focus on business customers and the buying decisions they face in 2015.

Mike Gualtieri, Principal Analyst at Forrester Research, will also be presenting. Join Mike for his keynote entitled Adoption is the Only Option—Five Ways Hadoop is Changing the World and Two Ways It Will Change Yours.

In addition to keynote speakers, Summit will host more than 160 sessions being delivered by end user organizations, such as Aetna, Ernst & Young, Facebook, Google, LinkedIn, Mercy, Microsoft, Noble Energy, Verizon, Walt Disney, and Yahoo!, so you can get the story directly from the elephant’s mouth.

San Jose Summit 2015 promises to be an informational, innovative and entertaining experience for everyone.

Come join us. Experience the momentum for yourself.

WANdisco Fusion Q&A with Jagane Sundar, CTO

Tuesday we unveiled our new product: WANdisco Fusion. Ahead of the launch, we caught up with WANdisco CTO Jagane Sundar, who was one of the driving forces behind Fusion.

Jagane joined WANdisco in November 2012 after the firm’s acquisition of AltoStor and has since played a key role in the company’s product development and rollout. Prior to founding AltoStor along with Konstantin Shvachko, Jagane was part of the original team that developed Apache Hadoop at Yahoo!.

Jagane, put simply, what is WANdisco Fusion?

JS: WANdisco Fusion is a wonderful piece of technology that’s built around a strongly consistent transactional replication engine, allowing for the seamless integration of different types of storage for Hadoop applications.

It was designed to help organizations get more out of their Big Data initiatives, answering a number of very real problems facing the business and IT worlds.

And the best part? All of your data centers are active simultaneously: You can read and write in any data center. The result is you don’t have hardware that’s lying idle in your backup or standby data center.

What sort of business problems does it solve?

JS: It provides two new important capabilities for customers. First, it keeps data consistent across different data centers no matter where they are in the world.

And it gives customers the ability to integrate different storage types into a single Hadoop ecosystem. With WANdisco Fusion, it doesn’t matter if you are using Pivotal in one data center, Hortonworks in another and EMC Isilon in a third – you can bring everything into the same environment.

Why would you need to replicate data across different storage systems?

JS: The answer is very simple. Anyone familiar with storage environments knows how diverse they can be. Different types of storage have different strengths depending on the individual application you are running.

However, keeping data synchronized is very difficult if not done right. Fusion removes this challenge while maintaining data consistency.

How does it help future proof a Hadoop deployment?

JS: We believe Fusion will form a critical component of companies’ workflow update procedures. You can update your Hadoop infrastructure one data center at a time, without impacting application availability or by having to copy massive amounts of data once the update is done.

This helps you deal with updates from both Hadoop and application vendors in a carefully orchestrated manner.

Doesn’t storage-level replication work as effectively as Fusion?

JS: The short answer is no. Storage-level replication is subject to latency limitations that are imposed by file systems. The result is you cannot really run storage-level replication over long distances, such as a WAN.

Storage-level replication is nowhere nearly as functional as Fusion: It has to happen at the LAN level and not over a true Wide Area Network.

With Fusion, you have the ability to integrate diverse systems such as NFS with Hadoop, allowing you to exploit the full strengths and capabilities of each individual storage system – I’ve never worked on a project as exciting and as revolutionary as this one.

How did WANdisco Fusion come about?

JS: By getting inside our customers’ data centers and witnessing the challenges they faced. It didn’t take long to notice the diversity of storage environments.

Our customers found that different storage types worked well for different applications – and they liked it that way. They didn’t want strict uniformity across their data centers, but to be able to leverage the strengths of each individual storage type.

At that point we had the idea for a product that would help keep data consistent across different systems.

The result was WANdisco Fusion: a fully replicated transactional engine that makes the work of keeping data consistent trivial. You only have to set it up once and never have to bother with checking if your data is consistent.

This vision of a fully utilized, strongly consistent diverse storage environment for Hadoop is what we had in mind when came up with the Fusion product.

You’ve been working with Hadoop for the last 10 years. Just how disruptive is WANdisco Fusion going to be?

JS: I’ve actually been in the storage industry for more than 15 years now. Over that period I’ve worked with shared storage systems, and I’ve worked with Hadoop storage systems. WANdisco Fusion has the potential to completely revolutionize the way people use their storage infrastructure. Frankly, this is the most exciting project I’ve ever been part of.

As the Hadoop ecosystem evolved I saw the need for this virtual storage system that integrates different types of storage.

Efforts to make Hadoop run across different data centers have been mostly unsuccessful. For the first time, we at WANdisco have a way to keep your data in Hadoop systems consistent across different data centers.

The reason this is so exciting is because it transforms Hadoop into something that runs in multiple data centers across the world.

Suddenly you have capabilities that even the original inventors of Hadoop didn’t really consider when it was conceived. That’s what makes WANdisco Fusion exciting.

The inspiration for WANdisco Fusion

Screen Shot 2015-04-21 at 10.08.22 PM

Roughly two years ago, we sat down to start work on a project that finally came to fruition this week.

At that meeting, we had set ourselves the challenge of redefining the storage landscape. We wanted to map out a world where there was complete shared storage, but where the landscape remained entirely heterogeneous.

Why? Because we’d witnessed the beginnings of a trend that has only grown more pronounced with the passage of time.

From the moment we started engaging with customers, we were struck by the extreme diversity of their storage environments. Regardless of whether we were dealing with a bank, a hospital or utility provider, different types of storage had been introduced across every organization for a variety of use cases.

In time, however, these same companies wanted to start integrating their different silos of data, whether to run real-time analytics or to gain a full 360 perspective of performance. Yet preserving diversity across data center was critical, given that each storage type has its own strengths.

They didn’t care about uniformity. They cared about performance and this meant being able to have the best of both worlds. Being able to deliver this became the Holy Grail – at least in the world of data centers.

This isn’t quite The Gordian Knot but it’s certainly a very difficult, complex problem and possibly one that could only be solved with our core, patented IP DConE.

Then we had a breakthrough.

Months later and I’m proud to formally release WANdisco Fusion (WD Fusion), the only product that enables WAN-scope active-active synchronization of different storage systems into one place.

What does this mean in practice? Well it means that you can use Hadoop distributions like Hortonworks, Cloudera or Pivotal for compute, Oracle BDA for fast compute, EMC Isilon for dense storage. You could even use a complete variety of Hadoop distros and versions. Whatever your set-up, with WD Fusion you can leverage new and existing storage assets immediately.

With it, Hadoop is transformed from being something that runs within a data center into an elastic platform that runs across multiple data centers throughout the world. WD Fusion allows you to update your storage infrastructure one data center at a time, without impacting your application ability or by having to copy vast swathes of data once the update is done.

When we were developing WD Fusion we agreed upon two things. First, we couldn’t produce anything that made changes to the underlying storage system – this had to behave like a client application. Second, anything we created had to enable a complete single global name-space across an entire storage infrastructure.

With WD Fusion, we allow businesses to bring together different storage systems by leveraging our existing intellectual property – the same Paxos-powered algorithm behind Non-Stop Hadoop, Subversion Multisite and Git Multisite – without making any changes to the platform you’re using.

Another way of putting it is we’ve managed to spread our secret sauce even further.

We have some of the best computer scientists in the world working at WANdisco, but I’m confident that this is the most revolutionary project any of us have ever worked on.

I’m delighted to be unveiling WD Fusion. It’s a testament to the talent and character of our firm, the result of looking at an impossible scenario and saying: “Challenge accepted.”

avatar

About David Richards

David is CEO, President and co-founder of WANdisco and has quickly established WANdisco as one of the world’s most promising technology companies. Since co-founding the company in Silicon Valley in 2005, David has led WANdisco on a course for rapid international expansion, opening offices in the UK, Japan and China. David spearheaded the acquisition of Altostor, which accelerated the development of WANdisco’s first products for the Big Data market. The majority of WANdisco’s core technology is now produced out of the company’s flourishing software development base in David’s hometown of Sheffield, England and in Belfast, Northern Ireland. David has become recognised as a champion of British technology and entrepreneurship. In 2012, he led WANdisco to a hugely successful listing on London Stock Exchange (WAND:LSE), raising over £24m to drive business growth. With over 15 years' executive experience in the software industry, David sits on a number of advisory and executive boards of Silicon Valley start-up ventures. A passionate advocate of entrepreneurship, he has established many successful start-up companies in Enterprise Software and is recognised as an industry leader in Enterprise Application Integration and its standards. David is a frequent commentator on a range of business and technology issues, appearing regularly on Bloomberg and CNBC. Profiles of David have appeared in a range of leading publications including the Financial Times, The Daily Telegraph and the Daily Mail. Specialties:IPO's, Startups, Entrepreneurship, CEO, Visionary, Investor, ceo, board member, advisor, venture capital, offshore development, financing, M&A

スマートメータのデータをHadoopで解析

british gasConnected HomeはBritish Gasが開発をしているエネルギー使用をモニター・コントロールするサービス。暖房を点けたり、消したりするサービスアプリを提供している。インタネットは家庭の娯楽は大きく変えてきたが、日常生活そのものについてはまだこれからであり、3rd Partyも活用しサービスの差別化をしていくことをBritish Gasは目指している。

WANdiscoは2014年3月に100万世帯のスマートメータからデータを取得し、エネルギー使用のモニタ・コントロールを行うトライアルに参加した。収集されたリアルタイムデータを解析することで、需要パターンと供給をダイナミックにマッチングし、需要に見合う供給を行い、かつ企業および一般家庭での使用のコントロールが可能となることが実証することが目的。

リアルタイム性、コンプライアンス対応のため、Non-Stop Hadoopが導入され、100ノードのクラスタでのデータ損失、ダウンタイムを最小限にし、ストレージコストも大幅に削減することができた。

10か月のトライアルが成功裏に終わり、2倍のスケールで本番運用に入ることになった。WANdiscoは3年間のSubscription契約をUS$750KでBritish Gasと締結。

avatar

About Kenji Ogawa (小川 研之)

WANdisco社で2013年11月より日本での事業を展開中。 以前は、NECで国産メインフレーム、Unix、ミドルウェアの開発に従事。その後、シリコンバレーのベンチャー企業開拓、パートナーマネージメント、インドでのオフショア開発に従事。

Hadoopが金融業の主流に

米調査会社のForesterリサーチと弊社のWebinarの紹介です。Webinarのリプレイは以下で見ることができます。

https://www.brighttalk.com/webcast/11809/134895

最初のスピーカはForesterのJost Hoppermann VP。金融業でのビッグデータ事例を紹介。最初はRisk管理が最重要課題であり、データウェアハウス、インメモリ技術を適用し、これに対応したドイツの銀行の例です。銀行はビッグデータという名前は使わないが、実際はビッグデータであるという一例。次に別の観点からのビッグデータの必要性を指摘している。81%の銀行が2018年までに変革を考えているとの調査結果があり、この実現にはビッグデータが必要。どこからこの変革を起こすかといえば、非定型も含めた顧客データからであり、ビッグデータがこれを支えるのは間違いない。コアバンキングも例えば顧客との組み合せでビッグデータが入り込むチャンスはあるとしている。少し視点が変わるが、クロスボーダーでの可能性が紹介された。個人情報は国外持出し制限、全面禁止となる国もあるが、必要データを切り分け、一つのデータセンタに集めることで統一されたリスク管理ルールを使用して成果をあげた事例が示された。

次のスピーカはLeslie Owen VP。現状は利用可能データの15%しか使われていないことを指摘。従来の整理された高価なデータから、安価で多様なデータを利用して世界の動きを理解する将来の姿に向け、考え方が変わりつつあるのが現在と分析。2014年のビッグデータの定義は「利用可能な大量のデータとそれをビジネスの為に使う能力のギャップを縮める技術と商習慣」としている。2012年の定義は5つV(Volume、Verity, Variability, Velocity, Value)を如何に扱うかとの技術視点であったが、これにビジネス視点が加わりバランスのとれたものになったと見ている。ビジネスおよび技術のデシジョンメーカへのビックデータへの期待に関するアンケートからのこの傾向が見てとれる。次にパラダイムシフトに必要な3C(Culture, Competence, Capability)について触れている。ビッグデータで成功している会社はR&Dとして投資するCultureを持っているとのこと。従業員が、Fact(データ)がどうなっているかを考えるような環境作りが重要。

最後にWANdisco社のRandy Defauwが金融業の3つのイノベーションについて説明。一つ目はアルゴリズムに基づく意思決定。金融業は一過性のデータも使い、遅延ない意思決定を行おうとしている。不正検出がよくある例として挙げられている。2つ目は前述と関連するがData Lakeの話。兎に角、いろいろなデータをため込み顧客・市場を理解すること。3つ目はプロセスのイノベーション。金融業界は特に短期間でのリターンが要求される。伝統的なデータウェアハウスからHadoopに変えて大きなコスト削減を行っている。これらにイノベーションの要求に如何にNon-Stop Hadoopが答えるかについての説明している。

この詳細については次回Whitepaperをベースに紹介します。

avatar

About Kenji Ogawa (小川 研之)

WANdisco社で2013年11月より日本での事業を展開中。 以前は、NECで国産メインフレーム、Unix、ミドルウェアの開発に従事。その後、シリコンバレーのベンチャー企業開拓、パートナーマネージメント、インドでのオフショア開発に従事。

Name Node単一障害点回避 (QJMとNon-Stop Name Node)

ThinkIT Web記事で「NameNode障害によって発生するSingle Point of Failure問題を解決するソリューション」として弊社のNon-Stop Hadoopを紹介して頂きました。http://thinkit.co.jp/story/2014/11/11/5413

同様のソリューションとしてQJMもありますので、少し補足をしておきたいと思います。

QJMについてスライド1

2012年にCloudera社のTodd Lipcon氏が提案したものです。HadoopのアーキテクチャではNameNodeが単一障害点になっていたのを冗長化するものです。StandbyのNamaNodeを追加し、Journal Nodeで複数のジャーナルをとり、Zookeeperにより障害を検出し手動・自動でのフェールオーバーが可能となりました。右上図のような構成になります。

 

Non-Stop NameNodeについて

Paxos(パクソス)を拡張したActive-Activeの複製を行う仕組み(Distributed Coordination スライド1Engine: DConE)をNameNodeに組みこんでいます。これによりHSDF-6469に準拠したConsensusNodeとして複数のNameNodeが同一のメタデータを保持し同等に動作することになります。あるNameNodeが障害になっても多数決論理によりNameNodeのメタデータの更新を行うので、過半数のNameNodeが生きていれば継続稼働が可能です。右下図では5つNameNodeがあるので2つが落ちてもHadoopが止まることはありません。当該NamaNodeの障害が復旧すれば最新のメタデータまで自動的に復元されます。

QJMと比較した特長は以下の通りです。

  • 100%の稼働を、運用者に障害時・回復時の負担を掛けずに実現できます
  • 全てのNameNodeはActiveであり、またQJMで必要となるJournal、Zookeeperも不要です。リソースは100%活用されます
  • 複数NameNodeによる負荷分散が可能となり、性能向上が可能です。またシステムを止めないで拡張が可能です

更にNon-Stop HadoopはDataNodeのデータを、指定した範囲で自動的に複製する機能を提供しています。これにより以下のようなことも可能となります。

  • NameNodeがWANを跨がった別のデータセンタにあってもメタデータの一貫性は保障されます。容量の大きいDataNodeの複製は非同期に行います。遠隔地にあるデータセンタに複製が自動的に作られ、ディザスタリカバリも可能となります。
  • 異なる場所のデータセンタにその地域で発生したデータを格納し、別の場所から使用することも可能になります。例えばクレジットカード使用データは東京、NY、シンガポールのデータセンタに適宜格納し、不正検出のアプリは東京で動かすといった使い方が可能となります。

要は複数のHadoopクラスタを、仮想的に一つに見せることが可能という事です。これはクラスタが別のデータセンタに分散している場合も可能です。

NameNodeのメタデータの一貫性が保障されることで、述べてきたようなことが可能になっています。分散環境での一貫性の保障を行うのがPaxosを拡張した弊社の特許技術であるDistribution Coordination Engineです。これについては別途、紹介したいと思います。

avatar

About Kenji Ogawa (小川 研之)

WANdisco社で2013年11月より日本での事業を展開中。 以前は、NECで国産メインフレーム、Unix、ミドルウェアの開発に従事。その後、シリコンバレーのベンチャー企業開拓、パートナーマネージメント、インドでのオフショア開発に従事。

Big Data ETL Across Multiple Data Centers

Scientific applications, weather forecasting, click-stream analysis, web crawling, and social networking applications often have several distributed data sources, i.e., big data is collected in separate data center locations or even across the Internet.

In these cases, the most efficient architecture for running extract, transform, load (ETL) jobs over the entire data set becomes nontrivial.

Hadoop provides the Hadoop Distributed File System (HDFS) for storage and YARN (Yet Another Resource Negotiator) as the programming model in Hadoop 2.0. ETL jobs use the MapReduce programming model to run on the YARN framework.

Though these are adequate for a single data center, there is a clear need to enhance them for multi-data center environments. In these instances, it is important to provide active-active redundancy for YARN and HDFS across data centers. Here’s why:

1. Bringing compute to data

Hadoop’s architectural advantage lies in bringing compute to data. Providing active-active (global) YARN accomplishes that on top of global HDFS across data centers.

2. Minimizing traffic on a WAN link

There are three types of data analytics schemes:

a) High-throughput analytics where the output data of a MapReduce job is small compared to the input.

Examples include weblogs, word count, etc.

b) Zero-throughput analytics where the output data of a MapReduce job is equal to the input. A sort operation is a good example of a job of this type.

c) Balloon-throughput analytics where the output is much larger than the input.

Local YARN can crunch the data and use global HDFS to redistribute for high throughput analytics. Keep in mind that this might require another MapReduce job running on the output results, however, which can add traffic to the WAN link. Global YARN mitigates this even further by distributing the computational load.

Last but not least, fault tolerance is required at the server, rack, and data center levels. Passive redundancy solutions can cause days of downtime before resuming. Active-active redundant YARN and HDFS provide zero-downtime solutions for MapReduce jobs and data.

To summarize, it is imperative for mission-critical applications to have active-active redundancy for HDFS and YARN. Not only does this protect data and prevent downtime, but it also allows big data to be processed at an accelerated rate by taking advantage of the aggregated CPU, network and storage of all servers across datacenters.

– Gurumurthy Yeleswarapu, Director of Engineering, WANdisco

Application Specific Data? It’s So 2013

Looking back at the past 10 years of software the word ‘boring’ comes to mind.  The buzzwords were things like ‘web services’, ‘SOA’.  CIO’s Tape drives 70sloved the promise of these things but they could not deliver.  The idea of build once and reuse everywhere really was the ‘nirvana’.

Well it now seems like we can do all of that stuff.

As I’ve said before Big Data is not a great name because it implies that all we are talking about a big database with tons of data.  Actually that’s only part of the story. Hadoop is the new enterprise applications platform.  The key word there is platform.  If you could have a single general-purpose data store that could service ‘n’ applications then the whole of notion of database design is over.  Think about the new breed of apps on a cell phone, the social media platforms and web search engines.  Most of these do this today.  Storing data in a general purpose, non-specific data store and then used by a wide variety of applications.  The new phrase for this data store is a ‘data lake’ implying a large quantum of every growing and changing data stored without any specific structure

Talking to a variety of CIOs recently they are very excited by the prospect of both amalgamating data so it can be used and also bringing into play data that previously could not be used.  Unstructured data in a wide variety of formats like word documents and PDF files.  This also means the barriers to entry are low.  Many people believe that adopting Hadoop requires a massive re-skilling of the workforce.  It does but not in the way most people think.  Actually getting the data into Hadoop is the easy bit (‘data ingestion‘ is the new buzz-word).  It’s not like the old relational database days where you first had to model the data using data normalization techniques and then use ETL to make the data in usable format.  With a data lake you simply set up a server cluster and load the data. Creating a data model and using ETL is simply not required.

The real transformation and re-skilling is in application development.  Applications are moving to data – today in a client-server world it’s the other way around.  We have seen this type of reskilling before like moving from Cobol to object oriented programming.

In the same way that client-server technology disrupted  mainframe computer systems, big data will disrupt client-server.  We’re already seeing this in the market today.  It’s no surprise that the most successful companies in the world today (Google, Amazon, Facebook, etc.) are all actually big data companies.  This isn’t a ‘might be’ it’s already happened.

avatar

About David Richards

David is CEO, President and co-founder of WANdisco and has quickly established WANdisco as one of the world’s most promising technology companies. Since co-founding the company in Silicon Valley in 2005, David has led WANdisco on a course for rapid international expansion, opening offices in the UK, Japan and China. David spearheaded the acquisition of Altostor, which accelerated the development of WANdisco’s first products for the Big Data market. The majority of WANdisco’s core technology is now produced out of the company’s flourishing software development base in David’s hometown of Sheffield, England and in Belfast, Northern Ireland. David has become recognised as a champion of British technology and entrepreneurship. In 2012, he led WANdisco to a hugely successful listing on London Stock Exchange (WAND:LSE), raising over £24m to drive business growth. With over 15 years' executive experience in the software industry, David sits on a number of advisory and executive boards of Silicon Valley start-up ventures. A passionate advocate of entrepreneurship, he has established many successful start-up companies in Enterprise Software and is recognised as an industry leader in Enterprise Application Integration and its standards. David is a frequent commentator on a range of business and technology issues, appearing regularly on Bloomberg and CNBC. Profiles of David have appeared in a range of leading publications including the Financial Times, The Daily Telegraph and the Daily Mail. Specialties:IPO's, Startups, Entrepreneurship, CEO, Visionary, Investor, ceo, board member, advisor, venture capital, offshore development, financing, M&A