Application Specific Data? It’s So 2013

Looking back at the past 10 years of software the word ‘boring’ comes to mind.  The buzzwords were things like ‘web services’, ‘SOA’.  CIO’s Tape drives 70sloved the promise of these things but they could not deliver.  The idea of build once and reuse everywhere really was the ‘nirvana’.

Well it now seems like we can do all of that stuff.

As I’ve said before Big Data is not a great name because it implies that all we are talking about a big database with tons of data.  Actually that’s only part of the story. Hadoop is the new enterprise applications platform.  The key word there is platform.  If you could have a single general-purpose data store that could service ‘n’ applications then the whole of notion of database design is over.  Think about the new breed of apps on a cell phone, the social media platforms and web search engines.  Most of these do this today.  Storing data in a general purpose, non-specific data store and then used by a wide variety of applications.  The new phrase for this data store is a ‘data lake’ implying a large quantum of every growing and changing data stored without any specific structure

Talking to a variety of CIOs recently they are very excited by the prospect of both amalgamating data so it can be used and also bringing into play data that previously could not be used.  Unstructured data in a wide variety of formats like word documents and PDF files.  This also means the barriers to entry are low.  Many people believe that adopting Hadoop requires a massive re-skilling of the workforce.  It does but not in the way most people think.  Actually getting the data into Hadoop is the easy bit (‘data ingestion‘ is the new buzz-word).  It’s not like the old relational database days where you first had to model the data using data normalization techniques and then use ETL to make the data in usable format.  With a data lake you simply set up a server cluster and load the data. Creating a data model and using ETL is simply not required.

The real transformation and re-skilling is in application development.  Applications are moving to data – today in a client-server world it’s the other way around.  We have seen this type of reskilling before like moving from Cobol to object oriented programming.

In the same way that client-server technology disrupted  mainframe computer systems, big data will disrupt client-server.  We’re already seeing this in the market today.  It’s no surprise that the most successful companies in the world today (Google, Amazon, Facebook, etc.) are all actually big data companies.  This isn’t a ‘might be’ it’s already happened.

The hunger for low latency big data

Good grief: Spark has barely hit a 1.0 release and already there are several projects vying to improve on it and perhaps be the next big thing.  I think this is another sign that Spark is here to stay – everyone is focusing on how to beat it!  In fact even the Berkeley lab that developed Spark has come up with an alternative that is supposedly a couple orders of magnitude faster than Spark for some types of machine learning.

The bigger lesson here for CIOs and data architects is that your Hadoop infrastructure has to be flexible enough to deploy the latest and greatest tools.  Your ‘customers’ – data scientists, marketers, managers – will keep asking for faster processing time.

Of course here at WANdisco we’ve got some of the best minds in Big Data working on exactly this problem.  Our principal scientists have been working on the innards of Hadoop almost since day one, and they’re evolving our Hadoop products to support very sophisticated deployments.  For instance, Non-stop Hadoop lets you run several Hadoop clusters that share the same HDFS namespace but otherwise operate independently.  That means you can allocate distinct clusters (or carve off part of a cluster) to run dedicated processing pipelines that might require a different hardware or job management profile to support low latency big data operations.

Sound interesting?  It’s a fast-moving field and we’re ready to help!

Paxosと分散コーディネーションエンジン(DConE)

分散環境でのActive-Active複製を可能にするDConEの紹介です。Paxosを拡張して、WAN環境でも情報の一貫性が保障されるようになっています。
1. Paxosについて
Leslie Lamportが提案した信頼性の低いプロセッサから構成されるネットワークにおいての合意の問題を解決するためのプロトコルの集合。Paxosプロトコルは1990年に登場し命名されたが、論文として出版されたのは1998年でした。
PaxosアルゴリズムではState Machine(状態機械)が分散システムの各ノードにインストールされ、トランザクションの順序を保障します。State MachineはProposers、Acceptors、Learnersのいずれかの役割を果たす。Paxosアルゴリズムでは以下の3つのフェーズがコンセンサスの取れるまで繰り返されます。①CoordinatorになるNodeが選択され②トランザクションProposeがブロードキャストされ、受け取ったノードはAcceptするかReject③関連するノードの過半数がAcceptすれば、コミットメッセージがブロードキャストされトランザクションを実行。プロポーザル番号は一意となるようにします。

2. DConE(Distribution Coordination Engine)についてスライド1
DConEはPaxosをWANでも使えるように拡張したものです。DConEの主な構成要素は図の通りです。実際の動きを簡単に説明します。
アプリケーションの書き込みトランザクションが発生すると当該ノードのProposal ManagerはLocal Sequence Number(LSN)を付与したProposalを生成し、障害発生に備え、セーブします(Proposal Log)。次にAgreement Managerは保持しているGlobal Sequence Number(GSN)をIncrementしたAgreement Memberを決定します。GSNは全てのNodeでAcceptされた最後のProposalの番号です。このAgreement Memberにより当該トランザクションに関するコンセンサスをすべてのノードに対して取ることを試みます。このデータもAgreement Logとしてセーブされ、障害時にはProposal Logと一緒に回復処理に使われます。
過半数がAcceptすれば、Agreement NumberがそれらのノードのGSNになります。コンセンサスに達するとDConEはトランザクションをLocking Managerに渡します。Proposalを発行したNodeはGSNの合意は待ちますが、他のノードのトランザクションの完了を待つことはありません。以下、DConEの特長を説明します。

2.1 ローカルな順序の保持(米国特許#8364633)
DConEは性能向上のため、並列してコンセンサスの処理を行います。このため、ノード内でアプリが求める順序がGSNを決める際に逆転してしまう可能性があります。ノード内の順序を決めるためのLSN(Local Sequence Number)とネットワークで接続されたノードでのGSN(Global Sequence Number)の2つによりトランザクションの順番を決定します。ネットワークはLANでもWANでも対応可能です。

2.2 Locking Scheduler
Locking SchedulerはGlobal Orderは意識しません。ローカルキューに従ってトランザクションの処理をします。アプリとのインターフェースはDConE は持っておらず、Locking Schedulerが面倒をみます。Locking Schedulerという名前ですがロックのスケジュールをする訳ではなく、言ってみれば、Database Schedulerのような振る舞いをする訳です。

2.3 性能・スケーラビィリティの向上
DConEは大規模なWAN環境でもPaxosアルゴリズムが動くような強化をしています。主なものは以下です。

・Paxosは多数決論理で動く訳ですが、オプションで決め方が選べます。あるノード(Singleton)のみがプロポーザルのレスポンスをするような設定が可能です。一番多くユーザが使っているところをSingletonノードにすることでWANのトラフィックを下げることができます。リージョンで多数決論理を動かすことでFollow the Sunのようなことも可能です。また偶数ノードの場合はTie Breakerを決めることで対応できます。
・複数のプロポーザルの処理を並行して行うこともできます。Agreement Number取得の際に発生する衝突による性能劣化への対処もされています。新しいノードの追加・既存ノードの削除も動的に行えます。
・ディスク・メモリ上にセーブされたState Machineの情報をいつまで保持すべきかは、安全性とリソース使用率のバランスをとるのに苦労するところです。一定間隔でノード同士がメッセージをやり取りすることで解決しています。

2.3 自動バックアップとリカバリ
Active-Activeの複製により全てのノードのホットバックアップが取られます。あるノードにネットワーク障害が起きてもリードだけは可能としています。ライトは禁止されSplit Brainにより情報不一致が発生することを避けています。CAP定理によればC(Consistency)、A(Availability)、P(Partition-Tolerance)の3つを同時に満たすことはできません。DConEはCとAを優先するような設計思想となっています。
障害が復旧すれば、DConEが障害の期間中にAgreeされたトランザクションをAgreementログ、Proposal Logより自動的に復旧します。

3. 終わりに
以上、大まかな説明ですが、詳しくはDConEのWhitepaperを参照ください。
http://www.wandisco.com/system/files/documentation/WANdisco_DConE_White_Paper.pdf

Hadoop security tiers via cluster zones

The recent cyberattack on Sony’s network was a CIO’s nightmare come true. The Wall Street Journal had a good summary of some of the initial findings and recommendations. One of the important points was that data integration, although a huge win for productivity, increases the exposure from a single security breach.

That started me thinking about the use of isolated Hadoop security tiers in Hadoop clusters. I’m as excited as anyone by the prospect of Hadoop data lakes; in general, the more data you have available, the happier your data scientists will be. When it comes to your most sensitive data, however, it may be worth protecting with greater rigor.

Hadoop security has come a long way in recent releases, with better integration with Kerberos and more powerful role-based controls, but there is no substitute for the protection that comes with isolating sensitive data on a separate cluster.

But how do you do that and still allow privileged users full access to the entire set of data for analysis? Non-stop Hadoop offers the answer: you can share the HDFS namespace across the less secure and more secure clusters, and use selective replication to ensure that the sensitive data never moves into the less secure cluster. The picture below illustrates the concept.

hadoop -ref-arch-5

Users on the ‘open’ cluster can only see the generally available data. Users on the ‘secure’ cluster can access all of the data.

Feel free to get in touch if you have questions about how to add this extra layer of defense into your Hadoop infrastructure.

Machine Learning as a Service?

Rick Delgado had an interesting article on how the widespread availability of machine learning will facilitate the rollout of the Internet of Things (IoT).  Intuitively it makes sense; as algorithms become widely understood and field tested, they evolve from black magic to tools in the engineering kit.  You can see this phenomenon in automotive safety technology.  In the mid 1990s I was working on machine vision algorithms for automotive applications.  Everything was new and exciting; there were a few standard theories, but they had barely been tested at any scale and the processing hardware hadn’t caught up to the data demands.  Now as the Wall Street Journal reports, Toyota is making collision-avoidance gadgets standard on almost every new model.  One driver is the reduced price of the cameras and radars, but I think a bigger driver is the trustworthiness of the autonomous vehicle algorithms that can reliably sense a possible collision.

Of course, here at WANdisco the IoT is of much interest.  For all of this new streaming data to be useful, it has to be ingested, processed, and used, often at very high speeds.  That’s a challenge for traditional Hadoop architectures – but one that we’re quite prepared to meet.

SmartSVN 8.6.3 General Access Released!

We’re pleased to announce the latest release of SmartSVN, 8.6.3. SmartSVN is the popular graphical Subversion (SVN) client for Mac, Windows, and Linux. SmartSVN 8.6.3 is available immediately for download from our website.

New Features include:

– Show client certificate option in the SSL tab in Preferences

Fixes include:

– Bug reporting now suggests the email address from the license file

For a full list of all improvements and features please see the changelog.

 

Note for Mac Os X 8.6.2 users:- If you installed version 8.6.2 as a new download (rather than autoupdating) you will need to download and reinstall 8.6.3 to stop the master password window from constantly reappearing
– You will be required to enter the master password once more after the installation

Contribute to further enhancements

Many of the issues resolved in SmartSVN were raised by our dedicated SmartSVN forum, so if you’ve got an issue or a request for a new feature, head there and let us know.

Get Started

Haven’t yet started using SmartSVN? Get a free trial of SmartSVN Professional now.

If you have SmartSVN and need to update to SmartSVN 8, you can update directly within the application. Read the Knowledgebase article for further information.

An essential Git plugin for Gerrit

One of the frequent complaints about Gerrit is the esoteric syntax of pushing a change for review:

git push origin HEAD:refs/for/master

Translated, that means to push your current HEAD ref to a remote named origin and to a special review ref (for master).

If you’re a Gerrit user, you need this plugin:

https://github.com/openstack-infra/git-review

It automates some of the Gerrit syntax so now you can just run:

git review

The only problem is that when you push to a non-Gerrit repository you start to wonder why your review command doesn’t work anymore.  That’s how deeply ingrained code review is to the Gerrit workflow.

Name Node単一障害点回避 (QJMとNon-Stop Name Node)

ThinkIT Web記事で「NameNode障害によって発生するSingle Point of Failure問題を解決するソリューション」として弊社のNon-Stop Hadoopを紹介して頂きました。http://thinkit.co.jp/story/2014/11/11/5413

同様のソリューションとしてQJMもありますので、少し補足をしておきたいと思います。

QJMについてスライド1

2012年にCloudera社のTodd Lipcon氏が提案したものです。HadoopのアーキテクチャではNameNodeが単一障害点になっていたのを冗長化するものです。StandbyのNamaNodeを追加し、Journal Nodeで複数のジャーナルをとり、Zookeeperにより障害を検出し手動・自動でのフェールオーバーが可能となりました。右上図のような構成になります。

 

Non-Stop NameNodeについて

Paxos(パクソス)を拡張したActive-Activeの複製を行う仕組み(Distributed Coordination スライド1Engine: DConE)をNameNodeに組みこんでいます。これによりHSDF-6469に準拠したConsensusNodeとして複数のNameNodeが同一のメタデータを保持し同等に動作することになります。あるNameNodeが障害になっても多数決論理によりNameNodeのメタデータの更新を行うので、過半数のNameNodeが生きていれば継続稼働が可能です。右下図では5つNameNodeがあるので2つが落ちてもHadoopが止まることはありません。当該NamaNodeの障害が復旧すれば最新のメタデータまで自動的に復元されます。

QJMと比較した特長は以下の通りです。

  • 100%の稼働を、運用者に障害時・回復時の負担を掛けずに実現できます
  • 全てのNameNodeはActiveであり、またQJMで必要となるJournal、Zookeeperも不要です。リソースは100%活用されます
  • 複数NameNodeによる負荷分散が可能となり、性能向上が可能です。またシステムを止めないで拡張が可能です

更にNon-Stop HadoopはDataNodeのデータを、指定した範囲で自動的に複製する機能を提供しています。これにより以下のようなことも可能となります。

  • NameNodeがWANを跨がった別のデータセンタにあってもメタデータの一貫性は保障されます。容量の大きいDataNodeの複製は非同期に行います。遠隔地にあるデータセンタに複製が自動的に作られ、ディザスタリカバリも可能となります。
  • 異なる場所のデータセンタにその地域で発生したデータを格納し、別の場所から使用することも可能になります。例えばクレジットカード使用データは東京、NY、シンガポールのデータセンタに適宜格納し、不正検出のアプリは東京で動かすといった使い方が可能となります。

要は複数のHadoopクラスタを、仮想的に一つに見せることが可能という事です。これはクラスタが別のデータセンタに分散している場合も可能です。

NameNodeのメタデータの一貫性が保障されることで、述べてきたようなことが可能になっています。分散環境での一貫性の保障を行うのがPaxosを拡張した弊社の特許技術であるDistribution Coordination Engineです。これについては別途、紹介したいと思います。

Another Top 5 List on Hadoop

Top 5 lists are always fun, and here’s another top 5 list on Hadoop.  It’s fairly familiar to anyone who follows the space, but it does highlight a few important trends.  A few comments and quibbles:

  • The fact that open source is the foundation of Big Data software shouldn’t be surprising even to the government anymore.  After all, even the secretive NSA has publicly acknowledged use of Hadoop.
  • The only controversial claim is that Hadoop is set to replace Enterprise Data Warehouses (EDWs).  I’ve heard a lot of arguments for and against that point over the last year.  It seems the Hadoop will at least complement EDWs and allow them to be used more efficiently, but complete replacement will depend on Hadoop maturing in a couple of key areas.  First, it will have to handle low-latency queries more efficiently.  Second, it will have to be as reliable and flexible as mature EDWs.  Keep an eye on projects like Apache Spark and, of course, Non-stop Hadoop in this area.
  • I agree that the Internet of Things (IoT) will be a new and important source of data for Hadoop in the future.  However, just a point of terminology: no one will “embed Hadoop” into  small devices.  Rather, data from these devices will be streamed into Hadoop.
  • Siri and the other smart assistants like Cortana are making waves, but IBM’s Watson seems to be years ahead in terms of analyzing complex unstructured situations.  Watson does use Hadoop for distributed processing but it has a much different paradigm than traditional MapReduce processing, and it needs to store a good chunk of its data in RAM.  That’s another sign that the brightest future for Hadoop will require new and exciting analytics frameworks.

 

Binary artifact management in Git

Paul Hammant has an interesting post on whether to check binary artifacts into source control.  Binary artifact management in Git is an interesting question and worth revisiting from time to time.

First, a bit of background.  Centralized SCM systems like Subversion and ClearCase are a bit more capable than Git when it comes to handling binary files.  One reason is sheer performance: since a Git repository has a full copy of the entire history, you just don’t want your clone (working copy) to be too big.  Another reason is assembling your working views.  ClearCase and to a lesser extent Subversion give you some nice tools to pick and choose pieces of a really big central repository and assemble the right working copy.  For example in a ClearCase config spec you can specify that you want a certain version of a third party library dependency.  Git on the other hand is pretty much all or nothing; it’s not easy to do a partial clone of a really big master repository.

Meanwhile, there had been a trend in development to move to more formal build and artifact management systems.  You could define a dependency graph in a tool like Maven and use Maven or Artifactory or even Jenkins to manage artifacts.  Along with offering benefits like not storing derived objects in source control, this trend covered off Git’s weak spot in handling binaries.

Now I’m not entirely sure about Paul’s reasons for recommending a switch back to managing binaries in Git.  Personally I prefer to properly capture dependencies in a configuration file like Maven’s POM, as I can exercise proper change control over that file.  The odd thing about SCM working view definitions like config specs is that they aren’t strongly versioned like source code files are.

But that being said,  you may prefer to store binaries in source control, or you may have binaries that are actually source artifacts (like graphics or multimedia for game development).  So is it hopeless with Git?

Not quite.  There are a couple of options worth looking at.  First, you could try out one of the Git extensions like git-annex or git-media.  These have been around a long time and work well in some use cases.  However they do require extra configuration and changes to the way you work.

Another interesting option is the use of shared back-end storage for cloned repositories.  Most Git repository management solutions that offer forks use these options for efficient use of back-end storage space.  If you can accept working on shared development infrastructure rather than your own workstation, then you can clone a Git repository using the file protocol with the -s option to share the object folder.  There’s also the -reference option to point a new Git clone at an existing object store.  These options make cloning relatively fast as you don’t have to create copies of large objects.  It doesn’t alleviate the pain of having the checked out files in your clone directory, but if you’re working on a powerful server that may be acceptable.  The bigger drawback to the file protocol is the lack of access control.

Management of large binaries is still an unsolved problem in the Git community.  There are effective alternatives and work-arounds but it’ll be interesting to see if anyone tries to solve the problem more systematically.

SmartSVN 8.6.2 General Access Now Available

We’re pleased to announce the latest release of SmartSVN, 8.6.2. SmartSVN is the popular graphical Subversion (SVN) client for Mac, Windows, and Linux. SmartSVN 8.6.2 is available immediately for download from our website.

New Features include:

– Support for Mac OSX 10.10 Yosemite

Fixes include:

– Issue with log and graphing when no cache is created

For a full list of all improvements and features please see the changelog.

Contribute to further enhancements

Many of the issues resolved in SmartSVN were raised by our dedicated SmartSVN forum, so if you’ve got an issue or a request for a new feature, head there and let us know.

Get Started

Haven’t yet started using SmartSVN? Get a free trial of SmartSVN Professional now.

If you have SmartSVN and need to update to SmartSVN 8, you can update directly within the application. Read the Knowledgebase article for further information.