Monthly Archive for April, 2015

SubversionかGitかそれとも両方?

SubversionとGitは最もよく使われている世代管理ソフトであり、OSSプロジェクトでは85%、企業においても約60%が使用しています。Gitを使っている、検討中という話を日本でもよく聞くようになりました。両者の比較、選択の指針、マイグレーションの注意事項等々をWebcastで紹介しています。概要を紹介します。

Subversionは集中型であり、Gitは分散型ですが、Gitにおいても企業ユースでは、Subversion同様、管理されたマスターリポジトリ(Golden Master)を持つことになります。しかしながらGitでは、開発者同士が変更を自由に共有できること、例えば開発者Aさんが自分の変更を開発者Bさんだけに渡す(Push)ことが可能です。例えばAさんがマスターリポジトリにPushする権限を持っていなくともBさんによりAさんの更新がマスターリポジトリ反映されるようなことも起こります。

Gitは自由度が大きいので多様なワークフローを実現できるのでメンタルチェンジが必要という事です。一方、GitHub, GitLab, Gerrit等の管理ツールが充実してきており企業ユースのハードルも下がってきています。SubversionからGitに移行するには一定期間、共存させるのが、お勧めで、ツールも用意されています。Gitを使用する際の注意点は、リポジトリサイズを小さく維持し管理していく事です。

最後に今までのコンサル経験からSubversionかGitかについてコメント。開発者はGitのパワフルな機能に魅かれるが実際に企業内で使いだすと色々な問題に遭遇しているのが現状であり、注意深く進めることが必要です。

詳しくはhttps://www.brighttalk.com/webcast/11815/152641をご覧ください。

(最初の2分程、エコーがかかって聞けませんが肝心なところからは問題ありませんのでちょっと我慢して下さいね)。

 

GitLabのデモは下記のWebcastで見ることができます。

https://www.brighttalk.com/webcast/11817/150559?utm_campaign=communication_viewer_followup&utm_medium=email&utm_source=brighttalk-transact&utm_content=webcast

最後に、Subversion・Git共通のアクセス制御を可能とするAccessControlPlusのデータシートです。

WD-Datasheet-ACplus-A4-Japan-WEB

Hortonworks and WANdisco make it easy to get started with Spark

Hortonworks, one of our partners in the Open Data Platform Initiative, recently released version 2.2.4 of the Hortonworks Data Platform (HDP).  It bundles Apache Spark 1.2.1.  That’s a clear indicator (if we needed another one) that Spark has entered the Hadoop mainstream.  Are you ready for it?

Spark opens up a new realm of use cases for Hadoop since it offers very fast in-memory data processing.  Spark has blown through several Hadoop benchmarks and offers a unified batch, SQL, and streaming framework.

But Spark presents new challenges for Hadoop infrastructure architects.  It favors memory and CPU with a smaller number of drives than a typical Hadoop data node.  The art of monitoring and tuning Spark is still in early days.

Hortonworks is addressing many of these challenges by including Spark in HDP 2.2.4 and integrating it into Ambari.  And now WANdisco is making it even easier to get started with Spark by giving you the flexibility to deploy Spark into a separate cluster while still using your production data.

WANdisco Fusion uses active-active data replication to make the same Hadoop data available and usable consistently from several Hadoop clusters.  That means you can run Spark against your production data, but isolate it on a separate cluster (perhaps in the cloud) while you get up to speed on hardware sizing and performance monitoring.  You can continue to run Spark this way indefinitely in order to isolate any potential performance impact, or eventually migrate Spark to your main cluster.

Shared data but separate compute resources gives you the extra flexibility you need to rapidly deploy new Hadoop technologies like Spark without impacting critical applications on your main cluster.  Hortonworks and WANdisco make it easy to get started with Spark.  Get in touch with our solution architects today to get started.

 

 

WANdisco Fusion Q&A with Jagane Sundar, CTO

Tuesday we unveiled our new product: WANdisco Fusion. Ahead of the launch, we caught up with WANdisco CTO Jagane Sundar, who was one of the driving forces behind Fusion.

Jagane joined WANdisco in November 2012 after the firm’s acquisition of AltoStor and has since played a key role in the company’s product development and rollout. Prior to founding AltoStor along with Konstantin Shvachko, Jagane was part of the original team that developed Apache Hadoop at Yahoo!.

Jagane, put simply, what is WANdisco Fusion?

JS: WANdisco Fusion is a wonderful piece of technology that’s built around a strongly consistent transactional replication engine, allowing for the seamless integration of different types of storage for Hadoop applications.

It was designed to help organizations get more out of their Big Data initiatives, answering a number of very real problems facing the business and IT worlds.

And the best part? All of your data centers are active simultaneously: You can read and write in any data center. The result is you don’t have hardware that’s lying idle in your backup or standby data center.

What sort of business problems does it solve?

JS: It provides two new important capabilities for customers. First, it keeps data consistent across different data centers no matter where they are in the world.

And it gives customers the ability to integrate different storage types into a single Hadoop ecosystem. With WANdisco Fusion, it doesn’t matter if you are using Pivotal in one data center, Hortonworks in another and EMC Isilon in a third – you can bring everything into the same environment.

Why would you need to replicate data across different storage systems?

JS: The answer is very simple. Anyone familiar with storage environments knows how diverse they can be. Different types of storage have different strengths depending on the individual application you are running.

However, keeping data synchronized is very difficult if not done right. Fusion removes this challenge while maintaining data consistency.

How does it help future proof a Hadoop deployment?

JS: We believe Fusion will form a critical component of companies’ workflow update procedures. You can update your Hadoop infrastructure one data center at a time, without impacting application availability or by having to copy massive amounts of data once the update is done.

This helps you deal with updates from both Hadoop and application vendors in a carefully orchestrated manner.

Doesn’t storage-level replication work as effectively as Fusion?

JS: The short answer is no. Storage-level replication is subject to latency limitations that are imposed by file systems. The result is you cannot really run storage-level replication over long distances, such as a WAN.

Storage-level replication is nowhere nearly as functional as Fusion: It has to happen at the LAN level and not over a true Wide Area Network.

With Fusion, you have the ability to integrate diverse systems such as NFS with Hadoop, allowing you to exploit the full strengths and capabilities of each individual storage system – I’ve never worked on a project as exciting and as revolutionary as this one.

How did WANdisco Fusion come about?

JS: By getting inside our customers’ data centers and witnessing the challenges they faced. It didn’t take long to notice the diversity of storage environments.

Our customers found that different storage types worked well for different applications – and they liked it that way. They didn’t want strict uniformity across their data centers, but to be able to leverage the strengths of each individual storage type.

At that point we had the idea for a product that would help keep data consistent across different systems.

The result was WANdisco Fusion: a fully replicated transactional engine that makes the work of keeping data consistent trivial. You only have to set it up once and never have to bother with checking if your data is consistent.

This vision of a fully utilized, strongly consistent diverse storage environment for Hadoop is what we had in mind when came up with the Fusion product.

You’ve been working with Hadoop for the last 10 years. Just how disruptive is WANdisco Fusion going to be?

JS: I’ve actually been in the storage industry for more than 15 years now. Over that period I’ve worked with shared storage systems, and I’ve worked with Hadoop storage systems. WANdisco Fusion has the potential to completely revolutionize the way people use their storage infrastructure. Frankly, this is the most exciting project I’ve ever been part of.

As the Hadoop ecosystem evolved I saw the need for this virtual storage system that integrates different types of storage.

Efforts to make Hadoop run across different data centers have been mostly unsuccessful. For the first time, we at WANdisco have a way to keep your data in Hadoop systems consistent across different data centers.

The reason this is so exciting is because it transforms Hadoop into something that runs in multiple data centers across the world.

Suddenly you have capabilities that even the original inventors of Hadoop didn’t really consider when it was conceived. That’s what makes WANdisco Fusion exciting.

The inspiration for WANdisco Fusion

Screen Shot 2015-04-21 at 10.08.22 PM

Roughly two years ago, we sat down to start work on a project that finally came to fruition this week.

At that meeting, we had set ourselves the challenge of redefining the storage landscape. We wanted to map out a world where there was complete shared storage, but where the landscape remained entirely heterogeneous.

Why? Because we’d witnessed the beginnings of a trend that has only grown more pronounced with the passage of time.

From the moment we started engaging with customers, we were struck by the extreme diversity of their storage environments. Regardless of whether we were dealing with a bank, a hospital or utility provider, different types of storage had been introduced across every organization for a variety of use cases.

In time, however, these same companies wanted to start integrating their different silos of data, whether to run real-time analytics or to gain a full 360 perspective of performance. Yet preserving diversity across data center was critical, given that each storage type has its own strengths.

They didn’t care about uniformity. They cared about performance and this meant being able to have the best of both worlds. Being able to deliver this became the Holy Grail – at least in the world of data centers.

This isn’t quite The Gordian Knot but it’s certainly a very difficult, complex problem and possibly one that could only be solved with our core, patented IP DConE.

Then we had a breakthrough.

Months later and I’m proud to formally release WANdisco Fusion (WD Fusion), the only product that enables WAN-scope active-active synchronization of different storage systems into one place.

What does this mean in practice? Well it means that you can use Hadoop distributions like Hortonworks, Cloudera or Pivotal for compute, Oracle BDA for fast compute, EMC Isilon for dense storage. You could even use a complete variety of Hadoop distros and versions. Whatever your set-up, with WD Fusion you can leverage new and existing storage assets immediately.

With it, Hadoop is transformed from being something that runs within a data center into an elastic platform that runs across multiple data centers throughout the world. WD Fusion allows you to update your storage infrastructure one data center at a time, without impacting your application ability or by having to copy vast swathes of data once the update is done.

When we were developing WD Fusion we agreed upon two things. First, we couldn’t produce anything that made changes to the underlying storage system – this had to behave like a client application. Second, anything we created had to enable a complete single global name-space across an entire storage infrastructure.

With WD Fusion, we allow businesses to bring together different storage systems by leveraging our existing intellectual property – the same Paxos-powered algorithm behind Non-Stop Hadoop, Subversion Multisite and Git Multisite – without making any changes to the platform you’re using.

Another way of putting it is we’ve managed to spread our secret sauce even further.

We have some of the best computer scientists in the world working at WANdisco, but I’m confident that this is the most revolutionary project any of us have ever worked on.

I’m delighted to be unveiling WD Fusion. It’s a testament to the talent and character of our firm, the result of looking at an impossible scenario and saying: “Challenge accepted.”

avatar

About David Richards

David is CEO, President and co-founder of WANdisco and has quickly established WANdisco as one of the world’s most promising technology companies.

Since co-founding the company in Silicon Valley in 2005, David has led WANdisco on a course for rapid international expansion, opening offices in the UK, Japan and China. David spearheaded the acquisition of Altostor, which accelerated the development of WANdisco’s first products for the Big Data market. The majority of WANdisco’s core technology is now produced out of the company’s flourishing software development base in David’s hometown of Sheffield, England and in Belfast, Northern Ireland.

David has become recognised as a champion of British technology and entrepreneurship. In 2012, he led WANdisco to a hugely successful listing on London Stock Exchange (WAND:LSE), raising over £24m to drive business growth.

With over 15 years’ executive experience in the software industry, David sits on a number of advisory and executive boards of Silicon Valley start-up ventures. A passionate advocate of entrepreneurship, he has established many successful start-up companies in Enterprise Software and is recognised as an industry leader in Enterprise Application Integration and its standards.

David is a frequent commentator on a range of business and technology issues, appearing regularly on Bloomberg and CNBC. Profiles of David have appeared in a range of leading publications including the Financial Times, The Daily Telegraph and the Daily Mail.

Specialties:IPO’s, Startups, Entrepreneurship, CEO, Visionary, Investor, ceo, board member, advisor, venture capital, offshore development, financing, M&A

Scalable and Secure Git

Now that WANdisco has released an integration between Git MultiSite and GitLab, it’s worth putting the entire Git lineup at WANdisco into perspective.

Git MultiSite is the core product providing active-active replication of Git repository data. This underpins our efforts to make Git more reliable and better performing. Active-active replication means that you have full use of your Git data at several locations, not just in a single ‘master’ Git server. You get full high availability and disaster recovery out of the box, and you can load balance your end user and build demands between several Git servers. Plus, users at every location get fast local read and write access. As one of our customers recently pointed out, trying to make regular Git mirrors work this way requires a few man-years of effort.

On top of Git MultiSite you have three options for user management, security, and collaboration.

  • Use WANdisco’s Access Control Plus for unified, scalable user and permission management. It features granular permissions, delegated team management, and full integration with SVN MultiSite Plus for unified Subversion-Git administration.
  • Use Gerrit to take advantage of powerful continuous review workflows that underpin the Android community.
  • Use GitLab for an enterprise-grade social coding and collaboration platform.

Not sure which direction to take? Our solution architects help you understand how to choose between Subversion, Git, and all the other tools that you have to contend with.

Active-active strategies for data protection

A new report preview from 451 Research highlights some of the challenges facing data center operators. Two of the conclusions stood out in particular.  First, disaster recovery (DR) strategies are top of mind as IT operations become increasingly centralized, increasing the cost of an outage. 42% of data center operators are evaluating DR strategies, and a majority (62%) are using active-active strategies for data protection. Second, data center operators are playing in a more complicated world now. The ability to operator applications and data centers in a hybrid cloud environment is called out as a particular area of interest.

These findings echo what we’re hearing from our own customers. For many enterprise IT architects, active-active data replication is a checklist item when deploying a vital service like a Hadoop cluster. Many WANdisco Fusion customers buy our products for precisely that reason. And we’re also seeing strong interest in WANdisco Fusion’s unique ability to provide that replication between Hadoop clusters that use different distributions and storage systems, on-premise or in the cloud.

Visit 451 Research to obtain the full report. In the meantime, our solution architects can help you evaluate your own DR and hybrid deployment strategies.

Improving HBase Scalability for Real-time Applications

When we introduced Non-Stop for Apache HBase, we explained how it would improve HBase reliability for critical applications.  But Non-Stop for Apache HBase also uniquely improves HBase scalability and performance.

By making multiple active-active region servers, Non-Stop for Apache HBase alleviates some common HBase performance woes.  First, clients are load balanced between several region servers for any particular region.  By spreading the load among several region servers, the impact of problems like region ‘hot spots’ is alleviated.

architecture-nshbase-wan

So far so good, but you might be thinking that you could get the same benefit by using HBase read-HA.  However, HBase read-HA is limited to read operations in a single data center.  Non-Stop for Apache HBase lets you put region servers in several data centers, and any of them can handle write operations.  That gives you a few nice benefits:

  • Writes can be directed to any region server, reducing the chance that a single region server becomes a bottleneck due to hot spots or garbage collection.
  • Applications at other data centers now have fast access to a ‘local’ region server.

Although the HBase community continues to try to improve HBase performance, there are some bottlenecks that just can’t be eliminated without active-active replication.  No other solution lets you use several active region servers per region, and put those region servers at any location without regard to WAN latency.

If you’ve ever struggled with HBase performance, you should give Non-Stop for Apache HBase a close look.

Reducing Hadoop network vulnerabilities for backup and replication

Managing network connections between Hadoop clusters in different data centers is a significant challenge for Hadoop and network administrators. WANdisco Fusion reduces the number of connections required for any cross-cluster data flow, thereby reducing Hadoop network vulnerabilities for backup and replication.

DistCP is the tool used for data transfer and backup in almost every Hadoop backup and workflow system.  DistCP requires connectivity from each data node in the source cluster to each data node in the target cluster.

DistCP networking

DistCP requires connections from each data node to each data node

Typically each data node – to – data node connection requires configuring two connections for inbound and outbound traffic to cross the firewall and navigate any intermediate proxies.  In a case where you have 16 data nodes in each cluster, that means [16x16x2] connections to configure, secure, and monitor – 512 in total!  That’s a nightmare for Hadoop administrators and network operators.  Just ask the people responsible for planning and running your Hadoop cluster.

WANdisco Fusion solves this problem by routing cross-cluster communication through a handful of servers.  As the diagram below illustrates, in the simplest case you’ll have one server in the source cluster talking to one server in the target cluster, requiring a grand total of 2 connections to configure.

WDFusion network connections

WANdisco Fusion requires only a handful of network connections

In a realistic deployment, you’d require additional connections for the redundant WANdisco Fusion servers – this is an active-active configuration after all.  Still, in a large deployment you’d see a few tens of connections, rather than many hundreds.

The most recent spate of data breaches is driving a 10% annual increase in cybersecurity spending.  Why make yourself more vulnerable by exposing your entire Hadoop cluster to the WAN?  Our solution architects can help you reduce your Hadoop network exposure.