Monthly Archive for October, 2014

Solving the 3 biggest Hadoop challenges

A colleague recently pointed me to this great article on the 3 biggest Hadoop challenges. The article is written by Sean Suchter, the CEO of Pepperdata, and offers a practical perspective on how these challenges are seen and managed through workarounds.

Ultimate none of those workarounds are very satisfactory. Fortunately, Non-Stop Hadoop offers a compelling way to solve these challenges, either in whole or in part.

Resource contention due to mixed workloads and multi-tenancy environments

This problem seems to be the biggest driver of Hadoop challenges. Of the many workarounds Suchter discusses, all seem either manually intensive (tweaking Hadoop parameters for better performance) or limiting from a business perspective (gating production jobs or designing workflows to avoid bottlenecks).

As I’ve written before, the concept of a logical data lake with a unified HDFS namespace largely overcomes this challenge. Non-Stop Hadoop lets you set up multiple clusters at one or several locations, all sharing the same data – unless you choose to restrict the sharing through selective replication. Now you can run jobs on the most appropriate cluster (e.g. using high-memory nodes for in-memory processing) and avoid the worst of the resource contention.

Difficult troubleshooting

We all know the feeling of being under the gun while an important production system is offline. While the Hadoop ecosystem will surely mature in the coming years, Non-Stop Hadoop gives you built-in redundancy. Lose a NameNode? You’ve got 8 more. The whole cluster is shot? You’ve got two others that can fill in the gap…immediately.

Inefficient use of hardware

It’s really a tough problem: you need enough hardware to handle peak bursts of activity, but then a lot of it will sit idle during non-peak times. Non-Stop Hadoop gives you a clever solution: put your backup cluster to work. The backup cluster is effectively just an extension of the primary cluster when you use Non-Stop Hadoop. Point some jobs at the second cluster during periods of peak workload and you’ll have easy load balancing.

To borrow an analogy from the electric power industry, do you want to maintain expensive and inefficient peaker units for the two hours when the air-conditioning load is straining the grid? Or do you want to invest in distributed power setups like solar, wind, and neighborhood generation?

A better Hadoop

Non-Stop Hadoop is Hadoop…just better. Let’s solve your problems together.

GitLab and Git MultiSite: Architecture

The architecture of GitLab running with Git MultiSite is worth exploring.  In the interest of saving a thousand words, here’s the picture.


As you can see, the topology is quite a bit more complex when you use a Git repository management system that uses multiple data stores.  Git MultiSite coordinates with GitLab to replicate all repository activity, including wiki repositories.  Git MultiSite also replicates some important files like the GitLab authorization files for access control.

As for the other data stores, we’re relying on GitLab’s ability to run with multiple web apps connected to a single logical relational database and a single logical Redis database.  They can be connected directly or via pass-through mirrors.  Kudos to the GitLab team for a clean architecture that facilitates this multi-master setup; they’ve avoid some of the nasty caching issues that other applications encounter.  This topology is in fact similar to what you can do with GitLab when you use shared storage for the repositories.  Git MultiSite provides the missing link: full repository replication with robust performance in a WAN environment and a shared-nothing architecture.

Short of relying completely on Git as a data store for code reviews and other metadata, this architecture is about as clean as it gets.

Now for some nuts and bolts…

We are making some simplifying assumptions for the first release of GitLab integration.  The biggest assumption is that all nodes run all the software, and that all repositories originate in GitLab and exist on all nodes.  We plan to relax some of these constraints in the future.

And what about performance?  Well, I’m happy to relate that you’ll see very good performance in all cases and much improved performance in some cases.  Balancing repository activity across several nodes gives better throughput when the system is under practical load.


Well, that picture saved a few words, but nothing speaks better than a demo or a proof-of-concept deployment.  Contact us for details!


Scalable Social Coding

I’m very pleased to announce that Git MultiSite now formally supports GitLab, a leading on-premise Git collaboration and management suite.  With this and future integrations, Git MultiSite’s promise of a truly distributed Git solution is coming to fruition.

WANdisco first announced Git MultiSite in 2013.  Git MultiSite provides our patented active-active replication for Git, giving you a deployment of fully writable peer nodes instead of a single ‘master’ Git server.  The next step came with Access Control Plus in 2014, which brought Git repositories under a unified security and management umbrella.

And now we’re tackling the final piece of the puzzle.  Those of you active in the Git ecosystem know that most companies deploy Git as part of an integrated repository management solution that also provides social coding and collaboration tools — code review, wikis, and sometimes lightweight issue tracking.

In one sense, Git MultiSite is still a foundational technology that can replicate Git repositories managed by almost any system.  And indeed we do have customers who deployed Git MultiSite with GitLab long before we did any extra work in this area.

The devil is in the details though.  For one thing, some code review systems actually modify a Git repository using non-standard techniques in response to code review activity like approving a merge request.  So we had to make a few under-the-hood modifications to support that workflow.

Perhaps more importantly, Git MultiSite and Access Control Plus provide consistent (and writable) access to repository and access control data at all sites.  But if the collaboration tool is a key part of the workflow, you really need that portal to be available at every node as well.  And we’ve worked hard with the GitLab crew to make that possible.

So what does that all mean?  You get it all:

  • LAN speed access to repositories at every site
  • A built-in HA/DR strategy for zero down time
  • Easy scalability for build automation or a larger user base
  • Fast access to the GitLab UI for code reviews and more at every site
  • Consistent access control at every site
  • All backed by WANdisco’s premier support options

Interested?  I’ll be publishing more details on the integration in the near future.  In the meantime, give us a call and we’ll give you a full briefing.


Advanced Gerrit Workflows

As a final note on Gerrit workflows, it’s worth looking into Gerrit’s Prolog engine if you need a customized code approval process.  Now, I know what you’re thinking – do you really need to learn Prolog to use Gerrit?  Certainly not!  You can use Gerrit out of the box very effectively.  But if you need a highly tailored workflow, you can either write a Java plugin or write some rules in Prolog.  The Prolog syntax is well suited for logical expressions, and you can check the Prolog rules in to a Gerrit repo as regular text files.  That’s easier than writing, building, and maintaining a Java plugin.

So what can you do with Prolog?  Two very useful things:


  • Submit rules define when a change can be submitted.  The default is to require one vote of the highest option from each rule category, with no lowest votes in any category.  A common choice is to require a human ‘+2’ and a ‘+1’ from the CI system.  Submit rules can be defined globally or per project.  Submit rules are given a set of facts about a commit (author, message, and so on) and then decide whether the commit can be submitted.
  • Submit types define how a change can be submitted, per project.  You can choose from fast forward only, merge if necessary, merge always, cherry pick, or rebase if necessary.


There’s a great Gerrit Prolog workbook to get you started, and Gerrit provides a Prolog shell and debugging environment.

As a simple example, here’s a submit type that only allows fast-forward updates on release branches, but allows other submit types on other branches.

submit_type(fast_forward_only) :-
 gerrit:change_branch(B), regex_matches('refs/heads/
   release.*', B),
submit_type(T) :- gerrit:project_default_submit_type(T)

Hacking Prolog is not for the brand-new-to-Gerrit, but don’t be scared of it either.  It gives you a tremendous amount of control over how changes flow into your repositories.  If you store configuration data in Git and are subject to PCI regulations or other compliance measures, then a strong Gerrit workflow explicitly defined in Prolog will help satisfy your compliance concerns.

As always if you have any questions just ask.  We have a team of Git experts waiting to help.