Monthly Archive for May, 2014

Why 100% Availability is Critical for Big Data Applications

https://creativecommons.org/licenses/by-sa/2.0/ Marla LyJeremy Howard, the former president of Kaggle, opined recently that few people outside the machine learning space “have yet grasped how astonishingly quickly it’s progressing.”  This is no understatement. And as a Big Data company, WANdisco is right in the intellectual middle of this coming revolution.

No human could program an algorithm for a car to safely drive itself, there are simply too many edge cases. Only a learning computer can do this, and no human knows the complete algorithm. The computer actually learns how to drive by watching a human do it. Imagine this being repeated across any number of current activities that we today take for granted only humans can perform.

There’s a critical component of these machine learning robots that might be less flashy, but essential to their success: Big Data.  In the case of the Google driverless car, first an extremely detailed recreation of the world is built. The car must then only see the difference between what’s actually happening and its internal model. That’s where Big Data comes in: petabytes of data about the world and the ability to merge with a stream of incoming data in real time make this miracle work.

That’s also where these systems take a sharp turn from many computing systems of the past; this Big Data must always be available and working. Clearly a system that drives a car must be available more than 99.99% of the time. 99.99% uptime would mean approximately 8 seconds of failure for every 24 hours of driving, clearly not even close to acceptable.

Of course, computers have been critical components in cars for many years. But there’s a big difference between these computers and the machine learning, Big Data driverless car of today. Unlike an embedded system that is self contained in a controlled environment, today’s Big Data technology must work in the high failure environment of distributed systems.

As the inventor of Paxos, Leslie Lamport, defined it:

“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.”

Given this challenging environment, how does one obtain the kind of guaranteed availability that’s required for critically important functions such as driving a car? WANdisco’s core WAN-capable Paxos technology is the answer, removing single points of failure in existing technology and proving seamless redundancy with 100% data safety in high failure environments.

So while the future promises a Big Data driven revolution of new capabilities, those capabilities rely on systems that must always work. That’s Why WANdisco.

Permission Precedence in Access Control Plus

Access Control Plus gives you a flexible permission system for both Subversion and Git.  Management is delegated with a hierarchical team system, a necessity for large deployments.  You can’t have every onboarding request bubbling up to the SCM administration team, after all.  Daily permission maintenance is, within boundaries, rightfully the place of team leaders.

But what happens if several rules seem to apply at the same time?  Consider a few examples of authorization rules for Git.  In these examples I’ll look at different rule scope in terms of where the rule applies (the resource) and who the rule applies to (the target).

Same resource, same target

Harry want to push a commit to the task105 branch of the Git repository named acme.

  • Harry belongs to the team called acme-devs, which has write access to the entire acme repo.
  • Harry also belongs to the team called acme-qa, which has read access to the entire repo.

In this case Harry has write access to the task105 branch.  Both rules apply to the same resource and target, so Harry gets the most liberal permission.

Different resource, same target

Now consider this case where again Harry wants to push a commit to the task105 branch.

  • Harry belongs to acme-qa which has read access to the entire repo.
  • Harry belongs to acme-leads, a sub-team of acme-qa, which has write access to task105.

In this case Harry again has access.  The more specific resource of the rule for acme-leads (on the branch as opposed to the entire repo) takes precedence.

Different resource, different target

In another variation:

  • Harry has read access to the entire repo.
  • Harry belongs to acme-leads which has write access to task105.

In this case Harry again has access.  The more specific resource of the rule for acme-leads (on the branch as opposed to the entire repo) takes precedence.

Different resource, different target – with a wrinkle

Now consider:

  • Harry has write access to the entire repo.
  • Harry belongs to acme-reviewers-only which has read access to task105.

In this case Harry again has access.  The team rule that grants read access is considered first, but it doesn’t grant or deny the access level Harry needs (write access).  So we keep searching and find the more general rule that grants write access at the repo level.  If we actually wanted to prevent Harry from writing, the team rule would need to deny write access explicitly.

Same resource, different target

And in a final example:

  • Harry belongs to acme-reviewers-only which has read access to task105.
  • Another rule grants Harry himself write access to task105.

In this case Harry gets write access.  The two rules have the same resource (branch level), but the rule that applies to the more specific target (his own account versus a team) is applied first.

Rules of the road

To sum up, the rules of precedence for Git permissions are:

  • Rules on a more specific resource take precedence over more general rules.
  • If two rules apply to the same resource, then rules applying to a specific account take precedence over rules that apply to a team.  (All teams and sub-teams are considered as equivalent identities.)
  • When two rules are equivalent in resource and target, the more liberal rule takes precedence.
  • Rules are considered until a rule is found that grants or denies the requested access level.
  • If no rules apply, fall back on the default access level specified for Git MultiSite users.

precedence

 

And in case Harry has any doubts, he can always use the Rule Lookup tool to find out which rule applies.

acp-rule-lookup

Migrating to Git, Forensic Considerations

https://creativecommons.org/licenses/by/2.0/ Jack Spades

Git has unleashed an unusual number of migrations from legacy tools among a wide variety of companies. Desire to attract new developer talent is one reason we commonly hear, another is finally there is an open source version control tool that has sufficiently compelling advantages, perceived as well as real, to undergo a migration.

Often there’s an initial desire to migrate everything to Git, and shut off the legacy system. If that system is commercial and requires on going licensing fees, there’s a stronger incentive. But if it’s an open source tool, there are reasons you might want to pay the maintenance cost of keeping it around.

Easier migrations

One reason is that it’s usually advisable to not attempt a full migration of all history, but instead pick major baselines and only migrate those. By leaving the legacy system running, but in read-only mode, you always have the chance to go back and find something in the full history. That leads us to the next topic.

Proof of invention

Many companies face infrequent but high stakes litigation around intellectual property disputes. Take the example of an algorithm that’s an issue in a lawsuit. You need to prove you were using the algorithm prior to a certain date. Version control systems are ideal for this, but only if you have the complete historical record. Someone noticed that the CVS repository hadn’t been used in three years and deleted it? Oops.

Forensics in a hybrid system

Thinking ahead a few years, we now have all new development done in Git, with our trusty CVS server patiently waiting for the next lawsuit. Consider that an investigation begun today may start in the Git history, and then need to be traced into the read-only CVS history.

This means that you will need to be able to link history going backwards into time across your hybrid Git-CVS deployment. Practically, this means that this requirement should be taken into consideration during the initial migration to Git.

Imperfect history migration

You might think that doing a full history migration would be a fix for this. In some cases this might be advised; you should generally migrate enough history so that going back into the legacy system is an unusual event. However the problem here is that perfect fidelity in history migration between SCM systems is rarely possible. There are differences in capabilities or metadata that may have no deterministic answer.

The legacy system remains the definitive system of record pertaining to your intellectual property. Further, you may need to treat that history as it spans legacy and new tools. While its use may be infrequent, you’ll likely be happy you planned ahead during the giddy days of your Git adoption.

 

 

HIPAA Compliance and Continuous Delivery

The HIPAA law poses a compliance challenge for developers of software that intersects electronic protected health information (ePHI). Part of the burden is showing proper control over the electronic system: how you provision for auditing, availability, access control, and so on.

If you’re a practitioner of DevOps and continuous delivery, you’ve got a good head start on meeting those challenges. DevOps and continuous delivery believe in the idea of configuration as code. In other words, all of your runtime configuration and environment data is stored in an SCM system like Git or Subversion. As a result, the SCM system is your system of record for how your software was actually deployed, and helps you demonstrate compliance with the HIPAA provisions.

There is, however, a slight wrinkle in the story: the SCM system is now a critical part of your runtime infrastructure, and most SCM systems are not designed to be highly available with no risk of data corruption.

That’s where WANdisco’s family of MultiSite and Clustering products for Git and Subversion come into play. WANdisco provides a 100% uptime solution; every node in the deployment is a replicated peer, so the loss of a single server does not pose a problem. High Availability and Disaster Recovery are built in with automatic failover and recovery capabilities.

Moreover, these are zero data loss solutions. By the time a piece of runtime configuration data is committed, it is guaranteed to exist on more than one node, guaranteeing data integrity. Every site, including deployment sites, will see the right set of data.

In an environment bound by regulatory and compliance concerns, you need the peace of mind that a 100% uptime solution with guaranteed data integrity provides. Give us a call for more information on how Subversion and Git MultiSite and Clustering can help you meet your compliance demands.

Reflections on Subversion & Git Live 2014

Yesterday marked the conclusion of this year’s Subversion & Git Live conference tour through New York and San Francisco. This was also WANdisco’s second conference with the DVCS Git under our wing.

Growing Git Sophistication

Last fall for Subversion & Git Live 2013 we targeted Git materials at an introductory level. This turned out to be about exactly right, as our attendees were as novice as they were enthusiastic about the disruptive and beneficial effects of Git in their environments. This year, not only were virtually all attendees familiar with Git, they were also markedly more comfortable with it and the resulting impact on their development organizations.

Where a common question last year was “I’ve heard of Git and I’d like to learn something about it”, this year it was more likely to be “I have Git in my environment and what do you recommend for supporting it successfully.” Followup questions were more likely to be about specific tool stacks that could be deployed this year.

Strength of Subversion

All of this hot discussion played out against the backdrop of the enterprise workhorse of Subversion. Significant improvements in speed and scalability are part of the roadmap in 2013-2014, but there were also more ambitious discussions about assimilating more functionality from DVCS’s like Git, and even ground up designs for a new merge engine with move as a first class operation. I’ve rarely seen WANdisco’s Subversion committers more engaged; fortunately WANdisco’s long resume of large customers means easy access to real world use cases for complex enterprise software development.

What I see ahead

One change observed at this conference was an acceleration to end-of-life expensive, legacy commercial products, ClearCase the easy target here. The relevance of newer commercial SCM systems outside niche industries continues to decline as SCM and version control commoditizes around open source. Despite significant challenges for creating enterprise class deployments, Git seems an unstoppable force as a developer productivity and talent attraction tool. WANdisco plays a significant role here, leapfrogging the ubiquitous Web UI paradigm around self-provisioning of repos and engineering a world-scale, foundational backbone for Git.

It was a great conference, and we hope to see you next year!

Apache Announces Subversion 1.8.9

We’re pleased to announce the release of Subversion 1.8.9 on behalf of the Apache Subversion project. Along with the official Apache Software Foundation source releases, our own fully tested and certified binaries are available from our website.

1.8.9 contains a number of bugfixes. For a complete list please check the Apache changelogs for Subversion 1.8.

You can download our fully tested, certified binaries for Subversion 1.8.9 free here.

WANdisco’s binaries are a complete, fully-tested version of Subversion based on the most recent stable release, including the latest fixes, and undergo the same rigorous quality assurance process that WANdisco uses for its enterprise products that support the world’s largest Subversion implementations.

Git Access Control Levels

It seems that every Git management solution has its own flavor of access control permissions.  I thought it’d be useful to have a quick matrix of the capabilities WANDisco’s Access Control Plus.  Questions?  We’re here to help!

 

Feature Available
Repository read/write permissions now
Branch write permissions now
Branch/tag create/delete permissions now
Path write permissions 2014
Regular expressions (in refs and paths) 2014
HTTP(S) and SSH protocols now
Enforced on all Git replicas (via Git MultiSite) now
Unified interface for Subversion and Git now

Unified Git and Subversion Management

Over the past several years the movement in ALM tools has been away from heavy, inflexible tools towards lighter and more flexible solutions. Developers want and need the freedom to experiment and work quickly without being bound by heavy processes and restrictions.

But, of course, an enterprise still needs some level of management and governance over software development. Now it looks like the pendulum is swinging back towards a useful middle ground – and WANdisco’s new Access Control Plus product strikes that fine balance between flexibility and guidance.

Access Control Plus is flexible because it lets team leaders manage access to their repositories.  Site administrators can set overall policies and make sure that the truly sensitive data stays safe. Access Control Plus provides for any level of delegated team management, letting the team leaders closest to the source code manage their teams and permissions. And with accounts backed by any number of LDAP or Active Directory authorities, the grunt work of account management is automated.

Yet Access Control Plus is still an authoritative resource for security, auditing and reporting. It covers permissions for all of your Subversion and Git repositories at any location. That’s important for a number of reasons:

  • Sanity! You need some form of consistent permission management over your repositories.
  • An audit trail of your inventions. With the new America Invents Act, a comprehensive record of your intellectual property is more important than ever.
  • Regulatory regimes. Whether it’s Sarbanes-Oxley, HIPAA, or PCI, can you prove accurately who was accessing and modifying your IP?  That’s a key concern for compliance officers.
  • DevOps. If you practice configuration as code, then some of your crown jewels are stored in SCM, and need to be managed appropriately.
  • Industry standards. From CMMI to ISO 9000, standard processes and controls are the cost of doing business in certain industries.  Access Control Plus ticks all of the auditing and reporting checkmarks for you.

Combined with SVN MultiSite Plus and Git MultiSite, Access Control Plus is a complete solution for making your valuable digital data highly available and secure. Be proactive – give us a call and figure out how to manage all of your Subversion and Git repositories.