Git Blog

DevOps is eating the world

You know a technology trend has become fully mainstream when you see it written up in the Wall Street Journal.  So it goes with DevOps, as this recent article shows.

DevOps and continuous delivery have been important trends in many firms for several years.  It’s all about building higher quality software products and delivering them more quickly.  For SaaS companies it’s an obvious fit as they sometimes push out minor changes many times a day.  But even companies with more traditional products can benefit.  And internal IT departments can use DevOps principles to start saying “yes” to business users more often.

For example, let’s say that your business analytics team asks for a small Hadoop cluster to try out some of the latest machine learning algorithms on Spark.  Saying “yes” to that request should only take hours, not weeks.  If you have a private cloud and the right level of automation, you can spin up a new Spark cluster in minutes.  Then you can work with the analysts to automate the deployment of their algorithms.  If they’re wildly successful and they need to move their new project to a production cluster it’s just a matter of deploying somewhere with more resources.

Of course, none of this comes easily.  On the operations side you’ll need to invest in the right configuration and private cloud infrastructure.   Tools like Puppet, Ansible, and Docker can capture the configuration of servers and applications as code.

But equally important is the development infrastructure.  Companies like Google practice mainline development: all of their work is done from the trunk or mainline, supported by a massive continuous build and test infrastructure.  And Gerrit, a tool that Google sponsors, is perhaps the best code review tool for continuous delivery.

If you look at potential bottlenecks in a continuous delivery pipeline, you need to consider how code gets to the mainline, and then how it gets deployed.  With Gerrit there are only two steps to the mainline:

  • Commit the code.  Gerrit makes a de facto review branch on the fly and initiates a code review.
  • Approve the merge request.  Gerrit handles the merge automatically unless there’s a conflict.

With this system you don’t even need to ask a developer to open a pull request or create a private branch.  Gerrit just automates all of that.  And Gerrit will also invoke any continuous build and test automation to make sure that code is passing those tests before a human reviewer even looks at it.

Once it’s on the mainline the rest of the automation kicks in, and those operational tools become important to help you rapidly spin up more realistic test environments.

As you can imagine, this type of infrastructure can put a heavy load on your development systems.  That’s why WANdisco has put the muscle of Git MultiSite behind Gerrit, giving you a horizontally scalable Gerrit infrastructure.

Latest Git binaries available for download

As part of our participation in the open source SCM community, WANdisco provides up-to-date binary downloads for Git and Subversion for all major platforms.  We now have the latest Git binaries available for download on our Git downloads site.

One interesting new feature is git push –atomic.  When you’re pushing several refs (e.g. branches) at once, this feature makes sure that either all the refs are accepted or none are.  That’s useful if you’re making related changes on several branches at once.  Those who merge patches onto several releases at once are often in this position.

The Git community has done a great job of ensuring a stable upgrade process, so there’s generally little concern about upgrading.  It’s always a good idea to review the release notes of course.

Scalable and Secure Git

Now that WANdisco has released an integration between Git MultiSite and GitLab, it’s worth putting the entire Git lineup at WANdisco into perspective.

Git MultiSite is the core product providing active-active replication of Git repository data. This underpins our efforts to make Git more reliable and better performing. Active-active replication means that you have full use of your Git data at several locations, not just in a single ‘master’ Git server. You get full high availability and disaster recovery out of the box, and you can load balance your end user and build demands between several Git servers. Plus, users at every location get fast local read and write access. As one of our customers recently pointed out, trying to make regular Git mirrors work this way requires a few man-years of effort.

On top of Git MultiSite you have three options for user management, security, and collaboration.

  • Use WANdisco’s Access Control Plus for unified, scalable user and permission management. It features granular permissions, delegated team management, and full integration with SVN MultiSite Plus for unified Subversion-Git administration.
  • Use Gerrit to take advantage of powerful continuous review workflows that underpin the Android community.
  • Use GitLab for an enterprise-grade social coding and collaboration platform.

Not sure which direction to take? Our solution architects help you understand how to choose between Subversion, Git, and all the other tools that you have to contend with.

An essential Git plugin for Gerrit

One of the frequent complaints about Gerrit is the esoteric syntax of pushing a change for review:

git push origin HEAD:refs/for/master

Translated, that means to push your current HEAD ref to a remote named origin and to a special review ref (for master).

If you’re a Gerrit user, you need this plugin:

https://github.com/openstack-infra/git-review

It automates some of the Gerrit syntax so now you can just run:

git review

The only problem is that when you push to a non-Gerrit repository you start to wonder why your review command doesn’t work anymore.  That’s how deeply ingrained code review is to the Gerrit workflow.

Binary artifact management in Git

Paul Hammant has an interesting post on whether to check binary artifacts into source control.  Binary artifact management in Git is an interesting question and worth revisiting from time to time.

First, a bit of background.  Centralized SCM systems like Subversion and ClearCase are a bit more capable than Git when it comes to handling binary files.  One reason is sheer performance: since a Git repository has a full copy of the entire history, you just don’t want your clone (working copy) to be too big.  Another reason is assembling your working views.  ClearCase and to a lesser extent Subversion give you some nice tools to pick and choose pieces of a really big central repository and assemble the right working copy.  For example in a ClearCase config spec you can specify that you want a certain version of a third party library dependency.  Git on the other hand is pretty much all or nothing; it’s not easy to do a partial clone of a really big master repository.

Meanwhile, there had been a trend in development to move to more formal build and artifact management systems.  You could define a dependency graph in a tool like Maven and use Maven or Artifactory or even Jenkins to manage artifacts.  Along with offering benefits like not storing derived objects in source control, this trend covered off Git’s weak spot in handling binaries.

Now I’m not entirely sure about Paul’s reasons for recommending a switch back to managing binaries in Git.  Personally I prefer to properly capture dependencies in a configuration file like Maven’s POM, as I can exercise proper change control over that file.  The odd thing about SCM working view definitions like config specs is that they aren’t strongly versioned like source code files are.

But that being said,  you may prefer to store binaries in source control, or you may have binaries that are actually source artifacts (like graphics or multimedia for game development).  So is it hopeless with Git?

Not quite.  There are a couple of options worth looking at.  First, you could try out one of the Git extensions like git-annex or git-media.  These have been around a long time and work well in some use cases.  However they do require extra configuration and changes to the way you work.

Another interesting option is the use of shared back-end storage for cloned repositories.  Most Git repository management solutions that offer forks use these options for efficient use of back-end storage space.  If you can accept working on shared development infrastructure rather than your own workstation, then you can clone a Git repository using the file protocol with the -s option to share the object folder.  There’s also the -reference option to point a new Git clone at an existing object store.  These options make cloning relatively fast as you don’t have to create copies of large objects.  It doesn’t alleviate the pain of having the checked out files in your clone directory, but if you’re working on a powerful server that may be acceptable.  The bigger drawback to the file protocol is the lack of access control.

Management of large binaries is still an unsolved problem in the Git community.  There are effective alternatives and work-arounds but it’ll be interesting to see if anyone tries to solve the problem more systematically.

GitLab and Git MultiSite: Architecture

The architecture of GitLab running with Git MultiSite is worth exploring.  In the interest of saving a thousand words, here’s the picture.

gitlab-deployment-ms

As you can see, the topology is quite a bit more complex when you use a Git repository management system that uses multiple data stores.  Git MultiSite coordinates with GitLab to replicate all repository activity, including wiki repositories.  Git MultiSite also replicates some important files like the GitLab authorization files for access control.

As for the other data stores, we’re relying on GitLab’s ability to run with multiple web apps connected to a single logical relational database and a single logical Redis database.  They can be connected directly or via pass-through mirrors.  Kudos to the GitLab team for a clean architecture that facilitates this multi-master setup; they’ve avoid some of the nasty caching issues that other applications encounter.  This topology is in fact similar to what you can do with GitLab when you use shared storage for the repositories.  Git MultiSite provides the missing link: full repository replication with robust performance in a WAN environment and a shared-nothing architecture.

Short of relying completely on Git as a data store for code reviews and other metadata, this architecture is about as clean as it gets.

Now for some nuts and bolts…

We are making some simplifying assumptions for the first release of GitLab integration.  The biggest assumption is that all nodes run all the software, and that all repositories originate in GitLab and exist on all nodes.  We plan to relax some of these constraints in the future.

And what about performance?  Well, I’m happy to relate that you’ll see very good performance in all cases and much improved performance in some cases.  Balancing repository activity across several nodes gives better throughput when the system is under practical load.

perf

Well, that picture saved a few words, but nothing speaks better than a demo or a proof-of-concept deployment.  Contact us for details!

 

Scalable Social Coding

I’m very pleased to announce that Git MultiSite now formally supports GitLab, a leading on-premise Git collaboration and management suite.  With this and future integrations, Git MultiSite’s promise of a truly distributed Git solution is coming to fruition.

WANdisco first announced Git MultiSite in 2013.  Git MultiSite provides our patented active-active replication for Git, giving you a deployment of fully writable peer nodes instead of a single ‘master’ Git server.  The next step came with Access Control Plus in 2014, which brought Git repositories under a unified security and management umbrella.

And now we’re tackling the final piece of the puzzle.  Those of you active in the Git ecosystem know that most companies deploy Git as part of an integrated repository management solution that also provides social coding and collaboration tools — code review, wikis, and sometimes lightweight issue tracking.

In one sense, Git MultiSite is still a foundational technology that can replicate Git repositories managed by almost any system.  And indeed we do have customers who deployed Git MultiSite with GitLab long before we did any extra work in this area.

The devil is in the details though.  For one thing, some code review systems actually modify a Git repository using non-standard techniques in response to code review activity like approving a merge request.  So we had to make a few under-the-hood modifications to support that workflow.

Perhaps more importantly, Git MultiSite and Access Control Plus provide consistent (and writable) access to repository and access control data at all sites.  But if the collaboration tool is a key part of the workflow, you really need that portal to be available at every node as well.  And we’ve worked hard with the GitLab crew to make that possible.

So what does that all mean?  You get it all:

  • LAN speed access to repositories at every site
  • A built-in HA/DR strategy for zero down time
  • Easy scalability for build automation or a larger user base
  • Fast access to the GitLab UI for code reviews and more at every site
  • Consistent access control at every site
  • All backed by WANdisco’s premier support options

Interested?  I’ll be publishing more details on the integration in the near future.  In the meantime, give us a call and we’ll give you a full briefing.

 

Advanced Gerrit Workflows

As a final note on Gerrit workflows, it’s worth looking into Gerrit’s Prolog engine if you need a customized code approval process.  Now, I know what you’re thinking – do you really need to learn Prolog to use Gerrit?  Certainly not!  You can use Gerrit out of the box very effectively.  But if you need a highly tailored workflow, you can either write a Java plugin or write some rules in Prolog.  The Prolog syntax is well suited for logical expressions, and you can check the Prolog rules in to a Gerrit repo as regular text files.  That’s easier than writing, building, and maintaining a Java plugin.

So what can you do with Prolog?  Two very useful things:

 

  • Submit rules define when a change can be submitted.  The default is to require one vote of the highest option from each rule category, with no lowest votes in any category.  A common choice is to require a human ‘+2’ and a ‘+1’ from the CI system.  Submit rules can be defined globally or per project.  Submit rules are given a set of facts about a commit (author, message, and so on) and then decide whether the commit can be submitted.
  • Submit types define how a change can be submitted, per project.  You can choose from fast forward only, merge if necessary, merge always, cherry pick, or rebase if necessary.

 

There’s a great Gerrit Prolog workbook to get you started, and Gerrit provides a Prolog shell and debugging environment.

As a simple example, here’s a submit type that only allows fast-forward updates on release branches, but allows other submit types on other branches.

submit_type(fast_forward_only) :-
 gerrit:change_branch(B), regex_matches('refs/heads/
   release.*', B),
 !.
submit_type(T) :- gerrit:project_default_submit_type(T)

Hacking Prolog is not for the brand-new-to-Gerrit, but don’t be scared of it either.  It gives you a tremendous amount of control over how changes flow into your repositories.  If you store configuration data in Git and are subject to PCI regulations or other compliance measures, then a strong Gerrit workflow explicitly defined in Prolog will help satisfy your compliance concerns.

As always if you have any questions just ask.  We have a team of Git experts waiting to help.

 

Gerrit Administration

So far I’ve been talking a lot about Gerrit’s strong points. Now it’s time to focus on one of Gerrit’s comparative weak points: administration. Gerrit has all the tools you need to run a stable and secure deployment, but you need to be a master mechanic, not a weekend hobbyist.

Although Gerrit has an easy ‘quick start’ mode that’s great for trying it out, you need to do some research before running it in a production environment. Here are some areas that will need attention.

User Management

Gerrit supports several authentication mechanisms. The default is OpenID, which is suitable for open source projects or for enterprise environments that have an internal OpenID provider. Other sites will want to look at using LDAP, Active Directory, or possible Apache for authentication. Similarly, you can maintain groups internally or via an external directory.

Protocols

Gerrit can serve Git repositories over SSH or HTTP/S. SSH is a convenient way to start for small teams, as each user can upload a public key. However maintaining SSH keys for a large user base is cumbersome, and for large deployments we recommend serving over HTTP/S.

Of course you should use HTTPS to secure both the Gerrit UI and the repositories.

Authorization

Gerrit has a robust access control system built in. You set permissions in a hierarchy, with global defaults set for the ‘All Projects’ project. You can set up other project templates and have new projects inherit from the template of your choice.

You can manage permissions on:

  • Branches and tags
  • Change sets uploaded for review
  • Configuration including access control settings and submit rules
  • Code review workflow steps including approving and verifying changes

Integrations

You’ll want to hook up your build system to Gerrit to make best use of its workflow. (The build system can vote on whether to accept a change.) Similarly, you might want to integrate an external ticket system or wiki.

Scalability

I’ll cover this topic in more detail later on. But for now I’ll mention that you should have mirrors available at each location to provide the best performance. If you need Gerrit to enforce access control on the mirrors then you’ll need to run Gerrit in slave mode against a database mirror.

Sound complicated? It is. That’s why WANdisco provides Git MultiSite for Gerrit. You’ll get active-active fully replicated and writable repositories at each site, with regular Gerrit access control enforced.

Need help?

Call our Git support specialists if you need a hand getting started with Gerrit.

Gerrit Workflow

As I mentioned in an earlier post, Gerrit has a unique workflow.  It has some similarities to pull and merge request models, but is more flexible and more automated.  That goes back to its roots in the Android ecosystem; at the scale of work in that community, bottlenecks need to be few and far between.

gerrit-model

Gerrit’s model is unique in a couple of ways:

  • By default all changes are put into temporary pending review branches, which are created automatically on push.
  • The workflow engine enforces rules before changes can be merged to the permanent repository.  Notably, you can require human code review, automated build and test, or both, and use the access control system to specify who’s allowed to perform various steps in the workflow.
  • Review IDs are generated automatically via hooks and group serial sets of patches.  Additional patches can be provided for rework based on the result of a review.
  • Gerrit’s Prolog engine can be used to create customized review approval conditions.

Gerrit’s workflow engine is well tuned for ‘continuous review’, which means that commits can be reviewed rapidly and merged into trunk (master or mainline) very quickly.  In some cases only certain commits would be manually reviewed, while all commits would be subject to automated build and test.  Gerrit is thus a good choice for large scale deployments that want to move towards continuous delivery practices.