Git Blog

Page 2 of 5

Gerrit Scalability

As a fundamental part of the Android Open Source Project (AOSP), Gerrit has to support a large user base and a big data set.  In this article I’ll review Gerrit scalability from both a performance and operational standpoint.

Operational Scalability

Let’s start with operational tasks:

  • Managing users.  Gerrit provides integration with most common enterprise authentication solutions including LDAP and Active Directory, so the Gerrit administrator should not have to worry much about user management.
  • Managing permissions.  Gerrit has a rich set of permissions that govern operations on code, code reviews, and internal Gerrit data.  The permission model is hierarchical, with any project able to inherit permissions from a parent project.  As long as the Gerrit administrator has set up sensible top level defaults, individual team leads can override the settings as necessary and permission management should be easy on a large scale.  The only potential wrinkle comes when Gerrit mirrors are used.  Unless you run the Gerrit UI in slave mode at every site, the mirrors will not have Gerrit access control applied.
  • Auditing.  Gerrit does not provide auditing, so this area can be a challenge.  You may have to set up your own tools to watch SSH and Apache logs as well as Gerrit logs.
  • Monitoring performance.  As a Gerrit administrator you’ll have to set up your own monitoring system using tools like Nagios and Graphite.  You should keep a particular eye on file system size growth, RAM usage, and CPU usage.
  • Monitoring mirrors.  Like most Git mirrors, a Gerrit mirror (as provided by the Gerrit replication plugin) is somewhat fragile.  There’s no automated way to detect if a Gerrit mirror is out of sync, unless you monitor the logs for replication failures (or your users start to complain that their local mirror is out of date).
  • HA/DR.  Gerrit has no HA/DR solution built-in.  Most deployments make use of mirrors for the repositories and database to support a manual failover strategy.

If you use Git MultiSite with Gerrit, those last two points will be largely addressed.  Git MultiSite nodes are self-healing in the case of temporary failure, and the Git MultiSite console will let you know about nodes that are down or transactions that have failed to replicate due to network issues.  And similarly, as we’ll see in the next section, Git MultiSite gives you a 100% uptime solution with automated failover out of the box.

Performance Scalability

Now on to performance.  Gerrit was designed for large deployments (hundreds of repositories, millions of lines of code, thousands of developers) and the Gerrit community has provided some innovations like bitmap indexes.

Nevertheless, running Gerrit on a single machine will eventually reach some scalability limits.  Big deployments require big hardware (24 core CPUs, 100+ GB of RAM, fast I/O), and even so they may use several read-only mirrors for load balancing and remote site support.

If you want to run a big Gerrit deployment without worrying about managing expensive hardware and monitoring a farm of mirrors, Git MultiSite provides an elegant solution.  Using active-active replication, you’ll have a deployment of fully writable Gerrit nodes.  That means that any single machine doesn’t have to be sized as large, as you can deploy more writable nodes for load balancing.  You can also put fully writable nodes at remote locations for better performance over the WAN.  To put the icing on the cake, there is no single point of failure in Git MultiSite.  If you have 5 nodes in your Gerrit deployment you can tolerate the loss of 2 of those nodes without any downtime, giving you HA/DR out of the box.

 

And here’s Gerrit with Git MultiSite!

With the recent announcement of Gerrit support in Git MultiSite, it’s worth taking a step back and looking at Gerrit itself.  Gerrit, just like its logo, is a bit of an odd bird. It has a huge user base and dynamic community including the likes of Google and Qualcomm, yet is little known outside of that community.

gerrit

Gerrit is one of two known descendants of Mondrian, a code review tool used internally at Google. Mondrian proved very popular and led to Rietveld, an open source code review tool for Subversion and Git, and Gerrit. Gerrit was developed as the code review and workflow solution for the Android Open Source Project (AOSP).

In order to support AOSP, Gerrit was designed to be:

  • Scalable. It supports large deployments with thousands of users.
  • Powerful. The workflow engine enforces code review and automated build and test for every commit.
  • Flexible. Gerrit offers a delegated permission model with granular permissions as well as a Prolog interpreter for custom workflows.
  • Secure. Gerrit integrates with enterprise authentication mechanisms including LDAP, Active Directory, and OpenID, and can be served over SSH and HTTPS.

Gerrit offers three key features: repository management, access control, and the code review and workflow engine.

In future articles I’ll dive into more detail on Gerrit’s workflow and other features, but for now, I’ll conclude by talking about why we decided to put MultiSite support behind Gerrit.

Gerrit is a scalable system, but still has a centralized architecture. Out of the box it has a master set of repositories and a simple master-slave replication system. That can lead to challenges in performance and uptime – exactly the problems that WANdisco solves with our patented active-active replication technology. Under Git MultiSite, Gerrit repositories can be replicated to any location for maximum performance, or you can add additional local repositories for load balancing. Access control is enforced with the normal Gerrit permissions, and code review and workflow still route through the Gerrit UI.

Gerrit with Git MultiSite gives you 100% uptime and the best possible performance for users everywhere. More details coming soon!

Experiences with R and Big Data

The next releases of Subversion MultiSite Plus and Git MultiSite will embed Apache Flume for audit event collection and transmission. We’re taking an incremental approach to audit event collection and analysis, as the throughput at a busy site could generate a lot of data.

In the meantime, I’ve been experimenting with some more advanced and customized analysis. I’ve got a test system instrumented with a custom Flume configuration that pipes data into HBase instead of our Access Control Plus product. The question then is how to get useful answers out of HBase to questions like: What’s the distribution of SCM activity between the nodes in the system?

It’s actually not too bad to get that information directly from an HBase scan, but I also wanted to see some pretty charts. Naturally I turned to R, which led me again to the topic of how to use R to analyze Big Data.

A quick survey showed three possible approaches:

  • The RHadoop packages provided by Revolution Analytics, which includes RHBase and Rmr (R MapReduce)
  • The SparkR package
  • The Pivotal package that lets you analyze data in Hawq

I’m not using Pivotal’s distribution and I didn’t want to invest time in looking at a MapReduce-style analysis, so that left me with RHBase and Spark R.

Both packages were reasonably easy to install as these things go, and RHBase let me directly perform a table scan and crunch the output data set. I was a bit worried about what would happen once a table scan started returning millions of rows instead of thousands, so I wanted to try SparkR as well.

SparkR let me define a data source (in this case an export from HBase) and then run a functional reduce on it. In the first step I would produce some metric of interest (AuthZ success/failure for some combination of repository and node location) for each input line, and then reduce by key to get aggregate statistics. Nothing fancy, but Spark can handle a lot more data than R on a single workstation. The Spark programming paradigm fits nicely into R; it didn’t feel nearly as foreign as writing MapReduce or HBase scans. Of course, Spark is also considerably faster than normal MapReduce.

Here’s a small code snippet for illustration:

 

lines <- textFile(sc, "/home/vagrant/hbase-out/authz_event.csv")
mlines = lapply(lines, function(line) {
       return(list(key, metric))
       })
parts = reduceByKey(mlines, "+", 2L)
reduced = collect(parts)

 

In reality, I might use SparkR in a lambda architecture as part of my serving layer and RHBase as part of the speed layer.

It already feels like these extra packages are making Big Data very accessible to the tools that data scientists use, and given that data analysis is driving a lot of the business use cases for Hadoop, I’m sure we’ll see more innovation in this area soon.

Distributed Code Review

As I’ve written about previously, one of the compelling reasons to look at Git as an enterprise SCM system is the great workflow innovation in the Git community. Workflows like Git Flow have pulled in best practices like short lived task branches and made them not only palatable but downright convenient. Likewise, the role of the workflow tools like Gerrit should not be discounted. They’ve turned mandatory code review from an annoyance to a feature that developers can’t live without (although we call it social coding now).

But as any tool skeptic will tell you, you should hesitate before building your development process too heavily on these tools. You’ll risk locking in to the way the tool works – and the extra data that is stored in these tools is not very portable.

The data stored in Git is very portable, of course. A developer can clone a repository, maintain a fork, and still reasonably exchange data with other developers. Git has truly broken the bond between code and a central SCM service.

As fans of social coding will tell you, however, the conversation is often just as important as the code. The code review data holds a rich history of why a change was rejected, accepted, or resubmitted. In addition, these tools often serve as the gatekeeper’s tools: if your pull request is rejected, your code isn’t merged.

Consider what happens if you decide you need to switch from one code review tool to another. All of your code review metadata is likely stored in a custom schema in a relational database. Moving, say, from Gerrit to GitLab would be a significant data migration effort – or you just accept the fact that you’ll lose all of the code review information you’ve stored in Gerrit.

For this reason, I was really happy to hear about the distributed code review system now offered in SmartGit. Essentially SmartGit is using Git to store all of the code review metadata, making it as portable as the code itself. When you clone the repository, you get all of the code review information too. They charge a very modest fee for the GUI tools they’ve layered on top, but you can always take the code review metadata with you, and they’ve published the schema so you can make sense of it. Although I’ve only used it lightly myself, this system breaks the chain between my Git repo and the particular tool that my company uses for repository management and access control.

I know distributed bug trackers fizzled out a couple of years ago, but I’m very happy to see Syntevo keep the social coding conversation in the same place as the code.

Git MultiSite Cluster Performance

A common misconception about Git is that having a distributed version control system automatically immunizes you from performance problems. The reality isn’t quite so rosy. As you’ll hear quite often if you read about tools like Gerrit, busy development sites make a heavy investment to cope with the concurrent demands on a Git server posed by developers and build automation.

Here’s where Git MultiSite comes into the picture. Git MultiSite is known for providing a seamless HA/DR solution and excellent performance at remote sites, but it’s also a great way to increase elastic scalability within a single data center by adding more Git MultiSite nodes to cope with increased load. Since read operations (clones and pulls) are local to a single node and write operations (pushes) are coordinated, with the bulk of the data transfer happening asynchronously, Git MultiSite lets you scale out horizontally. You don’t have to invest in extremely high-end hardware or worry about managing and securing Git mirrors.

So how much does Git MultiSite help? Ultimately that depends on your particular environment and usage patterns, but I ran a little test to illustrate some of the benefits even when running in a fairly undemanding environment.

I set up two test environments in Amazon EC2. Both environments used a single instance to run the Git client operations. The first environment used a regular Git server with a new empty repository accessed over SSH. The second environment instead used three Git MultiSite nodes.  All servers were m1.large instances.

The test ran a series of concurrent clone, pull, and push operations for an hour. The split between read and write operations was roughly 7:1, a pretty typical ratio in an environment where developers are pulling regularly and pushing periodically, and automated processes are cloning and pulling frequently. I used both small (1k) and large (10MB) commits while pushing.

What did I find?

Git MultiSite gives you more throughput

Git MultiSite processed more operations in an hour. There were no dropped operations, so the servers were not under unusual stress.

throughput

Better Performance

Git MultiSite provided significantly better performance, particularly for reads. That makes a big difference for developer productivity.

perf

More Consistent Performance

Git MultiSite provides a more consistent processing rate.

procrate

You won’t hit any performance cliffs as the load increases.

pullrunner

Try it yourself

We perform regular performance testing during evaluations of Git MultiSite. How much speed do you need?

Permission Precedence in Access Control Plus

Access Control Plus gives you a flexible permission system for both Subversion and Git.  Management is delegated with a hierarchical team system, a necessity for large deployments.  You can’t have every onboarding request bubbling up to the SCM administration team, after all.  Daily permission maintenance is, within boundaries, rightfully the place of team leaders.

But what happens if several rules seem to apply at the same time?  Consider a few examples of authorization rules for Git.  In these examples I’ll look at different rule scope in terms of where the rule applies (the resource) and who the rule applies to (the target).

Same resource, same target

Harry want to push a commit to the task105 branch of the Git repository named acme.

  • Harry belongs to the team called acme-devs, which has write access to the entire acme repo.
  • Harry also belongs to the team called acme-qa, which has read access to the entire repo.

In this case Harry has write access to the task105 branch.  Both rules apply to the same resource and target, so Harry gets the most liberal permission.

Different resource, same target

Now consider this case where again Harry wants to push a commit to the task105 branch.

  • Harry belongs to acme-qa which has read access to the entire repo.
  • Harry belongs to acme-leads, a sub-team of acme-qa, which has write access to task105.

In this case Harry again has access.  The more specific resource of the rule for acme-leads (on the branch as opposed to the entire repo) takes precedence.

Different resource, different target

In another variation:

  • Harry has read access to the entire repo.
  • Harry belongs to acme-leads which has write access to task105.

In this case Harry again has access.  The more specific resource of the rule for acme-leads (on the branch as opposed to the entire repo) takes precedence.

Different resource, different target – with a wrinkle

Now consider:

  • Harry has write access to the entire repo.
  • Harry belongs to acme-reviewers-only which has read access to task105.

In this case Harry again has access.  The team rule that grants read access is considered first, but it doesn’t grant or deny the access level Harry needs (write access).  So we keep searching and find the more general rule that grants write access at the repo level.  If we actually wanted to prevent Harry from writing, the team rule would need to deny write access explicitly.

Same resource, different target

And in a final example:

  • Harry belongs to acme-reviewers-only which has read access to task105.
  • Another rule grants Harry himself write access to task105.

In this case Harry gets write access.  The two rules have the same resource (branch level), but the rule that applies to the more specific target (his own account versus a team) is applied first.

Rules of the road

To sum up, the rules of precedence for Git permissions are:

  • Rules on a more specific resource take precedence over more general rules.
  • If two rules apply to the same resource, then rules applying to a specific account take precedence over rules that apply to a team.  (All teams and sub-teams are considered as equivalent identities.)
  • When two rules are equivalent in resource and target, the more liberal rule takes precedence.
  • Rules are considered until a rule is found that grants or denies the requested access level.
  • If no rules apply, fall back on the default access level specified for Git MultiSite users.

precedence

 

And in case Harry has any doubts, he can always use the Rule Lookup tool to find out which rule applies.

acp-rule-lookup

Git Access Control Levels

It seems that every Git management solution has its own flavor of access control permissions.  I thought it’d be useful to have a quick matrix of the capabilities WANDisco’s Access Control Plus.  Questions?  We’re here to help!

 

Feature Available
Repository read/write permissions now
Branch write permissions now
Branch/tag create/delete permissions now
Path write permissions 2014
Regular expressions (in refs and paths) 2014
HTTP(S) and SSH protocols now
Enforced on all Git replicas (via Git MultiSite) now
Unified interface for Subversion and Git now

Unified Git and Subversion Management

Over the past several years the movement in ALM tools has been away from heavy, inflexible tools towards lighter and more flexible solutions. Developers want and need the freedom to experiment and work quickly without being bound by heavy processes and restrictions.

But, of course, an enterprise still needs some level of management and governance over software development. Now it looks like the pendulum is swinging back towards a useful middle ground – and WANdisco’s new Access Control Plus product strikes that fine balance between flexibility and guidance.

Access Control Plus is flexible because it lets team leaders manage access to their repositories.  Site administrators can set overall policies and make sure that the truly sensitive data stays safe. Access Control Plus provides for any level of delegated team management, letting the team leaders closest to the source code manage their teams and permissions. And with accounts backed by any number of LDAP or Active Directory authorities, the grunt work of account management is automated.

Yet Access Control Plus is still an authoritative resource for security, auditing and reporting. It covers permissions for all of your Subversion and Git repositories at any location. That’s important for a number of reasons:

  • Sanity! You need some form of consistent permission management over your repositories.
  • An audit trail of your inventions. With the new America Invents Act, a comprehensive record of your intellectual property is more important than ever.
  • Regulatory regimes. Whether it’s Sarbanes-Oxley, HIPAA, or PCI, can you prove accurately who was accessing and modifying your IP?  That’s a key concern for compliance officers.
  • DevOps. If you practice configuration as code, then some of your crown jewels are stored in SCM, and need to be managed appropriately.
  • Industry standards. From CMMI to ISO 9000, standard processes and controls are the cost of doing business in certain industries.  Access Control Plus ticks all of the auditing and reporting checkmarks for you.

Combined with SVN MultiSite Plus and Git MultiSite, Access Control Plus is a complete solution for making your valuable digital data highly available and secure. Be proactive – give us a call and figure out how to manage all of your Subversion and Git repositories.

 

The AIA Prior Use Defense and DevOps

Configuration as Highly Valuable Code

As I wrote about earlier, the expanded scope of the ‘prior use’ defense in the America Invents Act (AIA) provides you with an improved defense against patent litigation. If you’ve adopted DevOps and Continuous Delivery, you need to make sure that you have a strong record of how you’re deploying your software, not just how it was developed. After all, some of your secret sauce may well be your deployment process – a clever way of scaling your application on Azure or EC2, or perhaps a sophisticated canary deployment technique.

Proving that your clever deployment tricks were in use at some point in time is just another reason to treat configuration as code and store it in your Git repositories. In order to do that, you need to figure out a couple of key problems:

  • How do you secure the production data while still making less sensitive deployment data available to development teams?
  • How do you prove that your production data was actually in use?
  • How do you manage having Git repositories on production app servers that may be outside your firewall?

WANdisco’s Git Access Control and Git MultiSite provide easy answers to those challenges.  Git Access Control lets you control write access down to the file level, so you can easily let developers modify staging data without giving them access to production data in the same repository. These permissions are applied consistently on every repository, on every server.

Similarly, Git Access Control provides comprehensive audit capabilities so you can see when data was cloned or fetched to a particular server. You can also use these auditing capabilities to satisfy regulatory concerns over access to production environment data.

Finally, Git MultiSite’s flexible replication groups let you securely control where and how a DevOps repository is used. For example, you may want to have the DevOps repository available for full use on internal servers, but only available for clones and pulls on a production server.

If DevOps has taught us anything, it’s that configuration and environment data is as important as source code in many cases. Git Access Control and Git MultiSite give you the control you need to confidently store configuration as code and establish your ‘prior use’ history.

Intro to Gerrit – Subversion and Git Live 2014

You may be aware of Gerrit, the web based code review system. Our Director of Product Marketing, Randy Defauw, has a number of good reasons for adopting it as part of your development process:

The most interesting thing about Gerrit is that it facilitates what some call ‘continuous review’. Code review is often seen as a bottleneck in continuous delivery, but it’s also widely recognized as a way to improve quality. Gerrit resolves this conundrum with innovative features like dynamic review branch creation and the incorporation of continuous build into the heart of the review process.

Gerrit is also notable because it is the most enterprise friendly Git code review system, although it has open source roots. It integrates with all standard authentication frameworks, has delegated permission models, and was designed for large deployments.

Randy is Director of Product Marketing for WANdisco’s ALM products. He focuses on understanding in detail how WANdisco’s products help solve real world problems, and has deep background in development tools and processes. Prior to joining WANdisco he worked in product management, marketing, consulting, and development. He has several years of experience applying Subversion and Git workflows to modern development challenges.

If you’d like to hear more about Gerrit, or Git in general, come see us at Subversion and Git Live 2014.