As a fundamental part of the Android Open Source Project (AOSP), Gerrit has to support a large user base and a big data set. In this article I’ll review Gerrit scalability from both a performance and operational standpoint.
Let’s start with operational tasks:
- Managing users. Gerrit provides integration with most common enterprise authentication solutions including LDAP and Active Directory, so the Gerrit administrator should not have to worry much about user management.
- Managing permissions. Gerrit has a rich set of permissions that govern operations on code, code reviews, and internal Gerrit data. The permission model is hierarchical, with any project able to inherit permissions from a parent project. As long as the Gerrit administrator has set up sensible top level defaults, individual team leads can override the settings as necessary and permission management should be easy on a large scale. The only potential wrinkle comes when Gerrit mirrors are used. Unless you run the Gerrit UI in slave mode at every site, the mirrors will not have Gerrit access control applied.
- Auditing. Gerrit does not provide auditing, so this area can be a challenge. You may have to set up your own tools to watch SSH and Apache logs as well as Gerrit logs.
- Monitoring performance. As a Gerrit administrator you’ll have to set up your own monitoring system using tools like Nagios and Graphite. You should keep a particular eye on file system size growth, RAM usage, and CPU usage.
- Monitoring mirrors. Like most Git mirrors, a Gerrit mirror (as provided by the Gerrit replication plugin) is somewhat fragile. There’s no automated way to detect if a Gerrit mirror is out of sync, unless you monitor the logs for replication failures (or your users start to complain that their local mirror is out of date).
- HA/DR. Gerrit has no HA/DR solution built-in. Most deployments make use of mirrors for the repositories and database to support a manual failover strategy.
If you use Git MultiSite with Gerrit, those last two points will be largely addressed. Git MultiSite nodes are self-healing in the case of temporary failure, and the Git MultiSite console will let you know about nodes that are down or transactions that have failed to replicate due to network issues. And similarly, as we’ll see in the next section, Git MultiSite gives you a 100% uptime solution with automated failover out of the box.
Now on to performance. Gerrit was designed for large deployments (hundreds of repositories, millions of lines of code, thousands of developers) and the Gerrit community has provided some innovations like bitmap indexes.
Nevertheless, running Gerrit on a single machine will eventually reach some scalability limits. Big deployments require big hardware (24 core CPUs, 100+ GB of RAM, fast I/O), and even so they may use several read-only mirrors for load balancing and remote site support.
If you want to run a big Gerrit deployment without worrying about managing expensive hardware and monitoring a farm of mirrors, Git MultiSite provides an elegant solution. Using active-active replication, you’ll have a deployment of fully writable Gerrit nodes. That means that any single machine doesn’t have to be sized as large, as you can deploy more writable nodes for load balancing. You can also put fully writable nodes at remote locations for better performance over the WAN. To put the icing on the cake, there is no single point of failure in Git MultiSite. If you have 5 nodes in your Gerrit deployment you can tolerate the loss of 2 of those nodes without any downtime, giving you HA/DR out of the box.