I recently found myself reflecting on how our continuous integration builds have changed, and was amazed to think how a series of seemingly small tweaks and improvements have accumulated, over time, to become a business changing new "normal".
Imagine the scene; it's 9AM on Monday morning, late 2016, and you start the day with a coffee and scan through your emails. You see the usual "Build Failed" email from the build server, which sets off the following chain of events:
- Firstly you ask yourself, "was it me"?
- Then, has anyone else looked into it?
- If not, then it's time for you to investigate.
- You discover there were five changes merged on Friday evening, and anyone of them could be why the unit tests have broken, and the reason no artefacts were generated.
- Do you revert them, and ask each developer to investigate?
- Or do you dig in and identify the cause?
With the build broken, the Dev Team can't pull, and work with the latest changes, QA might be blocked from testing, and people are looking for someone to blame (in a good natured way, but it's embarrassing for the recipient all the same).
- You dig in and find the problem. It wasn't your fault, but the culprit is on vacation; their parting gift before a two week break.
- You identify the issue, fix it, and merge your patch… Fingers crossed that what worked for you, now works on Jenkins.
It's now 12 noon, and you have a successful build, crisis resolved.
Hmmm... Maybe not. In that 3 hour window, 8 people across development and QA were effectively blocked. So behind the jovial social stigma for the person who broke the build, there is a real business cost for this down time. In this case, 3 man days of effort is lost!
This was a situation we found ourselves in more than once, so we decided to take action.
# Don't Break The Build…
- GitBlit - An implementation of the Git version control system.
- Served out of our Sheffield office.
- Patch files would be created locally by developer's for each of their changes.
- JIRA - Bug tracking tool.
- A patch for each ticket would be added to JIRA for code review.
- Jenkins - Build Server.
- Builds would be triggered by polling the git repository.
- Circa 40 jobs executed to cover all distributions to which we deploy.
- Each job would execute: Unit tests, PMD, FindBugs, Packaging.
- Served out of our San Ramon office.
|Repeatable Builds||Post Commit Tests.|
|Git SCM||Post Commit Static Analysis.|
|Broken builds affect wider team.|
|Single point of failure (Git)|
|Patch files with little to no context.|
|JIRA inappropriate code review tool.|
|Merge conflicts when applying patches.|
|Long running build jobs (> 2 hours combined)|
|Transatlantic Clones for Builds and Developers outside UK.|
|No integration tests.|
As you can see, the fundamental problem with this approach is that any failures are identified after a change is merged, so the change and the problems associated with it are passed into the domain of the wider team.
# Break it Early… (The New Normal):
18 months on and through a series of tweaks, and in places, tooling/mindset changes, developers now get more immediate feedback with a lot less effort, resulting in higher quality code, and little to no downtime.
Our first step was to adopt the internal use of Gerrit MS, giving ourselves Live Git repositories at each of our development centres. This not only reduced build times through reduced latency but made developers lives easier too. Gerrit now provided us with a gateway to control access to the central Git repositories. At a high level, we got a fully featured code review tool that allowed us to collaboratively review changes and provide feedback, but deeper down we got the ability to front load verification tests, completely inverting the window in which we identified and dealt with failures.
This switch from back loading to front loading failures was a big change in mindset for management, and developers alike. All of a sudden, we now had a roadblock preventing us from merging changes, "surely this can't be better"? For a period of time, I felt the pressure, and would continually hear complaints of "we just need to get this change in", or "the tests are too slow", etc, but I knew we were doing the right thing simply based on the fact I wasn't getting any more "build failed" emails. In the first week alone, we must have caught at least 10 different failures. How many man days had we already saved?
At no point do I recall anyone complaining about the fact that we were catching breakages early, more the pleasant surprise that public shaming was now a thing of the past, and no longer was anyone blocked by a bad merge. However, in order to make this a success, we needed to kill off the questions around performance and remove the sense of the new way being a bottleneck to our productivity. What were we still doing wrong?
When we made the initial switch in where the tests were executed, we just started doing the original post commit build also as a pre-commit step, but without the deployment of artefacts. That was ok, but we were doing too much. So we stripped off the packaging elements of the verification build, and began to parallelise the test execution both physically distributing test execution across different Jenkins slaves, as well as introducing performance tweaks in gradle. There's a bit of trial and error required here, but I found this blog from Keval Patel quite practical in approaching gradle performance changes. Ultimately, we cut the build time from hours to minutes.
Once we got to this point, the complaints died away, but the constant strive to improve the system continues. This requires feedback from the whole team. As such, one of the recent innovations came from our QA team, who wanted to introduce smoke tests, and do some level of sanity testing in advance of a build reaching them, only to find core functionality has been compromised.
The interesting thing from my perspective is that, after the short period of unrest, what we do is now just considered to be the new "normal", and the team and management's mindset has adjusted to think of identifying problems to be a primary concern and not, as was previously the case, an afterthought.
So, what we have now looks like the following:
- Gerrit MS - Active-Active Git version control system.
- Served from Sheffield, San Ramon, and Boston. (All
- GUI based code review tool.
- JIRA - Bug tracking tool.
- Automatically linked to Gerrit review for changes.
- Jenkins - Build/Verification Server.
- Builds triggered by Gerrit Events.
- Configuration Matrix Jobs.
- Verification jobs for Testing, PMD, and FindBugs.
- Build Jobs for packaging (run nightly).
- Smoke Test Job for short Integration/Sanity Test.
- Served out of San Ramon office.
|Repeatable Builds||Post Commit Smoke Test meaning we still don't catch some problems until after they are merged.|
|GUI based code review tooling, with full context.|
|Pre-commit Unit Tests, meaning failures are caught early.|
|Pre-commit Static Analysis, meaning failures are caught early.|
|Broken verification build affects only the individual developer, and prevents an otherwise broken commit being merged.|
|Replicated Git Repository local to each development centre/build server.|
|Rebasing/Merging through Gerrit UI.|
|Verification Job (~40 matrix combinations) takes <= 30 minutes to execute combined.|
|Build Job (~40 matrix configurations) takes <= 30 minutes to execute combined.|
|Integration Tests executed for every merged commit.|
Our build system isn't perfect by any means, but it's better than what we had, and soon it will be better than it is now.
As one of my colleagues pointed out, "now we don't have to worry about the builds anymore; we just fix real problems".
So... we must be doing something right. 😃