Application Specific Data? It’s So 2013

Looking back at the past 10 years of software the word ‘boring’ comes to mind.  The buzzwords were things like ‘web services’, ‘SOA’.  CIO’s Tape drives 70sloved the promise of these things but they could not deliver.  The idea of build once and reuse everywhere really was the ‘nirvana’.

Well it now seems like we can do all of that stuff.

As I’ve said before Big Data is not a great name because it implies that all we are talking about a big database with tons of data.  Actually that’s only part of the story. Hadoop is the new enterprise applications platform.  The key word there is platform.  If you could have a single general-purpose data store that could service ‘n’ applications then the whole of notion of database design is over.  Think about the new breed of apps on a cell phone, the social media platforms and web search engines.  Most of these do this today.  Storing data in a general purpose, non-specific data store and then used by a wide variety of applications.  The new phrase for this data store is a ‘data lake’ implying a large quantum of every growing and changing data stored without any specific structure

Talking to a variety of CIOs recently they are very excited by the prospect of both amalgamating data so it can be used and also bringing into play data that previously could not be used.  Unstructured data in a wide variety of formats like word documents and PDF files.  This also means the barriers to entry are low.  Many people believe that adopting Hadoop requires a massive re-skilling of the workforce.  It does but not in the way most people think.  Actually getting the data into Hadoop is the easy bit (‘data ingestion‘ is the new buzz-word).  It’s not like the old relational database days where you first had to model the data using data normalization techniques and then use ETL to make the data in usable format.  With a data lake you simply set up a server cluster and load the data. Creating a data model and using ETL is simply not required.

The real transformation and re-skilling is in application development.  Applications are moving to data – today in a client-server world it’s the other way around.  We have seen this type of reskilling before like moving from Cobol to object oriented programming.

In the same way that client-server technology disrupted  mainframe computer systems, big data will disrupt client-server.  We’re already seeing this in the market today.  It’s no surprise that the most successful companies in the world today (Google, Amazon, Facebook, etc.) are all actually big data companies.  This isn’t a ‘might be’ it’s already happened.

Gerrit Administration

So far I’ve been talking a lot about Gerrit’s strong points. Now it’s time to focus on one of Gerrit’s comparative weak points: administration. Gerrit has all the tools you need to run a stable and secure deployment, but you need to be a master mechanic, not a weekend hobbyist.

Although Gerrit has an easy ‘quick start’ mode that’s great for trying it out, you need to do some research before running it in a production environment. Here are some areas that will need attention.

User Management

Gerrit supports several authentication mechanisms. The default is OpenID, which is suitable for open source projects or for enterprise environments that have an internal OpenID provider. Other sites will want to look at using LDAP, Active Directory, or possible Apache for authentication. Similarly, you can maintain groups internally or via an external directory.

Protocols

Gerrit can serve Git repositories over SSH or HTTP/S. SSH is a convenient way to start for small teams, as each user can upload a public key. However maintaining SSH keys for a large user base is cumbersome, and for large deployments we recommend serving over HTTP/S.

Of course you should use HTTPS to secure both the Gerrit UI and the repositories.

Authorization

Gerrit has a robust access control system built in. You set permissions in a hierarchy, with global defaults set for the ‘All Projects’ project. You can set up other project templates and have new projects inherit from the template of your choice.

You can manage permissions on:

  • Branches and tags
  • Change sets uploaded for review
  • Configuration including access control settings and submit rules
  • Code review workflow steps including approving and verifying changes

Integrations

You’ll want to hook up your build system to Gerrit to make best use of its workflow. (The build system can vote on whether to accept a change.) Similarly, you might want to integrate an external ticket system or wiki.

Scalability

I’ll cover this topic in more detail later on. But for now I’ll mention that you should have mirrors available at each location to provide the best performance. If you need Gerrit to enforce access control on the mirrors then you’ll need to run Gerrit in slave mode against a database mirror.

Sound complicated? It is. That’s why WANdisco provides Git MultiSite for Gerrit. You’ll get active-active fully replicated and writable repositories at each site, with regular Gerrit access control enforced.

Need help?

Call our Git support specialists if you need a hand getting started with Gerrit.

Unlimited Holidays? Old news to us!

Don’t get me wrong. It’s a great idea, though it also looks a bit like an attempt to sell a book – but this is Sir Richard Branson, a very smart and exceedingly canny man, who I believe has pledged to never undertake any task in life if he can’t make any money from it. This may sound mercenary but to my knowledge Sir Richard has never done so at the expense of or by stepping on other people. Which is nice.
Anyway, holidays. To all of us here at WANdisco, this kind of thing is old news. I’m lucky enough to work for a company that adopted the same policy a couple of years ago and I tell you what – it’s liberating, is probably the best word. I realise it may not work for every individual, but to know that you’re trusted to do your job and to know enough about what your colleagues are doing and what projects are on the go and to plan your holidays around that is something special.
Much like Netflix, we’ve found that treating people like grownups works. If you’re forced to report weekly, daily, even hourly in some cases what you’re doing and need to put your hand up to ask if you can use the bathroom do you feel trusted? It’s a weird feeling, having been out of school for several years and then find yourself in an environment that’s not much different. No one wants to feel like just a number, and policies (or lack of!) such as these have a big impact on working life.
A common question when people announce this sort of thing is ‘won’t the office just be empty all the time?’. Here at WANdisco we found that not to be the case, in actual fact last time we crunched the numbers we had to go out and ensure people took their statutory minimum holiday entitlement…. in addition to the 8 bank holidays. All of us appreciate the fact that we’re given the choice to take holiday when we need it, but for the most part we love coming to work.
It may not be the sort of thing that could work at your company, but if you want to engender satisfaction and loyalty in your workforce and if you want them to be proud of the company they work for, it’s certainly worth considering.

Starting at WANdisco (part 2)

Part 1

So, that was the majority of my first few months. I can do forums, blogging, writing, all that – that’s fine, but I needed to learn Subversion and Git because I need to be able to answer posts in forums helping people to use it, and arguably more importantly I need to be able to replicate issues that are raised and report them to the developers.

Just as an aside here, this is a fairly important place to be – in between the devs and the customers, understanding the language of both sides and translating from one to the other. I find it extremely satisfying and quite often have a lot of fun with it.

*ahem* training. Subversion was where I started, which is probably the smart move as it’s a fair bit simpler than Git though arguably not as good depending on your point of view. My understanding is that Subversion was written so that non-coders could have some control over what code gets committed or not, whereas Git was written by coders for coders, with the things that coders want in it, hence it’s significantly more complex.

It was good training, in fact the same training that our support engineers are given when they start (and our support engineers are incredible guys). It taught me a lot about Subversion – especially because it was written for Windows and TortoiseSVN and I was following it using SmartSVN on a mixture of OSX and Linux. In all honesty I can totally recommend that approach as you learn so much more in applying instructions for one thing to something fairly similar in a lot of ways but in others fundamentally different.

The svn command line stuff is all the same no matter the operating system – you’re giving commands to the program, so they’re the same whatever platform it’s running on. It’s when you get to the GUI stuff that things are different. TortoiseSVN is not the same as SmartSVN, and when your instructions are to view the repository log or even find the graph version of the log with helpful screenshots of a totally different application there’s a lot of looking up in help files and googling.

And as for setting up a server…well. Windows may have its share of detractors, but tickboxes for ‘start SVN server on startup’ and ‘install as a service’ basically take care of everything you need to worry about for a standard setup so it’s hard to argue. It took several VMs before I had a working SVN over http server running on Linux and (yes, noob, I know) several more before I had one that would still be working after a reboot.

Git…now that took a little longer. The training was a bit more in depth, but also very good in that it’s basically a list of tasks – achievements, if you will – and fairly vague ones at that. It also included a list of resources, although I mostly used the Git book (http://git-scm.com/book) which is invaluable, seriously. I’m not sure if it works this way for everyone but I certainly remember a lot more when I’ve had to figure something out for myself.

For example, “Rebase to edit a commit message”. That was it, that’s the full instruction. Not the first one in the training though, so at least when I came to it I definitely knew what committing was and why it would have a message. Rebasing I had to read up on. As I said though, fortunately for me and indeed everyone else, the Git book is brilliant.

So I learned a lot about Git, Gitlab, Gitosis and in the process a fair bit about Ubuntu and CentOS as well (I’d used Ubuntu before – in fact it runs my home server), and come to the conclusion that I like both of them even though installing and configuring Git over http on Linux is not the easiest thing I’ve ever done. Throw something like Gitolite with its dependency on Ruby into the mix and you may well spend a fair amount of time following installation guides.

So, training done, let’s hit the forums, but not literally because that would be silly. Forums are the lifeblood of my job and the hub of the WANdisco community, whom I am here to help and grow as much as possible – we’ll leave aside the occasional urge to lmgtfy (which I’ve managed to resist doing so far).

At the moment things are fairly quiet but the spam is cleaned out daily (I make sure of that) so that’s improving things, and now we have someone in there during (UK) business hours as well. At present things aren’t busy enough to warrant them being looked at outside those times, though I’m sure some of our guys in the US look through them from time to time, but if I have my way (and I fully intend to) it’ll get a lot busier.

So, how?

Well, in the first instance, by cleaning up the spam and being present in the forums. Then getting the word out. Social media is very powerful, but I think our best strategy is to be as knowledgeable and helpful as possible. The more that happens and the more we get out there and help with stuff, the more word will spread. Along with our own forums for Subversion, Git and Hadoop there’s StackOverflow and LinkedIn for those more technical queries, Facebook and to some degree LinkedIn again for less techy more human stories, and Twitter to tie things together and also point out new articles, forum threads and with any luck, engage in some banter as well.

The blogs, then – release blogs usually, for a new version of one of our products, but if something interesting happens then we like to talk about it, so we do. Hence this, and other blogs you’ll see shortly. We want to talk about what we’re doing a bit more in the office, whether it’s related to Big Data, improving our working environment, or just plain having fun.

So that’s it, really. Hopefully this has given you some insight into my journey and an idea of what we’re hoping to accomplish in the near future. Beyond that? I dunno. World domination might be nice.

 


If you want to find me you can on the above forums, I’m on Twitter as @WANdisco_Matt or there’s always my LinkedIn page – give me a shout if I can help with anything, and cheers for reading this far :)

Gerrit Workflow

As I mentioned in an earlier post, Gerrit has a unique workflow.  It has some similarities to pull and merge request models, but is more flexible and more automated.  That goes back to its roots in the Android ecosystem; at the scale of work in that community, bottlenecks need to be few and far between.

gerrit-model

Gerrit’s model is unique in a couple of ways:

  • By default all changes are put into temporary pending review branches, which are created automatically on push.
  • The workflow engine enforces rules before changes can be merged to the permanent repository.  Notably, you can require human code review, automated build and test, or both, and use the access control system to specify who’s allowed to perform various steps in the workflow.
  • Review IDs are generated automatically via hooks and group serial sets of patches.  Additional patches can be provided for rework based on the result of a review.
  • Gerrit’s Prolog engine can be used to create customized review approval conditions.

Gerrit’s workflow engine is well tuned for ‘continuous review’, which means that commits can be reviewed rapidly and merged into trunk (master or mainline) very quickly.  In some cases only certain commits would be manually reviewed, while all commits would be subject to automated build and test.  Gerrit is thus a good choice for large scale deployments that want to move towards continuous delivery practices.

 

Starting at WANdisco

9 years. I hadn’t expected to last at a job that long, but then I’d never had a job that felt like a career before. Unfortunately, it stopped feeling like a career and went back to being a job, so when a new opportunity knocked I answered with ebullience.

I’d been working in various customer forums and social media for the past half decade or so, and the opportunity to become Communications Lead for WANdisco was quite simply far too good to pass up.

So, that was it… off I went. It’s surprisingly easy to change jobs, in spite of how difficult it seems. Bear in mind if you’re thinking similarly, it’s the change that we fear and it’s nothing to be scared of. It’s a good thing. Chances are it’s what you need, especially if you feel bogged down and like you aren’t going anywhere. As an aside, if you like what you’re reading and think we’d be a good fit for you we are recruiting at the moment – why not check out the posts we have on offer at http://www.wandisco.com/careers?

Having said that though, moving from ISP support (essentially) to supporting version control systems is a fair leap and has involved an awful lot of learning. This also has been a good thing.

So, the runup to the change. Some clandestine emails (from a personal account of course), an after work visit to the new office for a chat, and finally the handing in of notice, which was kind of satisfying but mostly…melancholy, I think is the best word for it, though it wasn’t unpleasant. After sorting out the remainder of my holidays and arranging for a week off in between jobs (heartily recommended and well enjoyed), the first day dawned.

office panorama

Apologies for potatocam. Panorama shots are like that sometimes.

As luck would have it, a few others had trodden the path I was soon to walk so I wasn’t heading into a strange place filled with new people – several of them I’d worked with before which certainly helped tamp down the first day nerves. I even found myself sat next to a friend I’d had since secondary school, which was an interesting experience – we spent more than a few classes sat next to each other and while we hadn’t done the same in, oh, twenty years or so, it felt eerily familiar. Fortunately we were both professional enough to not let things interfere with the work that has to be done.

The other people I didn’t know? Lovely, lovely people. All of them. Especially the content team (but then I’m biased, and also a part of that team. Coinkydink? Decide for yourself). I feel like I fitted in well and nothing has happened to make me think otherwise so I’m going to assume the feeling is mutual or at least not totally opposite.

And the coffee? Oh my, the coffee.

COFFEEEEEeeeee

The coffee machine says ‘COFFEE READY’. The sticker says ‘GLADIATOR READY’.

Never underestimate the power of a decent coffee machine. You’ll save so much money, at least you will if you like coffee. It’s what, £4 for a decent sized cost-bucks? Twice a day for some people, especially in the IT industry. There’s also pool and ping pong if you like that sort of thing, which I do. So that’s nice.

pool and wiff

Also bike parking and meeting rooms with panoramic floor to ceiling windows (not pictured).

So that’s the people and the office, summed up in a couple of paragraphs. I could go on, but I don’t think that would be the best thing in the world, so I will move on to training and learning and working which are all things that happen in the world of jobs.

To start, version control. I’ve not written code. I’ve tinkered, and could – with much messing about and no small amount of internet searching – probably hack existing code with copy and pasted bits of other code in order to get it doing what I want it to do. I realise that’s how most hackers get started, and I enjoy it, but I’ve not done enough to actually learn code. I could explain an array or a variable, but I couldn’t write one without googling.

Therefore, I have never used version control. It sounds simple enough, right? Keep a copy of this code, if someone makes a change remember both how it was and how it is now, and give each change a sequential reference.

Now, let’s scale that up.

But… but we can’t. You have a repository server, which clients connect to and commit code. How can you scale that?

Well, you have more than one repository server.

Eh?

What do you mean, ‘Eh’? More than one. Many. Many servers, for many many many clients.

But…

Well, indeed. How do those servers know about each other? How do they know when a client has connected and added more changes and files, and how do they talk to each other to make sure there aren’t conflicts and that changes aren’t missed?

That’s what we do. We sell software (and support for said software) that guarantees 100% uptime for distributed version control systems. We have a number of large clients with big names, too. (Oooh, get me.) We also do training, which is lucky as (to finally close this rapidly expanding circle of text) I needed some.

 

 

Part two to follow in a week or so. If you want to find me you can on our forums, I’m on Twitter as @WANdisco_Matt or there’s always my LinkedIn page – give me a shout if I can help with anything, and cheers for reading this far :)

 

Gerrit Scalability

As a fundamental part of the Android Open Source Project (AOSP), Gerrit has to support a large user base and a big data set.  In this article I’ll review Gerrit scalability from both a performance and operational standpoint.

Operational Scalability

Let’s start with operational tasks:

  • Managing users.  Gerrit provides integration with most common enterprise authentication solutions including LDAP and Active Directory, so the Gerrit administrator should not have to worry much about user management.
  • Managing permissions.  Gerrit has a rich set of permissions that govern operations on code, code reviews, and internal Gerrit data.  The permission model is hierarchical, with any project able to inherit permissions from a parent project.  As long as the Gerrit administrator has set up sensible top level defaults, individual team leads can override the settings as necessary and permission management should be easy on a large scale.  The only potential wrinkle comes when Gerrit mirrors are used.  Unless you run the Gerrit UI in slave mode at every site, the mirrors will not have Gerrit access control applied.
  • Auditing.  Gerrit does not provide auditing, so this area can be a challenge.  You may have to set up your own tools to watch SSH and Apache logs as well as Gerrit logs.
  • Monitoring performance.  As a Gerrit administrator you’ll have to set up your own monitoring system using tools like Nagios and Graphite.  You should keep a particular eye on file system size growth, RAM usage, and CPU usage.
  • Monitoring mirrors.  Like most Git mirrors, a Gerrit mirror (as provided by the Gerrit replication plugin) is somewhat fragile.  There’s no automated way to detect if a Gerrit mirror is out of sync, unless you monitor the logs for replication failures (or your users start to complain that their local mirror is out of date).
  • HA/DR.  Gerrit has no HA/DR solution built-in.  Most deployments make use of mirrors for the repositories and database to support a manual failover strategy.

If you use Git MultiSite with Gerrit, those last two points will be largely addressed.  Git MultiSite nodes are self-healing in the case of temporary failure, and the Git MultiSite console will let you know about nodes that are down or transactions that have failed to replicate due to network issues.  And similarly, as we’ll see in the next section, Git MultiSite gives you a 100% uptime solution with automated failover out of the box.

Performance Scalability

Now on to performance.  Gerrit was designed for large deployments (hundreds of repositories, millions of lines of code, thousands of developers) and the Gerrit community has provided some innovations like bitmap indexes.

Nevertheless, running Gerrit on a single machine will eventually reach some scalability limits.  Big deployments require big hardware (24 core CPUs, 100+ GB of RAM, fast I/O), and even so they may use several read-only mirrors for load balancing and remote site support.

If you want to run a big Gerrit deployment without worrying about managing expensive hardware and monitoring a farm of mirrors, Git MultiSite provides an elegant solution.  Using active-active replication, you’ll have a deployment of fully writable Gerrit nodes.  That means that any single machine doesn’t have to be sized as large, as you can deploy more writable nodes for load balancing.  You can also put fully writable nodes at remote locations for better performance over the WAN.  To put the icing on the cake, there is no single point of failure in Git MultiSite.  If you have 5 nodes in your Gerrit deployment you can tolerate the loss of 2 of those nodes without any downtime, giving you HA/DR out of the box.

 

And here’s Gerrit with Git MultiSite!

With the recent announcement of Gerrit support in Git MultiSite, it’s worth taking a step back and looking at Gerrit itself.  Gerrit, just like its logo, is a bit of an odd bird. It has a huge user base and dynamic community including the likes of Google and Qualcomm, yet is little known outside of that community.

gerrit

Gerrit is one of two known descendants of Mondrian, a code review tool used internally at Google. Mondrian proved very popular and led to Rietveld, an open source code review tool for Subversion and Git, and Gerrit. Gerrit was developed as the code review and workflow solution for the Android Open Source Project (AOSP).

In order to support AOSP, Gerrit was designed to be:

  • Scalable. It supports large deployments with thousands of users.
  • Powerful. The workflow engine enforces code review and automated build and test for every commit.
  • Flexible. Gerrit offers a delegated permission model with granular permissions as well as a Prolog interpreter for custom workflows.
  • Secure. Gerrit integrates with enterprise authentication mechanisms including LDAP, Active Directory, and OpenID, and can be served over SSH and HTTPS.

Gerrit offers three key features: repository management, access control, and the code review and workflow engine.

In future articles I’ll dive into more detail on Gerrit’s workflow and other features, but for now, I’ll conclude by talking about why we decided to put MultiSite support behind Gerrit.

Gerrit is a scalable system, but still has a centralized architecture. Out of the box it has a master set of repositories and a simple master-slave replication system. That can lead to challenges in performance and uptime – exactly the problems that WANdisco solves with our patented active-active replication technology. Under Git MultiSite, Gerrit repositories can be replicated to any location for maximum performance, or you can add additional local repositories for load balancing. Access control is enforced with the normal Gerrit permissions, and code review and workflow still route through the Gerrit UI.

Gerrit with Git MultiSite gives you 100% uptime and the best possible performance for users everywhere. More details coming soon!

A bit of programming language history

When I started programming, I used C and just a bit of Fortran. I took my first degrees in electrical engineering, and at the time those languages were the default choice for scientific and numerical computing on workstations. The Java wave was just building at the time, Perl was for sysadmins, and Python was a toy.

That’s how the landscape appeared from my limited perspective. As I started working more deeply in computer science, I started glimpsing odd languages that I couldn’t quite place (Smalltalk? Tcl?). If you follow data analytics and big data, you’ll see a bewildering array of new and old languages in use. Java is still around, but we also have a lot of functional languages to consider as there’s a concerted effort to expose data analysis languages to big data infrastructure. R, Erlang, Go, Scala, of course Java and Python – how do we keep track?

I was very happy to find a lovely diagram showing how these languages have evolved from common heritage. It’s on slide 2 of this presentation from the Data Science Association.

This may be old hat to those who’ve been in the space for a long time, but I find this sort of programming language history very useful. Now I’ve got to find out what in the world Algol 60 was.

Sample datasets for Big Data experimentation

Another week, another gem from the Data Science Association. If you’re trying to prototype a data analysis algorithm, benchmark performance on a new platform like Spark, or just play around with a new tool, you’re going to need reliable sample data.

As anyone familiar with testing knows, good data can be tough to find. Although there’s plenty of data in the public domain, most of it is not ready to use. A few months ago, I downloaded some data sets from a US government site and it took a few hours of cleaning before I had the data in shape for analysis.

Behold: https://github.com/datasets. Here the Frictionless Data Project has compiled a set of easily accessible and well documented data sets. The specific data may not be of much interest, but these are great for trials and experimentation. For example, if you want to analyze time series financial data, there’s a CSV file with updated S&P 500 data.

Well worth a look!

SmartSVN 8.6 Available Now

We’re pleased to announce the release of SmartSVN 8.6, the popular graphical Subversion (SVN) client for Mac, Windows, and Linux. SmartSVN 8.6 is available immediately for download from our website.

New Features include:

  • Bug reporting now optionally allows uploading bug reports directly to WANdisco from within SmartSVN
  • Improved handling of svn:global-ignores inherited property
  • Windows SASL authentication support added and required DLLs provided

Fixes include:

  • Internal error when selecting a file in the file table
  • Possible internal error in repository browser related to file externals
  • Potentially incorrect rendering of directory tree in Linux

For a full list of all improvements and features please see the changelog.

Contribute to further enhancements

Many issues resolved in this release were raised by our dedicated SmartSVN forum, so if you’ve got an issue or a request for a new feature, head there and let us know.

Get Started

Haven’t yet started using SmartSVN? Get a free trial of SmartSVN Professional now.

If you have SmartSVN and need to update to SmartSVN 8, you can update directly within the application. Read the Knowledgebase article for further information.