Causality in a Real-World Distributed Consensus Implementation

http://www.flickr.com/photos/wiertz/ Sebastien Wiertz http://creativecommons.org/licenses/by/2.0/I read with interest Peter Bailis’s excellent article Causality is expensive (and what to do about it) because causality lies at the heart of our global replication solutions for Subversion, Git and Hadoop.

Clock-less causal ordering implements WAN-based 100% data-safe replication by guaranteeing the order of writes is the same at each node of a replication group. The result are independent nodes that evolve in exactly the same way, remaining shared-nothing replicas of each other. This true active-active replication is the key technology for eliminating single points of failure in any system, such as the NameNode in Hadoop, while also reducing latency in SCM systems such as Subversion and Git.

The part of the article that most caught my eye, however, was the section on how expensive causality is in terms of storage. O(N) per vector clock seems to be the theoretical limit, where N is the number of processes. Clearly this can get expensive.

So how does WANdisco’s replication engine deal with this limit when processing millions of writes a month on a system with dozens of nodes?

Bailis outlines four sophisticated approaches in his article, none of which we employ in WANdisco’s core DConE replication technology.

Instead of weakening order guarantees or reducing availability, DConE instead uses two main methods to reduce the storage cost of causality:

1. Garbage collection
2. Sidelining

Garbage collection works because each node broadcasts the highest global sequence number it has applied when it has finished processing a transaction, and so nodes can garbage collect the transactions known to be processed. There are a few wrinkles that go beyond the scope of this article, but that’s the basic technique.

Sidelining is used when a node has gotten so far behind that other nodes may accumulate too large a transaction queue. This can happen when a node is partitioned from the group. Since the rest of the nodes can still achieve quorum, they can continue to process write transactions. The queue can’t be garbage collected because the partitioned node can’t signal completion, much less actually process queuing transactions. Sidelining means that we give up on the partitioned node and garbage collect without it. A recovery process with helper nodes assist in bringing the node back to current when it does come back online.

These techniques work because in our real-world implementations, the ordering is used to modify the state of something else. In our case Subversion, Git and Hadoop. So once we’ve applied the state, we need to keep around only enough causal history to catch up a temporarily slowed node.  A node can be be recovered by restoring a known application state out-of-band, and then starting the replication at an appropriate global sequence number.

Because our applications are deployed at hundreds of companies worldwide, we often take a pragmatic approach rather than a research oriented one. That said, the advanced techniques outlined in the article are not unknown to us, and form an inspiration as we stretch the capabilities of our products and technology.

0 Responses to “Causality in a Real-World Distributed Consensus Implementation”


  • No Comments

Leave a Reply