“FLP” stands for Fischer, Lynch and Paterson and refers to a famous paper by the aforementioned authors: “Impossibility of Distributed Consensus with One Faulty Process” (https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf). This post is a continuation of our series of posts on time.
As we have seen in the previous post on “Time, Clocks and the Ordering of Events in a Distributed System” (https://blogs.wandisco.com/lamports-time-clocks-and-the-ordering-of-events-in-a-distributed-system/) solving consensus is key to the replicated state machine approach to building replication solutions.
The FLP result states that with a perfect network where all messages are eventually delivered (ie no lost messages) and are all delivered in order it is impossible to solve consensus with just one faulty process, the failure can be a simple fail stop of the process such that it simply stops functioning. This result cast a pall over research into solving consensus for a long time.
The result is for asynchronous systems where processors can run at arbitrary speeds and messages can take an arbitrary long time to be delivered, essentially a system without time. In contrast for synchronous system the processors run at fixed speeds and messages are delivered within fixed time limits, an example of a synchronous system is aircraft control system where the processors have a known speed, they run a real time operating system and network is closed a system.
The advantage of having a algorithm that works for an asynchronous system is that it will work for a synchronous system also, essentially synchronous systems are a subset of asynchronous systems. Solve the problem for an asynchronous system and you have a solution for a synchronous system.
At the heart of the FLP result is the fact that it is impossible to tell the difference between a processor that has failed or one that is simply running really slowly.
Often systems are built with timeouts, for example if a message is not received from a node within some timeout period then assume the node has failed – but what if the node has not failed and the timeout was too short, could something bad have happened? Aahlad (https://www.wandisco.com/about-us) our Chief Scientist has a simple piece of advice around this: “timeouts cannot be used for correctness”.
There are algorithms to solve consensus for synchronous systems (for example Byzantine Generals, https://www.microsoft.com/en-us/research/publication/byzantine-generals-problem/) but there is no solution for asynchronous systems.
So what about the real world. The internet cannot be treated as a synchronous system, but if it is treated as an asynchronous system then consensus cannot be solved. The solution is to treat it as “partially synchronous”, this means that the system is asynchronous but at times it behaves in a synchronous way. This maps to how real systems work, often they behave in a predictable way and within expected time bounds but occasionally they behave in unpredictable ways with unpredictable timing (due to things like network issues, GC pauses etc). When the system is behaving asynchronous the algorithm should be safe and when the system is behaving synchronously the algorithm should reach consensus.
What does this mean for consensus algorithms such as Paxos (https://lamport.azurewebsites.net/pubs/paxos-simple.pdf)? For Paxos to reach consensus you need a single leader process for two rounds of messages, with a quorum of processors in the system available for that period. FLP tells us there is no way we can guarantee the above conditions in an asynchronous system, however without these conditions Paxos is safe as it will simply not make progress. Paxos is safe when there are two leaders, but will not reach consensus. This leads to the fuller understanding of when timeouts can be used: timeouts cannot be used for correctness but they can be used to ensure progress.