Butler Lampson
Abstract: In this talk I will describe the general recipe for building a highly available system and show how to do it efficiently in practice. The idea is to build a replicated state machine. The tricky part is getting all the replicas to agree on the content and order of the inputs. For this you need a general fault-tolerant algorithm for reaching consensus. Two-phase commit uses the simplest possible such algorithm, but I will describe Lamport's "Paxos" algorithm, which is the best one that works without real-time guarantees. Since consensus is expensive, you also need to minimize its use, and you do that with leases, which are locks that time out.
The other theme of the talk is a general method for designing or understanding a complex system, especially a concurrent or fault-tolerant one. Using the Paxos algorithm as the example, I will show you how to write a precise spec for such a system and how to prove that an implementation satisfies the spec. There won't be any formulas in the talk.
Availability is not having to say “I’m sorry”.
A
distributed system is a system in which I can’t do my work because some
computer has failed that I’ve never even heard of.
Lamport
Lamport’s “Paxos” algorithm.
The system can work even when some parts are broken.
Several copies of each part.
Each non-faulty copy does the same thing.
This idea is due to Leslie Lamport.
If a state machine is deterministic, then feeding two copies the same inputs will produce the same outputs and states.
We call each copy a process.
So all we need is to agree on the inputs.
Examples:
Replicated storage with Read(a) and Write(a, d) steps.
Airplane flight control system with ReadInstrument(i) and RaiseFlaps(d) steps.
Agreeing on some value is called consensus.
A replicated state machine needs to agree on a sequence of values:
Input 1 Write(x, 3)
Input 2 Read(x)
. . .
A single process, with the same state as the specification, tells everyone else the outcome (two-phase commit works this way).
This is not fault-tolerant — it fails if the coordinator fails.
A set of processes, each choosing a value. If a majority choose the same value, that is the outcome.
This is not fault-tolerant — it fails if a majority doesn’t agree, or if a member of the majority fails.
Like ordinary locks, leases can be hierarchical.
Only the root lease need be granted by consensus.
Consider a synchronous system: processes have clocks with bounded skew.
The lease expires at a stated time.
The lease is granted to a process which becomes the master for x.
Writes of x must take bounded time, so they end before the lease expires.
If the master fails, x is inaccessible until the lease expires.
Run consensus again to renew the lease.
x is a multi-ported disk
x is a cache block
Running consensus to issue a lease is too expensive.
Run consensus once to elect a czar C and give C a lease.
Now C gives out sub-leases on x and y to masters.
Each master controls its part of the state.
The masters renew their sub-leases with the czar. This is cheap.
The czar renews its lease by consensus. This costs more, but there’s only one czar lease.
Also, the czar can be simple and less likely to fail, so a longer lease may be OK.
This method is commonly used in replicated file systems and in clusters.
In an asynchronous system with a single faulty process, there is no algorithm for consensus that is guaranteed to terminate (Fisher-Lynch-Paterson).
In a synchronous system consensus is possible even with processes that have arbitrary or malicious faults (Byzantine agreement), but it is expensive in messages sent and in time.
Committing a distributed transaction: agree on commit or abort.
Electing a leader: agree on the new leader.
Group membership: agree on current members.
You can’t observe the actual state of the system from outside.
state: a set of values, usually divided into named variables.
actions: named changes in the state; internal and external.
Data abstractions
Concurrent systems
Distributed systems
every external behavior of Y is an external behavior of X, and
Y’s liveness property implies X’s liveness property.
This expresses the idea that Y
implements X if
you can’t tell Y apart from X by looking only at the external actions.
Choose the state to make the spec clear, not to match the code.
What they do to the state.
What they return.
Notation is important, because it helps you to think about what’s going on.
Invent a suitable vocabulary.
Less is more. Fewer actions are better.
More non-determinism is better, because it allows more implementations.
State:
outcome : Value È {nil} initially nil
Name |
Guard |
Effect |
allow(v) |
|
choose
if outcome = nil then outcome := v |
outcome |
|
choose
return outcome |
State:
outcome : Value È {nil} initially
nil
done :
Bool initially false
Name |
Guard |
Effect |
allow(v) |
|
choose
if outcome = nil then outcome := v |
outcome |
|
choose
return outcome |
|
|
|
terminate |
outcome ¹ nil |
done := true |
State:
outcome : Value È {nil} initially nil
done :
Bool initially false
allowed : set Value initially {}
Name |
Guard |
Effect |
allow(v) |
|
allowed := allowed È {v} |
outcome |
|
choose
return outcome |
|
|
|
agree(v) |
outcome = nil |
outcome := v |
terminate |
outcome ¹ nil |
done := true |
Define an abstraction function f from the state of Y to the state of X.
Show that Y simulates X:
1) f maps initial states of Y to initial states of X.
2) For
each Y-action and each state y
there is a sequence of X-actions that is the same externally,
such that the diagram commutes.
This always works!
Invariants describe the reachable states of Y; simulation only needs to work from a reachable state.
A single coordinator process, with the same state as the specification, tells everyone else the outcome (this is how two-phase commit works). The abstraction function is:
outcome = the outcome of the coordinator.
done = everyone has gotten the outcome.
This is not fault-tolerant — it fails if the coordinator fails.
A set of processes, each choosing a value. If a majority choose the same value, that is the outcome. The abstraction function is:
outcome = the choice of
a majority, or nil if there’s no majority.
This is not fault-tolerant — it fails if a majority doesn’t agree, or if a member of the majority fails.
Embody the key idea in the abstraction function.
Add invariants to make this easier. Each action must maintain them.
Change the implementation (or the spec) until this works.
More efficiency means more complicated invariants.
You might need to change the spec to get an efficient implementation.
A set of agent processes, indexed by I. The agents do what they are told.
Agents have “persistent” storage that survives crashes.
Some leader processes that tell the agents what to do.
A set of
rounds, indexed by N. Each round has at most one value vn.
In each round, some of the agents may
accept the value.
If a majority accept, that value is the
outcome.
Two majorities always intersect.
An algorithm based on this idea was invented by Leslie Lamport, who called it “Paxos”, and independently by Liskov and Oki as part of a replicated storage system.
The state of the agents is the persistent variables (for the moment, don’t worry about how to code this efficiently):
si, n : Value È {no, neutral} initially neutral
A state component can only change if it is neutral.
Define the value of round n:
vn º if
some agent i has a Value in si, n then si, n
else nil
Invariant 1: A round has at most one value.
That is, in a given round all the agents with a value have the same value.
The abstraction function is
outcome = vn for some round n that has a
majority for vn,
or nil if there is no
such n
Invariant 2: If two rounds have majorities for a value, it is the same value.
A predicate P is stable º once true, P is true henceforth º P Þ q P
This is important because it’s safe to act on the truth of a stable predicate. Anything else might change because of concurrency or crashes.
These predicates are stable:
si, n = no
si, n = v
vn = v
n is dead º round n has a majority for no
n is successful º round n has a majority for some value
n is anchored º for all m £ n, m
is dead or vn = vm
For the last, we need a total ordering on N.
Invariant 2: Any two successful rounds have the same value.
which follows from
Invariant 3: for all n and m £ n, if m is successful then vn = nil or vn = vm
which follows from
Invariant 4: for all n and m £ n, if m is not dead then vn = nil or vn = vm
º for all n and m £ n, m is dead or vn = nil or vn = vm
º for all n, vn = nil or (for all m £ n, m is dead or vn = vm)
º for all n, vn = nil or n is anchored
So all we have to do is choose each vn so that
there is only one, and
n is anchored.
Now the rest of the algorithm is obvious.
Maintain invariant 1 (a round has at most one value) by having
at most one leader process per round
that keeps the current round and its value in volatile variables
and doesn’t resume an old round after a crash
Let N = (J, L). The ordering is lexicographic.
Leader l chooses (j, l) for n,
where j is a J that l has not used before (for instance, a local clock).
The state of a leader is the volatile variables:
nl :
N È {nil} initially nil
ul : Value È {nil} initially nil
allowedl : set Value initially
{}
The abstraction function is allowed = Èl Î L allowedl
A
crash of leader l sets nl and ul to nil and may set allowedl to {}.
Maintain invariant 4 by choosing vn to keep n anchored.
Leader l |
Message |
Agent
i |
Choose a new n |
|
|
Query a majority of agents for their status |
query(n)
® |
for all m < n, if
si, m = neutral then si, m := no |
|
¬ report(i, si) |
|
Choose vn to keep n anchored.
If all m < n are dead, choose any vn in allowedl |
|
|
Command
a majority |
command(n, vn) ® |
if
si, n = neutral |
|
¬ report(i, n, si, n) |
|
If a majority accepts, publish the outcome vn |
outcome(vn) ® |
|
|
Status
|
|
|
vn sa, n sb, n sc, n
|
vn sa, n sb, n sc, n
|
Round 1
Round 2
Round 3
|
7 7 no no 8 8 no no 9 no no 9
|
8 8 no no 9 9 no 9 9 no no 9 |
Leader’s
|
7, 8, 9 if a, b, c
report 8 if a, b or b, c report 9 if b, c report |
9 no
matter what majority reports |
Since it isn’t dead, any later round must choose the same value.
When an agent accepts and forms a majority.
But no one
knows that this has happened until later!
With one leader that doesn’t fail, Paxos terminates when the leader succeeds in both querying and commanding a majority of agents (perhaps not the same majority).
With multiple leaders, Paxos may never terminate, because later queries can make earlier rounds dead before the agents get commands.
Here is a sloppy algorithm for choosing a single leader, if processes have clocks and the usual maximum time to send, receive, and process a message is known:
Every potential leader that is up broadcasts its name.
You become the leader one round-trip time after doing a broadcast unless you have received the broadcast of a bigger name.
The relevant part of si is just the most recent value and the later no states:
si, last = v,
si, m = no for all m between last and next, and
si, m = neutral for all m ³ next.
Encode this as (v, last, next).
For a sequence of consensus problems (for instance, the successive steps of a state machine) we run a sequence of instances of Paxos, numbered by another index p.
Make the state
(v, last, next) for agent i and instance p encode
sq, i, m = no
for all q £ p and m < next.
Then a query only needs to be done once each time the leader changes.
Piggyback the outcome message on the next instance.
The result is 2 messages (1 round trip) for each consensus.
Run a replicated deterministic state machine, and get consensus on each input.
Use leases to replace most of the consensus steps with actions by one process.
Lamport’s “Paxos” algorithm, based on
Repeating rounds until you get a majority.
Ensuring that every round after a majority has the same value.
Write a simple spec as a state machine.
Find the abstraction function from the implementation to the spec.
Show that the implementation simulates the spec.
Specifications Lamport, A simple approach to specifying concurrent systems. Comm acm, 32, 1, Jan. 1989.
Impossibility Fischer, Lynch, and Paterson, Impossibility of distributed consensus with one faulty process. J. acm 32, 2, April 1985.
Paxos algorithm Lamport, The part-time parliament. Technical Report 49, Digital Equipment Corp, Palo Alto, Sep. 1989.
Liskov and Oki,
Viewstamped replication, Proc. 7th PODC, Aug. 1988.
State machines Lamport, Using time instead of timeout for fault-tolerant distributed systems. acm TOPLAS 6, 2, April 1978.
Schneider, Implementing fault-tolerant services using the state-machine approach: A tutorial. Computing Surveys 22 (Dec 1990).
This talk Lampson,
How to build a highly available system using consensus. In Distributed Algorithms, ed. Babaoglu and Marzullo, LNCS 1151, Springer, 1996.