The ABCD’s of
Paxos1
Microsoft
1-617-547-9580
blampson@microsoft.com
This file was converted automatically
from a Word original, and has formatting problems. For a correctly formatted version,
use the .pdf or .doc version.
We explain how consensus is used to implement replicated state machines, the general mechanism for fault-tolerance. We describe an abstract version of Lamport’s Paxos algorithm for asynchronous consensus. Then we derive the Byzantine, classic, and disk versions of Paxos from the abstract one, show how they are related to each other, discuss the safety, liveness, and performance of each one, and give the abstraction functions and invariants for simulation proofs of safety.
D.2.4 [Software]
Correctness Proofs—abstraction function, invariant, simulation; Fault
Tolerance—Byzantine, Paxos, replicated state machine, view change.
[Theory]—consensus, liveness,
safety.
Algorithms, Reliability, Security, Theory
Paxos, asynchronous consensus, fault-tolerant, replication, Lamport, Byzantine, state machine
We give an abstract version AP of Lamport’s Paxos algorithm for asynchronous consensus that captures its idea, but is not directly implementable because some of the actions are non-local. Then we give three implementations of AP that solve this problem in different ways, together with the abstractions and invariants of their simulation proofs:
Classic Paxos, CP, from Lamport’s original paper [10] and from Liskov and Oki [14], tolerates n/2 stopped processes and requires conditional write (compare and swap) operations on persistent state variables.
Disk Paxos, DP, from Gafni and Lamport’s recent paper [6], is a generalization of AP and CP that requires only read and write operations on persistent state variables.
Byzantine Paxos, BP, from Castro and Liskov [1], [2] tolerates n/3 processes with arbitrary faults. Their papers also describe a replicated state machine implementation, based on BP, that has good performance and the same fault tolerance.
AP, CP, and BP are summarized in the appendix.
I’ve tried to answer all the questions I had when I read these papers, about how simple the algorithms can be made, the minimum conditions for them to work, and how they are related. The role that General Λαμπσων played in the original Paxos paper makes it especially appropriate for me to write about a Byzantine version.
I don’t know whether a practical algorithm could be developed in this top-down fashion. Certainly the three that we give were not invented in this way, but our exposition does clarify the relationships among them and perhaps will suggest other variations.[2]
The main application for fault-tolerant consensus is replicated state machines. This is the fundamental technique for general fault-tolerance, first described by Lamport [8]. It goes like this:
Cast your problem as a deterministic state machine that takes input requests for state transitions, called steps, from the client, performs the steps, and returns the output to the client. Any computation can be done this way.
Make n copies or ‘replicas’ of the state machine.
Using consensus, feed all the replicas the same input sequence. Then they all generate the same output sequence.
If a replica fails, it can recover by starting in the initial state and replaying all the inputs. Like a transaction system [7], it can speed up this complete replay by starting with a previous state instead of at the beginning.
The steps of the state machine can be arbitrarily complicated as long as they are deterministic, atomic, and strictly local to one replica. To make a big step atomic, use transactions [7]. Of course a replica can involve more than one physical machine; in fact, like any good idea in computer science, the entire method can be applied recursively.
Even reading the state must be done with a step, unless the client is willing to accept output based on an old state. If a read also returns the sequence number of the last step that affected it, the client can pay for better read performance with complexity by doing an occasional step to learn the current step, and then accepting read outputs that are not too far out of date. With a sloppy notion of real time the state machine can give the client a bound on number of seconds a read might be out of date.
Since fault-tolerant consensus makes all the inputs persistent, exactly-once semantics needs no extra persistent writes. The state machine does have to check that an input hasn’t been accepted already, which it can do by remembering the most recent input from each client, or just its hash, sequence number, or time-stamp.
The most common application is to data storage systems such as a file system [14]. The method is much more general, however. For instance, state machine actions can be used to change the sets of processes that form the various quorums on which consensus depends, so that no special algorithms are needed to deal with processes that arrive and depart in an orderly way.
Many applications combine a replicated state machine with leases, which are locks on portions of the state. A lease differs from a lock because it times out, so the system doesn’t block indefinitely if the leaseholder fails. To keep the lock the holder must renew the lease. There is an obvious tradeoff between the cost of frequent renewals and the cost of waiting for the lease to expire when the leaseholder fails. A client (or a subordinate state machine) with a lease can do arbitrary reads and writes of the leased state without taking any steps of the main state machine, except for a single step that combines all the writes. The most important use of leases is to allow holders to cache part of the state.
Like locks, leases can have different modes such as shared and exclusive, and they can be hierarchical. A parent leaseholder can issue child leases for sub-portions of its state without using consensus; of course the child’s lease must expire no later than the parent’s.
Consensus is also useful for group membership and transaction commit, if a full replicated state machine is not needed.
A consensus algorithm decides on one from a set of input values (such as the state machine inputs). It uses a set of processes, called agents in this paper. The simplest form of fault-tolerant consensus decides when a majority of agents choose the same value. This is not very fault-tolerant for two reasons: there may never be a majority, and even when there is, it may remain permanently invisible if some of its agents stop. Since we can’t distinguish a stopped agent from a slow one, we can’t tell whether the invisible majority will reappear, so we can’t ignore it.
To avoid these problems, Paxos uses a sequence of views.[3] A majority in any view decides (or more generally, a decision quorum; see section 4.2), but if a view doesn’t work out, a later view can supersede it. This makes the algorithm fault-tolerant, but introduces a new problem: decisions in all views must agree.
The key idea of Paxos is that a later view v need not know that an earlier view decided in order to agree with it. Instead, it’s enough to classify each earlier view u into one of two buckets: either it can never decide, in which case we say that it’s out, or it has made a choice and it must decide for that choice if it decides at all. In the latter case v just needs to know u’s choice.
Thus a view chooses and then decides. The choice can be superseded, but the decision cannot. On the other hand, the choice must be visible unless the view is visibly out, but the decision need not be visible because we can run another view to get a visible decision. This separation between decision and visibility is the heart of the algorithm.
A decision will be unique as long as every later choice agrees with it. We ensure this by anchoring the choice: if all previous views are out, v can choose any input value; if not, it can take the choice of the latest previous view that isn’t known to be out. By induction, this ensures that v will agree with any previous decision. To keep the algorithm from blocking, each previous view must be visibly out or have a visible choice. See section 4.3 for a picture of the anchor-choose-decide sequence.
In each view a primary process initiates the choice. A view eventually decides unless the primary fails or a later view starts. A later view may be necessary if the primary fails. Since asynchronous consensus with faults cannot be live [5], there is no reliable way to decide when to start another view. Paxos uses some unreliable way based on timeouts. Thus views may run concurrently.
Our description of the algorithms is based on a methodology for designing fault-tolerant systems. There are five principles:
Use only stable predicates to communicate state among processes. A predicate is stable if once true, it never becomes false. Hence information about non-local state can never become false. This makes it much easier to reason about the effects of failures and other concurrent actions. We say that a variable is stable if its non-nil value doesn’t change: y is stable if (y = constant Ù y ≠ nil) is stable. Often variables that are not stable encode stable predicates; see section 4.8 for an example.
Structure the program as a set of separate atomic actions. This simplifies reasoning about failures. If sequencing is necessary, code it into the state; the actions of the primary in CP below are an example of this. This avoids having a program counter and invariants that connect it to the state. State should be either persistent, or local to a sequence of actions that can be abandoned.
Make the actions as non-deterministic as possible, with the weakest possible guards. This allows more implementations, and also makes it clearer why the algorithm works.
Separate safety, liveness, and performance. Start with an algorithm that satisfies a safety property expressed as a state machine specification. Then strengthen the guards on some of the actions to ensure liveness or to schedule the actions; this reduces the number of possible state transitions and therefore cannot affect safety.
Use an abstraction function and a simulation proof to show that an algorithm satisfies its safety specification.[4] Put all the relationships between actions into invariants; it should never be necessary to do an explicit induction on the number of actions. Liveness proofs are more ad hoc.
The top-down development often works by introducing new variables that are related to the abstract variables by an invariant, and modifying the actions so that they depend only on the new variables and not on the abstract ones. The abstract variables thus become history variables in the proof.
Classic Paxos was invented independently by Lamport [10] and by Liskov and Oki [14]. This version of Paxos tolerates only stopping faults.
Lamport’s work was neglected because of the complicated Paxon fiction he used to describe it. He calls an agent a ‘priest’ and a view a ‘ballot’, and describes the application to replicated state machines in detail. A recent extension called Disk Paxos allows read-write memory such as a disk to be used as an agent [6]. My previous exposition of Classic Paxos and state machines calls a view a ‘round’ and a primary a ‘leader’ [13].
Liskov and Oki’s work is embedded in an algorithm for data replication, so the fact that they describe a consensus algorithm was overlooked. Not surprisingly, they call an agent a ‘replica’; they also use the terms ‘primary’ and ‘backup’.
Castro and Liskov introduced Byzantine Paxos, which tolerates arbitrary faults [1][2]. They present it as Liskov and Oki do.
There is an extensive literature on consensus problems, thoroughly surveyed by Lynch [15]. Dwork et al [4] give consensus algorithms that, like Paxos, have a sequence of views (called ‘rounds’) and are guaranteed safe but are live only if views are started prudently. Malkhi and Reiter treat Byzantine quorums [16].
Section 2 gives the background: notation, failure model, and quorums. Section 3 is the specification for consensus, followed by AP in section 4 and its DP generalization in section 5. Section 6 explains how we abstract communication, and sections 7 and 8 use this abstraction for CP and BP. Section 9 concludes. An appendix summarizes the notation and the main actions of AP, CP, and BP.
To avoid a clutter of parentheses, we usually write subscripts and superscripts for function arguments, so g(v, a) becomes gva. We use subscripts for views and superscripts for processes. Other subscripts are part of the name, as in v0 or Qout.
Lower-case letters denote variables and upper-case letters denote sets and predicates (except that q and z denote sets of processes, so that Q and Z can denote sets of sets). A type is a set, but also overloads functions and operators. Names starting with t denote variables of type T.
No-argument functions on the state are ‘state functions’, used like variables except that we don’t assign to them. Rather than recompute such an r each time it’s used, a real program might have a variable r′ and maintain the invariant r = r′.
We use g for a predicate on the state, and G for a process predicate, a function from a process to a predicate. F and S denote specific process predicates; see section 2.2. We lift logical operators to process predicates, writing G1 Ù G2 for (λ m | G1m Ù G2m).
We write {x Î X | G(x)} in the usual way to describe a set: the elements of X that satisfy G. This extends to {x , y | G(x, y) | f(x, y)} for {z | ($x , y | G(x, y) Ù z = f(x, y)}.
The following schema describes actions:
Name |
Guard |
State change |
|
Closev |
cv = nil Ù x Î anchorv |
→cv := x |
|
The name of the action is in bold. The guard is a predicate that must be true for the action to happen. The last column describes how the state changes; read “guard → state change” as “if guard then state change”. A free variable in an action can take on any value of its type. An action is atomic.
A variable declaration
var y : Y := nil
gives the variable’s name y, type Y, and initial value nil.
When an action or formula derives from a previous version, boxes highlight the parts that change, except for process superscripts. Shading highlights non-local information. Underlines mark the abstract variables in a simulation proof of refinement.
The appendix has a summary of the notation in table 3, and the variables and main actions of the various algorithms in table 4.
We have a set M (for Machine) of processes, and write m or k for a process, and later a or p for an agent or primary process.
We admit faulty processes that can send any messages, and stopped processes that do nothing. A failed process is faulty or stopped; a process that isn’t failed is OK. Our model is asynchronous, which means that you can’t tell a stopped process from a slow one (after all, both begin with ‘s’). A process that crashes and restarts without losing its state is not stopped, but only slow. A primary process may have a crash or reset action that does lose some state; this is also not a failure.
We define predicates on processes: Fm is true when m is faulty, Sm when m is stopped. These are stable, since a process that fails stays failed. OK = ~(F Ú S). When a process fails its state stops changing, since failed processes don’t do actions. Thus every action at m has Ù OKm in its guard, except a send from a faulty process. To reduce clutter we don’t write this conjunct explicitly.
A faulty process can send arbitrary messages. For reasoning from the contents of messages to be sound, any g inferred from a message from m must therefore be weaker than Fm, that is, equal to g Ú Fm. You might think that the state of a faulty process should change arbitrarily, but this is unnecessary. It does all its damage by sending arbitrary messages. Those are its external actions, and they are the same for arbitrary state and for frozen state.
The reason for distinguishing faulty from stopped processes is that faulty processes compromise safety: the system does the wrong thing. Stopped processes can only compromise liveness: the system does nothing. Often safety is much more important than liveness. This is like the distinction in security between integrity and availability (preventing denial of service).
We limit the extent of failures with sets ZF, the set of all sets of processes that can be faulty simultaneously, ZS the same for stopped, and ZFS the same for failed. Clearly ZF Í ZFS and ZS Í ZFS.
The simplest example is bounds f and s on the number of faulty and stopped processes. We define Z≤i = {z | |z| ≤ i}. Then ZF = Z≤f, any set of size ≤ f, and ZS = Z≤s, any set of size ≤ s. If f = 0 there are no faulty processes and only {} is in ZF.
A different example for faults is mutual mistrust. Each process belongs either to Intel or to Microsoft, and both an Intel and a Microsoft process cannot be faulty:
ZF = {z | z Í zIntel Ú z Í zMicrosoft}.
Similarly, for stops we might use geographical separation. All the
processes in
ZS = {zb Í zBoston, zs Í zSeattle | |zb| ≤1 Ú |zs| ≤ 1 | zb È zs}
It seems natural to assume that F Þ S, since a faulty process might appear stopped by sending no messages. This implies ZF Í ZS = ZFS. For the bounded case, it implies f ≤ s. It’s not essential, however, that faulty imply stopped. The important thing about a faulty process is that it can send a false message, which can affect safety, while a stopped process can only affect liveness.
For example, F Þ S implies that Intel-Microsoft has no live quorums (see below), since all the Intel processes can be faulty, but if they can all be stopped then none are left to form the Intel part of a quorum. We could, however, configure such a system on the assumption that no more than two processes will stop; then any three processes from each side is a live quorum. This makes sense if each side insists that no decision can depend entirely on the other side, but is willing to wait for a decision if the other side is completely stopped.
A quorum set Q is a set of sets of processes. Define Q#G = {m | Gm Ú Fm} Î Q, that is, G Ú F holds at every process in some quorum in Q. F is there to make the predicate a sound conclusion from a message. We write Q[rv*=x] for Q#(λ m | rvm = x); here rvm=x stands for any expression.
We require Q to be monotonic (q ÎQ Ù q Í q′ Þ q′ Î Q), so
that making G true at more processes doesn’t falsify Q#G. Thus if G is stable, so
is Q#G.
If G1 Þ G2 then Q#G1 Þ Q#G2.
It’s natural to define Q~F = {q | q Ï ZF}; these are good quorums, with at least one non-faulty process.
Quorum sets Q and Q′
are (mutually) exclusive if we can’t have both a Q
quorum for G and a Q′
quorum for its negation: ("G | Q#G Þ ~Q′#~G).
This holds if every Q quorum intersects every Q′
quorum in a set of processes that can’t all be faulty:
"q Î Q, q′
Î Q′ | q Ç q′ Î Q~F
This is how we lift local exclusion G1 Þ ~G2 to global exclusion Q#G1 Þ ~Q′#G2. Exclusion is what we need for
safety.
For liveness we need to relate various quorums to the sets of possibly faulty or stopped processes.
To ensure G holds at some non-faulty process, we need to hear it from a good quorum, one that can’t be all faulty, that is, one in Q~F. If g = Gm is independent of m, then Q~F#G Þ g; this is how we establish g by hearing from some processes.
To ensure that
henceforth there’s a visible Q quorum satisfying a predicate G, we need a quorum Q+ satisfying G that still leaves a Q quorum after losing any set that can
fail:
Q+ = {q′ | ("z Î ZFS | q′ – z Î Q}.
If Q+ ≠ {} then Q is live: there’s always some quorum of OK processes in Q.
The most popular quorum sets are based only on the size of the quorums: Q≥i = {q | |q| ≥ i}. If there are n processes, then for Q≥i and Q≥j to be exclusive, we need i + j > n + f. If ZF = Z≤f then Q~F = Q≥f+1. If ZFS = Z≤s then Q≥i+ = Q≥s+i and Q≥i live requires i ≤ n – s, since Q>n = {}. So we get n + f < i + j ≤ 2(n – s), or n > f + 2s. Also i > n + f – j ≥ n + f – (n – s), or i > f + s. With the minimum n = f + 2s + 1, f + s < i ≤ f + s + 1, so we must have i = f + s + 1. If s = f, we get the familiar n = 3f + 1 and i = 2f + 1.
With f = 0 there are exclusive ‘grid’ quorum sets: arrange the processes in a rectangular grid and take Q to be the rows and Q′ the columns. If Q must exclude itself, take a quorum to be a row and a column, minus the intersection if both have more than two processes. The advantage: a quorum is only √n or 2(√n – 1) processes, not n/2. This generalizes to f > 0 because quorums of i rows and j columns intersect in ij processes [16].
For the Intel-Microsoft example, an exclusive quorum must be the union of an exclusive quorum on each of the two sides.
The external actions are Input, which provides an input value from the client, and Decision, which returns the decision, waiting until there is one.[5] Consensus collects the inputs in the input set, and the internal Decide action picks one from the set.
type X = … values to agree on
var d : (X È {nil}) := nil Decision
input : set X := {}
Name |
Guard |
State change |
|
Input(x) |
|
input := input È {x} |
|
Decision: X |
d ≠ nil |
→ret d |
|
Decide |
d = nil Ù x Î input |
→d := x |
|
For replicated state machines, the inputs are requests from the clients. Typically there is more than one at a time; those that don’t win are carried over to input for the next step.
It’s interesting to observe that there is a simpler spec with identical behavior.[6] It has the same d and Decision, but drops input and Decide, doing the work in Input.
var d : (X È {nil}) := nil Decision
Input(x) |
|
if d = nil then optionally d := x |
|
Decision: X |
d ≠ nil |
→ret d |
|
A simulation proof that the first spec implements the second, however, requires a prophecy variable or backward simulation.
This spec says nothing about liveness, because there is no live algorithm for asynchronous consensus [5].
As we said in section 1.2, the idea of Paxos is to have a sequence of views until one of them forms a quorum that is noticed. So each view has three stages:
Choose an input value that is anchored: guaranteed to be the same as any previous decision.
Try to get a decision quorum of agents to accept the value.
If successful, finish by recording the decision at the agents.
This section describes AP, an abstract version of Paxos. AP can’t run on your computers because some of the actions refer to non-local state (marked like this so you can easily see where the implementation must differ). In particular, Choose and cv are completely non-local in AP. Later we will see different ways to implement AP with actions that are entirely local; the key problem is implementing Choose.
AP has external actions with the same names as the spec, of course. They are almost identical to the actions of the spec.
Name |
Guard |
State change |
|
Input(x) |
|
input := input È {x} |
|
Decisiona: X |
da
≠ nil |
→ret da |
|
type V = ... View; totally ordered
Y = X È {out, nil}
A Í M = … Agent
Q = set A Quorum
const Qdec : set Q := ... decision Quorum set
Qout : set Q := ... out Quorum set
v0 : V := ... smallest V
The views must be totally ordered, with a first view v0. Qdec and Qout must be exclusive.
var rva : Y := nil, but rv0a := out Result
da : X È {nil} := nil Decision
cv : X È {nil} := nil Choice
input : set X := {}
activev : Bool := false
Each agent has a decision da, and a result rva for each view; we take rv0a = out for every a. AP doesn’t say where the other variables live.
We define a state function rv that is a summary of the rva: the view’s choice if there’s a decision quorum for that among the agents, or out if there’s an out quorum for that, or nil otherwise.
sfunc |
rv: Y = if |
Qdec[rv*=x] |
then x |
view v decided x |
(A1) |
According to the main idea of Paxos, there should be a decision if there’s a decision quorum in some view. Thus
abstract |
d |
= if rv ÎX then rv else nil |
|
|
|
input |
= input |
|
|
Figure 1 summarizes the state variables and functions.
Figure 1: AP state variables and functions
All the variables with short names are stable: rva, da, rv, cv. In addition, activev, x Î input, and x Î anchorv are stable, although the sets are not because they can grow:
input grows in Input;
anchorv is empty until every earlier view is out or has a choice, and then becomes X or that choice; see (A8) below.
AP maintains the following plausible invariants. All but (A3) are summarized in figure 2.
invariant |
da ≠ nil Þ ($v | rv = da) |
decision is a result |
(A2) |
|
rv = x Ù ru = x′ Þ x = x′ |
all results
agree |
(A3) |
|
rva = x Þ rva = cv |
agent’s
result is view’s cv |
(A4) |
|
cv=x Þ cv Î input Ç anchorv |
cv is input and anchored |
(A5) |
|
rva ≠ nil Ù u < v Þ rua ≠ nil |
Close/Acceptv do all u<v |
(A6) |
Figure 2: AP data flow
Invariant (A3) ensures a unique decision. To see how to maintain it, we rewrite it with some of the universal quantifiers made explicit so that we can push them around:
" x′, u |
rv = x Ù ru = x′ Þ x = x′
By symmetry, we can assume u < v. Symbol-pushing and substituting the definition of ru = x′ yields
rv = x Þ (" u < v, x′ ≠ x | ~ Qdec[ru*=x′]) (A7)
How can we exclude Qdec[ru*=x′]? In the scope of x′ ≠ x,
rua Î {x, out} Þ ~(rua = x′)
Lifting this exclusion to the exclusive decision and out quorums (see section 2.3), we get Qout[ru*Î{x,out}] Þ ~Qdec[ru*=x′]. In addition, cu = x Þ ~Qdec[ru*=x′] by (A4), since a decision quorum can’t be all faulty. Substituting the stronger predicates, we see that (A7) is implied by
rv = x Þ (" u < v | cu = x Ú Qout[ru*Î{x,out}])
where we drop the quantifier over x′ since x′ no longer appears. You might think that by (A4) Qout[ru*=out] would be just as good as Qout[ru*Î{x,out}], but in fact it’s too strong if there are faults, since we can get x from a faulty agent in the quorum even though cu ≠ x.
If we limit rva to values of X that satisfy the right hand side, this will be an invariant. With this in mind, we define
sfunc |
anchorv : set X = |
{x | (" u < v | cu = x Ú Qout[ru*Î{x,out}])} |
(A8) |
This says that x is in anchorv if each view less than v chose x or has an out quorum for (out or x). If all the earlier views are out, anchorv is all of X. If we make cv anchored and set rva only to cv, then (A3) will hold.
Note that this definition does not require every previous view to be decided or out (that would be ... Qdec[ru*=x] Ú Qout[ru*=out], which is ru ≠ nil). It’s strong enough, however, to ensure that if there is a previous decision it is the only element of anchor, because a decision excludes an out quorum for anything else.
To compute anchor directly from the definition (A8), we need to know a choice or out for each previous view. We can, however, compute it recursively by splitting the quantifier’s domain at u:
anchorv
= {x | (" w | v0 ≤ w < v Þ cw = x Ú Qout[rw*Î{x,out}])}
= {x |
(" w | v0 ≤ w < u Þ cw = x Ú Qout[rw*Î{x,out}])}
Ç {x | cu = x Ú Qout[ru*Î{x,out}]}
Ç {x |
(" w | u0 < w < v Þ cw = x Ú Qout[rw*Î{x,out}])}
We define outu,v = (" w | u < w < v Þ rw = out): all views between u and v are out. If this is true, then the third term is just X, so since cu Î anchoru by (A5):
anchorv
= |
{x | cu = x} È (anchoru Ç {x
| Qout[ru*Î{x,out}]}) |
(A9) |
If rua = x is the latest visible x, then cu = x by (A4), and the Closev action below makes all views later than u out and ensures that x is in anchorv; note that this x is not necessarily unique. If all the views earlier than v are out, anchorv = X. Thus we have
anchorv
Ê |
if outu,v Ù rua = x then {x} |
(A10) |
In BP, however, rua may not be visible, so we need the more inclusive (A9) to ensure that Choose can happen; see section 8.3.
With this machinery the algorithm is straightforward. We Choose an anchored input and then Accept (which can’t happen until after Choose, since it needs cv ≠ nil). That leads to a decision, which Finish records for posterity. This is the whole story for safety.
Name |
Guard |
State change |
|
Choosev |
cva = nil Ù x Î input Ç anchorv |
→cv := x |
|
Acceptva |
rva = nil Ù cv ≠
nil |
→rva := cv; Closeva |
|
Finishva |
rv Î X |
→da := rv |
|
For the safety proof, Input and Decisiona simulate Input and Decision in the spec. All the other actions do not change the abstract state and therefore simulate skip in the spec, except for the Accept that forms a decision quorum of agents for cv. This Accept simulates Decide. However, the agent whose Accept simulates Decide has no way of knowing this. In fact, if some agent in the quorum fails before anyone else finds out that it accepted cv, there’s no way for anyone to know that there’s been a decision. There will be another view, and by the magic of anchoring it will choose the value already decided. This can happen repeatedly, until finally there’s a view in which the agents in a decision quorum stay up long enough that others can find out about it; see section 4.4 for an example.
For liveness, however, this is not enough, because Choose needs a non-empty anchor, which we get by doing Close on enough agents to ensure that every previous view either is out or has made a choice. An out quorum is definitely enough. Anchor happens when an out quorum has done Close; it marks the end of a view change (see section 4.9) even though there’s no state change.
Startv |
u<v too slow |
→activev := true |
|
Closeva |
activev |
→for all u
< v do |
post u<v Þrua≠nil |
Anchorv |
anchorv ≠{} |
→none |
|
Note that we do not need, and do not necessarily get, ru ≠ nil, since some agents may never close, and even closing all the agents may yield a view that’s neither decided or out.
Agents are just memories; they don’t do anything complicated. They cannot be simple read-write memories, however, since they must do the conditional-write or compare-and-swap operations of Close and Accept. Disk Paxos (section 5) implements AP without conditional writes.
With these actions AP finishes provided there are quorums of OK agents and a final view that is the last one in which Close actions occur; see section 4.5 for details.
Figure 3: Abstract Paxos
Figure 3 shows the actions of AP for one complete view. It shows communication with vague wavy arrows, since AP abstracts away from that, but the “transmit” part says what information needs to flow to enable the next action. Capitalized items refer to the state machine application of Paxos: the client (boxed in the figure) provides an input request, the machine takes a step, and it sends the client some output. If there are no faults, any agent could send the output.
An example may help your intuition about why this works when there are no faults. The table shows views 1-3 in two runs of AP with agents a, b, c, two agents in a quorum, and input = {7, 8, 9}.
|
cv rva rvb rvc |
cv rva rvb rvc |
View 1 View 2 View 3 |
7 7 out out 8 8 out out 9 out out 9 |
8 8 out out 9 9 out 9 9 out out 9 |
input Ç
anchor4 |
= {7, 8,
9} seeing a, b, c |
{9} no matter what |
In the left run all three views are out, so if we compute anchor4 by seeing all three agents, we can choose any input value. If we see only a and b, view 3 appears out but view 2 does not, and hence we must choose 8. If we see only a and c or b and c, view 3 doesn’t appear out and hence we must choose 9.
In the right run, view 2 is decided, but we don’t see that if either a or c is stopped. Nonetheless, we must choose 9, since we see that value in a non-out view no matter what out quorum we see. Thus a decided view such as 2 acts as a barrier which keeps any later view from choosing another value.
The reason there were three views in both runs is that each view was interrupted by Close in a later one before it had a chance to finish. Otherwise view 1 would have succeeded. If new views keep starting, the algorithm can continue indefinitely.
We want AP to finish in a final view v if there’s no Close action in any later view. It will do so if the actions can see certain things:
· Finishv must see a decision d (that is, must see Qdec[rv*=d]). This means that Qdec must be live. Since there are no later views to mess with rva, if Qdec is live Accept will eventually run at enough agents to make d visible. However, d need not be visible in the view that made it. In fact, it’s fundamental to Paxos that until Finish has run at a live quorum, you may have to run another view to make d visible. This can’t happen in a final view, since it can only happen in u if a later view does Close and sets some rua to out.
· Acceptv must see the choice cv, though again perhaps not in every view if processes fail at bad times. This depends on the implementation of cv, which varies. It is trivial with no faults: one process, called a primary, chooses cv and announces it, which works if there’s only one primary for v and it doesn’t stop. With faults, BP uses a quorum to get a unique and visible cv, which works if all OK processes choose the same cv and the quorum is live.
· Choosev must see at least one element of anchor. Since this doesn’t get easier when you run another view, we insist that it be true in every view. This means that every previous view w must become visibly out (Qout[rw*=out] is visible) back to a view u that has a visible choice (A10) or at least is visibly anchored (A9). Hence Qout must be live. Since anchor involves the choice, this also depends on the implementation.
Some element x of anchor that Choosev sees must also be in input. But either anchor = X, in which case input Í anchor, or x = cu for some u, in which case x Îinput by (A5).
If Qout is live, Closev always leads eventually to a visible out quorum of OK agents in every u < v. In this quorum either every agent is out, in which case u is out, or some rua = cu by (A4). So if no faults are allowed, we get a non-empty anchorv immediately from this out quorum by (A10). If there are faults, there may be other out quorums as well, in which we see rua = x ≠ cu if a is faulty. Since we can’t tell which out quorum is OK, (A10) isn’t enough to anchor v. We need (A9) and some delicate reasoning; see section 8.3.
A view can finish by seeing only an out quorum (for x Î anchorv, which Choose needs) and a decision quorum (for rv = x, which Finish needs). Thus the requirement is Qout and Qdec both live. With no faults and equal size-based quorums, for example, both quorums are the same: more than n/2 agents.
Doing Close in views later than v may keep Acceptv from happening, by setting too many rva to out before Acceptv has a chance to set them to cv; of course this can’t happen in the final view because there are no later views. To get a final view, activev controls the scheduling of Closev. Since asynchronous consensus can’t be guaranteed to terminate, there is no foolproof way to do this scheduling.
Schedulers either randomize or estimate the longest time RT for a round-trip from one process to another and back; note that RT includes the time for the processes to run as well as the time for the messages to travel. The idea is that if a view doesn’t complete within 2RT, you multicast a new view v. If v is smaller than any other view you hear about within another RT, v becomes active. Obviously this can fail in an asynchronous system, where there is no guarantee that the RT estimate is correct.
Castro and Liskov [1] use the Ethernet’s exponential backoff technique to estimate RT; a process backs off whenever it fails to hear from a quorum within its current RT estimate. This works as long as RT does not increase without bound. The estimate can be as much as b times the actual RT, where b is the backoff multiplier, commonly 2. More serious is that if processes stop and then recover, the estimate may be much too large.
Summing this up, with proper scheduling AP finishes as soon as there are Qdec and Qout quorums of processes that haven’t failed. We can’t implement proper scheduling in general, but it’s not hard in most practical situations.
Once a Q~F+ quorum knows a decision all the other state can be discarded, since no matter what failures occur there will be a good quorum to report d.
Q~F+[d ≠ nil] |
→for all v do rva := nil; input := {} |
|
Closeva leaves rua out or cu for all u < v, and by (A10) anchor only depends on the latest view with rua = x. Hence an agent a only needs to keep track of the latest view u for which rua = x and the range (maybe empty) of later views w for which rwa = out. The following variables do this:
vXlasta the latest u for which rua = x (v0 if there is no such u)
xlasta x (arbitrary if there’s no such u), and
vlasta the earliest v ≥ u for which rua ≠ out.
For views w between vXlasta and vlasta, rwa = out; for
views past vlasta, rwa = nil.
Thus vXlasta = u ≠ v0 and xlasta = x encode the
predicate rua = x,
and vXlasta = u and vlasta = v encode
" w | (u<w<v Þ rwa = out) Ù (v<w Þ rwa = nil) Ù (u ≠ v Þ rva = nil).
These predicates are stable, although the variables of course are not. Here is the picture.
rwa don’t know xlasta out nil
| | |
view v0 vXlasta vlasta
This encoding uses space logarithmic rather than linear in the number of views, which makes it cheaper both to store and to transmit the agent state. In practice, of course, we use a fixed amount of space for a view. Close and Accept become
Closeva |
activev Ù vlasta < v |
→vlasta := v |
Acceptva |
cv ≠ nil Ù vlasta = v |
→vXlasta :=
v; xlasta:=
cv; vlasta
:= v |
When we use Paxos (or any other consensus algorithm) to implement a replicated state machine, we need to reach consensus on a sequence of values: the first step of the state machine, the second step, etc. By observing that anchorv does not depend on cv, we can compute it in parallel for any number of steps. For most of these, of course, there will have been no previous activity, so the agent states for all the steps can be represented in the same way. We only need to keep track of the last step for which this is not true, and keep separate last triples just for this and any preceding steps that are not decided. To bound this storage, we don’t start a step if too many previous steps are not known to be decided.
With this optimization we do Close and compute anchor only when the view changes, and we can use one view for a whole sequence of steps. Each step then requires Choose and Accept to reach a decision, and Finish to tell everyone. Finish can be piggy-backed on the next accept, so this halves the number of messages.
It’s possible to run several steps in parallel. However, in the state machine application the ordering of steps is important: to maintain external consistency a step should not decide an input x that arrives after a later step decided y and sent its output. Otherwise the clients will see that the inputs execute in the order x; y even though they also see that x was not submitted until after y completed; this is generally considered to be bad. To avoid this problem, fill any gaps in the sequence of steps with a special skip step. Of course there shouldn’t be nothing but skips.
If there are lots of state machine steps they can be batched, so one run of AP decides on many steps. This is like group commit for transactions [7], with the same tradeoffs: more bandwidth but greater latency, since the client gets no output until a batch runs.
An agent can send its rva directly to the client as well as to the other agents, reducing the client’s latency by one message delay. Of course the client must see the same result from a decision quorum of agents; otherwise it retransmits the request. A state machine agent can tentatively do a step and send the output to the client, which again must wait for a decision quorum. In this case the agent must be able to undo the step in case v doesn’t reach a decision and a later view decides on a different step. Castro and Liskov call this ‘tentative execution’ [1].
If a step is read-only (doesn’t change the state of the state machine), an agent can do it immediately and send the client the output. The client still needs a decision quorum, which it may not get if different agents order the read-only step differently with respect to concurrent write steps that affect the read-only result. In this case, the client must try the step again without the read-only optimization.
It’s convenient to describe an algorithm in terms of the persistent variables. In practice we don’t keep each one in its own disk block, but instead log all the writes to them in a persistent log. In some applications this log can be combined with the log used for local transactions.
We would like to implement the agent with memory that has only read and write operations, rather than the conditional writes that AP does in Close and Accept. The main motivation for this is to use commodity disks as agents; hence the name Disk Paxos (DP) [6]. These disks implement block read and write operations, but not the conditional-write operations that AP agents use.
To this end we add separate state variables rxva and rova in the agent for x and out, and change Close and Accept to unconditionally write out into ro and cv into rx. We want the code to look only at the values of rx and ro, so that rva becomes a history variable, that is, the behavior of the algorithm is unchanged when we remove it.
What makes this work is an invariant that allows us to infer a lot about rva from rxva and rova:
invariant relates state to history (D1)
rxva = |
Ù |
rova = |
Þ |
rva |
nil |
|
nil |
|
= nil |
nil |
|
out |
|
= out |
x |
|
nil |
|
= x |
x |
|
out |
|
≠ nil |
In particular, if anchor is non-empty we can still always compute at least one of its elements, because the only information lost is some cases in which the view is out, and in those cases we get rva instead, which is enough by (A10). We may miss anchor = X, but we only need a non-empty anchor (and this can happen in AP as well if we don’t hear from some agents that are out). We may also sometimes miss a decision because we only know rva ≠ nil when actually rva = x, but this only costs another view (and this too can happen in AP if we don’t hear from some agents that accept). In the final view rova = nil and we don’t lose any information, so liveness is unaffected.
var rxva : X È {nil} := nil Result X
rova : {out, nil} := nil Result out
rva : Y := nil history
Closeva |
activev |
→for all u
< v do |
post u < v |
Choosev |
cv = nil |
→cv := x |
|
Acceptva |
cv
≠ nil |
→rxva := cv; Closeva; |
|
invariant |
rxva = x Þ rxva = cv |
|
(D2) |
With the abstraction rva = rva, DP simulates AP.
A more general version encompasses both AP and DP, by allowing either a conditional or an unconditional write in Close and Accept. It replaces the boxed sections with the following:
Closeva … if rxua = nil, or optionally anyway, roua := out
Acceptva … if roua = nil, or optionally anyway, rxva := cv
Liveness and scheduling are the same as for AP. The last-triple optimization needs special handling; it is discussed in section 7.2.
For the algorithm to progress, the processes must communicate. We abstract away from messages by adding to m’s state a stable predicate Tm called its ‘truth’ that includes everything m knows to be true from others; T also stands for ‘transmitted’. If g is a stable predicate, we write g@m for Tm Þ g, and read it “m knows g” or “m sees g” or “g is visible at m”. The safety invariant is
invariant |
g@m
Þ g |
|
(T1) |
In other words, everything a process knows is actually true. This invariant allows us to replace a non-local guard g in an action at m with the stronger local g@m. The resulting code makes fewer transitions and therefore satisfies all the safety properties of the original, non-local code. Liveness may be a challenge.
We lift @ to process predicates: G@m = (λ k | Gk@m). Then (Q#G)@m = Q#(G@m): seeing G from a quorum is the same as seeing a quorum for G. Read Q[g@*] as “a Q quorum knows g”, where g is a predicate, not a function from processes to predicates.
Note that m can’t communicate g@m if m might be faulty. This is not an issue when we use g@m in a guard at m, but we can only get (g@m Ú Fm)@k rather than (g@m)@k.
The implementation, of course, is that g@m becomes true when m receives a message from k asserting g; recall that g is stable and therefore cannot become false if k fails. We model the message channel as a set ch of terms gk→m (read “g from k to m”). Here are all the actions that affect ch or T:
Name |
Guard |
State change |
|
Localk(g) |
g |
→Tk :=
Tk Ù (g Ú Fk) |
post
(g Ú Fk)@k |
Sendk,m(g) |
g@k |
→ch := ch È {gk→m} |
post gk→m Î ch |
SendFk,m(g) |
Fk |
→ch := ch È {gk→m} |
post gk→m Î ch |
Receivem(g) |
gk→m Î ch |
→Tm
:= Tm Ù (g@k Ú Fk) |
post(g@k Ú Fk)@m |
Drop(g) |
gk→m Î ch |
→ch := ch - {gk→m} |
|
So k can use Local to add to Tk any true predicate g; presumably g will only mention k’s local state, since otherwise it would be in Tk already.[7] Then k can send gk→m to any process m if either k knows g or k is faulty. We separate the two send actions because SendF is not fair: there’s no guarantee that a faulty process will send any messages.
invariant |
gk→m Î ch Þ g@k Ú Fk |
|
(T2) |
|
(g@k Ú Fk)@m Þ g@k Ú Fk |
|
(T3) |
From the two send actions we have (T2) since g is stable and therefore g@k Ú Fk is stable. Receivem(g) adds g@k Ú Fk to m’s truth. Since this is the only way to establish (g@k Ú Fk)@m, (T3) follows from (T2). (T1) follows from this and Local, since they are the only ways to establish g@m.
These actions express our assumption that the only way m can receive g from a non-faulty k is for g to be true. In other words, there’s no way to fake the source of a message. Usually we get this security either by trusting the source address of a message or by cryptographic message authentication; see [1] for details of how this works for BP.
We now abstract away from the channel to actions that establish g@m directly:
k can transmit g@k to all the other OK processes, even if k fails. This allows for messages that remain in ch after k fails.
A faulty k can transmit anything.
TransmitFk,m(g) |
g@k Ù OKm |
→Tm := Tm Ù (g@k Ú Fk) |
post (g@k Ú Fk)@m |
TransmitFk,m(g) |
Fk Ù OKm |
→Tm := Tm Ù (g@k Ú Fk) |
post (g@k Ú Fk)@m |
We say that m hears g@k Ú Fk from k. When there’s a quorum Q#G@m, we say that m hears G from a Q quorum. In the simulation proof Receivem(g) of gk→m simulates Transmitk,m(g) by (T2) because g@k is stable, and the Send actions simulate skip.
As before, both Transmitk.m and Broadcastk,m (see below) are fair if k is OK, and so is Broadcastm, but TransmitF is not. This means that if g@k holds, and OKk and OKm continue to hold, then eventually (g@k Ú Fk)@m or g@m will hold.
A history variable can appear in a predicate g in Tk, even though it can’t appear directly in a guard or in an expression assigned to an ordinary variable, since it’s not supposed to affect the actions that occur. Such a g can get into Tk initially if an invariant (such as (C1)) says it’s implied by a g′ that doesn’t contain a history variable. Once it’s in Tk, g can be transmitted in the usual way. This is just a way of encoding “g′ was true at some time in the past”. So if g′ has no history variables, and g and g′ Þ g are stable:
Localk(g) |
g′ Ù (g′ Þ g) |
→Tk := Tk Ù (g Ú Fk) |
post g@k Ú Fk |
The Local, Transmit, and Broadcast actions are the only ones we need for the rest of the paper.
If a Q~F+ quorum
ever knows g, then henceforth there’s always a Q~F quorum of OK processes that knows g. Hence
repeated Transmits
will establish (Q~F[g@*])@m
at every OK process m. But Q~F[g@*] Þ g,
so this establishes g@m. We package
this in an action:
Broadcastm(g) |
Q~F+[g@*] Ù OKm |
→Tm := Tm Ù g |
post g@m |
If we have broadcast messages (signed by public keys) there’s a more direct way to broadcast a predicate. We can drop the m from gk→m, since any process can read the messages.[8] This means that if Receivek establishes g@k, then g@m follows too, not simply (g@k Ú Fk)@m. In other words, k can transmit a transmitted g@k without weakening it, by simply forwarding the messages that k received. If g@k follows from Localk, g@k = (g@k Ú Fk). Thus, provided k remembers the signed evidence for g, it can do
Broadcastk,m(g) |
g@k Ù OKm |
→Tm := Tm Ù g |
post g@m |
We transmit predicates, but since they take only a few forms, an implementation encodes a predicate as a message with a kind field that says what kind of predicate it is, plus one field for each part of the predicate that varies. For example, after doing Closeva agent a sends (closed-state, a, last-triplea).
The Send actions that implement Transmit need to be scheduled to provide congestion and flow control, any necessary retransmission, and prudent use of network resources. How this is done depends on the properties of the message channel. For example, TCP is a standard way to do it for unicast on an IP network. For a multicast such as Broadcast, scheduling may be more complex.
Since processes can fail, you may have to retransmit a message even after a quorum has acknowledged its delivery.
To turn AP into an implementation, we can take AP’s agent almost as is, since the agent’s Close, Accept, and Finish actions only touch its local state rva. We need to implement input, activev, and cv, which are the non-local variables of AP, and the Input, Start, and Choose actions that set them. We also need to tell the agents that they should invoke their actions, and give them activev and cv. Our first implementation, CP, tolerates stopped processes but no faults.
Since CP is a real implementation, the actions refer only to local state. We still use shading, but now it marks state in T transmitted from other processes. We discuss the scheduling of these Transmit actions in section 7.1. Look at Table 4 to see how non-local information in AP becomes either local state or transmitted information in CP and BP.
CP implements AP by doing Input, Start, and Choose in a primary process. For fault tolerance there can be several primaries. However, for each view there is exactly one process that can be its primary; in other words, there is a function p(v) that maps each view to its primary. If in addition a primary never reuses a view for which it has already chosen a value, there is at most one cv for each v. A simple implementation of pv is to represent a view by a pair, with the name of its primary as the least significant part.
An agent’s state must be persistent, but we allow a primary to reset, lose its state, and restart in a fixed state. Then it starts working on a new view, one for which it never chose a result before. We discuss later how to find such a view.
The primary’s job is to coax the agents to a decision, by telling them when to close, choosing cv, and relaying information among them. Once it has a new view, the primary’s Choose action chooses an anchored value cv for the view. To do this it must collect enough information from the agents to compute a non-empty subset of anchorv. (A10) tells us how much information suffices: either that all previous views are out, or that all views since u are out and cu. So it’s enough to trigger Closeva at an out quorum (with Closep) and then collect the state from that quorum.
Once the primary has cv, it can try (with Acceptp) to get the agents to accept it. They respond with their state, and if the primary sees a decision quorum for cv, then there is a decision which the primary can tell all the agents about (with Finishp).
We fearlessly overload variable names, so we have cv and cp, for example, and v and vp.
The agent variables of AP become agent variables of CP.
var rva : Y := nil, but rv0a := out Result
da : X È {nil} := nil Decision
All the other variables of AP become primary variables of CP, except that activep is coded by vp:
type P Í M = … Primary
var vp : V := v0 Primary’s View
cp : X È {nil} :=
nil Primary’s Choice
inputp : set X := {}
sfunc |
activep = |
(vp ≠ v0) |
|
These are not stable across resets, so we add history variables that are, with the obvious invariants relating them to cp and activep.
var cv : X È {nil} := nil history
input : set X :=
{} history
activev : Bool := false history
invariant |
activep
Ù cp ≠ nil Þ cp = cvp |
|
(C1) |
|
inputp Í input |
|
|
|
activep ≠ nil Þ activep = activevp |
|
(C2) |
Thus all the variables of AP are also variables of CP with the identity abstraction function to AP. The invariants (A2-A6) of AP are also invariants of CP.
Any primary can accept an input. For a state machine, this means that any primary can receive requests from clients. The client might have to do Inputp at several primaries if some fail.
Inputp(x) |
|
inputp := inputp È {x}; input := input È {x} |
|
We define the primary’s estimates of rv and anchorv in the obvious way. We define revp rather than just rep because p needs views earlier than vp to compute anchor. From (A1) for rv:
sfunc |
revp
= if |
(Qdec[rv*=x] )@p |
then x |
view v decided x |
(C3) |
From (A10) for anchor:
|
outu,vp = |
("w |
u < w < v Þ rewp
= out) |
|
sfunc |
anchorp Ê |
if outu,vp Ù (rua = x)@p then {x} |
(C4) |
Two unsurprising invariants characterize the estimates:
invariant |
revp ≠ nil Þ revp = rv |
|
(C5) |
|
anchorp Í anchorvp |
|
(C6) |
We avoid a program counter variable by using the variables vp and cp to keep track of what the primary is doing:
vp |
cp |
p’s view
of agent state |
Action |
= v0 |
- |
- |
Startp |
≠ v0 |
= nil |
anchorp = {} |
Closep |
≠ v0 |
= nil |
anchorp ≠
{} |
Choosep |
≠ v0 |
≠ nil |
revpp Ï X |
Acceptp |
≠ v0 |
≠ nil |
revpp Î X |
Finishp |
To keep cvp stable we need to know cvp = nil before setting it. The following invariant lets us establish this from local state:
invariant |
activep
Ù cp = nil Þ cvp = nil |
|
(C7) |
To maintain this invariant we put a suitable guard on the Startp action that makes p active. This is an abstract action since it involves cv; section 7.3 discusses how to implement it.
Startvp |
u < v
too slow |
→ activev :=
true; vp := v; cp := nil |
|
With this machinery, we can define Choosep as a copy of AP’s Choosev, with activep added to the guard and the primary’s versions of c, input, and anchor replacing the truth. (C3) and (C7) ensure that Choosev’s guard is not weakened.
Choosep |
activep Ù cp = nil |
→cp := x; cvp := x |
|
The agent’s actions are the same as in AP (see section 4.3) with cv@a and revp@a for cv and rv. With these actions it’s easy to show that CP simulates AP, using (A2-A6) and (C5-C6).
We can use the last optimization in CP just as in AP, and of course the view change optimization works the same way.
As we saw above, the definitions of revp and anchorp imply that the agents tell the primary their state after Closea and Accepta. In addition, the primary tells the agents when to close, and what values to use for accept and finish. It implements these actions by sending trigger messages to the agents, using the invariants shown; since we are abstracting away from messages, we describe them informally. The agents respond by returning their state.
Closep |
activep Ù cp = nil |
→trigger Closeva at
all agents, send- |
Anchorp |
anchorp ≠ {} |
→none |
Acceptp |
cp ≠ nil Ù revpp = nil |
→trigger Acceptva at
all agents, |
Finishp |
revpp Î X |
→trigger Finishva at all agents, |
Figure 4: Classic Paxos
Figure 4 shows the actions of CP for one complete view, with n = 3 agents; compare figure 3. The arrows show the flow of messages, and the “transmit” part shows their contents and whether they are unicast or multicast. An n* means that if the primary is also an agent, only n – 1 messages need to flow. To finish, of course, only a quorum of agents is needed, and only the corresponding messages. In normal operation, however, when no processes are stopped, it’s desirable to keep all n of them up to date, so they should all get at least the Finish message.
Liveness, scheduling, and cleanup are the same as AP’s. A primary can discard all its state at any time with Reset (section 7.3).
In practice the primary is usually one of the agents, and only two other agents are needed to tolerate one stopped process. It’s also possible to compute only at the primary and use the agents just to store the state of the state machine; in this case the Finish message contains the state changes instead of d.
Implementing DP with CP is completely straightforward except for the log-space representation of the agent state. We can’t just use the triple of last values, because if a primary overwrites one of those unconditionally with an earlier view, it will change some rva back to nil. Instead, we keep a triple for each primary, so the state of an agent is the last triple as in AP, but each component is a function from p to a value (implemented, of course, as an array indexed by p). Then the primary rather than the agents can enforce the guards on writing the agent state, since each variable has only one writer. We abstract vXlast and vlast as the maximum over the primaries, and xlast as the value that goes with vXlast. Reading an agent’s state thus requires reading the triples for all the primaries.
This read operation is not atomic, however, so these abstractions are not enough to show that DP-last implements AP-last. Fortunately, they don’t need to be, since what we care about is implementing DP. For this we don’t need the last values but only enough information about rxua and rxua to do the actions. As we saw in section 4.8, each lastpa triple encodes two predicates on ra, and all of them together encode the conjunction of the predicates. Thus setting vXlast,pa := u and xlast,pa := x is equivalent to setting rxua := x, and setting xlast,pa := u and vlast,pa := v is equivalent to setting rowa := out for all w between u and v. (In addition, some information about earlier values of rxa and roa may be lost, but nothing is changed.) There’s never a contradiction in these predicates, because cv is the only value we write into rxva. By reading all the triples, we get a predicate that implies the facts about rxa and roa that would follow from:
|
vXlasta |
= max over p
of vXlast,pa |
|
|
xlasta |
= xlast,pa for the p
for which vXlast,pa = vXlasta |
|
|
vlasta |
= max over p of vlast,pa |
|
It follows that DP-last implements DP.
A primary p can write all three values at once provided it finds suitable values vXlast,p and xlast,p to write into vXlast,pa and xlast,pa in Close. This is useful because it allows a to keep the whole triple in a single disk block. The values already there are suitable; so are those that accompany the largest vlast,pa in an out quorum.
Precisely, we have:
Closev,pa |
|
vlast,pa :=
v; |
|
Acceptv,pa |
cv ≠ nil |
→vlast,pa := v; vXlast,pa := v; vXlast,pa := cv |
|
If the primary has a little persistent state, for example a clock, it can use that to implement Startp, by choosing (clock, p) as a v that it has never used before, which ensures cv = nil.
To get by without any persistent state at the primary, Startp queries the agents and chooses a view later than some view in which a decision quorum of agents is not closed.
Resetp |
|
vp
:= v0; inputp :=
{}; cp :=
nil |
Startvp |
u < v
too slow |
→vp := v; activevp:= true |
This works because before choosing a result, a primary closes an out quorum at all previous views, and the two quorums must intersect. The invariants we need are (A6) and
invariant |
Qdec[ru*=nil] Ù
v > u Þ cv = nil |
|
(C8) |
This argument is trickier than it looks, since Qdec[ru*=nil] is not stable. The true, stable condition is “at some time after the primary reset, a decision quorum of agents was still open”. Then p can conclude cv = nil if pv = p, since only p can change cv. To establish this condition, the query must not get a reply that was generated before the reset. We can ensure this if there’s a known upper bound on how long the reply can take to arrive (which is true for SCSI disks, for example), or with standard techniques for at-most-once messages on channels with unbounded delays. Unfortunately, the latter require some persistent state in the primary, which is what we are trying to avoid. We won’t formalize this argument.
If the primary sees any agent out in vp or sees any non-nil agent variable for a bigger view u, it restarts, since this means that a later view has superseded the current one. To restart, p chooses one of its views that is bigger than any it has seen to be out. This is another implementation of the abstract Startp, more efficient when the primary’s state hasn’t been lost.
Restartvp |
activep Ù vp<u<v Ù pv = p |
→vp:= v; cp := nil |
|
As figure 4 shows, a normal run of CP that doesn’t need a view change multicasts two messages from the primary to the agents, and each agent sends one reply. The output to the client can go in parallel with the second multicast, so that the client’s latency is one client-primary round-trip plus one primary-agents round trip. Usually the finish message piggybacks on the accept message for the next step, so its cost is negligible. Cleanup takes another (piggybacked) agents-primary-agents round trip. See table 1 in section 8.7. With tentative execution (section 4.10) the primary-agents round-trip is reduced to one way.
A view change adds another primary-agents round trip, and if the primary has to run Start, there is a third one. The last only happens when the primary crashes, however, in which case this cost is probably small compared to others.
For a more detailed analysis see [3].
BP is a different implementation of AP, due to Castro and Liskov [1], that tolerates arbitrary faults in ZF of the agents. Their description interweaves the consensus algorithm and the state machine, assumes the primary is also an agent, and distinguishes it from other agents (called ‘backups’) much more than we do here. They use different names than ours; see table 2 in the appendix for a translation.
With faults it is unattractive to have separate primary processes for Choose or for relaying information among the agents, so we do Choose in the agents and use multicast for communication among them. Thus BP starts with AP, keeps all the agent variables rva and da, and adds agent versions of the other variables, and a history variable for input as in CP.
const Qch : set Q := … choice Quorum set
var rva : Y := nil, except rv0a := out Result
da : X È {nil} := nil Decision
cva : X È {nil} := nil Choice
inputa : set X
:= {}
input : set X := {} history
activeva : Bool := false
Thus all the variables of AP are also variables of BP with the identity abstraction function, except for cv. The abstraction to cv is a choice quorum of the agents’ choices.
abstract |
cv |
= if
Qch[cv*=x] then x else
nil |
|
Agent a adds a value to inputa when a client transmits it; we don’t formalize this transmission. Since clients can also fail, other agents may not see this value.
Inputa(x) |
|
inputa :=
inputa È{x}; input := input È{x} |
|
There’s still only one choice cv for a view, however, because Qch excludes itself, and the quorum must agree that the input came from the client. Thus any decision is still for a client input and still unique no matter how many faulty clients there are. For the effect of faulty clients on liveness, see the end of section 8.3.
We define ceva as a’s estimate of cv, like cv except for the “@a”:
sfunc |
ceva = if |
(Qch[cv*=x])@a |
then
x |
a’s estimate of cv |
(B1) |
Similarly, a quorum of rvas makes a result (as in AP), and reva is a’s estimate of that result, the same as CP’s revp:
sfunc |
reva = if |
(Qdec[rv*=x] )@a |
then
x |
view v
decided x |
(B2) |
The state function rv is defined in AP; it’s (B2) without the “@a”.
With these definitions, the agents’ non-nil estimates of r and c agree with the abstract ones, because they are all stable and (A4) means we see at most one rva from the OK agents. These invariants are parallel to (C1) and (C5):
invariant |
ceva ≠ nil
Þ
ceva = cv |
c estimates
agree |
(B3) |
|
reva ≠ nil Þ reva = rv |
r estimates agree |
(B4) |
We take AP’s anchorv as a state function of BP also. Following (A9) with reva for rv and without the cu term, we define:
|
outu,va = |
("w | u
< w < v Þ rewa = out) |
|
sfunc |
anchorva
= |
anchoru Ç {x | Qout[ru*Î{x,out}]@a} if outu,va |
(B5) |
The missing a on anchoru is not a misprint. We have the obvious
invariant |
anchorva Í anchorv |
|
(B6) |
There is still a role for a primary, however: to propose a choice to the agents. This is essential for liveness, since if the agents can’t get a quorum for some choice, the view can’t proceed. BP is thus roughly a merger of AP’s agents and CP’s primary. As in CP, the primary is usually an agent too, but we describe it separately.
Safety cannot depend on the primary, since it may be faulty and propose different choices to different agents. If there’s no quorum for any choice, the view never does Accept and BP advances to the next view as discussed in section 8.4.
The primary has a persistent stable cvp (but see section 8.8 for an optimization that gets rid of this). The primary needs inputp in order to choose, but it doesn’t need vp since it just works on the last anchored view.
var cvp : X È {nil} := nil Primary’s
Choice
inputp : set X
:= {}
An annoying complication is that when the primary chooses cvp, it needs to be able to broadcast cvp Î anchorv so that all the agents will go along with it. To broadcast, p needs Q~F+[(cvp Î anchorv)@*] (see the discussion of broadcast at the end of section 6), so a value that’s anchored at the primary had better be anchored at enough agents, because anchorva is their only approximation of anchorv. Then
sfunc |
anchorvp |
= {x | Q~F+[xÎanchorv*]@p} |
(B7) |
Thus to compute anchorvp, p needs to hear from Q~F+ agents.
The agents’ actions are essentially the same as in AP; (B3-B4) imply that the guards are stronger and the state change is the same.
Closeva |
activeva |
→for all u < v do |
|
Anchorva |
anchorva ≠ {} |
none |
|
Acceptva |
ceva
≠ nil Ù rva
= nil |
→rva := ceva;
Closeva |
|
Finishva |
reva ÎX |
→da := reva |
|
The primary does Input and Anchor as in CP, though the definition of anchorvp is quite different.
Inputp(x) |
|
inputp := inputp È{x} |
|
Anchorvp |
anchorvp ≠ {} |
→none |
|
Choose is like AP’s Choose, but at both agents and primary:
The primary chooses for a view that belongs to it and is anchored, but where it hasn’t chosen already.
An agent only chooses the primary’s apparent choice.
(B5)-(B6) mean that the agents’ guards are stronger than in AP; this is what matters, since cva is what’s in the abstraction to cv.
Choosevp |
pv = p Ù cvp = nil |
→cvp := x |
|
Chooseva |
cva = nil |
→cva := x |
|
There’s no guarantee that cvp is in input, but this wouldn’t be strong enough anyway, since for liveness it must be in inputa for a choice quorum.
invariant |
cvp ≠ nil Þ cvp Î inputp Ç anchorvp |
|
(B8) |
|
cva ≠ nil Þ cva Î input Ç anchorva |
|
(B9) |
A client must hear da from a good quorum of agents.
For safety, in
addition to AP’s assumption that Qdec and Qout are exclusive, Qch must exclude itself. Then the
invariants (A2-A6) of AP hold in BP, and the Closea, Accepta,
and Finisha
actions of BP simulate the same actions in AP. All the other actions simulate skip except the Choosea action that forms a
quorum, which simulates AP’s Choose.
Figure 5: Byzantine
Paxos
Figure 5 shows the flow of messages in BP. This is the logical flow. If the client-agents network is much slower than the inter-agent network, which is common in practice, the client can send an input just to the primary, including message authenticators for all the agents; the primary forwards the input to the agents. This does not change any costs or affect cryptographic authentication of messages. It does mean that the client may have to resend to another primary if the first one turns out to be faulty.
In BP the primary’s only job is to propose cvp to the agents, who are responsible for everything else including scheduling, since they can’t count on the possibly faulty primary. So Transmit and Broadcast are all there is to say about communication.
We want to show that a view with an OK primary will produce a decision unless a later view starts. We assume that Qdec, Qout, and Qch are live, and Qch Í Q~F+; if these don’t hold BP is still safe, but it may not decide. Suppose initially that anchorva Ê anchorvp ≠ {} at a choice quorum of OK agents; this is the case after a view change. Then we have normal operation, which is the easy part of the liveness argument. If the clients are OK:
Since anchorvp ≠ {}, an OK primary can do Choosevp, which leads to cvp@a at all the agents, so that they can all do Chooseva by (B7) provided cvp Î inputa, that is, a knows the client really sent cvp. An OK client will send its input to all the agents, but a faulty client may fail to do so.
This leads to an OK choice quorum for cvp because Qch is live, which leads to knowing that quorum at all the agents so that they can all do Accepta.
This leads to an OK decision quorum for rva = cvp because Qdec is live, which leads to knowing that quorum at all the agents, so that they can all do Finisha.
This leads to knowing a good quorum for da = cvp (or alternatively for the output) at the client, because Q~F is live.
Now we consider what happens in a view change. For an OK agent a to get anchorva ≠ {}, it needs x Î anchorua and Qout[ru*Î{x,out}]@a for some u < v, and rewa = out for each u < w < v, from (B5). This is the tricky part, since these require quorums, and having a quorum is no guarantee that it’s visible.
Since Qout is live, there’s an out quorum q of OK agents that will eventually do Closev. Hence eventually Qout[rw*Î{cw,out}]@a for each w < v, since this is the post-condition of Close. Either all these views are out at a, or there’s some largest u < v and a′ Î q with rua′ = cu ≠ nil. In the former case anchorv = X. In the latter case Qch[cu*=x], and hence Qch[(x Î anchoru)@*] by (A5) and (B6). Hence x Î anchoru is broadcast since Qch Í Q~F+, so eventually we have (x Î anchoru)@a, as required.
This argument is more subtle than it looks. Note that a doesn’t know which of the out quorums it sees come from OK agents, so it doesn’t know x = cu, and in fact cu might be nil.
To sum up, since a eventually hears from a Qout of OK agents, it hears, for some view u and each w between u and v,
an out quorum for w and
Qout[ru*Î{x,out}]@a and (x Î anchoru)@a,
and this is all a needs to anchor v, by (B5).
All that remains is the liveness of Anchorvp: the primary must see a non-empty intersection of anchorva sets from a Q~F+ quorum. Since such a quorum is live, eventually every agent in it will hear from the same q of OK agents and come up with the same x Î anchorv, which will thus be broadcast.
So BP is live, except for faulty clients, although it’s hanging on by its fingernails.
If the client is faulty, it can fail to deliver inputs to some agents. A view change that has x = cu ≠ nil and broadcasts x Î anchoru can broadcast x Î input as well and override the client’s failings. This is essential, since there might be a decision for x. During normal operation, however, a faulty client can cause a view change if the primary chooses an input that the client did not send to Qch OK agents.
Agents can keep track of such clients and refuse to accept more input from them. If there are lots of them mounting a denial of service attack, however, performance can still be significantly affected. I don’t know any way to prevent this except for the primary to insist that each input be broadcast by getting an ack from Q~F+ agents, or by public key as in section 8.6. This is expensive, since it happens in normal operation, not just in a view change.
A faulty primary cannot keep BP from satisfying its safety spec, but it can certainly prevent progress. We therefore need a way to ensure that there are times when there’s only a non-faulty primary. To do this, we let the agents become primary in round-robin order. That is, we use integers as views and take a view’s primary to be the view modulo n: pv = v mod n.
An agent a keeps an estimate PT of the time to process a client input. If a gets input from a client at time t and doesn’t see some decision by t + PT, a assumes that the primary has failed. It advances to the next view v, does Closeva, and multicasts its state in a Closev message. Other agents’ timers expire, they do the same thing, and when a sees enough Closev messages it does Anchorva and sends its anchorva set to the new primary pv; see section 8.3.
Startva |
v–1 too
slow Ú Q~F[activev*]@a |
→activeva := true |
If a gets Closeu messages for various u > v from a good quorum, it changes its v to the smallest u and does Closeu. Thus the OK agents increase v at most n times before they agree on the next view, and faulty agents can’t disrupt this agreement. This is not quite the same as self-stabilization, since it relies on not running out of values for v.
BP uses the same exponential backoff as AP to adjust PT.
This is similar to AP, but there is a lot more agent state. As in CP, the primary can discard its state at any time, and the extra transmits for Cleanup can be piggy-backed on the next step.
Cleanupa |
Q~F+[d
* ≠ nil] |
→ rva := nil; ina := {}; |
|
As we saw in section 6, if messages can be broadcast securely, that is, signed by public keys, then a process can forward information to other processes so that they don’t have to get it from the source. This does not add any new power, but it avoids the Q~F+ acknowledgements otherwise needed for a broadcast.
There are two points where BP needs a broadcast, of inputa in normal operation and of anchora in a view change:
1. The primary can broadcast an input to all the agents, so a faulty client cannot force a view change.
2. During a view change the Anchorvp action is not needed. That is, an agent does not need to acknowledge x Î anchorva to the primary, since the information on which anchorva is based is broadcast.
This does not reduce the amount of message traffic in the normal case, since we are cheating there by not broadcasting input and taking some risk from faulty clients. Thus there is no performance gain to balance the large loss from doing public key operations, except when there are lots of faulty clients.
We separate the client-Paxos costs from the internal costs. They are not really comparable, for two reasons:
They often involve a network with very different properties.
Internal traffic can often have much bigger batches since it can combine the traffic from all the clients.
Figure 5 shows that in the normal case BP has one client-agents round trip (ng is a good quorum), by comparison with a client-primary round-trip in CP. In addition, there is one 1→n message from the primary as in CP, and two n→n messages among the agents, compared with one n→1 message to the primary in CP, and one 1→n message from the primary that can go in parallel with output. Thus BP has one extra message latency before the client gets output. What about throughput?
In a network that supports multicast efficiently (for example, any broadcast LAN or a switched LAN whose switches support it), the extra cost for n receivers is small. Table 1 shows the cost comparison on this assumption. BP is about twice as expensive as CP, or almost three times as expensive for the same number of failures (f or s). It’s not surprising that faults are much more costly.
If there’s no efficient multicast, agents can relay their messages to other agents through the primary, complete with authenticators, so that there are 2n messages after Choosea or Accepta rather than n2.
Table 1: Cost of
a normal run of BP and CP
Enables |
Message flow |
BP |
cost |
CP |
cost |
Inputp |
client→agents/primary |
1→n |
1 |
1→1 |
1 |
output |
agents/primary→client |
ng→1 |
f + 1 |
1→1 |
1 |
Total external |
f + 2 |
2 |
|||
Choosea |
primary→agents |
1→n-1 |
1 |
|
|
Accept |
agents/primary→agents |
n-1→n |
n-1 |
1→n-1 |
1 |
Finish |
agents→agents/primary |
n→n |
n |
n-1→1 |
n-1 |
Finisha |
primary→agents |
|
|
1→n-1 |
0 |
Total internal |
2n |
n |
|||
Smallest
non-trivial n |
f = 1 |
s = 1 |
The optimizations of AP work in BP: compressing state with the last-triple, using one view change for many steps, and batching.
BP does not have to transmit the client’s entire input in each message. It’s sometimes enough to just send an ‘authenticator’, a signature of the message implemented by hashing it with a key shared between sender and receiver.
An undesirable property of BP’s view change is that the agents must remember all their cwa values, since they don’t know which one might be needed. This means that the last-triple optimization is not enough to avoid storage linear in the number of views. To avoid this, notice that if x Î anchorvp then x Î anchorv is broadcast by (B7), so if agent a is the primary for v then a can discard cwa for all w < v, since these are only needed for finding an element of anchorw, and anchorw Ê anchorv. For this to work, each agent a′ must remember its contribution to x Î anchorv. If anchorva′ = {cva′}, remembering cva′ is enough. If anchorva′ = X, then a′ must remember that; this is a new requirement. An agent must remember at most n values of cwa or anchorwa = X before its turn as primary comes along. If agents don’t act as primaries, then they need to collect the Ancvx,a facts themselves at regular intervals.
An agent’s inputa need not be persistent, because of the way input is defined as a history variable. If an agent discards input, however, the clients might have to retransmit their inputs.
It’s unfortunate that the primary has a persistent cvp. If it’s also an agent, then this can be the agent’s cva, so the only cost is that it must be persisted before it’s sent to any other agent. To get a primary with no persistent state, follow the model of CP: introduce a volatile cp, make cvp a history variable, and maintain invariants corresponding to (C1) and (C8) as in section 7.3:
invariant |
cpv≠ nil Þ cp = cvp |
|
(B10) |
|
Qdec[ru*=nil] Ù v > u Þ cvp = nil |
|
(B11) |
To do Choosevp the primary must establish cvp = nil using (B11). This may require a new view; to preserve the round-robin scheduling of primaries, make a V a pair (i, j), where i determines the primary (p(i, j) = i mod n) and p can use j to start another view.
We started with an abstract Paxos algorithm AP that uses n agents and has only the agent actions Close, Accept, and Finish and an abstract Choose (plus the external actions Input and Decision). AP works by running a sequence of views until there’s one that runs for long enough to make a visible decision quorum for some input. Provided no later view starts, this will always happen as long as the choice is made and is visible. AP’s operation is divided into view change and normal operation; the latter requires one round-trip of agent-agent communication. AP can do any number of successive decisions with a single view change plus one normal operation per decision. AP’s agents are memories that can do conditional writes, but DP is a generalization that works with read-write memories.
AP can’t be implemented directly because it has actions that touch state at more than one process, in particular the Choosev action. We showed two implementations in which the processes communicate stable predicates about their state that are strong enough to convey all the information that AP’s actions need. Both CP and BP have essentially the same agent actions as AP. Both implement AP’s Anchor and Choose actions in a primary process that is logically separate, though it practice it is combined with an agent unless the agents are disks.
CP also uses the primary to relay information among the agents. It doesn’t tolerate any faults. It needs Qout and Qdec exclusive for safety, and live for liveness. For size-based quorums we have f = 0, s < n/2 and Qout = Qdec = Q≥s+1. In normal operation there are n internal messages if a multicast counts as 1, and the client latency is one client-primary round-trip plus one primary-agent round trip.
BP does tolerate faults, so it needs Anchor and Choose actions at both agents and primary, and uses multicast to share information among agents. In addition to CP’s requirements on quorums, it also needs Qch exclusive with itself for safety, and Qch live and Qch Í Q~F+ for liveness. For size-based quorums and F Þ S we have Q~F = Q≥f+1 and Q~F+ = Qout = Qdec = Qch = Q≥2f+1. In normal operation there are 2n internal messages, and CP’s primary-agent round-trip is replaced by a primary-agent multicast plus an agent-agent round trip.
The main application for Paxos is replicated state machines.
[1]
Castro, M. and Liskov, B.
Practical Byzantine fault tolerance. Proc. 3rd OSDI,
[2]
Castro, M. and Liskov, B. Proactive recovery in
a Byzantine-fault-tolerant system. Proc. 4th OSDI,
[5]
Fischer, M., Lynch, N., and
[6] Gafni, E. and Lamport, L. Disk Paxos. Proc. DISC 2000, LNCS 1914, Springer, 2000, 330-344.
[7] Gray, J. and Reuter, A. Transaction Processing: Concepts and Techniques. Morgan Kaufmann, 1993.
[9] Lamport, L. A simple approach to specifying concurrent systems. Comm. ACM 32, 1, Jan. 1989, 32-45.
[10] Lamport,
L. The part-time parliament. ACM Transactions on Computer Systems 16, 2, May 1998, 133-169. Originally
appeared as Research Report 49,
[11] Lampson,
B., Lynch, N., and Søgaard-Andersen, J. Correctness of at-most-once message
delivery protocols. Proc. 6th Conf. on Formal Description Techniques,
[14] Liskov, B. and Oki, B. Viewstamped replication, Proc. 7th PODC, Aug. 1988.
[15] Lynch, N. Distributed Algorithms. Morgan Kaufmann, 1996.
[16] Malkhi,
D. and Reiter, M. Byzantine quorum
systems. In Proc. 29th ACM STOC,
Table 2 gives some correspondences between the terminology of this paper and that of Castro and Liskov.
Table 3 lists all the names for variables and constants in alphabetical order, followed by the @, #, [], and Q+ notation and the names of the actions for communication.
Table 4 collects the variables, abstractions, state functions, actions, and invariants of AP, CP, and BP to help you see how they are related. To save space, we shorten the names of actions to two characters, and shorten input, active, and anchor to in, act, and anc.
The external actions are first, then the internal ones in the order of a complete run. Changes from the item to the left are marked by boxes except for p and a superscripts. A ditto mark " means that the entry is a copy of the corresponding entry to the left.
The legend in the lower left corner summarizes the way we mark non-local, changed, and abstract variables. We mark as non-local anything in an action that came from other processes, even though in CP and BP it is of course local when the action occurs.
Figure 6 collects from figures 3-5 the pictures for the flow of actions and messages in AP, CP, and BP. Notice the fact that they start slightly differently, the extra Choose action in BP, and the extra Finisha action in CP.
Table 2: Our terminology for BP vs. Castro and Liskov’s
Action→ |
C-L
state |
Our
state |
C-L
msg |
Our
msg |
Closep |
|
|
view-change |
ra, ca |
Anchorva |
in view v |
anchorva ≠ {} |
view-ack |
anchorva |
Anchorvp |
in view v |
anchorvp ≠ {} |
new-view |
|
Choosep |
pre-prepared |
cvp ≠ nil |
pre-prepare |
cvp |
Choosea |
pre-prepared |
cva ≠ nil |
prepare |
cva |
Accept |
prepared |
rva ≠ nil |
commit |
rva |
Finish |
committed |
da
≠
nil |
|
|
Q~F |
weak certificate |
|
|
|
Q~F+ |
quorum certificate |
|
|
|
Table 3: Variables, constants, notation, and communication
|
|
|
|
|
|
|
|
|
|
Spec, |
AP |
DP |
CP |
BP |
|
|
|
failure, quorum |
|
Δ from AP |
Δ from |
Δ from CP |
|
in section |
|
§ 2, 3 |
§ 4 |
§ 5 |
§ 7 |
§ 8 |
|
Agent |
a |
|
a |
|
|
|
|
Choice |
c |
|
cv |
|
cp |
cva, ceva, cvp |
|
Decision |
d |
d |
da |
|
|
|
|
Faulty |
|
f,
Fm |
|
|
|
|
|
predicate |
g,
G |
|
|
|
|
|
|
Integer |
i,
j |
|
|
|
|
|
|
process |
k,
m |
|
|
|
|
|
|
|A| |
|
n |
|
|
|
|
|
Primary |
p |
|
|
|
p,
pv |
|
|
Quorum |
Q,
q |
Q~F, Q+ |
Qdec, Qout |
|
|
Qch |
|
Result |
r |
|
rva, rv |
rxva, rova |
revp |
reva, |
|
Stopped |
|
s,
Sm |
|
|
|
|
|
Truth |
T |
T
(§6) |
|
|
|
|
|
View |
u,
v,
w |
|
v |
|
vp |
|
|
value |
x,
y |
|
|
|
|
|
|
failures |
Z,
z |
ZF, ZS, ZFS |
|
|
|
|
|
g@m |
Tm Þ g |
Communication
|
G@m |
(λ k | Gk@m) |
|
Q#G |
{m | Gm Ú Fm} ÎQ |
Localk(g) |
Q[rv*=x] |
Q#(λ m | rvm=x) |
Transmit k,m(g) |
Q+ |
{q′ | ("z Î ZFS | q′ – z Î Q} |
TransmitFk,m(g) |
Q~F |
{q | q Ï ZF} |
Broadcast m (g) |
Figure 6: Summary of actions
Table 4: Summary of declarations, actions, and invariants
|
AP |
implements spec |
|
CP |
implements AP |
|
BP |
implements AP |
var |
rva, da |
result, decision |
rva, da |
|
= rva, da |
rva, da |
|
= rva, da |
|
cv |
choice |
cv |
|
history, = cv |
cva |
|
|
|
|
|
vp,
cp |
|
view, choice |
cvp |
|
|
|
|
|
inp |
|
|
ina |
|
|
input |
in |
|
in |
|
history, = in |
in |
|
history, = in |
active |
actv |
|
actv |
|
history, = actv |
actva |
|
|
abstract |
d = |
if rv ÎX then rv else nil |
|
|
|
cv = |
if Qch[cv*=x] then
x else
nil |
|
|
in = |
in |
|
|
|
actv = |
($a | actva) |
|
sfunc |
|
actp = |
(vp ≠
v0) |
|
ceva
= |
if
(Qch[cv*=x])@a then x else nil |
(B1) |
||
rv = |
ifelse Qdec[rv*=x] then
x |
(A1) |
revp
= |
ifelse
(Qdec[rv*=x] )@p then x |
(C3) |
reva = |
ifelse (Qdec[rv*=x] )@a then x |
(B2) |
|
anchorv = |
{x
|("u<v | cu = x |
(A8) |
|
|
|
" = |
" |
|
|
anchorv = |
ancu Ç {x
| Qout[ru*Î{x,out}]} |
(A9) |
|
|
|
ancva = |
ancu Ç {x
| Qout[ru*Î{x,out}]@a} |
(B5) |
|
anchorv Ê |
if outu,v Ù rua= x then {x} |
(A10) |
ancp Ê |
if outu,vp Ù (rua = x)@p then {x} |
(C4) |
|
|
|
|
|
|
|
|
|
|
ancvp = |
{x | Q~F+[xÎanchorv*]@p} |
(B7) |
|
outu,v = |
("w | u<w<v Þ rw = out) |
|
outu,vp= |
("w | u<w<v Þ rewp = out) |
|
outu,va = |
("w | u<w<v Þ rewa = out) |
|
|
Actions |
|
|
|
|
|
|
|
|
Name |
Guard |
State
change |
Name |
Guard |
State
change |
Name |
Guard |
State
change |
Input(x) |
|
in := in È {x} |
Inp |
|
inp :=
inp È {x}; |
Ina |
|
" |
Decisiona |
da ≠ nil |
→ret da |
" |
" |
|
" |
" |
|
Startv |
u<v too slow |
→actv := true |
Stvp |
u < v
too slow |
→actv :=
true; |
Stva |
v–1 too slow |
→actva :=
true |
Closeva |
actv |
→for all u<v do |
" |
" |
|
" |
actva |
" |
Anchorv |
ancv ≠
{} |
→none |
Anp |
ancp ≠ {} |
→none |
Anva |
ancva ≠ {} |
→none |
|
|
|
|
|
|
Anvp |
ancvp ≠ {} |
→none |
Choosev |
cv = nil |
→cv := x |
Chp |
actp Ù cp = nil |
→cp := x; |
Chvp |
pv = p Ù
cvp = nil |
→cvp := x |
|
|
|
|
|
|
Chva |
cva = nil
|
→cva := x |
Acceptva |
cv ≠
nil |
→rva := cv; |
" |
cv@a ≠ nil |
→rva := cv@a; |
" |
ceva ≠ nil |
→rva := ceva; |
Finishva |
rv Î X |
→da := rv |
" |
revp@a Î X |
→da := revp@a |
" |
reva Î X |
→da := reva |
Cleanupa |
Q~F+[d * ≠ nil] |
→rva:=nil; in:={} |
" |
" |
|
" |
" |
→rva:=nil; … |
invariant |
da ≠ nil Þ ($v | rv = da) |
(A2) |
|
" |
|
|
" |
|
|
rv
= x Ù ru = x′ Þ x = x′ |
(A3) |
|
" |
|
|
" |
|
|
rva = x Þ rva = cv |
(A4) |
|
" |
|
|
" |
|
|
cv = x Þ cv Î in Ç ancv |
(A5) |
|
" |
|
|
" |
|
|
rva ≠ nil Ù u < v Þ rua ≠ nil |
(A6) |
|
" |
|
|
" |
|
|
|
|
actp Ù cpv≠ nil Þ cp = cvp |
(C1) |
|
ceva ≠ nil Þ ceva = cv |
(B3) |
|
Legend |
|
|
revp ≠ nil Þ revp = rv |
(C5) |
|
reva ≠ nil Þ reva = rv |
(B4) |
|
ancv non-local |
|
|
ancp Í ancvp |
(C6) |
|
ancva Í ancv |
(B6) |
|
in abstract variable |
|
|
Qdec[ru*=nil]
Ù v>u Þ cv = nil |
(C8) |
|
Qdec[ru*=nil] Ù v>u Þ
cvp = nil |
(B11) |
|
actpÙ changed from item on left |
|
|
actp Þ actvp; inp Í in |
(C2) |
|
cvp ≠ nil Þ
cvp Î inp Ç ancvp |
(B8) |
|
" copy of item on left |
|
|
actp Ù cp = nil Þ cvp = nil |
(C7) |
|
cva ≠ nil Þ cva Î ina Ç
ancva |
(B9) |
[1] This paper was presented as an invited talk at the 2001 Principles of Distributed Computing Conference. It has not been published.
[2] There is a similar treatment of reliable messages in [11] and [12].
[3] Views are ‘ballots’ in Lamport’s original paper, and ‘rounds’ in other papers. ‘View’ suggests a view of the state or a view of the membership of a group, although these are only applications of consensus.
[4] See [9] and [13] for informal explanations of simulation proofs, and [15] for a thorough account.
[5] A different spec would allow it to return nil if there’s no decision, but then it must be able to return nil even if there has already been a decision, since a client may do the Decision action at a process that hasn’t yet heard about the decision. For this paper it makes no difference.
[6] I am indebted to Michael Jackson for a remark that led to this idea.
[7] The “Ú F” is there to simplify the definition of Broadcastk,m in section 6.3
[8] A more careful treatment would reflect the fact that a receiver must remember the message and its signature in order to forward them, since Gk→m may disappear from ch.