de Prisco, Lampson, and Lynch - Revisiting the Paxos Algorithm

This document was made by OCR from a scan of the technical report. It has not been edited or proofread and is not meant for human consumption, but only for search engines. To see the scanned original, replace OCR.htm with Abstract.htm or Abstract.html in the URL that got you here.

Theoretical Computer Science 243 (2000) 35–91

www.elsevier.com/locate/tcs

Fundamental Study
Revisiting the PAXOS algorithm *

Roberto De Prisco^a,^* , Butler Lampson^b, Nancy Lynch^a

^aMIT Laboratory for Computer Science, 545 Technology Square, Cambridge, MA 02139, USA
^bMicrosoft Corporation, 180 Lake View Ave, Cambridge, MA 02138, USA

Received October 1998; revised July 1999
Communicated by M. Mavronicolas

Abstract

The PAXOS algorithm is an efficient and highly fault-tolerant algorithm, devised by Lamport, for reaching consensus in a distributed system. Although it appears to be practical, it seems to be not widely known or understood. This paper contains a new presentation of the PAXOS algorithm, based on a formal decomposition into several interacting components. It also contains a correctness proof and a time performance and fault-tolerance analysis. The formal framework used for the presentation of the algorithm is provided by the Clock General Timed Automaton (Clock GTA) model. The Clock GTA provides a systematic way of describing timing-based systems in which there is a notion of “normal” timing behavior, but that do not necessarily always exhibit this “normal” timing behavior. ~c 2000 Elsevier Science B.V. All rights reserved.

Keywords: I/O automata models; Formal verification; Distributed consensus; Partially synchronous systems; Fault-tolerance

Contents

1. Introduction...............................................................................................36 1.1. Related work.........................................................................................38 1.2. Road map............................................................................................39

2. Models....................................................................................................39 2.1. I/O automata and the GTA..........................................................................39 2.2. The Clock GTA.....................................................................................40 2.3. Composition of automata............................................................................42

* A preliminary version of this paper appeared in Proceedings of the 11th International Workshop on Distributed Algorithms, Saarbrucken, Germany, September 1997, Lecture Notes in Computer Science, Vol. 1320, 1997, pp. 111–125. The first author is on leave from the Dipartimento di Informatica ed Applicazioni, Universitâ di Salerno, 84081 Baronissi (SA), Italy.

* Corresponding author.

E-mail addresses: robdep@theory.lcs.mit.edu (R. De Prisco), blampson@microsoft.com (B. Lampson), lynch@theory.lcs.mit.edu (N. Lynch).

3. The distributed setting....................................................................................43 3.1. Processes.............................................................................................44 3.2. Channels.............................................................................................44 3.3. Distributed systems ..................................................................................45

4. The consensus problem...................................................................................46

5. Failure detector and leader elector .......................................................................47 5.1. A failure detector....................................................................................47 5.2. A leader elector......................................................................................49

6. The PAXOS algorithm .....................................................................................52 6.1. Overview.............................................................................................52 6.2. Automaton BASICPAXOS ..............................................................................54 6.3. Automaton SPAX .....................................................................................80 6.4. Correctness and analysis of ^SPAX ................................................................... 81 6.5. Messages.............................................................................................86 6.6. Concluding remarks..................................................................................86

7. The MULTIPAXOS algorithm ...............................................................................87

8. Application to data replication............................................................................88 9. Conclusion................................................................................................89 Acknowledgement............................................................................................89 References...................................................................................................89

1. Introduction

Reaching consensus is a fundamental problem in distributed systems. Given a distributed system in which each process starts with an initial value, to solve a consensus problem means to give a distributed algorithm that enables each process to eventually output a value of the same type as the input values, in such a way that three conditions, called agreement, validity and termination, hold. There are different definitions of the problem depending on what these conditions require. Distributed consensus has been extensively studied. A good survey of early results is provided in [13]. We refer the reader to [24] for a more recent treatment of consensus problems.

Real distributed systems are often partially synchronous systems subject to process, channel and timing failures and process recoveries. In a partially synchronous distributed system, processes take actions within " time and messages are delivered within d time, for given constants " and d. However, these time bounds hold when the system exhibits a “normal” timing behavior. Hence the above-mentioned bounds of " and d can be occasionally violated (timing failures). Processes may stop and recover; it is possible to keep the state of a process, or part of it, in a stable storage so that the state, or part of it, survives the failure. Messages can be lost, duplicated or reordered. Any practical consensus algorithm needs to consider the above practical setting. Moreover, the basic safety properties must not be affected by the occurrence of failures. Also, the performance of the algorithm must be good when there are no failures, while when failures occur, it is reasonable to not expect efficiency.

Lamport’s PAXOS algorithm [19] meets these requirements. The model considered is a partially synchronous distributed system where each process has a direct communication channel with each other process. The failures allowed are timing failures, loss,

duplication and reordering of messages, and process stopping failures. Process recoveries are allowed; some stable storage is needed. PAXOS is guaranteed to work safely, that is, to satisfy agreement and validity, regardless of process, channel and timing failures and process recoveries. When the distributed system stabilizes, meaning that there are no failures, nor process recoveries, and a majority of the processes are not stopped, for a sufficiently long time, termination is also achieved and the performance of the algorithm is good. Hence PAXOS has good fault-tolerance properties and when the system is stable it combines those fault-tolerance properties with the performance of an efficient algorithm, so that it can be useful in practice. In the original paper [19], the PAXOS algorithm is described as the result of discoveries of archaeological studies of an ancient Greek civilization. That paper contains also a proof of correctness and a discussion of the performance analysis. The style used for the description of the algorithm often diverts the reader’s attention. Because of this, we found the paper hard to understand and we suspect that others did as well. Indeed the PAXOS algorithm, even though it appears to be a practical and elegant algorithm, seems not widely known or understood.

In [19] a variation of PAXOS that considers multiple concurrent runs of PAXOS for reaching consensus on a sequence of values is also presented. We call this variation the MULTIPAXOS algorithm. ¹

This paper contains a new, detailed presentation of the PAXOS algorithm, based on a formal decomposition into several interacting components. It also contains a correctness proof and a time performance and fault-tolerance analysis. The MULTIPAXOS algorithm is also described, together with an application to data replication. The formal framework used for the presentation is provided by the Clock General Timed Automaton (Clock GTA), which has been developed in [5]. The Clock GTA is a special type of Lynch and Vaandrager’s General Timed Automaton (GTA) model [26–28]. The Clock GTA uses the timing mechanisms of the GTA to provide a systematic way of describing both the normal and the abnormal timing behaviors of a partially synchronous distributed system subject to timing failures. The model is intended to be used for performance and fault-tolerance analysis of practical distributed systems based upon the stabilization of the system.

The correctness proof uses automata composition and invariant assertion methods. Automata composition is useful for representing a system using separate components. We provide a modular presentation of the PAXOS algorithm, obtained by decomposing it into several components. Each one of these components copes with a specific aspect of the problem. In particular there is a “failure detector” module that detects process failures and recoveries. There is a “leader elector” module that copes with the problem of electing a leader; processes elected leaders by this module, are used as leaders

1 PAXOS is the name of the ancient civilization studied in [19]. The actual algorithm is called the “single-decree synod” protocol and its variation for multiple consensus is called the “multi-decree parliament” protocol. We use the name PAXOS for the single-decree protocol and the name MULTIPAXOS for the multi-decree parliament protocol.

in PAXOS. The PAXOS algorithm is then split into a basic part that ensures agreement and validity and into an additional part that ensures termination when the system stabilizes; the basic part of the algorithm, for the sake of clarity of presentation, is further subdivided into three components. The correctness of each piece is proved by means of invariants, i.e., properties of system states which are always true in any execution.

The time performance and fault-tolerance analysis is conditional on the stabilization of the system behavior starting from some point in an execution. Using the Clock GTA we prove that when the system stabilizes PAXOS reaches consensus in O(1) time and uses O(n) messages, where n is the number of processes. We also briefly discuss the MULTIPAXOS protocol and a data replication algorithm which uses MULTIPAXOS. With MULTIPAXOS the high availability of the replicated data is combined with high fault tolerance.

1.1. Related work

The consensus algorithms of Dwork et al. [9] and of Chandra and Toueg [2] bear some similarities with PAXOS. The algorithm of [9] also uses “rounds” conducted by a leader, but the strategy used in each round is different from the one used by PAXOS. Also, [9] does not consider process restarts. The time analysis provided in [9] is conditional on a “global stabilization time” after which process response times and message delivery times satisfy the time assumptions. This is similar to our stabilized analysis. A similar time analysis, applied to the problem of reliable group communication, can be found in [12].

The algorithm of Chandra and Toueg is based on the idea of an abstract failure detector [2]. It turns out that failure detectors provide an abstract and modular way of incorporating partial synchrony assumptions in the model of computation. A P failure detector incorporates the partial synchrony considered in this paper. One of the algorithms in [2] uses a 9' failure detector, which is weaker than a P failure detector. This algorithm is based on the rotating coordinator paradigm and as PAXOS uses majorities to achieve consistency. However it takes, in the worst case, longer time than PAXOS to achieve termination. Chandra et al. [1] identified the “weakest” failure detector that can be used to solve the consensus problem. This weakest failure detector is #" and it is equivalent to 9'. The Chandra and Toueg algorithm does not consider channel failures (however, it can be modified to work with loss of messages but the resulting algorithm is less efficient than PAXOS with respect to the number of messages sent).

The failure detector provided in this paper differs from those classified by Chandra and Toueg in that it provides reliability conditional on the system stabilization. If the system eventually stabilizes then our failure detector can be classified in the class of the eventually perfect failure detectors. However it should be noted that in order for PAXOS to achieve termination it is not needed that the system become stable forever but only for a sufficiently long time.

Dolev et al. [8] have adapted the Chandra and Toueg’s definition of failure detector to consider also omission failures and have given a distributed consensus protocol that allows majorities to achieve consensus.

MULTIPAXOS can be easily used to implement a data replication algorithm. The data replication algorithms in [23, 30, 17,22] are based on ideas similar to the ones used in

PAXOS.

PAXOS bears some similarities with the three-phase commit protocol [34]. However, the three-phase commit protocol does not always guarantee majorities to progress. The commit algorithm of Keidar and Dolev [18] is similar to PAXOS in that it always guarantees majorities to progress. Also, PAXOS is more efficient than the three-phase commit protocol when the system is stable and consensus has to be reached on a sequence of values (a three-phase protocol is needed only for the first consensus problem, while all the subsequent ones can be solved with a two-phase exchange of messages).

Cristian’s timed asynchronous model [4] is similar to the distributed setting considered in this paper. It assumes, however, a bounded clock drift even when the system is unstable. Our model is weaker in the sense that makes no assumption on clock drift when the system is unstable. The Clock GTA provides a formal way of modelling the stability property of the timed asynchronous model. In [31] Patt-Shamir introduces a special type of GTA used for the clock synchronization problem. The Clock GTA considers only the local time; our goal is to model good timing behavior starting from some point on and thus we are not concerned with synchronization of the local clocks.

In [20] Lampson provides a brief overview of the PAXOS algorithm together with key ideas for proving the correctness of the algorithm. We used these ideas in the correctness proof provided in this paper.

1.2. Road map

Section 2 describes the I/O automaton models used and Section 3 describes the distributed system considered. Section 4 gives a formal definition of the consensus problem. In Section 5 a failure detector and a leader elector are presented; they are used by the PAXOS algorithm. The PAXOS algorithm itself is described and analyzed in Section 6. Section 7 describes MULTIPAXOS and Section 8 discusses how to use MULTIPAXOS to implement a data replication algorithm.

2. Models

Our formal framework is provided by I/O automaton models, specifically by the Clock GTA model developed in [5]. In this section we briefly describe essential notions about I/O automata needed to read the rest of the paper. We refer the interested reader to [24, Chapters 8 and 23] for more information and references about I/O automaton models, and to [5] for a more detailed presentation of the Clock GTA model.

2.1. I/O automata and the GTA

The I/O automata models are formal models suitable for describing asynchronous and partially synchronous distributed systems. An I/O automaton is a simple type of

state machine in which transitions are associated with named actions. These actions are classified into categories, namely input, output, internal and, for the timed models, time-passage. Input and output actions are used for communication with the external environment, while internal actions are local to the automaton. The time-passage actions are intended to model the passage of time. The input actions are assumed not to be under the control of the automaton, that is, they are controlled by the external environment, which can force the automaton to execute the input actions. Internal and output actions are controlled by the automaton. The time-passage actions are also controlled by the automaton (though this may at first seem somewhat strange, it is just a formal way of modelling the fact that the automaton must perform some action before some amount of time elapses).

The General Timed Automaton (GTA) uses time-passage actions called v(t), t E + to model the passage of time. The time-passage action v(t) represents the passage of time by the amount t.

A GTA consists of four components: (i) the signature, consisting of four disjoint sets of actions, namely, the input, output, internal and time-passage actions; (ii) the set of states; (iii) the set of initial states, which is a nonempty subset of the set of states; (iv) the state-transition relation, which specifies all the possible state to state transitions.

A state-to-state transition, usually called a step, is a triple (s, 1r, s') where s and ^s^'are states of the automaton and 1r is an action that takes the automaton from s to s'. If for a particular state s and action 1r, there is some transition of the form (s, 1r, s'), then we say that 1r is enabled in s. Input actions are enabled in every state.

A timed execution fragment of a GTA is defined to be either a finite sequence

· = s0,1r1,s1,1r2,...,1r,.,s,. or an infinite sequence = s0,1r1,s1,1r2,. ..,1r,.,s,.,..., where the s’s are states, the 1r’s are actions (either input, output, internal, or time-passage), and (sk, 1rk+1 , sk+1) is a step for every k. Note that if the sequence is finite, it must end with a state. The length of a finite execution fragment = s0, 1r1,s1, 1r2,. .. , 1r,., s,. ^is,.. A timed execution fragment beginning with a start state is called a timed execution. If

· is any timed execution fragment and _1r,.is any action in , then we say that the time of occurrence of _1r,.is the sum of all the reals in the time-passage actions preceding 1r,. ⁱⁿ. A timed execution fragment is said to be admissible if the sum of all the reals in the time-passage actions in is oc. A state is said to be reachable if it is the final state of a finite timed execution of the GTA.

In the rest of the paper we will often refer to timed executions (resp. timed execution fragments) simply as executions (resp. execution fragments).

2.2. The Clock GTA

A Clock GTA is a GTA with a special component included in the state; this special variable is called Clock and it can assume values in . The purpose of Clock is to model the local clock of the process. The only actions that are allowed to modify Clock are the time-passage actions v(t). When a time-passage action v(t) is executed

by the automaton, the Clock is incremented by an amount of time t' 0 independent of the amount t of time specified by the time-passage action.² Since the occurrence of the time-passage action v(t) represents the passage of (real) time by the amount t, by incrementing the local variable Clock by an amount t' different from t we are able to model the passage of (local) time by the amount t'. As a special case, we have some time-passage actions in which t' = t; in these cases the local clock of the process is running at the speed of real time.

In the following and in the rest of the paper, we use the notation s .x to denote the value of state component x in state s.

Definition 2.1. A step (sk 1,V(t),sk) of a Clock GTA is called regular if sk.Clock - sk 1. Clock = t; it is called irregular if it is not regular.

That is, a time-passage step executing action v(t) is regular if it increases Clock by t' = t. In a regular time-passage step, the local clock is increased by the same amount as the real time, whereas in an irregular time-passage step v(t) that represents the passage of real time by the amount t, the local clock is increased either by t' <t (the local clock is slower than the real time) or by t' > t (the local clock is faster than the real time).

Definition 2.2. A timed execution fragment of a Clock GTA is called regular if all the time-passage steps of are regular. It is called irregular if it is not regular, i.e., if at least one of its time-passage step is irregular.

In a partially synchronous distributed system processes are expected to respond and messages are expected to be delivered within given time bounds. A timing failure is a violation of these time bounds. An irregular time-passage step can model the occurrence of a timing failure. We remark that a timing failure can actually be either an upper bound violation (a process or a channel is slower than expected) or a lower bound violation (a process or a channel is faster than expected). Obviously, in a regular execution fragment there are no timing failures.

Though we have defined a regular execution fragment so that it does not contain any of the timing failures, we remark that for the the scope of this paper we actually need only that the former type of timing failures (upper bound) does not happen. That is, for the scope of this paper, we could have defined a regular step v(t) as one that increases the clock time by an amount t', t' t.

2.2.1. Using MMTAs to describe Clock GTAs

GTAs encode timing restrictions explicitly into the code of the automata. This provides a lot of flexibility but requires more complicated code to explicitly handle the

2 Formally, we have that if (s, v(t), s') is a step then also (s, v(t^˜), s'), for any ^˜t> 0, is a step. Hence a Clock GTA cannot keep track of the real time.

time and the time bounds. In many situations however we do not need such a flexibility and we only need to specify simple time bounds (e.g., an enabled action is executed within " time). The MMTA³ model is a subclass of the GTA model suitable for describing such simple time bounds. The MMTA does not have time-passage actions but each action is coupled with its time of execution so that the execution of an MMTA is a (not necessarily finite) sequence = s0, (in1, t1), s1, (in2, t2),. .. , (inr, tr), sr,..., where the s’s are states, the in’s are actions, and the t’s are times in [I^¿0.To specify the time bounds an MMTA has a fifth component (with respect to the four components of a GTA) called task partition, which is an equivalence relation on the locally controlled actions (i.e., internal and output action). Each equivalence class is called a task of the automaton. A task C having at least one enabled action is said enabled. Each task C has a lower bound, lower(C), and an upper bound upper(C), on the time that can elapse before an enabled action belonging to the task C is executed. If the task is not enabled then there is no restriction.

There is a standard technique that transforms any MMTA into a GTA (see [24, Section 23.1]). This technique can be extended to transform any MMTA into a Clock GTA (see [5]). In the rest of the paper we will sometimes use MMTAs to describe Clock GTAs and when using MMTAs we will always use lower(C) =0 and upper(C) = ". The following lemma [5] holds.

Lemma 1. Consider a regular execution fragment of a Clock GTA described with the MMTA model, starting from a reachable state s0 and lasting for more than " time. Then (i) any task C enabled in s0 either has a step or is disabled within " time, and (ii) any new enabling of C has a subsequent step or disabling within " time, provided that lasts for more than " time from the enabling of C.

2.3. Composition of automata

The composition operation allows an automaton representing a complex system to be constructed by composing automata representing simpler system components. The most important characteristic of the composition of automata is that properties of isolated system components still hold when those isolated components are composed with other components. The composition identifies actions with the same name in different component automata. When any component automaton performs a step involving action in, so do all component automata that have in in their signatures. Since internal actions of an automaton A are intended to be unobservable by any other automaton B, automaton

A cannot be composed with automaton B unless the internal actions of A are disjoint from the actions of B. (Otherwise, A’s performance of an internal action could force

B to take a step.) Moreover, A and B cannot be composed unless the sets of output actions of A and B are disjoint. (Otherwise two automata would have the control of an

3 The name MMT derives from Merritt, Modugno, and Tuttle who introduced this automaton [29].

output action.) When A and B can be composed we say that they are compatible. The transitions of the composition are obtained by allowing all the components that have a particular action t in their signature to participate, simultaneously, in steps involving t, while all the other components do nothing. Note that this implies that all the components participate in time-passage steps, with the same amount of time passing for all of them.

For a formal definition of the composition operation we refer the reader to [24, Section 23.2.3]. Here we recall the following theorems.

Theorem 2. The composition of a compatible collection of GTAs is a GTA.

Given the execution = s0, t1,s1,..., of a GTA A obtained by composing a compatible collection {^Ai}iEI of GTAs, the notation IAi denotes the sequence obtained from  by deleting each pair t,., s,. for which _t,.is not action of Ai and by replacing each remaining _s,.^by(s,.)i, that is, automaton Ai’s piece of s,..

Theorem 3. Let {^Ai}iEI be a compatible collection of GTAs and let A be the composition of Ai, for all i EI. If is an execution of A, then IAi is an execution of Ai, for every iEI.

The above theorem is important because it enables us to claim that properties proven to be true for a particular automaton A are still true for a bigger automaton obtained by composing automaton A with other automata. We will make extensive use of this theorem in the rest of the paper.

Clock GTAs are GTAs; hence, they can be composed as GTAs are composed. However we point out that the composition of Clock GTAs does not yield a Clock GTA but a GTA.

3. The distributed setting

In this section we discuss the distributed setting. We consider a partially synchronous distributed system consisting of n processes. The distributed system provides a bidirectional channel for every two processes. Each process is uniquely identified by its identifier i E I, where I is a totally ordered finite set of n identifiers. The set I is known by all the processes. Moreover each process of the system has a local clock. Local clocks can run at different speeds, though in general we expect them to run at the same speed as real time. We assume that a local clock is available also for channels; though this may seem somewhat strange, it is just a formal way to express the fact that a channel is able to deliver a given message within a fixed amount of time, by relying on some timing mechanism (which we model with the local clock). We use Clock GTAs to model both processes and channels.

Throughout the rest of the paper we use two constants, ‘ and d, to represent upper bounds on the time needed to execute an enabled action and to deliver a message,

respectively. These time bounds do not necessarily hold for every action and message in every execution; a violation of these bounds is a timing failure.

3.1. Processes

We allow process stopping failures and recoveries, and timing failures. To formally model process stops and recoveries we model process i with a Clock GTA which has a special state component called Statusi and two input actions Stop_i and Recoveri. The state variable Statusi reflects the current condition of process i. The effect of action Stop_i is to set Statusi to stopped, while the effect of Recoveri is to set Statusi to alive. Moreover when Statusi = stopped, all the locally controlled actions are not enabled and the input actions have no effect, except for action Recoveri. We say that a process i is alive (resp. stopped) in a given state s if we have s. Statusi = alive (resp. s.Statusi = stopped). We say that a process i is alive (resp. stopped) in a given execution fragment, if it is alive (resp. stopped) in all the states of the execution fragment. An automaton modelling a process is called a process automaton.

Between a failure and a recovery a process does not lose its state. We remark that PAXOS needs only a small amount of stable storage (see Section 6.6); however, for simplicity, we assume that the entire state of a process is stable. We also assume that there is an upper bound of ‘ on the elapsed clock time if some locally controlled action is enabled. This time bound can be violated if timing failures happen.

Finally, we provide the following definition of “stable” execution fragment of a given process automaton. This definition is used later to define a stable execution of a distributed system.

Definition 3.1. Given a process automaton PROCESSi, we say that an execution fragment  of PROCESSi is stable if process i is either stopped or alive in and is regular.

3.2. Channels

We consider unreliable channels that can lose and duplicate messages. Reordering of messages is allowed, i.e., is not considered a failure. Timing failures are also possible. Fig. 1 shows the code of a Clock GTACHANNELi_; j, which models the communication channel from process i to process j; there is one automaton for each possible choice of i and j. Notice that we allow the possibility that the sender and the receiver are the same process. We denote by h' the set of messages that can be sent over the channels.

The time-passage actions of CHANNELi_;j do not let pass the time beyond t^+ d if a message (m_;t^), that is, a message m sent at time t^,is in the channel. Clearly this restriction is on the local time and messages can also be lost. However if the execution is regular and no messages are lost then a particular message is delivered in a timely manner. The following definition of “stable” execution fragment for a channel captures the condition under which messages are delivered on time.

CHANNELi_, _j

Signature:

Input: Send(m)i_, j, Losei_, j, Duplicatei, _j

Output: Receive(m)i_, _j

Time-passage: v(t)

State:

Clock E , initially arbitrary

Msgs, a set of elements of .h' × , initially empty

Actions:

input Send(m)i, _j

Eff: add (m, Clock) to Msgs

output Receive(m)i_, _j

Pre: (m, t) is in Msgs, for some t Eff: remove (m, t) from Msgs

input Losei_, _j

Eff: remove one element of Msgs

input Duplicatei, _j

Eff: let (m, t) be an element of Msgs let t' such that t6t'6Clock place (m,t') into Msgs

time-passage v(t)

Pre: Let t'¿0 be such that

for all (m, t'') E Msgs

Clock + t' 6t'' + d E: Clock := Clock + _t'

Fig. 1. Automaton CHANNELi_,j.

Definition 3.2. Given a channelCHANNELi_, j, we say that an execution fragment of CHANNELi_, _jis stable if no Losei_, _jand Duplicate i, j actions occur in and is regular.

We remark that the above definition requires also that no Duplicate i, j actions happen. This is needed for the performance analysis (duplicated messages may introduce delays in the PAXOS algorithm).

The next lemma follows from the above discussion.

Lemma 4. In a stable execution fragment of CHANNELi_,j beginning in a reachable state s and lasting for more than d time, we have that (i) all messages (m, t) that in state s are in Msgsi, _jare delivered by time d, and (ii) any message sent in is delivered within time d of the sending, provided that lasts for more than d time from the sending of the message.

3.3. Distributed systems

A distributed system is the composition of automata modelling channels and processes. We are interested in modelling bad and good behaviors of a distributed system;

in order to do so we provide some definitions that characterize the behavior of a distributed system. The definition of “nice” execution fragment given later in this section, captures the good behavior of a distributed system. Informally, a distributed system behaves nicely if there are no process failures and recoveries, no channel failures and no irregular steps – remember that an irregular step models a timing failure – and a majority of the processes are alive.

Definition 3.3. A communication system for the set 5 of processes, is the composition of channel automataCHANNELi_, jfor all possible choices of i,j E 5.

Definition 3.4. A distributed system is the composition of process automata modeling the set 5 of processes and a communication system for 5.

We define the communication system _SCHAto be the communication system for the set 5 of all processes.

Next we provide the definition of “stable” execution fragment for a distributed system exploiting the definition of stable execution fragment given previously for channels and process automata.

Definition 3.5. Given a distributed system S, we say that an execution fragment of S is stable if: (i) for all automata PROCESSi modelling process i, i E S it holds that IPROCESSi is a stable execution fragment for process i; (ii) for all channels CHANNELi_, _jwith i,j E S it holds that ICHANNELi_, jis a stable execution fragment for CHANNELi_, j.

Finally we provide the definition of “nice” execution fragment that captures the conditions under which PAXOS satisfies termination.

Definition 3.6. Given a distributed system S, we say that an execution fragment of S is nice if is a stable execution fragment and a majority of the processes are alive in .

The above definition requires a majority of processes to be alive. As is explained in Section 6.6, any quorum scheme could be used instead of majorities.

In the rest of the paper, we will often use the word “system” to mean “distributed system”.

4. The consensus problem

Several different but related agreement problems have been considered in the literature. All have in common that processes start the computation with initial values and at the end of the computation each process must reach a decision. The variations mostly concern stronger or weaker requirements that the solution to the problem has to satisfy. The requirement that a solution to the problem has to satisfy are captured by three properties, usually called agreement, validity and termination. It is clear that

the definition of the consensus problem must take into account the distributed setting in which the problem is considered.

We assume that for each process i there is an external agent that provides an initial value v by means of an action Init(v)i. We denote by V the set of possible initial values and, given a particular execution , we denote by V  the subset of V consisting of those values actually used as initial values in , that is, those values provided by Init(v)i actions executed in . A process outputs a decision v by executing an action Decide(v)i. If a process i executes action Decide(v)i more than once then the output value v must be the same.

To solve the consensus problem means to give a distributed algorithm that, for any execution of the system, satisfies

· Agreement: All the Decide(v) actions in have the same v.

· Validity: For any Decide(v) action in , v belongs to V.

and, for any admissible execution , satisfies

· Termination: If = fly and y is a nice execution fragment and for each process i alive in y an Init(v)i action occurs in while process i is alive, then any process i alive in y, executes a Decide(v)i action in .

The agreement and termination conditions require, as one can expect, that processes “agree” on a particular value. The validity condition is needed to relate the output value to the input values (otherwise a trivial solution, i.e., always output a default value, exists).

5. Failure detector and leader elector

In this section we provide a failure detector algorithm and then we use it to implement a leader election algorithm, which, in turn, is used in Section 6 to implement PAXOS. The failure detector and the leader elector we implement here are both sloppy, meaning that they are guaranteed to give accurate information on the system only in a stable execution. However, this is enough for implementing PAXOS.

5.1. A failure detector

In this section we provide an automaton that detects process failures and recoveries. This automaton satisfies certain properties that we will need in the rest of the paper. We do not provide a formal definition of the failure detection problem, however, roughly speaking, the failure detection problem is the problem of checking which processes are alive and which ones are stopped.

Fig. 2 shows a Clock GTA, called DETECTOR(z, c)i, which detects failures. In our setting failures and recoveries are modeled by means of actions Stop_i and Recoveri. These two actions are input actions of DETECTOR(z, c)i. Moreover DETECTOR(z, c)i has InformStopped(j)i and InformAlive(j)i as output actions which are executed when, respectively, the stopping and the recovering of process j are detected.

Actions:

input Stops

Eff: Status := stopped

output Send(“Alive”)s, _j

Pre: Status = alive

Eff: Lastsend(j) := Clock + z

input Receive(“Alive”)j, _s

Eff: if Status = alive then Prevrec(j) := Clock if j E Alive then

Alive := Alive U {j} Lastcheck(j) := Clock + c

internal Check(j)s

Pre: Status = alive

jEAlive

Eff: Lastcheck(j) := Clock + c

if Clock¿Prevrec(j) +z + d then Alive := Alive\{j}

input Recovers

Eff: Status:=alive

output InformStopped(j)s

Pre: Status= alive

j E Alive

Eff: Lastinform(j) := Clock + "

output InformAlive(j)s

Pre: Status= alive

jEAlive

Eff: Lastinform(j) := Clock + "

time-passage v(t)

Pre: none

Eff: if Status = alive then

Let t' be such that

Vj, Clock + t' <Lastinform(j) Vj, Clock + t' <Lastsend(j) Vj, Clock + t' <Lastcheck(j)

Clock := Clock + _t'

time t + z) and checks for incoming messages at least once in an interval of time of a fixed length c. Let us denote by _SDETthe system consisting of system _SCHAand an automaton DETECTOR(z, c)s for each process s E 5.

For simplicity of notation, henceforth we assume that z = t' and c = t', that is, we use DETECTOR(t', t')s. In practice the choice of z and c may be different.

Using the strategy used by DETECTOR(t', t')sit is not hard to prove the following lemmas (for a detailed formal proof we refer the interested reader to [5]).

Lemma 5. If an execution fragment of _SDET,starting in a reachable state and lasting for more than 3t' + 2d time, is stable and process s is stopped in , then by time 3t' + 2d from the beginning of , for each process j alive in , an action InformStopped(s) _jis executed and no subsequent InformAlive(s) _jaction is executed in .

Lemma 6. If an execution fragment of _SDET,starting in a reachable state and lasting for more than d + 2t' time, is stable and process s is alive in , then by time d+2t' from the beginning of ,for each process j alive in , an action InformAlive(s)j is executed and no subsequent InformStopped(s) _jaction is executed in .

The strategy used by DETECTORs is a straightforward one. For this reason it is very easy to implement. However the failure detector so obtained is not reliable, i.e., it does not give accurate information, in the presence of failures (Stops, Loses_, j, irregular executions). For example, it may consider a process stopped just because the “Alive” message of that process was lost in the channel. Automaton DETECTORs is guaranteed to provide accurate information on faulty and alive processes only when the system is stable.

5.2. A leader elector

Electing a leader in an asynchronous distributed system is a difficult task. An informal argument that explains this difficulty is that the leader election problem is somewhat similar to the consensus problem (which, in an asynchronous system subject to failures is unsolvable [14]) in the sense that to elect a leader all processes must reach consensus on which one is the leader. It is fairly clear how a failure detector can be used to elect a leader. Indeed the failure detector gives information on which processes are alive and which ones are not alive. This information can be used to elect the current leader. We use the DETECTOR(t', t')s automaton to check for the set of alive processes. Fig. 3 shows automaton LEADERELECTORs which is an MMTA. Remember that we use MMTAs to describe in a simpler way Clock GTAs. Automaton LEADERELECTORs interacts with DETECTOR(t',t')s by means of actions InformStopped(j)s, which inform process s that process j has stopped, and InformAlive(j)s, which inform process s that process j has recovered. Each process updates its view of the set of alive processes when these two actions are executed. The process with the biggest identifier in the set of alive processes

LEADERELECTORi

Signature:

Input: InformStopped(j)i, InformAlive(j)i, Stop_i, Recoveri

Output: Leaderi, NotLeaderi

State:

Status E {alive_; stopped} initially alive

Pool E ₂5initially {i}

Derived variable:

Leader, defined as max of Pool Actions:

input Stop_i

Eff: Status := stopped

output Leaderi

Pre: Status= alive
i = Leader

Eff: none

input InformStopped(j)i

Eff: if Status = alive then Pool := Pool\{j}

Tasks and bounds:

{Leaderi, NotLeaderi}, bounds [0_; t']

input Recoveri

Eff: Status := alive

output NotLeaderi Pre: Status= alive

i = Leader Eff: none

input InformAlive(j)i

Eff: if Status = alive Pool :=PoolU{j}

Fig. 3. Automaton LEADERELECTOR for process i.

is declared leader. We denote with _SLEAthe system consisting of _SDETcomposed with a LEADERELECTORi automaton for each process i E 5. Fig. 4 shows _SLEA;it also shows SDET, which is a subsystem of SLEA.

Since DETECTOR(t'_; t')i is not a reliable failure detector, also LEADERELECTORi is not reliable. Thus, it is possible that processes have different views of the system so that more than one process considers itself leader, or the process supposed to be the leader is actually stopped. However, as the failure detector becomes reliable when the system SDET executes a stable execution fragment (see Lemmas 5 and 6), also the leader elector becomes reliable when system _SLEAis stable. Notice that when _SLEAexecutes a stable execution fragment, so does SDET.

Formally, we say that a state s of system _SLEA,is a unique-leader state if there exists an alive process i such that for all alive processes j it holds that s:Leader _j= i.

Fig. 4. The system SLEA.

In such a case, process i is the leader of state s. Moreover, we say that an execution of system _SLEA,is a unique-leader execution if all the states of are unique-leader states with the same leader in all the states.

Next lemma states that in a stable execution fragment, eventually there is unique-leader state.

Lemma 7. If an execution fragment of _SLEA,starting in a reachable state and lasting for more than 4t' + 2d, is stable, then by time 4t' + 2d, there is a state occurrence s such that in state s and in all the states after s there is a unique leader. Moreover this unique leader is always the process with the biggest identier among the processes alive in .

Proof. First notice that the system _SLEAconsists of system _SDETcomposed with other automata. Hence by Theorem 3 we can use any property of _SDET.In particular we can use Lemmas 5 and 6 and thus we have that by time 3t' + 2d each process has a consistent view of the set of alive and stopped processes. Let i be the leader. Since

is stable and thus also regular, by Lemma 1, within additional t' time, actions Leaderj and NotLeader _jare consistently executed for each process j, including process j = i. The fact that i is the process with the biggest identifier among the processes alive in

follows directly from the code of LEADERELECTORi.

We remark that, for many algorithms that rely on the concept of leader, it is important to provide exactly one leader. For example when the leader election is used to generate a new token in a token ring network, it is important that there is exactly one process (the leader) that generates the new token, because the network gives the right to send messages to the owner of the token and two tokens may result in an interference between two communications. For these algorithms, having two or more leaders jeopardizes the correctness. Hence the sloppy leader elector provided before is not suitable. However, for the purpose of this paper, LEADERELECTOR is all we need.

6. The PAXOS algorithm

PAXOS was devised a very long time ago⁴ but its discovery, due to Lamport, is very recent [19].

In this section we describe the PAXOS algorithm, provide an implementation using Clock GT automata, prove its correctness and analyze its performance. The performance analysis is given assuming that there are no failures nor recoveries, and a majority of the processes are alive for a sufficiently long time. We remark that when no restrictions are imposed on the possible failures, the algorithm might not terminate.

6.1. Overview

Our description of PAXOS is modular: we have separated various parts of the overall algorithm; each piece copes with a particular aspect of the problem. This approach should make the understanding of the algorithm much easier. The core part of the algorithm is a module that we call BASICPAXOS; this piece incorporates the basic ideas on which the algorithm itself is built. The description of this piece is further subdivided into three components, namely BPLEADER, BPAGENT and BPSUCCESS.

In BASICPAXOS processes try to reach a decision by running what we call a “round”. A process starting a round is the leader of that round. BASICPAXOS guarantees that, no matter how many leaders start rounds, agreement and validity are not violated. This means that in any run of the algorithm no two different decisions are ever made and any decision is equal to some input value. However to have a complete algorithm that satisfies termination when there are no failures for a sufficiently long time, we need to augment BASICPAXOS with another module; we call this module STARTER. The functionality of STARTER is to make the current leader start a new round if the previous one is not completed within some time bound.

Leaders are elected by using the LEADERELECTOR algorithm provided in Section 5. We remark that this is possible because the presence of two or more leaders does not jeopardize agreement or validity; however, to get termination there must be a unique leader.

⁴The most accurate information dates it back to the beginning of this millennium [19].

Fig. 5. PAXOS: process i. Some of the actions shown in the figure will be defined later in this section.

Thus, our implementation of PAXOS is obtained by composing the following automata: CHANNELi_; _jfor the communication between processes, DETECTORi and LEADERELECTORi for the leader election, BASICPAXOSi and STARTERi, for every process i_;j 5. The resulting system is called SPAX.

Fig. 5 shows the automaton at process i. Notice that not all of the actions are drawn in the picture: we have drawn only some of them and we refer to the formal code for all of the actions. Actions Stop_i and Recoveri are input actions of all the automata. The _SPAXautomaton at process i interacts with automata at other processes by sending messages over the channels. Channels are not drawn in the picture.

Fig. 6 shows the messages exchanged by processes i and j. The automata that send and receive these messages are shown in the picture. We remark that channels and actions interacting with channels are not drawn, as well as other actions for the interaction with other automata.

Fig. 6. BASICPAXOS: Messages.

It is worth to remark that some pieces of the algorithm do need to be able to measure the passage of the time (DETECTORi, STARTERi and BPSUCCESSi) while others do not.

We will prove (Theorems 9 and 10) that the system _SPAXsolves the consensus problem ensuring partial correctness – any output is guaranteed to be correct, that is, agreement and validity are satisfied – and (Theorem 17) that _SPAXguarantees also termination when the system executes a nice execution fragment, that is, without failures and recoveries and with at least a majority of the processes remaining alive.

6.1.1. Roadmap for the rest of the section

In Section 6.2 we provide automaton BASICPAXOS. This automaton is responsible for carrying out a round in response to an external request. We prove that any round satisfies agreement and validity and we provide a performance analysis for a successful round. Then in Section 6.3 we provide automaton STARTER which takes care of the problem of starting new rounds. In Section 6.4 we prove that the entire system _SPAX^iscorrect and provide a performance analysis. In Section 6.5 we provide some comments about the number of messages used by the algorithm. Finally Section 6.6 contains some concluding remarks.

6.2. Automaton BASICPAXOS

In this section we present the automaton BASICPAXOS which is the core part of the PAXOS algorithm. We begin by providing an overview of how automaton BASICPAXOS works, then we provide the automaton code along with a detailed description and finally we prove that it satisfies agreement and validity.

6.2.1. Overview

The basic idea is to have processes propose values until one of them is accepted by a majority of the processes; that value is the final output value. Any process may

propose a value by initiating a round for that value. The process initiating a round is said to be the leader of that round while all processes, including the leader itself, are said to be agents for that round. Informally, the steps for a round are the following.

(1) To initiate a round, the leader sends a “Collect” message to all agents⁵ announcing that it wants to start a new round and at the same time asking for information about previous rounds in which agents may have been involved.

(2) An agent that receives a message sent in Step 1 from the leader of the round, responds with a “Last” message giving its own information about rounds previously conducted. With this, the agent makes a kind of commitment for this particular round that may prevent it from accepting (in Step 4) the value proposed in some other round. If the agent is already committed for a round with a bigger round number then it informs the leader of its commitment with an “OldRound” message.

(3) Once the leader has gathered information about previous rounds from a majority of agents, it decides, according to some rules, the value to propose for its round and sends to all agents a “Begin” message announcing the value and asking them to accept it. In order for the leader to be able to choose a value for the round it is necessary that initial values be provided. If no initial value is provided, the leader must wait for an initial value before proceeding with Step 3. The set of processes from which the leader gathers information is called the info-quorum of the round.

(4) An agent that receives a message from the leader of the round sent in Step 3, responds with an “Accept” message by accepting the value proposed in the current round, unless it is committed for a later round and thus must reject the value proposed in the current round. In the latter case the agent sends an “OldRound” message to the leader indicating the round for which it is committed.

(5) If the leader gets “Accept” messages from a majority of agents, then the leader sets its own output value to the value proposed in the round. At this point the round is successful. The set of agents that accept the value proposed by the leader is called the accepting-quorum.

Since a successful round implies that the leader of the round reached a decision, after a successful round the leader still needs to do something, namely to broadcast the reached decision. Thus, once the leader has made a decision it broadcasts a “Success” message announcing the value for which it has decided. An agent that receives a “Success” message from the leader makes its decision choosing the value of the successful round. We use also an “Ack” message sent from the agent to the leader, so that the leader can make sure that everyone knows the outcome.

Fig. 7 shows: (a) the steps of a successful round r; (b) the responses from an agent that informs the leader that an higher numbered round r' has been already initiated; (c) the broadcast of a decision. The parameters used in the messages will be explained later. Section 6.2.2 contains a description of the messages.

5 Thus it sends a message also to itself. This helps in that we do not have to specify different behaviors for a process according to the fact that it is both leader and agent or just an agent. We just need to specify the leader behavior and the agent behavior.

Fig. 7. Exchange of messages.

Since different rounds may be carried out concurrently (several processes may concurrently initiate rounds), we need to distinguish them. Every round has a unique identifier. Next we formally define these round identifiers. A round number is a pair (x, i) where x is a nonnegative integer and i is a process identifier. The set of round numbers is denoted by . A total order on elements of is defined by (x, i) <(y,j) iffx<y or, x=y and i<j.

We say that round r precedes round r' if r<r'.

If round r precedes round r' then we also say that r is a previous round, with respect to round r'. We remark that the ordering of rounds is not related to the actual time the rounds are conducted. It is possible that a roundr' is started at some point in time and a previous round r, that is, one with r <r', is started later on.

For each process i, we define a “+i” operation that given a round number (x,j) and an integer y, returns the round number (x,j) +i y = (x + y, i).

Every round in the algorithm is tagged with a unique round number. Every message sent by the leader or by an agent for a round (with round number) r E , carries the

round number r so that no confusion among messages belonging to different rounds is possible.

However the most important issue is about the values that leaders propose for their rounds. Indeed, since the value of a successful round is the output value of some processes, we must guarantee that the values of successful rounds are all equal in order to satisfy the agreement condition of the consensus problem. This is the tricky part of the algorithm and basically all the difficulties derive from solving this problem. Consistency is guaranteed by choosing the values of new rounds exploiting the information about previous rounds from at least a majority of the agents so that, for any two rounds, there is at least one process that participated in both rounds.

In more detail, the leader of a round chooses the value for the round in the following way. In Step 1, the leader asks for information and in Step 2 an agent responds with the number of the latest round in which it accepted the value and with the accepted value or with round number (0,j) and nil if the agent has not yet accepted a value. Once the leader gets such information from a majority of the agents (which is the info-quorum of the round), it chooses the value for its round to be equal to the value of the latest round among all those it has heard from the agents in the info-quorum or equal to its initial value if all agents in the info-quorum were not involved in any previous round. Moreover, in order to keep consistency, if an agent tells the leader of a round r that the last round in which it accepted a value is round T1, r¹ <r, then implicitly the agent commits itself not to accept any value proposed in any other round

T¹¹,T¹<T¹¹<T.

Given the above setting, if T¹ is the round from which the leader of round T gets the value for its round, then, when a value for round T has been chosen, any round T11, T¹ <T ¹¹<T, cannot be successful; indeed at least a majority of the processes are committed for round T, which implies that at least a majority of the processes are rejecting round T¹¹.This, along with the fact that info-quorums and accepting-quorums are majorities, implies that if a round T is successful, then any round with a bigger round number ^˜1>T is for the same value. Indeed the information sent by processes in the info-quorum of round i ˜ is used to choose the value for the round, but since info-quorums and accepting-quorums share at least one process, at least one of the processes in the info-quorum of round T¹ is also in the accepting-quorum of round T. Indeed, since the round is successful, the accepting-quorum is a majority. This implies that the value of any round ^˜i> T must be equal to the value of round T, which, in turn, implies agreement.

We remark that instead of majorities for info-quorums and accepting-quorums, any quorum system can be used. Indeed the only property that is required is that there be a process in the intersection of any info-quorum with any accepting-quorum.

Example. Fig. 8 shows how the value of a round is chosen. In this example we have a network of 5 processes, A, B, C, D, E (where the ordering is the alphabetical one) and vA, vB denote the initial values of A and B. At some point process B is the leader and starts round (1,B). It receives information from A,B,E (the set {A,B,E} is the

Fig. 8. Choosing the values of rounds. Empty boxes denote that the process is in the info-quorum, and black boxes denote acceptance. Dotted lines indicate commitments.

info-quorum of this round). Since none of them has been involved in a previous round, process B is free to choose its initial value VB as the value of the round. However it receives acceptance only from B, C (the set {B, C} is the accepting-quorum for this round). Later, process A becomes the leader and starts round (2,A). The info-quorum for this round is {A, D, E}. Since none of this processes has accepted a value in a previous round, A is free to choose its initial value for its round. For round (2,D) the info-quorum is {C,D,E}. This time in the quorum there is process C that has accepted a value in round (1, B) so the value of this round must be the same of that of round (1,B). For round (3,A) the info-quorum is {A,B,E} and since A has accepted the value of round (2,A) then the value of round (2,A) is chosen for round (3,A). For round (3,B) the info-quorum is {A, C,D}. In this case there are three processes that accepted values in previous rounds: process A that has accepted the value of round (2,A) and processes C,D, that have accepted the value of round (2,D). Since round (2,D) is the higher round number, the value for round (3,B) is taken from round (2,D). Round (3,B) is successful.

To end up with a decision value, rounds must be started until at least one is successful. The basic consensus module BASICPAXOS guarantees that a new round does not

violate agreement or validity, that is, the value of a new round is chosen in such a way that if the round is successful, it does not violate agreement and validity. However, it is necessary to make BASICPAXOS start rounds until one is successful. We deal with this problem in Section 6.3.

6.2.2. The code

In order to describe automaton BASICPAXOS for process we provide three automata. One is called BPLEADER and models the “leader” behavior of the process; another one is called BPAGENT and models the “agent” behavior of the process; the third one is called BPSUCCESS and it simply takes care of broadcasting a reached decision. Automaton

BASICPAXOS is the composition of BPLEADER , BPAGENT and BPSUCCESS .

Figs. 9 and 10 show the code for BPLEADER , while Fig. 11 shows the code for BPAGENT . We remark that these code fragments are written using the MMTA model. Remember that we use MMTA to describe in a simpler way Clock GT automata. Figs. 12 and 13 show automaton BPSUCCESS . The purpose of this automaton is simply to broadcast the decision once it has been reached by the leader of a round. Figs. 6 and 7 describe the exchange of messages used in a round.

It is worth noticing that the code fragments are “tuned” to work efficiently when there are no failures. Indeed messages for a given round are sent only once, that is, no attempt is made to try to cope with losses of messages and responses are expected to be received within given time bounds. Other strategies to try to conduct a successful round even in the presence of some failures could be used. For example, messages could be sent more than once to cope with the loss of some messages or a leader could wait more than the minimum required time before abandoning the current round and starting a new one – this is actually dealt with in Section 6.3. We have chosen to send only one message for each step of the round: if the execution is nice, one message is enough to conduct a successful round. Once a decision has been made, there is nothing to do but try to send it to others. Thus once the decision has been made by the leader, the leader repeatedly sends the decision to the agents until it gets an acknowledgment. We remark that also in this case, in practice, it is important to choose appropriate time-outs for the re-sending of a message; in our implementation we have chosen to wait the minimum amount of time required by an agent to respond to a message from the leader; if the execution is stable this is enough to ensure that only one message announcing the decision is sent to each agent.

We remark that there is some redundancy that derives from having separate automata for the leader behavior and for the broadcasting of the decision. For example, both automata BPLEADER and BPSUCCESS need to be aware of the decision, thus both have a Decision variable (the Decision variable of BPSUCCESS is updated when action RndSuccess is executed by BPLEADER after the Decision variable of BPLEADER is set). Having only one automaton would have eliminated the need of such a duplication. However we preferred to separate BPLEADER and BPSUCCESS because they accomplish different tasks.

State:

Status E {alive, stopped} init. alives

IamLeader, a boolean init. false

Mode E {collect , gatherlast,

wait ,begincast, gatheraccept,

decided,done} init. done

Init Value E VU nil init. nil

Decision E V U {nil} init. nil

Derived Variable:

CurRnd E R init. (0, i)

HighestRnd E R init. (0, i)

Value E V U {nil} init. nil

ValFromE R init. (0,i)

Info Quo E ₂Iinit. {}

Accept Quo E ₂Iinit. {}

InMsgs, multiset of msgs init. {}
OutMsgs, multiset of msgs init. {}

input Stop_i

Eff: Status := stopped

input Recoveri

Eff: Status:= alive

input Leaderi

Eff: if Status = alive then IamLeader := true

input NotLeaderi

Eff: if Status = alive then IamLeader := false

output Send(m)i_, _j

Pre: Status = alive mi, j E OutMsgs

E: removemi_, jfrom OutMsgs

input Receive(m)j_, _i

Eff: if Status = alive then addmj_, ito InMsgs

input Init(v)i

Eff: if Status = alive then
In it Value := v

input NewRoundi

Eff: if LeaderAlive = true then CurRnd := HighestRnd +i 1 HighestRnd := CurRnd Mode := collect

output Collecti

Pre: LeaderAlive = true

Mode = collect

Eff: ValFrom := (0, i) Info Quo := {} Accept Quo := {}

Vj put (CurRnd, “Collect”)i_, _j

in OutMsgs

Mode := gatherlast

internal GatherLast(m)i Pre: LeaderAlive = true

Mode = gatherlast

m = (r, “Last”,r', ^v)j, i

m E InMsgs CurRnd = r

E: remove m from InMsgs

Info Quo := Info Quo U {j}

if ValFrom<r' and v=nil then

Value := v ValFrom :=r'

if I Info QuoI >n/2 then

Mode := gathered

output Gathered(Value)

Pre: LeaderAlive = true Mode = gathered

Eff: if Value = nil and

InitValue = nil then

Value := Init Value if Value = nil then

Mode := begincast

else

Mode := wait

internal Continues

Pre: LeaderAlive = true
Mode = wait

Value = nil

Init Value = nil

Eff: Value := Init Value Mode := begincast

internal BeginCast_i

Pre: LeaderAlive = true Mode = begincast Eff: Vj, let m be

(CurRnd, “Begin”, Value)i_, _jput m in OutMsgs Mode := gatheraccept

internal GatherAccept(m),

Pre: LeaderAlive = true Mode = gatheraccept m = (T, “Accept”)j, _imE InMsgs

CurRnd = T

E: remove m from InMsgs AcceptQuo := AcceptQuo U {j} if I Accept QuoI >n/2 then

Decision := Value

Mode := decided

output RndSuccess(Decision )j Pre: LeaderAlive = true Mode = decided

E: Mode := done

internal GatherOldRound(m), Pre: Status = alive

m = (T, “OldRound”, ^T')j, i mE InMsgs

HighestRnd <r'

E: remove m from InMsgs HighestRnd :=r'

BPAGENTs

Signature:

Input: Receive(m)j_, s, m E {“Collect”, “Begin”}
Init(v)s, Stops, Recovers

Internal: LastAccept(m)s, m is a “Collect” message Accept(m)s, m is a “Begin” message

Output: Send(m)s_, j, m E {“Last”, “Accept”, “OldRound”}

State:

Status E {alive, stopped}	init. alive	Commit E	init. (0, s)
LastRE	init. (0,s)	InMsgs, multiset of msgs	init. {}
LastVEVU{nil}	init. nil	OutMsgs, multiset of msgs	init. {}
Actions:

input Stops

E: Status := stopped

input Recovers

Eff: Status := alive

output Send(m)s_, _j

Pre: Status = alive

m E OutMsgs

Eff: removems_, jfrom OutMsgs

input Receive(m)j_, _s

Eff: if Status = alive then add _mj,sto InMsgs

input Init(v)s

Eff: if Status = alives then if Last V= nil then LastV:= v

Tasks and bounds:

internal LastAccept(m)s

Pre: Status = alive

m = (r, “Collect”)j_, sE InMsgs E: remove m from InMsgs

if r Commit then

Commit := r

put (r, “Last”, LastR , Last V)s, j in OutMsgs

else

put (r, “OldRound”, Commit)s_, _jin OutMsgs

internal Accept(m)s

Pre: Status = alive

m = (r, “Begin”,v)j_, sE InMsgs E: remove m from InMsgs

if r Commit then

put (r, “Accept”)s, _jin InMsgs LastR:=r, LastV:=v

else

put (r, “OldRound”, Commit)s, _jin OutMsgs

BPSUCCESSs

Signature:

Input: Receive(m)j_, s, m E {“Ack”, “Success”}

Stops, Recovers, Leaders, NotLeaders, RndSuccess(v)s

Internal: SendSuccesss, Checks

Output: Send(m)s_, j, m E {“Ack”, “Success”}
Decide(v)s

Time-passage: v(t)

State:

Clock E init. arbitrary For each j E I

StatusE{alive,stopped} init. alive Acked(j), a boolean init. false

IamLeader, a boolean init. false LastSendAck(j) E U {00} init. 00

Decision E V U {nil} init. nil LastSendSuc(j)E U{00} init. 00

Prevsend E U{nil} init. nil OutAckMsgs(j), set of msgs init. {}

Last Check E U {00} init. 00 OutSucMsgs(j), set of msgs init. {}

LastSS E U {00} init. 00

Actions:

input Stops

E: Status := stopped input Leaders

Eff: if Status = alive and

IamLeader = false then

IamLeader := true

if Decision = nil then

LastSS := clock + ‘

PrevSend := nil output Send(m)s_, _j

Pre: Status = alive

ms, j E OutAckMsgs(j)

Eff: OutAckMsgs(j) := {}

LastSendAck(j) := 00 output Send(m)s_, _j

Pre: Status = alive

ms, j E OutSucMsgs(j)
Eff: OutSucMsgs(j) := {}
LastSendSuc(j) := 00

input Recovers

Eff: Status:= alive

input NotLeaders

Eff: if Status = alive then IamLeader := false LastSS := 00

LastCheck := 00

For each j E I

LastSendSuc(j) := 00 input Receive( (“Ack”) )j, s Eff: if Status = alive then

Acked(j) := true input Receive( (“Success”, v) )j, s Eff: if Status = alive then

Decision := v

put (“Ack”)s_, _jinto OutAckMsgs(j) LastSendAck(j) := Clock + ‘

Text Box: input RndSuccess(v)i
Eff: if Status = alive then
Decision := v
if IamLeader = true then LastSS := Clock + " PrevSend := nil
internal Check,
Pre: Status = alive
PrevSend = nil t=PrevSend +(2" +2d) Clock>t
E: PrevSend:=nil LastSS := Clock + " LastCheck := 00
output Decide(v)i
Pre: Status = alive
Decision = nil
Decision = v
E: none internal SendSuccess,

Pre: Status = alive

IamLeader = true Decision = nil

PrevSend = nil

j = i, Acked(j)= false

Eff: Vj=i such that Acked(j)= false

put (“Success”, Decision),; _j

in OutSucMsgs(j)

LastSendSuc(j) := Clock + " PrevSend := Clock

LastCheck := Clock + (2"+ 2d) + " LastSS := 00

time-passage v(t)

Pre: none

Eff: if Status = alive then

Let t' be such that

Clock + t' <Last Check Clock + t' <LastSS and for each j E 5

Clock + t' <LastSendAck(j)

Clock + t' <LastSendSuc(j)

Clock := Clock + _t'

If the agent did not accept any value in previous rounds, then v is either nil or the initial value of the agent and r' is (0_;j).

(3) “Begin” messages, m = (r_; “Begin”_;v)i_; j.This message is sent by the leader of round r to announce the value v of the round and at the same time to ask to accept it.

(4) “Accept” messages, m = (r_; “Accept”)j_; i. This message is sent by an agent to respond to a “Begin” message from the leader. With this message an agent accepts the value proposed in the current round.

(5) “OldRound” messages, m = (r_; “OldRound”_;r')j_; i.This message is sent by an agent to respond either to a “Collect” or a “Begin” message. It is sent when the agent is committed to reject round r and it informs the leader about round r', which is the higher numbered round for which the agent is committed to reject round r.

(6) “Success” messages, m = (“Success”_;v)i_; j.This message is sent by the leader to broadcast the decision.

(7) “Ack” messages, m = (“Ack”)j_; i. This message is an acknowledgment, so that the

leader can be sure that an agent has received the “Success” message.

We use the kind of a message to indicate any message of that kind. For example the notation m E {“Collect”, “Begin”} means that m is either a “Collect” message, that is m = (r_; “Collect”) for some r, or a “Begin” message, that is m = (r_; “Begin”_; v) for some r and v.

Automaton BPLEADERi. Variable Statusi is used to model process failures and recoveries. Variable IamLeaderi keeps track of whether the process is leader. Variable Modei is used like a program counter, to go through the steps of a round. Variable Init Valuei contains the initial value of the process. Variable Decisioni contains the value, if any, decided by process i. Variable CurRndi contains the number of the round for which process i is currently the leader. Variable HighestRnd_i stores the highest round number seen by process i. Variable Valuei contains the value being proposed in the current round. Variable ValFromi is the round number of the round from which Valuei has been chosen (recall that a leader sets the value for its round to be equal to the value of a particular previous round, which is round ValFromi). Variable Info Quo_i contains the set of processes for which a “Last” message has been received by process i (that is, the info-quorum). Variable Accept Quo contains the set of processes for which an “Accept” message has been received by process i (that is, the accepting-quorum). We remark that in the original paper by Lamport, there is only one quorum which is fixed in the first exchange of messages between the leader and the agents, so that only processes in that quorum can accept the value being proposed. However, there is no need to restrict the set of processes that can accept the proposed value to the info-quorum of the round. Messages from processes in the info-quorum are used only to choose a consistent value for the round, and once this has been done anyone can accept that value. This improvement is also suggested in Lamport’s paper [19]. Finally, variables InMsgs_i and OutMsgs_i are buffers used for incoming and outcoming messages.

Actions Stops and Recovers model process failures and recoveries. Actions Leaders and NotLeader are used to update IamLeaderi. Actions Send(m)i_, and Receive(m)i_, _jsend messages to the channels and receive messages from the channels. Action Init(v)i is used by an external agent to set the initial value of process i. Action NewRoundi starts a new round. It sets the new round number by increasing the highest round number ever seen. Action Collects resets to the initial values all the variables that describe the status of the round being conducted and broadcasts a “Collect” message. Action GatherLast(m) collects the information sent by agents in response to the leader’s “Collect” message. This information is the number of the last round accepted by the agent and the value of that round. Upon receiving these messages, GatherLast(m) updates, if necessary, variables Value, and ValFromi. Also it updates the set of processes which eventually will be the info-quorum of the current round. Action GatherLast(m) is executed until information is received from a majority of the processes. When “Last” messages have been collected from a majority of the processes, the info-quorum is fixed and GatherLast(m) is no longer enabled. At this point action Gathered(v) is enabled. If Value, is defined then the value for the round is set, and action BeginCast_i is enabled. If Value, is not defined (and this is possible if the leader does not have an initial value and does not receive any value in “Last” messages) the leader waits for an initial value before enabling action BeginCast_i. When an initial value is provided, action Continues can be executed and it sets Value, and enables action BeginCast_i. Action BeginCast _ibroadcasts a “Begin” message including the value chosen for the round. Action GatherAccept(m) gathers the “Accept” messages. If a majority of the processes accept the value of the current round then the round is successful and GatherAccept_i sets the Decision, variable to the value of the current round. When variable Decisions has been set, action RndSuccess(v) is enabled. Action RndSuccess is used to pass the decision to BPSUCCESSi. Action GatherOldRound(m) collects messages that inform process i that the round previously started by i is “old”, in the sense that a round with a higher number has been started. Process i can update, if necessary, variable HighestRnd_i.

Automaton BPAGENTi. Variable Status is used to model process failures and recoveries. Variable LastR, is the round number of the latest round for which process i has sent an “Accept” message. Variable LastViis the value for round LastRi. Variable Commits specifies the round for which process i is committed and thus specifies the set of rounds that process i must reject, which are all the rounds with round number less than Commits. We remark that when an agent commits for a round r and sends to the leader of round r a “Last” message specifying the latest round r¹ <r in which it has accepted the proposed value, it is enough that the agent commits to not accept the value of any round T¹¹in between T¹ and T. To make the code simpler, when an agent commits for a round T, it commits to reject any round T¹¹<T. Finally, variables InMsgs _iand OutMsgs _iare buffers used for incoming and outcoming messages.

Actions Stops and Recover, model process failures and recoveries. Actions Send(m)i, _jand Receive(m)i_, send messages to the channels and receive messages from the

channels. Action LastAccept responds to the “Collect” message sent by the leader by sending a “Last” message that gives information about the last round in which the agent has been involved. Action Accept responds to the “Begin” message sent by the leader. The agent accepts the value of the current round if it is not rejecting the round. In both LastAccept and Accept actions, if the agent is committed to reject the current round because of a higher numbered round, then an “OldRound” message is sent to the leader so that the leader can update the highest round number ever seen. Action Init(v) sets to v the value of LastV only if this variable is undefined. With this, the agent sends its initial value in a “Last” message whenever the agent has not yet accepted the value of any round.

Automaton BPSUCCESS . Variable Status is used to model process failures and recoveries. Variable IamLeader keeps track of whether the process is leader. Variable Decision stores the decision. Variable Acked(j) contains a boolean that specifies whether or not process j has sent an acknowledgment for a “Success” message. Variable Prevsend records the time of the previous broadcast of the decision. Variables Last Check , LastSS , and variables LastSendAck(j) , LastSendSuc(j) , for j = , are used to impose the time bounds on enabled actions. Their use should be clear from the code. Variables OutAckMsgs(j) and OutSucMsgs(j) , for j = , are buffers for outcoming “Ack” and “Success” messages, respectively. There are no buffers for incoming messages because incoming messages are processed immediately, that is, by action Receive(m) ; j.

Actions Stop and Recover model process failures and recoveries. Actions Leader and NotLeader are used to update IamLeader . Actions Send(m) ; j and Receive(m) ; j send messages to the channels and receive messages from the channels. Action Receive (m) handles the receipt of “Ack” and “Success” messages. Action RndSuccess simply takes care of updating the Decision variable and sets a time bound for the execution of action SendSuccess . Action SendSuccess sends the “Success” message, along with the value of Decision to all processes for which there is no acknowledgment. It sets the time bounds for the re-sending of the “Success” message and also the time bounds LastSendSuc(j) for the actual sending of the messages. Action Check re-enable action SendSuccess after an appropriate time bound. We remark that 2‘ + 2d is the time needed to send the “Success” message and get back an “Ack” message (see the analysis in the proof of Lemma 11).

We remark that automaton BPSUCCESS needs to be able to measure the passage of time.

6.2.3. Partial correctness

Let us define the system _SBPXto be the composition of system _SCHAand automaton BASICPAXOS for each process E 5 (remember that BASICPAXOS is the composition of automata BPLEADER , BPAGENT and BPSUCCESS ). In this section we prove the partial correctness of _SBPX:we show that in any execution of the system _SBPX,agreement and validity are guaranteed.

For these proofs, we augment the algorithm with a collection ' of history variables. Each variable in ' is an array indexed by the round number. For every round number r a history variable contains some information about round r. In particular the set ' consists of:

Hleader(r) E 5 U nil, initially nil (the leader of round r). Hvalue(r) E VU nil, initially nil (the value for round r).

Hf rom(r) E U nil, initially nil (the round from which Hvalue(r) is taken). Hinfquo(r), subset of 5, initially { } (the info-quorum of round r). Haccquo(r), subset of 5, initially { } (the accepting-quorum of round r). Hrej ect(r), subset of 5, initially { } (processes committed to reject round r).

The code fragments of automata BPLEADER and BPAGENT augmented with the history variables are shown in Figs. 14 and 15. The figures show only the actions that change history variables. Actions of BPSUCCESS do not change history variables.

Initially, when no round has been started yet, all the information contained in the history variables is set to the initial values. All but Hrej ect(r) history variables of round r are set by the leader of round r, thus if the round has not been started these variables remain at their initial values. More formally we have the following lemma.

Lemma 8. In any state of an execution of _SBPX,if Hleader(r) = nil then Hvalue(r) = nil, Hf rom(r) = nil, Hinfquo(r) = { }, Haccquo(r) = { }.

Proof. By an easy induction.

Given a round r, Hrej ect(r), is modified by all the processes that commit themselves to reject round r, and we know nothing about its value at the time round r is started.

Next we define some key concepts that will be instrumental in the proofs.

Definition 6.1. In any state of the system _SBPX,a round r is said to be dead if |Hreject(r)|¿n/2.

That is, a round r is dead if at least n/2 of the processes are rejecting it. Hence, if a round r is dead, there cannot be a majority of the processes accepting its value, i.e., round r cannot be successful.

We denote by _Sthe set {r E | Hleader(r) = nil} of started rounds and by _Vthe set {r E | Hvalue(r) = nil} of rounds for which the value has been chosen. Clearly in any state s of an execution of _SBPX,we have that _V S.

Next we formally define the concept of anchored round which is crucial to the proofs. The idea of anchored round is borrowed from [21]. Informally a round r is anchored if its value is consistent with the value chosen in any previous round r'. Consistent means that either the value of round r is equal to the value of round r' or round r' is dead. Intuitively, it is clear that if all the rounds are either anchored or dead, then agreement is satisfied.

ABPleader (history variables)

input NewRoundi

Eff: if LeaderAlive = true then CurRnd :=HighestRnd + 1

· Hleader(CurRnd) := i HighestRnd := CurRnd Mode := collect

output BeginCasti

Pre: LeaderAlive = true Mode = begincast

Eff: Vj put (CurRnd, “Begin”, Value)i, _j

in OutMsgs

· Hinfquo(CurRnd) := Info Quo

· Hf rom(CurRnd):= ValFrom

· Hvalue(CurRnd):= Value

Mode := gatheraccept

internal GatherAccept(m), Pre: LeaderAlive = true

Mode = gatheraccept

m = (r, “Accept”)j_, E InMsgs

CurRnd =r

E: remove m from InMsgs

Accept Quo := Accept Quo U{j}

if |AcceptQuo| > n/2 then

Decision := Value

· Haccquo(CurRnd) := Accept Quo

Mode := decide

Fig. 14. Actions of BPLEADER for process i augmented with history variables. Only the actions that do change history variables are shown. Other actions are the same as in BPLEADERi, i.e. they do not change history variables. Actions of BPSUCCESS do not change history variables.

Definition 6.2. A round r E _Vis said to be anchored if for every round r' E _Vsuch that r' <r, either round r' is dead or Hvalue(r') = Hvalue(r).

Next we prove that _SBPXguarantees agreement, by using a sequence of invariants. The key invariant is Invariant 6.8 which states that all rounds are either dead or anchored. The first invariant, Invariant 6.3, captures the fact that when a process sends a “Last” message in response to a “Collect” message for a round r, then it commits to not vote for rounds previous to round r.

ABPagent _i(history variables)

internal LastAccept(m)i

Pre: Status=alive

= (r, “Collect”)j_, iE InMsgs

Eff: remove m from InMsgs

if r Commit then Commit := r

· For all r', LastR<r'<r

· Hrej ect(r') := Hrej ectt(r' ) U {i}

put (r, “Last”, LastR , LastV)i_, _j

in OutMsgs else

put (r, “OldRound”, Commit)i, _j

in OutMsgs

Fig. 15. Actions of BPAGENT for process i augmented with history variables. Only the actions that do change history variables are shown. Other actions are the same as in BPAGENTi, i.e. they do not change history variables. Actions of BPSUCCESS do not change history variables.

Invariant 6.3. In any state s of an execution of _SBPX,if message (r, “Last”, r'',v)j_, iis in OutMsgs_j, then jEHreject(r'), for all r' such that r''<r'<r.

Proof. We prove the invariant by induction on the length k of the execution c. The base is trivial: if k =0 then = s0, and in the initial state no message is in OutMsgs_j. Hence the invariant is vacuously true. For the inductive step assume that the invariant is true for = s0lr1s1 ... rksk and consider the execution _{s0r1s1...kskrs.}We need to prove that the invariant is still true in s. We distinguish two cases.

Case 1: (r, “Last”, r'', _v₎_j,E sk. OutMsgs_j. By the inductive hypothesis we have j E sk. Hreject(r'), for all r' such that T'' <T' <T. Since no process is ever removed from any Hrej ect set, we have j E s.Hrej ect(r'), for all T ' such that r''<T' <T.

Case 2: (T, “Last”,r'', ^v)j, iE=sk.OutMsgs_j. Since by hypothesis we have (T, “Last”, r'', ^v)j, iE s.OutMsgs_j, it must be that ir = LastAccept(m)j, with m = (T, “Collect”) and it must be sk.LastR _j=r''.Then the invariant follows by the code of LastAccept(m)j which puts process j into Hreject(r') for all T ' such that r''<T' <T.

The next invariant states that the commitment made by an agent when sending a “Last” message is still in effect when the message is in the communication channel. This should be obvious, but to be precise in the rest of the proof we prove it formally.

Invariant 6.4. In any state s of an execution of _SBPX,if message (T, “Last”,r'', v)j, iis inCHANNELj_, i,then j E Hreject(T'), for all T ' such that r''<T' <T.

Proof. We prove the invariant by induction on the length k of the execution c. The base is trivial: if k = 0 then = s0, and in the initial state no messages are in CHANNELj_, i. Hence the invariant is vacuously true. For the inductive step assume that the invariant is true for = s0lr1s1 . . . lrksk and consider the execution s0ir1s1 . . . iikskirs. We need to prove that the invariant is still true in s. We distinguish two cases.

Case 1: (r, “Last”, r'', _v₎_j,E sk.CHANNELj_, . By the inductive hypothesis we have j E sk. Hreject(r'), for all r' such that T'' <T' <T. Since no process is ever removed from any Hrej ect set, we have j E s.Hrej ect(T'), for all T' such that T'' <T' <T.

Case 2: (T, “Last”, T'',v)j_, iE= sk.CHANNELj_,i. Since by hypothesis (T, “Last”, T'',v)j_, iEs. OutMsgs_j, it must be that ir = Send(m)j_, with m = (T, “Last”, T'',v)j_, i.By the precondition of action Send(m)j_, we have that message (T, “Last”, T'',v)j_, iE sk.OutMsgs_j. By Invariant 6.3 we have that process j Esk.Hreject(T') for all T' such that T'' <T' <T. Since no process is ever removed from any Hreject set, we have j Es.Hreject(T'), for all T' such that T'' <T' <T.

The next invariant states that the commitment made by an agent when sending a “Last” message is still in effect when the message is received by the leader. Again, this should be obvious.

Invariant 6.5. In any state s of an execution of _SBPX,if message (T, “Last”, T'', v)j_, iis in InMsgs_i, then j E Hreject(T'), for all T' such that T'' <T' <T.

Proof. We prove the invariant by induction on the length k of the execution x. The base is trivial: if k =0 then = s0, and in the initial state no messages are in InMsgs_i. Hence the invariant is vacuously true. For the inductive step assume that the invariant is true for = s0lr1s1 . . . rksk and consider the execution s0lr1s1 . . . nkskns. We need to prove that the invariant is still true in s. We distinguish two cases.

Case 1: (T, “Last”, T'',v)j_, iE sk.InMsgs_i. By the inductive hypothesis we have j E sk. Hreject(T'), for all T' such that T'' <T' <T. Since no process is ever removed from any Hrej ect set, we have j E s.Hrej ect(T'), for all T' such that T'' <T' <T.

Case 2: (T, “Last”, T'',v)j_, iE= sk.InMsgs_i. Since by hypothesis (T, “Last”, T'',v)j_, iEs. InMsgs_i, it must be that ir = Receive(m)i_, _jwith m = (T, “Last”, T'',v)j_, i.In order to execute action Receive(m)i_, _jwe must have (T, “Last”, T'',v)j_, iE sk.CHANNELj_, i. By Invariant 6.4 we have j E sk.Hreject(T') for all T' such that T'' <T' <T. Since no process is ever removed from any Hreject set, we have j Es.Hreject(T'), for all T' such that T'' <T' <T.

The following invariant states that the commitment to reject smaller rounds, made by the agent is still in effect when the leader updates its information about previous rounds using the agents’ “Last” messages.

Invariant 6.6. In any state s of an execution _SBPX,if process j E Info Quo_i, for some process i, and CurRnd = T, then VT' such that s. ValFrom <T' <T, we have that

j E Hreject(T').

Proof. We prove the invariant by induction on the length k of the execution c. The base is trivial: if k =0 then = s0, and in the initial state no process j is in Info Quo_i for any i. Hence the invariant is vacuously true. For the inductive step assume that the invariant is true for = s0lr1s1 ... lrksk and consider the execution s0ir1s1 ... iikskirs. We need to prove that the invariant is still true in s. We distinguish two cases.

Case 1: In state sk, j E Info Quo_i, for some process i, and CurRnd = r. Then by the inductive hypothesis, in state sk we have that j EHreject(r'), for all r' such that sk.ValFrom <T' <r. Since no process is ever removed from any Hreject set and, as long as CurRnd is not changed, variable ValFrom, is never decreased, then also in state s we have that j E Hrej ect(r'), for all r' such that s. ValFrom <T' <r.

Case 2: In state sk, it is not true that j E Info Quo_i, for some process i, and CurRndi = T. Since in state s it holds that j E Info Quo_i, for some process i, and CurRnd = r, it must be the case that ir = GatherLast(m) with m = (T, “Last”, T'',vj_, i.Notice that, by the precondition of GatherLast(m)i, m E InMsgs_i. Hence, by Invariant 6.5 we have that j E Hreject(T'), for all T' such that r''<T' <T. By the code of the GatherLast(m)i action we have that ValFromi¿r''. Whence the invariant is proved.

The following invariant is basically the previous one stated when the leader has fixed the info-quorum.

Invariant 6.7. In any state of an execution of _SBPX,ifj E Hinfquo(T) then VT' such that Hf rom(T)<T'<T, we have that jEHreject(T').

Proof. We prove the invariant by induction on the length k of the execution x. The base is trivial: if k =0 then = s0, and in the initial state we have that for every round T, Hleader(T) = nil and thus by Lemma 8 there is no process j in Hinfquo(T). Hence the invariant is vacuously true. For the inductive step assume that the invariant is true for = s0lr1s1...ksk and consider the execution s0lr1s1 ... lrksklrs. We need to prove that the invariant is still true in s. We distinguish two cases.

Case 1: In state sk, j E Hinfquo(T). By the inductive hypothesis, in state sk we have that j E Hrej ect(T'), for all T' such that Hf rom(T)<T' <T. Since no process is ever removed from any Hreject set, then also in state s we have that j E Hreject(T'), for all T' such that Hf rom(T)<T'<T.

Case 2: In state sk, j E Hinfquo(T). Since in state s, j EHinfquo(T), it must be the case that action ir puts j in Hinfquo(T). Thus it must be ir = BeginCast _ifor some process i, and it must be sk.CuTRnd = T and j E sk.InfoQuo_i. Since action BeginCast_i does not change CurRnd and Info Quo _iwe have that s.CuTRnd = T and j E s.InfoQuo_i. By Invariant 6.6 we have that j E Hreject(T') for all T' such that s.ValFrom <T' <T. By the code of BeginCast _iwe have that Hf rom(T) = s. ValFromi.

We are now ready to prove the main invariant.

Invariant 6.8. In any state of an execution of _SBPX,any nondead round T E RV is anchored.

Proof. We proceed by induction on the length k of the execution . The base is trivial. When k =0 we have that = s0 and in the initial state no round has been started yet. Thus Hleader(r) = nil and by Lemma 8 we have that _V= { } and thus the assertion is vacuously true. For the inductive step assume that the assertion is true for = s0lr1s1] ... lrksk and consider the execution s0lr1s1 ... iikskirs. We need to prove that, for every possible action ir the assertion is still true in state s. First we observe that the definition of “dead” round depends only upon the history variables and that the definition of “anchored” round depends upon the history variables and the definition of “dead” round. Thus the definition of “anchored” depends only on the history variables. Hence actions that do not modify the history variables cannot affect the truth of the assertion. The actions that change history variables are:

(1) ir = NewRoundi

(2) ir = BeginCast_i

(3) ir = GatherAccept(m)i

(4) ir = LastAccept(m)i

Case 1: Assume ir = NewRoundi. This action sets the history variable Hleader(r), where r is the round number of the round being started by process i. The new round r does not belong to _Vsince Hvalue(r) is still undefined. Thus the assertion of the lemma cannot be contradicted by this action.

Case 2: Assume ir = BeginCast_i. Action ir sets Hvalue(r), Hf rom(r) and Hinfquo(r) where r =sk.CurRndi. Round r belongs to _Vin the new state s. In order to prove that the assertion is still true it suffices to prove that round r is anchored in state s and any roundr', r' >r is still anchored in state s. Indeed rounds with round number less than T are still anchored in state s, since the definition of anchored for a given round involves only rounds with smaller round numbers.

First we prove that round T is anchored. From the precondition of BeginCast _iwe have that Hinfquo(T) contains more than n/2 processes; indeed variable Modes is equal to begincast only if the cardinality of Info Quo _iis greater than n/2. Using Invariant 6.7 for each process j in s.Hinfquo(T), we have that for every roundr', such that s.Hfrom(r)<r' <T, there are more than n/2 processes in the set Hreject(r'), which means that every round r', s.Hfrom(T) <r' <T, is dead. Moreover, by the code of ir we have that s.Hfrom(T) = sk. ValFrom and s.Hvalue(T) = sk. Values. From the code (see action GatherLasti) it is immediate that in any state Values is the value of round ValFromi. In particular we have that sk. Values = sk.Hvalue(sk. ValFromi). Hence we have s.Hvalue(T) = s.Hvalue(s.Hfrom(r)). Finally we notice that round Hf rom(r) is anchored (any round previous to T is still anchored in state s) and thus we have that any round r' <, is either dead or such that s.Hvalue(s.Hfrom(T)) = s.Hvalue(r'). Hence for any round r' < we have that either roundr' is dead or that s.Hvalue(r) = s.Hvalue(r'). Thus round T is anchored in state s.

Finally, we need to prove that any non-dead roundr', r' >T that was anchored in sk is still anchored in s. Since action BeginCast _imodifies only history variables for round T, we only need to prove that in state s, Hvalue(r') = Hvalue(T). Let r'' be equal to Hf rom(r). Sincer' is anchored in state sk we have that sk.Hvalue(r') = sk.Hvalue(r'').

Again because BeginCast _imodifies only history variables for round r, we have that s.Hvalue(r') = s.Hvalue(r''). But we have proved that round r is anchored in state s and thus s.Hvalue(r) = s.Hvalue(r''). Hence s.Hvalue(r') = s.Hvalue(r).

Case 3: Assume ir = GatherAccept(m)i. This action modifies only variable Haccquo, which is not involved in the definition of anchored. Thus this action cannot make the assertion false.

Case 4: Assume ir = LastAccept(m)i. This action modifies Hinf quo and Hrej ect. Variable Hinf quo is not involved in the definition of anchored. Action LastAccept(m)i may put process i in Hrej ect of some rounds and this, in turn, may make those rounds dead. However this cannot make false the assertion; indeed if a round r was anchored in sk it is still anchored when another round becomes dead.

The next invariant follows from the previous one and gives a more direct statement about the agreement property.

Invariant 6.9. In any state of an execution of _SBPX,all the Decision variables that are not nil, are set to the same value.

Proof. We prove the invariant by induction on the length k of the execution c. The base of the induction is trivially true: for k = 0 we have that = s0 and in the initial state all the Decisioni variables are undefined.

Assume that the assertion is true for = s0lr1s1 ... irksk and consider the execution s0ir1s1 ... irkskirs. We need to prove that, for every possible action ir the assertion is still true in state s. Clearly the only actions which can make the assertion false are those that set Decisioni, for some process i. Thus we only need to consider actions GatherAccept( (r, “Accept”))i and actions RndSuccess(v)i and Receive( (“Success”, v) )i, j of automaton BPSUCCESSi.

Case 1. Assume r= GatherAccept((r,“Accept”))i. This action sets Decisioni to Hvalue(r). If all Decisionj, j = i, are undefined then Decisioni is the first decision and the assertion is still true. Assume there is only one Decisionj already defined. Let Decisionj = Hvalue(r') for some roundr'. By Invariant 6.8, rounds r andr' are anchored and thus we have that Hvalue(r') = Hvalue(r). Whence Decisioni = Decisionj. If there are some Decisionj, j = i, which are already defined, then by the inductive hypothesis they are all equal. Thus, the lemma follows.

Case 2. Assume ir = RndSuccess(v)i. This action sets Decisioni to v. By the code, value v is equal to the Decisionj of some other process. The lemma follows by the inductive hypothesis.

Case 3. Assume ir = Receive( (“Success”, v) )i. This action sets Decisioni to v. It is easy to see (by the code) that the value sent in a “Success” message is always the Decision of some process. Thus we have that Decisioni is equal to Decisionj for some other process j and the lemma follows by the inductive hypothesis.

Finally we can prove that agreement is satisfied.

Validity is easier to prove since the value proposed in any round comes either from a value supplied by an Init(v)i action or from a previous round.

Invariant 6.10. In any state of an execution of _SBPX,for any r E RV we have that Hvalue(r) E V.

Proof. We proceed by induction on the length k of the execution . The base of the induction is trivially true: for k = 0 we have that = s0 and in the initial state all the Hvalue variables are undefined.

Assume that the assertion is true for = s0lr1s1 ::: irksk and consider the execution s0ir1s1 ::: irkskirs. We need to prove that, for every possible action ir the assertion is still true in state s. Clearly the only actions that can make the assertion false are those that modify Hvalue. The only action that modifies Hvalue is BeginCast. Thus, assume ir = BeginCast_i. This action sets Hvalue(r) to Valuei. We need to prove that all the values assigned to Valuei are in the set V. Variable Valuei is modified by actions NewRoundi and GatherLast(m)i. We can easily take care of action NewRoundi because it simply sets Valuei to be InitValuei which is obviously in V. Thus we only need to worry about GatherLast(m)i actions. A GatherLast(m)i action sets variable Valuei to the value specified into the “Last” message if that value is not nil. The value specified into any “Last” message is either nil or the value Hvalue(r') of a previous round r'; by the inductive hypothesis we have that Hvalue(r') belongs to V.

Invariant 6.11. In any state of an execution of _SBPX,all the Decision variables that are not undefined are set to some value in V.

Proof. A variable Decision is always set to be equal to Hvalue(r) for some r. Thus the invariant follows from Invariant 6.10.

Theorem 10. In any execution of the system _SBPX;validity is satisfied. Proof. Immediate from Invariant 6.11.

6.2.4. Analysis of SBPX

In this section we analyze the performance of _SBPX.Since termination is not guaranteed by _SBPXin this section we provide a performance analysis (Lemma 14) assuming that a successful round is conducted. Then in Section 6.4, Theorem 17 provides the performance analysis of _SPAX,which, in a nice execution fragment, guarantees termination.

Let us begin by making precise the meaning of the expressions “the start (end) of a round”.

Definition 6.12. In an execution fragment whose states are all unique-leader states with process being the unique leader, the start of a round is the execution of action NewRound and the end of a round is the execution of action RndSuccess .

A round is successful if it ends, that is, if the RndSuccess action is executed by the leader . Moreover we say that a process reaches its decision when automaton BPSUCCESS sets its Decision variable. We remark that, in the case of a leader, the decision is actually reached when the leader knows that a majority of the processes have accepted the value being proposed. This happens in action GatherAccept(m) of BPLEADER . However, to be precise in our proofs, we consider the decision reached when the variable Decision of BPSUCCESS is set; for the leader this happens exactly at the end of a successful round. Notice that the Decide(v) action, which communicates the decision v of process to the external environment, is executed within " time from the point in time when process reaches the decision, provided that the execution is regular (in a regular execution actions are executed within the expected time bounds).

The following lemma states that once a round has ended, if the execution is stable, the decision is reached by all the alive processes within linear (in the number of processes) time.

Lemma 11. If an execution fragment of the system _SBPX,starting in a reachable state s and lasting for more than 3" + 2d time, is stable and unique-leader, with process leader, and process reaches a decision in state s, then by time 3" + 2d, every alive process j = has reached a decision, and the leader has Acked(j) = true for every alive process j = .

Proof. First notice that _SBPXis the composition of CHANNEL , j and other automata. Hence, by Theorem 3 we can apply Lemma 4. Let / be the alive processes j = such that Acked(j) = false. If / is empty then the lemma is trivially true. Hence assume / = { }.

By assumption, the action that brings the system into state s is action RndSuccess (the leader reaches a decision in state s). Hence action SendSuccess is enabled. By the code of BPSUCCESS , action SendSuccess is executed within " time. This action puts a “Success” message for each process j / into OutSucMsgs(j) . By the code of BPSUCCESS , each of these messages is put on CHANNEL _,j,i.e., action Send((“Success”, v) ) ,j is executed, within " time. By Lemma 4 each alive process j / receives the “Success” message, i.e., executes a Receive( (“Success”, v)) , jaction, within d time. This action sets Decisionj to v and puts an “Ack” message into OutAckMsgs( )j. By the code of BPSUCCESSj, this “Ack” message is put on CHANNELj_, , i.e., action Send(“Ack”)j_, is executed, within " time, for every process j. By Lemma 4 the leader

receives the “Ack” message, i.e., executes a Receive((“Ack”) _)j,action, within d time, for each process j. This action sets Acked(j) = true.

Summing up the time bounds we get the lemma.

In the following we are interested in the time analysis from the start to the end of a successful round. We consider a unique-leader execution fragment , with process i leader, and such that the leader i has started a round by the first state of (that is, in the first state of , CurRndi = r for some round number r).

We remark that in order for the leader to execute step 3 of a round, i.e., action BeginCast_i, it is necessary that Valuei be defined. If the leader does not have an initial value and no agent sends a value in a “Last” message, variable Valuei is not defined. In this case the leader needs to wait for the execution of the Init(v)i to set a value to propose in the round (see action Continuei). Clearly the time analysis depends on the time of occurrence of the Init(v)i. To deal with this we use the following definition.

Definition 6.13. Given an execution fragment , we define ^ti to be 0, if variable InitValuei is defined in the first state of ; the time of occurrence of action Init(v)i, if variable InitValuei is undefined in the first state of and action Init(v)i is executed in ; oc, if variable InitValuei is undefined in the first state of and no Init(v)i action is executed in . Moreover, we define ⁷ⁱ to be max{7" + 2d, ^ti + 2"}.

Informally, the above definition of 7i  gives the time, counted from the beginning of a round, by which a BeginCast _iaction is expected to be executed, assuming that the execution is stable and the round being conducted is successful. More formally we have the following lemma.

Lemma 12. Suppose that for an execution fragment of the system _SBPX,starting in a reachable state s in which s:Decision = nil, then it holds that

(i) is stable;

(ii) is a unique-leader execution, with process i leader;

(iii) lasts for more than ⁷ⁱ ;

(iv) the action that brings the system into state s is action NewRoundi for some round r;

(v) round r is successful.

Then we have that action Begin Cast_i for round r is executed within time ^7jof the beginning of .

Proof. First notice that _SBPXis the composition of CHANNELi_, _jand other automata. Hence, by Theorem 3 we can apply Lemmas 1 and 4. Since the execution is stable, it is also regular, and thus by Lemma 1 actions of BPLEADERi and BPAGENTi are executed within " time and by Lemma 4 messages are delivered within d time.

Action NewRoundi enables action Collecti which is executed in at most " time. This action puts “Collect” messages, one for each agent j, into OutMsgs_i. By the code of BPLEADERi (see tasks and bounds) each one of these messages is sent on CHANNELi_,j i.e., action Sendi_, _jis executed for each of these messages, within " time. By Lemma 4 a “Collect” message is delivered to each agent j, i.e., action Receivei_, _jis executed, within d time. Then it takes " time for an agent to execute action LastAccept _jwhich

puts a “Last” message in OutMsgs_j. By the code of BPAGENT (see tasks and bounds) it takes additional " time to execute action Sendj_, to send the “Last” message on CHANNELj_, . By Lemma 4, this “Last” message is delivered to the leader, i.e., action Receivej_, is executed, within additional d time. By the code of BPLEADER (see tasks and bounds) each one of these messages is processed by GatherLast within " time. Action Gathered is executed within additional " time.

At this point there are two possible cases: ( ) Value is defined and ( ) Value is not defined. In case (i), action BeginCast is enabled and is executed within " time. Summing up the times considered so far we have that action BeginCast is executed within 7"+2d time from the start of the round. In case (ii), action Continue is executed within " time of the execution of action Continue , and thus by time t  + ". This action enables action BeginCast which is executed within additional " time. Hence action BeginCast is executed by time t  + 2". Putting together the two cases we have that action BeginCast is executed by time max{7" + 2d, t  + 2"}.

Hence we have proved that action BeginCast is executed in by time T  .

Next lemma gives a bound for the time that elapses between the execution of the BeginCast action and the RndSuccess action for a successful round in a stable execution fragment.

Lemma 13. Suppose that for an execution fragment of the system _SBPX,starting in a reachable state s in which s:Decision = nil, then it holds that:

(i) is stable;

(ii) is a unique-leader execution, with process leader;

(iii) lasts for more than 5" + 2d time;

(iv) the action that brings the system into state s is action BeginCast for some round r;

(v) round r is successful.

Then we have that action RndSuccess is performed by time 5" + 2d from the beginning of .

Proof. First notice that _SBPXis the composition of CHANNEL , j and other automata. Hence, by Theorem 3 we can apply Lemmas 1 and 4. Since the execution is stable, it is also regular, and thus by Lemma 1 actions of BPLEADER and BPAGENT are executed within " time and by Lemma 4 messages are delivered within d time.

Action BeginCast puts “Begin” messages for round r in OutMsgs . By the code of BPLEADER (see tasks and bounds) each one of these messages is put on CHANNEL , j by means of action Send , j in at most " time. By Lemma 4 a “Begin” message is delivered to each agent j, i.e., action Receive , jis executed, within d time. By the code of BPAGENT|j (see tasks and bounds) action Accept_j is executed within " time. This action puts an “Accept” message in OutMsgs_j. By the code of BPAGENT _jthe “Accept” message is put on CHANNELj_, , i.e., action Sendj_, for this message is executed, within "

time. By Lemma 4 the message is delivered, i.e., action Receivej_, _ifor that message is executed, within d time. By the code of BPLEADERi action GatherAccept _iis executed for a majority of the “Accept” messages within additional ‘ time. At this point variable Decisioni is defined and action RndSuccessi is executed within ‘ time. Summing up all the times we have that the round ends within 5‘ + 2d.

We can now easily prove a time bound on the time needed to complete a round.

Lemma 14. Suppose that for an execution fragment of the system _SBPX,starting in a reachable state s in which s.Decision = nil, then it holds that

(i) is stable;

(ii) is a unique-leader execution, with process i leader;

(iii) lasts for more than ⁷ⁱ + 5‘ + 2d;

(iv) the action that brings the system into state s is action NewRoundi for some round r;

(v) round r is successful.

Then we have that action Begin Cast_i for round r is executed within time ⁷ⁱ of the beginning of and action RndSuccessi is executed by time ⁷ⁱ + 5‘ + 2d of the beginning of .

Proof. Follows from Lemmas 12 and 13.

The previous lemma states that in a stable execution a successful round is conducted within some time bound. However, it is possible that even if the system executes nicely from some point in time on, no successful round is conducted and to have a successful round a new round must be started. We take care of this problem in the next section. We will use a more refined version of Lemma 14; this refined version replaces condition (v) with a weaker requirement. This weaker requirement is enough to prove that the round is successful.

Lemma 15. Suppose that for an execution fragment of _SBPX,starting in a reachable state s in which s.Decision = nil, then it holds that

(i) is nice;

(ii) is a unique-leader execution, with process i leader;

(iii) lasts for more than ⁷ⁱ + 5‘ + 2d time;

(iv) the action that brings the system into state s is action NewRoundi for some round r;

(v) there exists a set / c 5 of processes such that every process in / is alive and / is a majority, for every j E /, s. Commitj r and in state s for every j E / and k E 5, CHANNELk_, _jand InMsgs _jdo not contain any “Collect” message belonging to any round r' ¿r.

Then we have that action BeginCasti is performed by time ⁷ⁱ and action RndSuccessi is performed by time ⁷ⁱ + 5‘ + 2d from the beginning of .

Proof. Process sends a “Collect” message which is delivered to all the alive voters. All the alive voters, and thus all the processes in /, respond with “Last” messages which are delivered to the leader. No process j / can be committed to reject round r. Indeed, by assumption, process j is not committed to reject round r in state s and process j cannot commit to reject round r. The latter is due to the fact that in state s no message that can cause process j to commit to reject round r is either in InMsgs_j nor in any channel to process j, and in the only leader is , which only sends messages belonging to round r. Since / is a majority, the leader receives at least a majority of “Last” messages and thus it is able to proceed with the next step of the round. The leader sends a “Begin” message which is delivered to all the alive voters. All the alive voters, and thus all the processes in /, respond with “Accept” messages since they are not committed to reject round r. Since / is a majority, the leader receives at least a majority of “Accept” messages. Therefore given that lasts for enough time round r is successful.

Since round r is successful, the lemma follows easily from Lemma 14.

6.3. Automaton SPAX

To reach consensus using _SBPX,rounds must be started by an external agent by means of the NewRound action that makes process start a new round. In this section we provide automata STARTER that start new round. Composing STARTER with _SBPX^weobtain SPAX.

The system _SBPXguarantees that running rounds does not violate agreement and validity, even if rounds are started by many processes. However, since running a new round may prevent a previous one from succeeding, initiating too many rounds is not a good idea. The strategy used to initiate rounds is to have a leader election algorithm and let the leader initiate new rounds until a round is successful. We exploit the robustness of BASICPAXOS in order to use the sloppy leader elector provided in Section 5. As long as the leader elector does not provide exactly one leader, it is possible that no round is successful, however agreement and validity are always guaranteed. This means that regardless of termination, in any run of the algorithm no two different decisions are ever made and any decision is equal to some input value. Moreover, when the leader elector provides exactly one leader, if the system _SBPXis executing a nice execution fragment then a round is successful.

Automaton STARTER takes care of the problem of starting new rounds. This automaton interacts with LEADERELECTOR by means of the Leader and NotLeader actions and with BASICPAXOS by means of the NewRound , Gathered(v) , Continue and RndSuccess(v) actions. Fig. 5, given at the beginning of the section, shows the interaction of the STARTER automaton with the other automata.

The code of automaton STARTER is shown in Figs. 16 and 17. Automaton STARTER does the following. Whenever process becomes leader, the STARTER automaton starts a new round by means of action NewRound . Moreover the automaton checks that action BeginCast is executed within the expected time bound (given by Lemma 14).

If BeginCast is not executed within the expected time bound, then STARTER starts a new round. Similarly once BeginCast has been executed, the automaton checks that action RndSuccess(v) is executed within the expected time bound (given by Lemma 14). Again, if such an action is not executed within the expected time bound, STARTER starts a new round. We remark that to check for the execution of BeginCast , the automaton actually checks for the execution of action Gathered(v) . This is because the expected time of execution of BeginCast depends on whether an initial value is already available when action Gathered(v) is executed. If such a value is available when Gathered(v) is executed then BeginCast is enabled and is expected to be executed within " time of the execution of Gathered(v) . Otherwise the leader has to wait for the execution of action Init(v) which enables action Continue and action BeginCast is expected to be executed within " time of the execution of Continue .

In addition to the code we provide some comments about the state variables and the actions. Variables IamLeader and Status are self-explanatory. Variable Start is true when a new round needs to be started. Variable RndSuccess is true when a decision has been reached. Variables DlineGat and DlineSuc are used to check for the execution of actions Gathered(v) and RndSuccess(v) . They are also used, together with variable LastNR , to impose time bounds on enabled actions.

Automaton STARTER updates variable IamLeader according to the input actions Leader and NotLeader and executes internal and output actions whenever it is the leader. Variable Start is used to start a new round and it is set either when a Leader action changes the leader status IamLeader from false to true, that is, when the process becomes leader or when the expected time bounds for the execution of actions Gathered(v) and RndSuccess(v) elapse without the execution of these actions. Variable RndSuccess is updated by the input action RndSuccess(v) . Action NewRound starts a new round. Actions CheckGathered and CheckRndSuccess check, respectively, whether actions Gathered(v) and RndSuccess(v) are executed within the expected time bounds. Using an analysis similar to the one done in the proof of Lemma 12 we have that action Gathered(v) is supposed to be executed within 6" + 2d time of the start of the round. The time bound for the execution of action RndSuccess(v) depends on whether the leader has to wait for an Init(v) event. However by Lemma 13 action RndSuccess(v) is expected to be executed within 5" + 2d time from the time of occurrence of action BeginCast and action BeginCast is executed either within " time of the execution of action Gathered(v) , if an initial value is available when this action is executed, or else within " time of the execution of action Continue . Hence actions Gathered(v) and Continue both set a deadline of 6" + 2d for the execution of action RndSuccess(v) . Actions CheckGathered and CheckRndSuccess start a new round if the above deadlines expire.

6.4. Correctness and analysis of SPAX

Even in a nice execution fragment a round may not reach success. This is possible when agents are committed to reject the first round started in the nice execution

Actions:

input Stops

E: Status := stopped

input Recovers

Eff: Status := alive

input Leaders

Eff: if Status = alive then

if IamLeader = false then IamLeader := true

if RndSuccess = false then

Start := true

DlineGat := 00 DlineSuc := 00 LastNR := Clock + t'

input NotLeaders

Eff: if Status = alive then
LastNR := 00

DlineSus := 00

DlineGat := 00

IamLeader := false

output NewRounds Pre: Status= alive IamLeader = true

Start = true

Eff: Start:=false

DlineGat := Clock + 6t' + 2d LastNR := 00

input Gathered(v)s

Eff: if Status = alive then
DlineGat := 00

if v = nil then

DlineSuc := Clock + 6t' + 2d

input Continues

Eff: if Status = alive then DlineSuc := Clock + 6t' + 2d

Fig. 17. Automaton STARTER for process i (part 2).

fragment because they are committed for higher numbered rounds started before the beginning of the nice execution fragment. However, in such a case a new round is started and there is nothing that can prevent the success of the new round. Indeed in the newly started round, alive processes are not committed for higher numbered rounds since during the first round they inform the leader of the round number for which they are committed and the leader, when starting a new round, always uses a round number greater than any round number ever seen. In this section we will prove that in a long enough nice execution fragment termination is guaranteed.

Remember that SPA X is the system obtained by composing system _SLEAwith one automaton BASICPAXOS and one automaton STARTER for each process i 5. Since this system contains as a subsystem the system _SBPX,it guarantees agreement and validity. However, in a long enough nice execution fragment of _SPAXtermination is achieved, too.

The following lemma states that in a long enough nice, unique-leader execution, the leader reaches a decision. We recall that 7 = max{7t' + 2d, ^ti + 2t'} and that ^ti is the time of occurrence of action Init(v) in (see Definition 6.15).

Lemma 16. Suppose that for an execution fragment of _SPAX,starting in a reachable state s in which s.Decision = nil, then it holds that

(i) is nice;

(ii) is a unique-leader execution, with process i leader;

(iii) lasts for more than 7ⁱ + 20t' + 7d time.

Then by time 7ⁱ + 20t' + 7d the leader i has reached a decision.

Proof. First we notice that system _SPAXcontains as subsystem _SBPX;hence by using Theorem 3, the projection of on the subsystem _SBPXis actually an execution of SBPX and thus Lemmas 14 and 15 are still true in .

For simplicity, in the following we assume that ^7j=0, i.e., that process i has executed an Init(v)i action before . At the end of each case we consider, we will add ⁷ⁱto the time bound to take into account the possibility that process i has to wait for an Init(v)i action. Notice that 7ⁱ =0 implies that7i  =0 for any fragment fi of starting at some state of and ending in the last state of .

Let s' be the first state of such that no “Collect” message sent by a process k = i is present in CHANNELk_;j nor in InMsgs _jfor any j. State s' exists in and its time of occurrence is less or equal to d + t'. Indeed, since the execution is nice, all the messages that are in the channels in state s are delivered by time d and messages present in any InMsgs set are processed within t' time. Since i is the unique leader, in state s' no messages sent by a process k = i is present in any channel nor in any InMsgs set. Let ' be the fragment of beginning at s'. Since ' is a fragment of , we have that ' is nice, process i is the unique leader in ' and ⁷ⁱ,=0.

If process i has started a round r' by state s' and roundr' is successful, then round r' ends by time ⁷ⁱ,+ 5t' + 2d = 5t' + 2d in '. Indeed if the action that brings that system in state s' is a NewRoundi action for round r' then by Lemma 14 we have that the round ends by time ⁷ⁱ,+ 5t'+ 2d = 5t' + 2d. If action NewRoundi for roundr' has been executed before, roundr' ends even more quickly and the time bound holds anyway. Since the time of occurrence of s' is less or equal to t' + d we have that round r' ends by time 6t' + 3d. Considering the possibility that process i has to wait for an Init(v)i action we have that round r' ends by time 7ⁱ + 6t' + 3d in . Hence the lemma is true in this case.

Assume that either (a) process i has started a round r' by state s' but round r is not successful or (b) that process i has not started any round by state s'. In both cases process i executes a NewRoundi action by time ⁷ⁱ,+ 7t' + 2d = 7t' + 2d in ' . Indeed in case (a), by the code of STARTERi, action CheckRndSuccessi is executed within ⁷ⁱ,+ 6t' + 2d = 6t' + 2d time and it takes additional t' time to execute action NewRoundi. In case (b), by the code of BPLEADERi, action NewRoundi is executed within t' time. Let r '' be the round started by such an action.

Let s'' be the state after the execution of the NewRoundi action and let '' be the fragment of starting in s''. Since '' is a fragment of , we have that '' is nice, process i is the unique leader in '' and ⁷ⁱ,,=0. We notice that since the time of occurrence of state s' is less or equal to t' + d the time of occurrence of s'' is less or equal to 8t' + 3d in .

We now distinguish two possible cases.

Case 1: Roundr'' is successful. In this case, by Lemma 14 we have that round _r''is successful within ⁷ⁱ,,+ 5t' + 2d = 5t' + 2d time in ''. Since the time of occurrence

of S'' is less or equal to 8t' + 3d, we have that round T'' ends by time 13t' + 5d in . Considering the possibility that process i has to wait for an Init(v)i action we have that round T'' ends by time 7ⁱ + 13t' + 5d in . Hence the lemma is true in this case.

Case 2. Round T'' is not successful. By the code of STARTERi, action NewRoundi is executed within ^Ti,,+ 7" + 2d = 6t' + 2d time in ''. Indeed, it takes ^Ti,,+ 5' + 2d to execute action CheckRndSuccessi and additional " time to execute action NewRoundi. Let T''' be the new round started by i with such an action, let 5''' be the state of the system after the execution of action NewRoundi and let ''' be the fragment of '' beginning at 5'''. The time of occurrence of 5''' is less or equal than 15t' + 5d in .

Clearly ''' is nice, process i is the unique leader in '''. Any alive process j that rejected round T'' because of a round ^˜i , ^˜i>T'', has responded to the “Collect” message of round T'', with a message (T'',“OldRound”,^˜rj_; iinforming the leader i about round ^˜i. Since '' is nice all the “OldRound” messages are received before state 5'''. Since action NewRoundi uses a round number greater than all the ones received in “OldRound” messages, we have that for any alive process j, s'''.Commit j<T'''. Let / be the set of alive processes. In state s''', for every j E / and any k E 5, CHANNELk_, _jdoes not contain any “Collect” message belonging to any round ^˜1>T''' nor such a message is present in any InMsgs _jset (indeed this is true in state s'). Finally, since '' is nice, by definition of nice execution fragment, we have that / contains a majority of the processes.

Hence, we can apply Lemma 15 to the execution fragment '''. By Lemma 15, round T''' is successful within ^Ti,,,+ 5' + 2d = 5' + 2d time from the beginning of '''. Since the time of occurrence of 5''' is less or equal to 15t' + 5d in , we have that round T''' ends by time 20t' + 7d in . Considering the possibility that process i has to wait for an Init(v)i action we have that round T''' ends by time ^7j+ 20t' + 7d in . Hence the lemma is true also in this case.

If the execution is stable for enough time, then the leader election eventually elects a unique leader (Lemma 7). In the following theorem we consider a nice execution fragment and we let i be the process eventually elected unique leader. We recall that ti is the time of occurrence of action Init(v)i in and that " and d are constants.

Theorem 17. Let be a nice execution fragment of _SPAXstarting in a reachable state and lasting for more than tⁱ + 35t' + 13d. Then the leader i executes Decide(v')i by time tⁱ + 32t' + 11d from the beginning of . Moreover by time tⁱ + 35t' + 13d from the beginning of any alive process j executes Decide(v' )j.

Proof. Since _SPAX^containsSLEA ^andSBPX as subsystems, by Theorem 3 we can use any property of _SLEA^andSBPX. Since the execution fragment is nice (and thus stable), by Lemma 7 there is a unique leader by time 4t' + 2d. Let 5' be the first unique-leader state of and let i be the leader. By Lemma 7 the time of occurrence of s' before or at time 4'+2d. Let ' be the fragment of starting in state s'. Since is nice, ' is nice.

By Lemma 16 we have that the leader reaches a decision by time 7j, + 20t' + 7d from the beginning of '. Summing up the times and noticing that ⁷ⁱ,ti , + 7" + 2d

and that ^ti, <t ⁱ we have that the leader reaches a decision by time ^ti + 31t' + 11d. Within additional t' time action Decide(v')i is executed.

The leader reaches a decision by time ^ti + 31t' + 11d. By Lemma 11 we have that a decision is reached by every alive process j within additional 3t' + 2d time, that is by time ^ti + 34t' + 13d. Within additional t' time action Decide(v' )j is executed.

6.5. Messages

It is not difficult to see that in a nice execution, which is an execution with no failures, the number of messages spent in a round is linear in the number of processes. Indeed in a successful round the leader broadcasts two messages and the agents respond to the leader’s messages. Once the leader reached a decision another broadcast is enough to spread this decision to the agents. It is easy to see that, if everything goes well, at most 6n messages are sent to have all the alive processes reach the decision.

However failures may cause the sending of extra messages. It is not difficult to construct situations where the number of messages sent is quadratic in the number of processes. For example if we have that before i becomes the unique leader, all the processes act as leaders and send messages, even if i becomes the unique leader and conducts a successful round, there are 61(n²) messages in the channels which are delivered to the agents which respond to these messages.

Automaton BPSUCCESS keeps sending messages to processes that do not acknowledge the “Success” messages. If a process is dead and never recovers, an infinite number of messages is sent. In a real implementation, clearly the leader should not send messages to dead processes.

Finally the automaton DETECTOR sends an infinite number of messages. However the information provided by this automaton can be used also by other applications.

6.6. Concluding remarks

The PAXOS algorithm was devised in [19]. In this section we have provided a new presentation of the PAXOS algorithm. We conclude this section with a few remarks.

The first remark concerns the use of majorities for info-quorums and accepting-quorums. The only property that is used is that there exists at least one process common to any info-quorum and any accepting-quorum. Thus any quorum scheme for info-quorums and accepting-quorums that guarantees the above property can be used.

As pointed out in also [21], the amount of stable storage needed can be reduced to a very few state variables. These are the last round started by a leader (which is stored in the CurRnd variable), the last round in which an agent accepted the value and the value of that round (variables LastR, Last V), and the round for which an agent is committed (variable Commit). These variables are used to keep consistency, that is, to always propose values that are consistent with previously proposed values, so if they are lost then consistency might not be preserved. In our setting we assumed that the entire state of the processes is in stable storage, but in a practical implementation only the variables described above need to be stable.

A practical implementation of PAXOS should cope with some failures before abandoning a round. For example a message could be sent twice, since duplication is not a problem for the algorithm (it may only affect the message analysis), or the time bound checking may be done later than the earliest possible time to allow some delay in the delivery of messages.

A recover may cause a delay. Indeed if the recovered process has a bigger identifier than the one of the leader then it will become the leader and will start new rounds, possibly preventing the old round from succeeding. As suggested in Lamport’s original paper, one could use a different leader election strategy which keeps a leader as long as it does not fail. However, it is not clear to us how to design such a strategy.

7. The MULTIPAXOS algorithm

The PAXOS algorithm allows processes to reach consensus on one value. We consider now the situation in which consensus has to be reached on a sequence of values; more precisely, for each integer k, processes need to reach consensus on the kth value. The MULTIPAXOS algorithm reaches consensus on a sequence of values; it was discovered by Lamport at the same time as PAXOS [19].

Informally, to achieve consensus on a sequence of values we can use an instance of PAXOS for each integer k, so that the kth instance is used to agree on the kth value. Since we need an instance of PAXOS to agree on the kth value, we need for each integer k an instance of the BASICPAXOS and STARTER automata. To distinguish instances we use an additional parameter that specifies the ordinal number of the instance. So, we have

BASICPAXOS( 1), BASICPAXOS(2), BASICPAXOS(3), etc., where BASICPAXOS(k) is used to agree

on the kth value. This additional parameter will be present in each action. For instance, the Init(v)i and Decide(v')i actions of process i become Init(k_; v)i and Decide(k; v' )i in BASICPAXOS(k)i. Similar modifications are needed for all other actions. The STARTERi automaton for process i has to be modified in a similar way. Also, messages belonging to the kth instance need to be tagged with k.

This simple approach has the problem that an infinite number of instances must be started unless we know in advance how many instances of PAXOS are needed. Hence it is not practical. Furthermore, we have not defined the composition of Clock GTAs for an infinite number of automata (see Section 2).

We can follow a different approach consisting in modifying the BASICPAXOS and STARTER automata of PAXOS to obtain the MULTIPAXOS algorithm. This differs from the approach described above because we do not have separate automata for each single instance. The MULTIPAXOS algorithms takes advantage of the fact that, in a normal situation, there is a unique leader that runs all the instances of PAXOS. The leader can use a single message for step 1 of all the instances. Similarly step 2 can also be handled grouping all the instances together. As a consequence, less messages are used. Then, from step 3 on each instance must proceed separately; however step 3 is performed only when an initial value is provided.

Though the approach described above is conceptually simple, it requires some change to the code of the automata we developed in Section 6. To implement MULTIPAXOS we need to modify BASICPAXOS and STARTER. Indeed BASICPAXOS and STARTER are designed to handle a single instance of PAXOS, while now we need to handle many instances all together for the first two steps of a round. As the changes to the automata code are only technical, we do not provide the modified code; however we refer the interested reader to [5].

The correctness follows from the correctness of PAXOS. Indeed for every instance of PAXOS, the code of MULTIPAXOS provided in this section does exactly the same thing that PAXOS does; the only difference is that Step 1 (as well as Step 2) is handled in a single shot for all the instances. It follows that Theorem 17 can be restated for each instance k of PAXOS.

8. Application to data replication

Providing distributed and concurrent access to data objects is an important issue in distributed computing. The simplest implementation maintains the object at a single process which is accessed by multiple clients. However, this approach does not scale well as the number of clients increases and it is not fault-tolerant. Data replication allows faster access and provides fault tolerance by replicating the data object at several processes.

One of the best known replication techniques is majority voting (e.g., [15, 16]). With this technique both update (write) and non-update (read) operations are performed at a majority of the processes of the distributed system. This scheme can be extended to consider any “write quorum” for an update operation and any “read quorum” for a non-update operation. Write quorums and read quorums are just sets of processes satisfying the property that any two quorums, one of which is a write quorum and the other one is a read quorum, intersect (e.g., [12]). A simple quorum scheme is the write-all=read-one scheme (e.g., [3]) which gives fast access for non-update operations.

Another well-known replication technique relies on a primary copy. A distinguished process is considered the primary copy and it coordinates the computation: the clients request operations to the primary copy and the primary copy decides which other copies must be involved in performing the operation. The primary copy technique works better in practice if the primary copy does not fail. Complex recovery mechanisms are needed when the primary copy crashes. Various data replication algorithms based on the primary copy technique have been devised (e.g., [10, 11,23]).

It is possible to use MULTIPAXOS to design a data replication algorithm that guarantees sequential consistency and provides the same fault tolerance properties of MULTIPAXOS. The resulting algorithm lies between the majority voting and the primary copy techniques. It is similar to voting schemes since it uses majorities to achieve consistency and it is similar to primary copy techniques since a unique leader is required to achieve termination. Using MULTIPAXOS gives much flexibility. For instance, it is not a disaster

when there are two or more “primary” copies. This can only slow down the computation, but never results in inconsistencies. The high fault tolerance of MULTIPAXOS results in a highly fault tolerant data replication algorithm, i.e., process stop and recovery, loss, duplication and reordering of messages, timing failures are tolerated.

We can use MULTIPAXOS in the following way. Each process in the system maintains a copy of the data object. When client i requests an update operation, process i proposes that operation in an instance of MULTIPAXOS. When an update operation is the output value of an instance of MULTIPAXOS and the previous update has been applied, a process updates its local copy and the process that received the request for the update gives back a report to its client. A read request can be immediately satisfied returning the current state of the local copy. We refer the reader to [5] for automaton code implementing the above algorithm.

9. Conclusion

This paper revisits Lamport’s PAXOS algorithm which is a practical and elegant algorithm for solving distributed consensus. Nevertheless, it seems to be not widely known or understood. A modular and detailed description of the algorithm is provided along with a formal proof of correctness and a performance analysis. The formal framework used is provided by the Clock GTA model which is a special I/O automaton model suitable for practical time performance analysis based on the stabilization of the physical system.

Possible future work encompasses an implementation of PAXOS and of data replication algorithms based on PAXOS. We recently learned that Lee and Thekkath [22] used an algorithm based on PAXOS to replicate state information within their Petal system which implements a distributed file server.

Acknowledgements

The first author would like to thank Idit Keidar and Sam Toueg for useful discussions on related work.

References

[1] T.D. Chandra, V. Hadzilacos, S. Toueg, The weakest failure detector for solving consensus, J. ACM 43 (2) (1996) 685–722. A preliminary version appeared in the Proc. 11th Annual ACM Symp. on Principles of Distributed Computing, Vancouver, British Columbia, Canada, August 1992, pp. 147–15 8.

[2] T.D. Chandra, S. Toueg, Unreliable failure detector for asynchronous distributed systems, J. ACM 43 (2) (1996) 225–267. A preliminary version appeared in the Proc. 10th Annual ACM Symp. on Principles of Distributed Computing, August 1991, pp. 325–340.

[3] E.C. Cooper, Replicated distributed programs, UCB/CSD 85/231, University of California, Berkeley, CA, May 1985.

[4] F. Cristian, C. Fetzer, The timed asynchronous system model, Dept. of Computer Science, UCSD, La Jolla, CA. Tech. Rep. CSE97-519.

[5] R. De Prisco, Revisiting the Paxos algorithm, M.S. Thesis, Massachusetts Institute of Technology, Lab. for Computer Science, Cambridge, MA, June 1997. Tech. Rep. MIT-LCS-TR-717, Lab. for Computer Science, MIT.

[6] R. De Prisco, B. Lampson, N. Lynch, Revisiting the Paxos algorithm, in Proc. 11th Internat. Workshop on Distributed Algorithms, Saarbrücken, Germany, September 1997, pp. 111–125.

[7] D. Dolev, C. Dwork, L. Stockmeyer, On the minimal synchrony needed for distributed consensus, J. ACM 34(1) (1987) 77–97.

[8]D. Dolev, R. Friedman, I. Keidar, D. Malkhi, Failure detectors in omission failure environments, in Proc. 16th Annual ACM Symp. on Principles of Distributed Systems, Santa Barbara, CA, August 1997, p. 286. Also TR 96-1608, Department of Computer Science, Cornell University, September, 1996 and TR CS96-13, Institute of Computer Science, The Hebrew University of Jerusalem.

[9] C. Dwork, N. Lynch, L. Stockmeyer, Consensus in the presence of partial synchrony, J. ACM 35(2) (1988) 288–323.

[10] A. El Abbadi, D. Skeen, F. Cristian, An efficient fault-tolerant protocol for replicated data management, Proc. 4th ACM SIGACT/SIGMOD Conf. on Principles of Database Systems, 1985.

[11] A. El Abbadi, S. Toueg, Maintaining availability in partitioned replicated databases, Proc. 5th ACM SIGACT/SIGMOD Conf. on Principles of Data Base Systems, 1986.

[12] A. Fekete, N. Lynch, A. Shvartsman, Specifying and using a partitionable group communication service, in Proc. 16th Annual ACM Symp. on Principles of Distributed Computing, August 1997, pp. 53–62.

[13] M.J. Fischer, The consensus problem in unreliable distributed systems (a brief survey). Rep. YALEU/DSC/RR-273. Dept. of Computer Science, Yale Univ., New Have, Conn., June 1983.

[14] M.J. Fischer, N. Lynch, M. Paterson, Impossibility of distributed consensus with one faulty process, J. ACM 32(2) (1985) 374–382.

[15] D.K. Gifford, Weighted voting for replicated data, Proc. 7th ACM Symp. on Oper. Systems Principles, SIGOPS Oper. Systems Rev. 13 (5) (1979) 150–162.

[16] M.P. Herlihy, A quorum-consensus replication method for abstract data types, ACM Trans. Comput. Systems 4(1) (1986) 32–53.

[17] I. Keidar, D. Dolev, Efficient message ordering in dynamic networks, in Proc. 15th Annual ACM Symp. on Principles of Distributed Computing, May 1996, pp. 68–76.

[18] I. Keidar, D. Dolev, Increasing the resilience of distributed and replicated database systems, J. Comput. System Sci. (JCSS), special issue with selected papers from PODS 1995, 57 (3) (1998) 309–324.

[19] L. Lamport, The part-time parliament, ACM Trans. Comput. Systems 16 (2) (1998) 133–169. Also Research Report 49, Digital Equipment Corporation Systems Research Center, Palo Alto, CA, September 1989.

[20] B. Lampson, How to build a highly available system using consensus, in Proc. 10th Internat. Workshop on Distributed Algorithms, Bologna, Italy, 1996, pp. 1–15.

[21] B. Lampson, W. Weihl, U. Maheshwari, Principle of Computer Systems: Lecture Notes for 6.826, Fall 1992, Research Seminar Series MIT-LCS-RSS 22, Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139, July 1993.

[22] E.K. Lee, C.A. Thekkath, Petal: distributed virtual disks, in Proc. 7th Internat. Conf. on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, October 1996, pp. 84–92.

[23] B. Liskov, B. Oki, Viewstamped replication: A new primary copy method to support highly-available distributed systems, in Proc. 7th Annual ACM Symp. on Principles of Distributed Computing, August 1988, pp. 8–17.

[24] N. Lynch, Distributed Algorithms, Morgan Kaufmann Publishers, San Francisco, 1996.

[25] N. Lynch, M.R. Tuttle, An introduction to I/O automata, CWI-Quart. 2 (3) 219–246. CWI, Amsterdam, The Netherlands, Sep 89. Technical Memo MIT-LCS-TM-373, Lab. for Computer Science, MIT, Cambridge, MA, USA, Nov 88.

[26] N. Lynch, F. Vaandrager, Forward and backward simulations for timing-based systems. in Real-Time: Theory in Practice, Lecture Notes in Computer Science, Vol. 600, Springer, Berlin, 1992, pp. 397–446.

[27] N. Lynch, F. Vaandrager, Forward and backward simulations – Part II: Timing-based systems, Technical

Memo MIT-LCS-TM-487.b, Lab. for Computer Science, MIT, Cambridge, MA, USA, April 1993.
[28] N. Lynch, F. Vaandrager, Actions transducers and timed automata, Technical Memo MIT-LC S-TM‑

480.b, Lab. for Computer Science, MIT, Cambridge, MA, USA, October 1994.

[29] M. Merritt, F. Modugno, M.R. Tuttle, Time constrained automata. CONCUR 91: 2nd Internat. Conf. on Concurrency Theory, Lecture Notes in Computer Science, Vol. 527, Springer, Berlin, 1991, pp. 408–423.

[30] B. Oki, Viewstamped replication for highly-available distributed systems, Ph.D. Thesis, Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139, 1988.

[31] B. Patt-Shamir, A theory of clock synchronization, Ph.D. Thesis, Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139, October 1994.

[32] B. Patt-Shamir, S. Rajsbaum, A theory of clock synchronization, in Proc. 26th Symp. on Theory of Computing, May 1994.

[33] M. Pease, R. Shostak, L. Lamport, Reaching agreement in the presence of faults, J. ACM 27(2) (1980) 228–234.

[34] D. Skeen, Nonblocking Commit Protocols, Proc. ACM SIGMOD Internat. Conf. on Management of Data, May 1981, pp. 133–142.

[35] D. Skeen, D.D. Wright, Increasing availability in partitioned database systems, TR 83-581, Dept. of Computer Science, Cornell University, Mar 1984.