This document was made by OCR from a scan of the technical report. It has not been edited or proofread and is not meant for human consumption, but only for search engines. To see the scanned original, replace OCR.htm with Abstract.htm or Abstract.html in the URL that got you here.


Correctness of At-Most-Once Message Delivery Protocols

Butler W. Lampsona, Nancy A. Lynchb and Jørgen F. Søgaard~Andersenc* aCambridge Research Laboratory, DEC, Cambridge, MA 02139, USA.

bLab. for Computer Science, MIT, 545 Tech. Square, Cambridge, MA 02139, USA.

cDepartment of Computer Science, Technical University of Denmark, Building 344, DK-2800 Lyngby, Denmark.

This paper addresses the issues of formal description and verification for communication protocols. Specifically, we present the results of a project concerned with proving correctness of two different solutions to the at-most-once message delivery problem. The two implementations are the well-known five-packet handshake protocol and a timing-based protocol developed for networks with bounded message delays.

We use an operational automaton-based approach to formal specification of the problem state­ment and the implementations, plus intermediate levels of abstraction in a step-wise development from specification to implementations. We use simulation techniques for proving correctness.

In the project we deal with safety, timing, and liveness properties. In this paper, however, we concentrate on safety and timing properties.

Keyword Codes: C.2.2; D.2.4; F.1.1

Keywords: Computer Systems Organization, Network Protocols; Software, Program Verifica­tion; Theory of Computation, Models of Computation

1. INTRODUCTION

During the past few years, the technology for formal verification of communication protocols has matured to the point where we believe that it now provides practical as­sistance for protocol design and validation. Recent advances include the development of formal models that allow reasoning about timed systems as well as untimed systems, e.g., [2, 5, 13], and the development of simulation techniques (including refinement mappings and forward and backward simulations) for proving that one protocol implements another, e.g., [1, 5-7, 13]. In this paper, we show how these techniques can be used to verify an important class of communication protocols - those for at-most-once message delivery. The goals of our project are twofold: to provide better understanding, documentation and proof for these protocols, and to test the adequacy of the models and proof techniques.

The at-most-once message delivery problem is that of delivering a sequence of messages submitted by a user at one location to a user at another location. Ideally, we would like to insist that all messages be delivered in the order in which they are sent, each exactly once, and that an acknowledgement be returned for each delivered message.'

eResearch supported in part by the Danish Research Academy.

1 Our definition of at-most-once message delivery is different from what some people call at-most-once message delivery in that we include acknowledgements and require messages to be delivered in order.


Unfortunately, it is expensive to achieve these goals in the presence of failures (e.g., node crashes and timing anomalies). In fact, it is impossible to achieve them at all unless some change is made to the stable state (i.e., the state that survives a crash) each time a message is delivered. To permit less expensive solutions, we weaken the statement of the problem slightly. We allow some messages to be lost when a node crash occurs; however, no messages should otherwise be lost, and those messages that are delivered should not be reordered or duplicated. (The specification is weakened in this way because message loss is generally considered to be less damaging than duplicate delivery.) Now it is required that the user receive either an acknowledgement that the message has been delivered, or in the case of crashes, an indication that the message might have been lost.

There are various ways to solve the at-most-once message delivery problem. All are based on the idea of tagging a message with an identifier and transmitting it repeatedly to overcome the unreliability of the channel. The receiver2 keeps a stock of "good" identifiers that it has never accepted before; when it sees a message tagged with a good identifier, it accepts it, delivers it, and removes that identifier from the set. Otherwise, the receiver just discards the message, perhaps after acknowledging it. Different protocols use different methods to keep the sender and the receiver more or less in agreement about what identifiers to use.

A desirable property, which is not directly related to correctness, is that the implemen­tations offer a way of cleaning up "old" information when this cannot affect the future behavior.

In this work, we consider two protocols that are important in practice: the clock-based protocol of Liskov, Shrira and Wroclawski [10] and the five-packet handshake protocol of Belsnes [3]. The latter is the standard protocol for setting up network connections, used in TCP, ISO TP-4, and many other transport protocols. It is sometimes called the three-way handshake, because only three packets are needed for message delivery; the additional packets are required for acknowledgement and cleaning up the state. The former protocol was developed as an example to show the usefulness of clocks in network protocols [9] and has been implemented at M.I.T. Both protocols are sufficiently complicated that formal specification and proof seems useful.

The basic model we use is based on the (timed) automaton model of Lynch and Vaandrager [13] with an extra added component to express liveness [5]. We express both protocols, as well as the formal specification of the at-most-once message delivery problem, in terms of this model. In the project we carry out complete correctness proofs for both protocols. Some highlights of our proofs are as follows:

Although the two protocols appear to be quite different, we have found that both can be expressed formally as implementations of a common generic protocol C, which, in turn, is an implementation of the problem specification. To prove that C implements the specification, for proof-technical reasons we introduce an additional level of abstraction, the delayed-decision specification D. This is depicted in Figure 1. Introducing interme­diate levels of abstraction, like C and D, is a general proof strategy that allows large,

2We denote by "receiver" the protocol entity that is situated on the receiver node, and use phrases like "the user at the receiver end" to denote the user that communicates with the receiver. Correspondingly for "sender".


 

S

Specification

Delayed-Decision Specification

Generic Protocol

~

D

~

C

V~V V ' ' '?

Clock-Based Protocol

C

 

H

Five-Packet Handshake Protocol

 

Figure 1. Overview of the levels of abstraction in our work.

complicated proofs to be split into smaller and more managerable subproofs.

The proof that an implementation correctly implements a specification is divided into two parts. First, a simulation technique is applied to show that the implementation safely implements the specification, i.e., that all safety (and timing) properties of the implementation are allowed by the specification. Then, heavily based on the simulation result, we prove liveness properties, and thus, correctness.

As proof techniques we use a backward simulation to show that D safely implements the specification, and forward simulations to show that each of the two protocols safely implements C and that C safely implements D. Because of space limitations, we only treat safe implementation in this paper. Our intention with this paper is to give an overview of our work and to convey our experiences with specifying and verifying practical communication protocols. For this reason we have left out many formal details. We refer to our full report [8] for such details. The full report also contains an exhaustive treatment of liveness properties of the protocols.

The rest of the paper is organized as follows. We start in Section 2 by giving an intro­duction to our model. Then in Sections 3 and 4 we present the problem specification and the low-level protocol C. In this paper we will only briefly deal with the H protocol. In Section 5 we show how aspects of both low-level protocols can be captured in the generic protocol C, and in Section 6 we outline the correctness proofs. In doing so we present the delayed-decision specification D. Finally, we give concluding remarks in Section 7.

2. THE UNDERLYING THEORY

The general model we use to specify safety and timing properties at all levels of ab­straction is based on the (timed) automaton model of [5, 13] and the I/O automaton model of [11, 12]. An automaton is a state machine with named actions associated with its transitions. Thus, an automaton consists of


·      a (possibly infinite) set of states. In timed systems a now component indicates the time,

·      a (nonempty) set of start states,

·      a set of actions partitioned into visible actions (which are furthermore partitioned into input actions and output actions), internal actions (which are invisible from the environment of the system), and, for timed systems, a special time-passage action, and

·      a transition relation of ((pre-)state, action, (post-)state) triples. Each triple is called a step.

For timed systems, the steps involving the time-passage action have to satisfy certain axioms that give natural properties of real time, e.g., that time cannot go backwards.

Correctness of an automaton is specified in terms of its external behavior given by the set of traces, each of which consists of a sequence of visible actions that the automaton can perform. In timed systems these actions are furthermore paired with their time of oc­currence to form timed traces. One automaton, A, safely implements another automaton, B, if the set of (timed) traces of A is included in that of B.

The papers [6, 13, 5] present a collection of simulation proof techniques (refinement mappings, forward and backward simulations, etc.) for showing that one (timed) automa­ton safely implements another. Each of these techniques involves describing a relation between the states of an implementation automaton and those of a specification automa­ton. Each technique requires a particular type of correspondence between initial states of the two automata, as well as a particular type of correspondence between their steps. Figure 2 illustrates how each step of the implementation automaton must correspond to a (possibly empty) sequence of steps of the specification automaton containing exactly the

same visible actions.

The main distinction between forward and backward simulations, which are abstract representations of history [16] and prophecy [1] variables, respectively, lies in the way these corresponding sequences of steps of the specification are found given a step of the implementation: in a forward simulation it must be shown that from each state of the specification which is related to the pre-state of the step of the implementation, there exists a sequence of steps (with the right actions) ending in some state of the specification which is related to the post-state of the step of the specification. Thus, one must, for all states related to the pre-state, trace forward to find some state related to the post-state. Conversely, in a backward simulation one must, for all states related to the post-state, trace backward to find some state related to the pre-state. A refinement mapping is a forward simulation where the relation is a function.

Since we only have to consider the steps of the implementation automaton which start in a reachable state, we will usually prove some invariants, i.e., properties that are true of all the reachable states, to restrict the states that we need to consider.

More formally, let A and B be automata and let R be a relation over the states of A and the states of B. Then R is a forward simulation from A to B if

·      For each start state of A, there is a related (by R) start state of B.


Specification Level      .------

a                       b

.                 .

~                   ~

.                       .                  .       c c c

~                    ~                 ~

~

 

~

EEU

 

~

EEU ~

 

Simulation Relation

 

 

 

 

 

 

Implementation Level                                     .

 

E

EL                        ~

.                        .

~                   ~

b

~ . ~

EEL                                     ~

.                                     .

~                                ~

c c c

Figure 2. Example of a simulation. The actions a and b are visible actions. The rest of the transitions are thought of as labelled by internal actions.

q For each step (s, a, s') of A, where s and s' satisfy the invariants of A, and each state u related to s that satisfies the invariants of B, there exists a sequence of steps of B, starting in u and ending in some state related to s', such that the sequence of steps contains the same visible actions as (s, a, s').

For timed systems the now components and time-passage actions require additional treat­ment. A backward simulation can be defined similarly.

Inductive proofs show the soundness of the simulation proof techniques, i.e., that they imply safe implementation. However, not all techniques are complete, meaning that for some of the simulation techniques, e.g., forward and backward simulations, safe imple­mentation does not imply the existence of such a simulation. Also, different simulation techniques apply to different situations. Thus, although a forward simulation is the most intuitive technique, there are some situations that require other techniques, like backward simulations. For instance, a backward, but not a forward, simulation can be used in situations where the implementation makes some decisions later than the corresponding decisions are made in the specification. We will see an example of this in Section 6.

3. SPECIFICATION S

This section is devoted to giving the specification, called S, of the at-most-once mes­sage delivery problem more formally. Since this is an untimed specification, we give the specification in terms of an untimed automaton.

Figure 3 shows the user-interface of the protocols to be developed by depicting the specification as a "black box" with visible input and output actions. A user can send a message m by performing a send msg(m) action and the protocol can deliver the next message m by performing a receive msg(m) action. A Boolean acknowledgement b is passed to the user at the sender side by an ack(b) action. At both the sender and receiver sides, a crash action causes the protocol to enter a recovery phase where messages might be lost (modelled by an internal lose action). A recover action, at the side where the crash occurred, then signals the end of the recovery phase, after which no messages can be lost unless new crashes occur.

The following code describes the specification S in a simple precondition-effect style com­monly used for I/O-automata protocols [11, 12]. Note that input actions have no precon‑


Specification 6

~ !!!!!~!

Sender Side

!!!!!!~

send msg(m)

 

receive msg (m)

Receiver Side

~

ack(b)

~

I                                                       crashr

I

crashV                       ~

recover,

 

recoverr

 

I

 

 

~

 

Figure 3. The specification 6 as a "black box"

ditions since our system should be able to respond to input from the environment at any time.

The state of the automaton contains a queue of pending messages, plus two flags rec8 and recr, where the subscripts refer to the sender and receiver sides, to indicate that we are in a recovery phase, and a status component giving the status of the last message sent. The status can be either "?" denoting that the last message sent is still in queue, true denoting that the last message sent has been successfully delivered to the user, or false. A status value of false, and therefore a negative acknowledgement, only allows the user to conclude that there has been a crash, but even in this situation the last message sent may have been successfully delivered.3

 

send msg(m)

Eff: queue :=queue "m status :=?

ack(b)

Pre: status = b (E Bool) Eff: none

receive msg (m)

Pre: queue=() ~A hd queue = m Eff: queue := tl queue

if queue = () A status =? then status :=true

 

crashV                                                                    crashr

Eff: rec :=true                                            Eff: recr :=true

lose

Pre: rec =true V rec =true

Eff: delete arbitrary elements of queue

if the element at the end of queue was deleted then status :=false

else

optionally status :=false

3Throughout this paper we use the following operations on lists: let l be a list (e0, ... , eQ) with elements e0 through eQ. Then l"m and tll denote the lists (e0,... ,eQ,m) and (ei,.. .,eQ), respectively, and hdl and last l denote the elements e0 and eQ, respectively.


Text Box:  Text Box: ~Text Box: ] receive pktr (p)	send pkt .V(p)
] 
Channel rs
Text Box: ReceiverText Box: receive msg(m)Text Box: ~Text Box: crashrText Box: ]Text Box: recoverrText Box: ~Text Box: send pktV. (p) 	receive pkt r(p)Text Box: Channel sr	~Text Box: SenderText Box: send msg(m)Text Box: ~Text Box: ack(b)Text Box: ]Text Box: crashText Box: ~Text Box: recoverText Box: ]Figure 4. The Clock-Based Protocol C.

recover                                                    recoverr

Pre: rec =true                                             Pre: recr =true

Eff: rec :=false                                            Eff: recr :=false

4. CLOCK-BASED PROTOCOL C

Figure 4 shows the structure of the clock-based protocol of [10], which we call C. An additional component, "the clock subsystem", is needed to keep the local clocks of the sender and the receiver "almost" synchronized.

Informally, the clock-based protocol works as follows. The sender associates a time-stamp with each message it wishes to transmit. The timestamps are obtained from the sender's local clock time8. The receiver uses the associated timestamp to decide whether or not to accept a received message—roughly, it will accept a message provided that the associated timestamp is greater than the timestamp of the last message that was accepted. However, the receiver does not always remember the timestamp of the last accepted message: it might forget this information because of a crash, or simply because a long time has elapsed since the last message was accepted and it is no longer efficient to remember it. Therefore, the receiver uses safe time estimates determined from its own local clock (timer) to decide when to accept a message. The estimates are kept in the variables lowerr and upperr; the receiver accepts if the message's timestamp is in the interval (lowerr, upperr].

The lowerr bound is designed to be at least as big as the time of the last message accepted. It can be bigger, however, as long as it is sufficiently less than the receiver's local time (at least approximately a one-way message delay less). This is because the receiver should not accidentally fail to accept a valid message that takes the maximum time to arrive. We note that the reason that we do not want to remember just the last timestamp is that we envision using this protocol in parallel for many users, and a single lowerr bound could be used for all users that have not sent messages for a long while.

The upperr bound is chosen to be big enough so that the receiver still accepts the most recent messages, even if they arrive very fast. That is, it should be somewhat larger than the current time. But this bound will be kept in stable storage, and therefore should not be updated very often. Thus, it will generally be set to be a good deal larger than the


current local time. In particular, it will be larger than the timestamp of any message so far accepted.

All that needs to be kept in stable storage is just the local clocks, plus the one variable upperr of the receiver. When the receiver side crashes and recovers again, it resets its lowerr bound to the old upperr bound, to be sure that it will not accept, and thus deliver, any message twice. This explains why we cannot just set upperr to infinity.

There is also a simple acknowledgement protocol, which we do not discuss in detail here.

We now consider how to model this protocol formally. Since we are in a timed setting, the architecture of the protocol consists of the parallel composition of a timed automaton for each of the sender and receiver, plus timed automata to represent the two communication channels, plus an additional timed automaton to represent an almost-synchronized clock subsystem.

Each communication channel acts as follows. A send pkt(p) action places the indicated packet4 p in the channel. The channel is allowed to lose, duplicate, and reorder any packets, but we will assume that for every k send pkt(p) actions for a particular packet, at least one packet p is not lost. For each such packet p, a receive pkt (p) action will occur within time d, the maximum channel delay time. We note that it is possible to give a more abstract description of the channels; our results do not depend on this particular description.

The clock subsystem updates the local clocks of the sender and receiver by issuing tick actions in arbitrary ways so as to keep those clocks within e of real time. We do not describe the channels and the clock subsystem formally in this paper.

For the sender and receiver, we first mention the state variables not described above and then define the transition relations.

For the sender, mode8, ranging over idle, send, and rec, indicates whether the sender is idle, sending the current message to the channel, or in recovery phase, respectively. The variable current-msg8 holds the current message, while last8 holds its timestamp. The list of messages buf8 contains the remaining messages waiting to be sent. Finally, current-ack8 of type Bool holds the acknowledgement received from the receiver.

For the receiver, moder, ranges over idle, rcvd, ack, and rec. rcvd indicates that packets have been received on the channel but the associated messages not yet passed to the user, ack indicates that all messages have been passed to the user and that the receiver is issuing positive acknowledgements, and finally rec indicates the recovery phase. The list of messages bufr holds the messages received from the channel but not yet passed to the user. For the simple acknowledgement protocol, we have lastr to hold the timestamp of the last accepted message, and nack-bufr to hold the timestamps for which negative acknowledgements must be issued. Finally, the variable rm-timer is used by the cleanupr action.

The definition of the steps is listed in the left column below for the sender and in the right for the receiver. We have aligned the send pkt and corresponding receive pkt actions to

4llere and elsewhere, we use the term "packet" to denote objects sent over the channels in an implemen­tation; we reserve the term "message" for the "higher-level", user-meaningful messages that appear, e.g., in the specification.


increase readability.

We treat timing requirements implicitly by giving upper time bounds on certain classes of actions. This corresponds to the way timing requirements are specified in [15]. The code contains three unspecified timing constants ci, /3, and p, to be explained after the code.

send msg(in)

Eff: if modes rec     ~then buf s := buf s "in

choose id(t)

Pre: modes =idle A buf s =() ~A times = t A t> lasts Eff: modes :=send

lasts := t

current-msg s := hd buf s

Text Box:  buf s := ti buf s

 

send pktV. (in, t)

Pre: modes =send A

current-msg s = in A lasts = t Eff: none

receive pktsr (in, t)

Eff: if moder=rec                                            ~then

if lowerr <t < upperr then

moder := rcvd

bufr := bufr " in

lastr, lowerr :=t

else if lastr <t < lowerr then

nack-bufr := nack- buf r "t

else if moder =idle A lastr = t then moder := ack

 

receive msg (in)

Pre: moder = rcvd A bufr=() ~A hd bufr = in Eff: bufr := ti bufr

if bufr = () then

moder := ack

rm-timer :=timer


receive pktr s(t, b)

Eff: if modes =send A lasts = t then modes :=idle

current-acks := b

current-msg s := nil

ack(b)

Pre: modes =idle A

buf s = () A current-acks = b Eff: none


send pkt UV(t, true)

Pre: moder = ack A lastr = t Eff: moder :=idle

send pkt UV(t, false)

Pre: moder=rec  ~A

nack-bufr=() ~A hd nack-buf r =t Eff: nack-bufr := ti nack-bufr


recovers                                                   recoverr

Pre: modes = rec                                                 Pre: moder = rec A upperr + 2 < timer

Eff: modes :=idle                                       Eff: moder :=idle

lasts :=times                                                            lastr := 0

rm-timer := 00                                                        rm-timer :=00


buf s := ()                                                  bufr, nack-bufr := ()

current-msg s := nil                                lowerr := upperr

current-acks :=false                                       upperr :=timer + /3

increase-lowerr (t)

Pre: moder=rec  ~A lowerr <t < timer - 13

Eff: lowerr :=t

increase-upperr (t)

Pre: moder=rec  ~A upperr <t =timer + /3
Eff: upperr := t

cleanupr

Pre: moder E {idle, ack} A

timer > rm-timer + m

Eff: moder :=idle
lastr :=0

rm-timer :=00

ticks(t)                                                      tickr (t)

Eff: times := t                                                         Eff: timer :=t

The correctness of the clock-based protocol requires that we put upper time bounds on one set of actions of the sender. Informally,

·      every time a send pkt8 U(m, t) action becomes possible (or stays possible after being
executed), it must occur within time l8 unless it gets disabled in the meantime.

Similarly, for the receiver we need to put upper bound on two classes of actions:

·      send pktU8(id, true) has an upper bound of lU, and

·       increase-upperr (t) has an upper bound of l~ U.

The correctness of the protocol depends on the timing constants in the code being re­lated properly to these time bounds, and to channel and local clock characteristics. The requirements are: 3> 2e+l~ U, p> kl8+d+2e, and ci > k(lU +d)+(k-1)kl8+2e. Note, how all three constants depend on the maximum difference of 2e between the local clocks.

In this paper we do not give a formal specification of H. Unlike C, which uses timing assumptions, H uses handshakes to first make the sender and receiver agree on a message identifier and then perform the actual message transmission. An additional packet type is used as cleanup information. We proceed by describing the generic protocol C.


5. GENERIC PROTOCOL C

The two protocols C and H both go through three major phases during normal opera­tion:

Choosing a message identifier The sender picks an identifier id that is within the set of identifiers that the receiver is willing to accept. In C time bounds are used to choose a good identifier; in H an initial handshake between the sender and the receiver is used.

Sending the message and getting acknowledgement This phase is similar in both C and H. The sender (re)transmits the current message with the chosen id, until it receives an acknowledgement packet for that id.

Cleaning up Here again, C uses time bounds (in particular timeouts) whereas H uses

a handshake to determine when some "old" information may be discarded.

Our generic protocol C is designed to capture these three phases in an abstract way that both C and H implement. The key abstractions incorporated into the protocol C are two

variables, good8 and goody. The variable good8 represents the identifiers that the sender might shortly assign to messages, and goody represents the identifiers that the receiver is willing to accept.

Four actions of C deal with "growing" and "shrinking" good8 and goody, respectively. As an example, remember that in C the identifier for the current message is taken from

the sender's local clock time8. Every time the local clock time8 is advanced, the identifier that the sender might assign to the current message is changed from the old value of time8 to the new value. In C this corresponds to first "shrinking" good8 with the old value and then "growing" it with the new value.

The preconditions of the grow and shrink actions are designed to preserve certain key invariants, one of which we will present in Section 6. We actually allow more freedom in these actions than is actually needed by C and H. This leaves open the possibility that other low-level protocols, other than C and H, can be proved to be implementations of C. We show the parts of C that deal with good8 and goody.

send msg(m) ~ ~ ~

prepare

Pre: modes =idle A buf s =()~ Eff: modes := needid

goods := I J

current-msg s := hd buf s

buf s := ti buf s

if moder=rec    ~then current-ok :=true

choose id(id)

Pre: modes = needid A id E goods Eff: modes :=send

lasts := id

useds := usedsAid


Text Box: send pktV. (m, id)
Pre: modes =send A
lasts = id A current-msg s = m Eff: none
receive pktsy(m, id)

Eff: if modey=rec    ~then

if id E goody then

modey := rcvd

bufy := bufy Am

lasty :=id

goody :=goody \ {id' M id' < id}

if id =lasts A modes =send then current-ok :=false

else if id=lasty   ~then

optionally nack-bufy := nack- buf y Aid else if modey =idle then

modey := ack


receive msg(m) ~ ~ ~

receive pkty s(id, b) ~~~                                 send pktU, (id, true) ~ ~ ~

ack(b) ~~~                                               send pktUV(id,false) ~ ~ ~

crashs~~~                                                 crashy ~ ~ ~

recovers ~~~                                             recovery ~ ~ ~


shrink goods(ids)

Pre: none

Eff: goods :=goods \ ids

grow goods(ids)

Pre: modes=needid       ~V

((modey=rec    ~== ids c issuedy) A
(current-ok =true == ids c goody) A (ids fl useds = { }))

Eff: goods :=goods U ids


shrink goody(ids)

Pre: current-ok =false V

((modes = needid == ids fl goods = { }) A (modes =send == lasts E ids))

Eff: goody :=goody \ ids

grow goody(ids)

Pre: ids fl issuedy = { }

Eff: goody :=goody U ids issuedy := issuedy U ids

cleanupy

Pre: modey E {idle, ack} A

(modes =send == lasts=lasty)~ Eff: modey :=idle

lasty := nil


6. CORRECTNESS PROOFS

In this section we sketch the proofs of safe implementation for the different levels of abstraction in our work. We will not be strictly formal. For such formal treatment and full proofs we refer to our full report [8].


6.1. Correctness of C

To prove that C is a safe implementation of S, we need a backward simulation. Infor­mally this is because C (and the lower-level protocols) may postpone the decisions about which messages to lose because of a crash till after recovery, whereas in S message loss occurs between crash and recovery. It is due to certain race conditions on the channels that the decisions are delayed in C, C, and H.

Since a backward simulation directly from C to S is still more complicated than we would like, we split the task one more time. Our strategy is to try to localize the back­ward simulation reasoning, because reasoning in this way seems to be inherently difficult compared to the more intuitive forward simulation and refinement mapping techniques. Thus, we define a new, "delayed-decision" version D of the specification (see Figure 1). D is just like S except that it delays the point at which the decision about loss of messages is made. Now, when a crash occurs, messages in the system and status may get marked. Later, even after recovery, any marked message is allowed to be lost. Also, a marked status is allowed to be lost, i.e., changed to false.

Due to space constraints we will not define D formally, nor will we define or prove the backward simulation from D to S. We only note that such a backward simulation exists, which allows us to conclude that D safely implements S.

The proof that C safely implements D uses certain invariants. For instance, the fol­lowing key invariant of C states that when no crashes have occurred since the sender executed a prepare action (i.e., current-ok =true) and the sender still has not chosen an identifier, then the sender can only choose identifiers that are considered good by the receiver. The shrink good and grow good actions are explicitly designed to preserve this invariant (among others).

If current-ok =true A mode8 = needid then good8 t goody.

The proof that C safely implements D is now discharged by exhibiting a refinement mapping from C to D. Again, we do not give details, but refer to [8] for the complete proofs.

Together, the existence of the backward simulation from D to S and the refinement mapping from C to D allow us to conclude that C safely implements S.

6.2. Correctness of C and H

To prove that C and H safely implement S we just have to prove some simulations from C and H to C, since we have already shown that C safely implements S. We only consider C.

The first step, a technical one, in the correctness proof involves adding a history variable [16] deadline to C to get an equivalent version of the protocol (which we still call C). The variable deadline is set to the current real time plus a maximum one-way delay when the current message gets a timestamp, and gets reset (to oo) when either a crash occurs or the current message gets accepted by the receiver. Below we show an invariant that states that this deadline is always met in any execution of C. Another history variable used8 is a list containing the timestamps used so far.

Next, note that C is formalized as an untimed automaton whereas C is formalized as a timed automaton. We resolve this model inconsistency by converting C into the


timed model to get GW. That is, an untimed automaton can be considered to be a timed automaton, where time can pass arbitrarily. (Additional liveness restrictions on G will make sure that time-passage is not the only activity of GW). Below we prove that C safely implements G in the timed setting. Certain embedding results then allow us to conclude that since G safely implements S, Gt safely implements the converted specification SW, which furthermore implies that C safely implements SW. Thus, the embedding results allow us to work mostly within the simpler untimed model.

Definition 1 (Refinement from C to GW) If s is a state of C then R&0 (s) is the state u of G such that

Text Box:  u.good8               = {s.time8} - {s.last8}

u.good y                       = (s.lowery, s.uppery]

u.issuedy             = (0, s.uppery]

u.current-ok              = (s.deadline=oo)~

All remaining variables (including the channels) have the same contents in C and GW.

The timestamp that the sender in C might associate with a message, corresponding to good8 in G, is taken from the local clock time8, but only if the local clock has advanced since the last timestamp (last8). This explains the line for good8. The lines for goody and issuedy can be explained similarly.

A key invariant we need in the refinement proof is that the history variable deadline of C is always at least as big as real time, i.e., now deadline. This can then be used to prove the following invariant: for all timestamps t on the channel from sender to receiver, deadline=oo ~== uppery > t. This invariant states that uppery is always sufficiently large when the current message is being transmitted on the channel and no crashes have occurred.

Lemma 2 R&0 is a refinement mapping from C to GW.

Proof The proof of a refinement mapping in the timed setting has three points, which we sketch here:

·      For any state s of C, Rcc(s) has the same real time (now) as s. This is satisfied immediately by the definition of R&0.

·      For any start state s of C, Rcc(s) is a start state of GW. This is easy to check.

·      For each step (s, a, s') of C, where s and s' satisfy the invariants of C, we must show the existence of a sequence of steps (Rcc(s), a1, ... , aQ, Rcc(s')) of G with the same trace (see Figure 2). We conduct such a proof by doing a case analysis based on the different actions a of C. For instance, if a = increase-lowery(t), we show that (Rcc(s), shrink goody((0, t]), Rcc(s')) is a step of GW.

Text Box:  The correctness of H can also be proved using the refinement mapping technique.


7. CONCLUSION

In this paper, we have used a simple automaton model to present the at-most-once mes­sage delivery problem and two interesting solutions, and have used simulation techniques to prove that the two algorithms meet the specification. We have only given arguments for safety and timing properties here, and have left liveness for a longer report. We be­lieve that this work yields important insights into the protocols, and also serves to show the adequacy of the model and proof techniques. Similar protocols have been verified formally, using different techniques. LOTOS has, for instance, been used in [14].

There is a considerable amount of further work remaining. First, if the timing assump­tions on C are weakened or removed, the resulting algorithm still will not deliver any message more than once; however, it may lose messages even in the absence of a crash. It remains to formulate the weaker specification and prove that the weaker version of C satisfies it.

Second, there are other algorithms that solve the at-most-once message delivery prob­lem, for example, using bounded identifier spaces or cryptographic assumptions. We would like also to verify these, again reusing as much of our proofs as possible. In [17] bounded identifier spaces are dealt with for similar protocols.

Third, we would like to automate our simulation proofs using a mechanical theorem prover. We have already begun this work, by proving the equivalence of versions of S and D using the Larch Prover [18, 4]. We have been pleased with the preliminary results: the prover has not only been able to check our hand proofs, but in fact has been able to fill in many of the details. We can draw several conclusions:

·      Automata, invariants, and simulations are all excellent tools for verifying timed and untimed communication protocols. The methods scale well, yield insight, and are not too difficult to use.

·      A general model such as timed automata is needed in order to model and verify most communication protocols in a single coherent framework. However, for reasoning about particular protocols, it is often better to work in a simplified special case of the general model. (For instance, for untimed protocols such as H, it is best to avoid details of timing.) What is needed is a collection of special models, each of which can be easily "embedded" in the general model.

·      Safety proofs are challenging. They require insight to obtain the right invariants and simulations, and a lot of detailed work to verify these choices. Computer assistance can help with the details; however, the insight will always be required.

·      Backward simulations are much harder to do than refinements and forward simula­tions but are necessary in certain situations.

·      Many algorithms can be treated as implementations of a common abstract algo­rithm.

·      Verifying a coordinated collection of protocols, rather than just a single isolated protocol, is extremely valuable. It leads to the discovery of useful abstractions, and tends to make the proofs more elegant and reusable.


• Doing proofs for realistic communication protocols is feasible now. We predict that it will become more so, and will be of considerable practical importance.

REFERENCES

1.      M. Abadi and L. Lamport. The existence of refinement mappings. Theoretical Compnter Science, 82(2):253^284, May 1991.

2.   M. Abadi and L. Lamport. An old-fashioned recipe for real time. In J. W. de Dakker, C. Huizing, and G. Rozenberg, editors, Proceedings of REX Workshop "Real-Time: Theory in Practice", Mook, The Netherlands, Jnne 1991, number 600 in Lecture Notes in Computer Science, pages 1-27. Springer-Verlag, 1992.

3.      D. Belsnes. Single message communication. IEEE Transactions on Commnnications, Com­24(2), February 1976.

4.      Stephen J. Garland and John V. Guttag. A guide to LP, the Larch Prover. Technical Report 82, DEC, Systems Research Center, December 1991.

5.      R. Gawlick, N. Lynch, R. Segala, and J. Søgaard-Andersen. Liveness in timed and untimed systems. Technical report, MIT, Laboratory for Computer Science, 1993.

6.   B. Jonsson. Simulations between specifications of distributed systems. In J. C. M. Baeten and J. F. Groote, editors, Proceedings of CONCUR '91.2nd International Conference on Concnrrency Theory, Amsterdam, The Netherlands, Angnst 1991, number 527 in Lecture Notes in Computer Science, pages 346^360. Springer-Verlag, 1991.

7.      S. S. Lam and A. U. Shankar. Protocol verification via projections. IEEE Transactions on Software Engeneering, 10(4), 1984.

8.      B. Lampson, N. Lynch, and J. Søgaard-Andersen. Reliable at-most-once message delivery protocols. Technical report, MIT, Laboratory for Computer Science, 1993. Full report.

9.      B. Liskov. Practical uses of synchronized clocks in distributed systems. In Proceedings of the Tenth Annnal ACM Symposinm on Principles of Distribnted Compnting, pages 1-10, 1991.

10.  B. Liskov, L. Shrira, and J. Wroclawski. Efficient at-most-once messages based on synchro­nized clocks. ACM Transactions on Compnter Systems, 9(2):125^142, May 1991.

11.  N. Lynch and M. Tuttle. Hierarchical correctness proofs for distributed algorithms. Tech­nical Report MIT/LCS/TR-387, MIT, Laboratory for Computer Science, Cambridge, MA, 02139, April 1987.

12.  N. Lynch and M. Tuttle. An introduction to Input/Output automata. CWI-Qnarterly, 2(3):219-246, September 1989.

13.              N. Lynch and F. Vaandrager. Forward and backward simulations for timing-based systems. In J. W. de Dakker, C. Huizing, and G. Rozenberg, editors, Proceedings of REX Work­shop "Real-Time: Theory in Practice", Mook, The Netherlands, Jnne 1991, number 600 in Lecture Notes in Computer Science, pages 397-446. Springer-Verlag, 1992.

14.  Eric Madelaine and Didier Vergamini. Specification and verification of a sliding window protocol in LOTOS. In K. R. Parker and G. A. Rose, editors, Formal Description Techniqnes, IV, pages 495-510. North-Holland, 1991.

15.  M. Merritt, F. Modugno, and M. Tuttle. Time-constrained automata. In Proceedings

of CONCUR '91. 2nd International Conference on Concnrrency Theory, Amsterdam, The

Netherlands, Angnst 1991, number 527 in Lecture Notes in Computer Science, pages 408-

423. Springer-Verlag, 1991.

16.  S. Owicki and D. Gries. An axiomatic proof technique for parallel programs I. Acta Infor­matica, 6(4):319-340, 1976.


17.  A. U. Shankar. Verified data transfer protocols with variable flow control. ACM Transactions on Compnter Systems, 7(3):281-316, August 1989.

18.              Jørgen F. Søgaard-Andersen, Stephen J. Garland, John V. Guttag, Nancy A. Lynch, and Anna Pogosyants. Computer-assisted simulation proofs. In Costas Courcoubetis, editor, Compnter Aided Verification. 5th International Conference, CAV '93. Elonnda, Greece, Jnne/Jnly 1993, volume 697 of Lectnre Notes in Compnter Science, pages 305-319. Springer­Verlag, 1993.