Butler Lampson - Interactive machine-language programming

University of California, Berkeley

INTRODUCTION

The problems of machine language programming, in the broad sense of coding in which it is possible to write each instruction out explicitly, have been curiously neglected in the literature. There are still many problems which must be coded in the hardware language of the computer on which they are to run, either because of stringent time and space requirements or because no suitable higher level language is available.

It is a sad fact, however, that a large number of these problems never run at all because of the inordinate amount of effort required to write and debug machine language programs. On those that are undertaken in spite of this obstacle, a great deal of time is wasted in struggles between programmer and computer which might be avoided if the proper systems were available. Some of the necessary components of these systems,. both hardware and software, have been developed and intensively used at a few installations. To most programmers, however, they remain as unfamiliar as other tools which are presented for the first time below.

In the former category fall the most important features of a good assembler:" macro-instructions implemented by character substitution, conditional assembly instructions, and reasonably free linking of independently assembled programs. The basic components of a debugging system are also known but relatively unfamiliar.• For these the essential prerequisite is an interactive environment, in which the power of the computer is available at a console for long periods of time. The batch processing mode in which large systems are operated today of course precludes interaction, but programs for small machines are normally

*This research was supported in part by the Advanced Research Projects Agency of the Department of Defense under contract SD-185. _

debugged in this way, and as time-sharing becomes more widespread the interactive environment will become common.

It is clear that interactive debugging systems must have abilities very different from those of off-line systems. Large volumes of output are intolerable, so that dumps and traces are to be avoided at all costs. To take the place of dumps, selective examination and alteration of memory locations are provided. Traces give way to breakpoints, which cause control to return to the system at selected instructions. It is also essential to escape from the switches-and-lights console debugging common on small machines without adequate software. To this end, type-in and type-out of information must be symbolic rather than octal where this is convenient. The goal, which can be very nearly achieved, is to make the symbolic representation of an instruction produced by the system identical to the original symbolic written by the user. The emphasis is on convenience to the user and rapidity of communication.

The combination of an assembler and a debugger of this kind is a powerful one which can reduce by a factor of perhaps five the time required to write and debug a machine language program. A full system for interactive machine language programming (IMP), however, can do much more and, if properly designed, need not be more difficult to implement. The basic ideas behind this system are these:

1. Complete integration of the assembler and the
debugging system, so that all input goes through
the same processor. Much redundant coding is
thus eliminated, together with one of two differ‑
ent languages serving the same purpose: to
specify instructions in symbolic form. This con‑
cept requires that code be- assembled directly
- into core—or into a core ima^cre on, secondary

storage. Relocatable output and relocatable loaders are thereby done away with. (A remark on terminology: It will be convenient in the sequel to speak of the "assembler" and the "debugger" in the IMP system. These terms should be understood in the light of the foregoing: different parts of the same language are being referred to, rather than distinct languages.)

2. Commands for editing the symbolic source program. The edit commands simultaneously modify the binary program in core and the symbolic on secondary storage. Corrections made during debugging are thus automatically incorporated into the symbolic, and the labor of keeping the latter current is almost eliminated.

3. A powerful string-handling capability in the assembler, which makes it quite easy to write macros for compiling algebraic expressions, to take a popular example, which can be handled in a few other systems but rather clumsily. The point is not that one wants to write such macros, but that in particular applications one may want macros of a similar degree of complexity.

These matters are discussed in more detail below. We consider the assembler first and then the debugger since the command language of the latter makes heavy use of the assembler's features.

Before beginning the discussion it may be well to describe briefly the machine on which this system is implemented. It is a Scientific Data Systems 930, a 2-microsecond, single-address computer with indirect addressing ant' one index register. Our system irchides a drum which is large enough to hold for each user all the symbolic for a program being debugged, together with the system, a core image of the program and some tables. Backup storage of at least this size is essential for the editing features of the IMP system. The rest of the system could be implemented after a fashion with tapes.

THE BASIC ASSEMBLER

The input format of the assembler was originated on the TX-0 at M.I.T. It has been adopted by DEC for most of its machines, but is unknown or unpopular elsewhere in the industry. Although it looks strange at first, it has substantial advantages in terms of simplicity, both for the user and for the system. The latter is a nonnegligible consideration, equally often ignored and overemphasized. -

The basic idea is that the assembler processes each line of input as an _expression (unless it is a directiveor macro call).⁵ The expression is evaluated and the value is put into core at the word addressed by the location counter, after which the location counter is advanced by 1. Expressions are made up of operands, which may be symbols, constants, numeric or alphanumeric and parenthesized subexpressions; and operators. Available operators are -f-, *, /, .AND, .OR, .NOT with their usual meaning and precedence; .E (equals), .G (greater), .GE, .L, .LE, .NE, which are

binary operators with precedence less than and yield 1 or 0 depending on whether the indicated relation holds between the operands or not; and #, a unary operator with lowest precedence which causes its operand to be taken as a literal. This means that it is assigned a storage location, which is the same as the location assigned to other literals with the same value, and the address of this location is the value of the

literal. Blanks have the following significance: Any string of blanks not at the beginning or end of an expression is taken as a single plus sign. An expression is terminated by carriage return or semicolon. Several instructions may therefore be written on one physical line. This trivial feature proves in practice to have significant advantages.

It is not immediately clear how instructions are conveniently written as expressions, and in fact the scheme used depends on the fact that the object machine is a single-address, word-oriented computer with a reasonable number of modifiers in a single instruction. It would work on the PDP-6, but not on the IBM 7030.

The idea is simple: all operation code mnemonics are predefined symbols with values equal to the octal encodings of the instructions. On the SDS 930, for instance, LDA (load A) is defined as 7600000 (all numbers are in octal). The expression LDA+200 then evaluates to 7600200. When the convention about spaces is invoked, the expression

LDA 200

evaluates to the same thing, which is just the instruction we expect from this symbolic line in a conventional assembler.

Modifiers are handled in the same spirit. In the 24 bit word of the 930 there is an index bit, which is the second from the left, and an indirect bit, which is the tenth. With the predefined symbols

I=40000

X=20000000

the expression

LDA I 200 X

evaluates to 27640200. In more Conventional form it would look like this:

LDA* 200,2 -
There is little to choose between them for _brevity or.

clarity. Note that the order of the terms in the expression is arbitrary.

The greatest advantages of the uniform use of expressions accrue to the assembler, but the programmer gains a good deal of flexibility. Examples will readily occur to the reader.

Using this convention the implementation of the basic assembler is very simple. Essentially all that is required is an expression analyzer and evaluator, which will not run to more than three or four hundred instructions on any machine. Because all assembly is into core, there is no such thing as relocatability.

Two rather conventional methods are provided for defining symbols. A symbol appearing at the beginning of a line and followed by a comma is defined to be the current value of the location counter. Such a symbol may not be redefined. In addition, a line such as

SYM-4600

defines SYM. Any earlier definition is simply overridden. The right side may of course be any expression which can be evaluated.

The special symbol . refers to the location counter. It may appear on the left of a = sign. Thus, the line A, .=. 40

is equivalent to

A BSS 40

in a conventional assembler.

Note that the first punctuation character in a line of input to the assembler must be comma or space. The character . is not a punctuation character, but behaves exactly like a letter. Symbols reserved by the system begin with dot ordinarily. For convenience in forming negative addresses, the symbol .. is provided with a permanent value such that ..-1 is —1 truncated to the address field. On the 930, a two's complement machine with a 14 bit address field, .. is 40000.

Strings of characters encoded in ASCII may be written surrounded by single or double quotes, " or " ". If the string is less than 4 characters in length, it is equivalent to the number obtained by left-justifying it in a 24-bit word. Otherwise, it must appear alone on a line and generates enough words to accommodate all its characters. Strings in single quotes are scanned for : and & (see below); those in double quotes are taken literally.

The characters space * signal a comment, which is ignored up to the next carriage return. An initial * also has this effect.

There remains one point about the basic assembler which is crucially important to the implementation: the treatment of undefined symbols. When an expression is encountered during assembly, there is no guarantee that

it can be evaluated, since all the symbols in it may not _ •

be defined. This is the reason why most assemblers are two pass: the first pass serves to define the symbols. The increase in speed obtained by looking at the symbolic only once is so great, however, that it is worth a good deal of trouble. Even if every expression contains an undefined symbol on the first pass, it still takes only one-fifth as long to evaluate the already analyzed expressions as to read the input again, and this for a program with no macros. The assembler therefore keeps track of undefined expressions explicitly.

There is a general way of doing this, in which the undefined expression, translated for convenience into reverse Polish, is added to a list of such expressions, together with the address of the word it is to occupy. At suitable intervals this list is scanned and all the newly defined expressions are evaluated and inserted in the proper locations. For complex expressions there is no avoiding some such mechanism, and it has the advantage of simplicity. It is, however, wasteful of storage and also of time, since an expression may be examined many times while it is on the list before it can be evaluated. One important case can be treated much more efficiently, and this is the case of an instruction with an undefined address, which includes perhaps 90% of the occurrences of undefined expressions.

For example, when the assembler sees this code: X, BRU A *BRANCH UNCONDITIONAL LDA B

A, STA C

the instruction at X has an undefined address which becomes defined when the label A is encountered. This situation can be kept track of by putting in the symbol table entry for A the location of the first word containing A as an address. In the address of this word we put the location of the second such word, and so build a list through all the words containing the undefined symbol A as an address. The list is terminated by making the address field point to itself. When the symbol is defined we simply run down the chain and fill in the proper value. This scheme will work as long as the address field contains only A, since there is then no other information which must be preserved. Note that no storage is wasted and that when A is defined the correct address can be filled in very quickly.

STRINGS AND MACROS

The description of the basic assembler is now complete, except for a few nonessential details, and we turn to the macro and string handling facility. There is a uniform method for delimiting strings of characters, which may be illustrated by the assignment of such a string as the value of a symbol:

A = <B,(C,D),E,F>:

In order to describe the result of using A after this assignment, we introduce a distinction between the appearance of a symbol in a literal and in a normal context.

A symbol inside string brackets < > or single quotes or in a macro argument is in a literal context; all other contexts but one are normal. In a normal context, the value of the symbol, whether a string or a number, is substituted for the symbol. In a literal context, on the other hand, the characters of the symbol are passed on unaltered. The case of a symbol on the left side of an assignment is an exceptional one; such a symbol is of course not normally evaluated.

To permit the value of a symbol to be obtained in a literal context, the convention is introduced that a colon preceding the symbol causes it to be evaluated if the colon is at the top level of parentheses, brackets, and quotes. If its value is a string, the characters of the string replace the symbol; if it is a number the shortest string of digits which can represent the number in the prevailing radix replaces the symbol. Colon in a normal context is illegal.

For convenience in delimiting string names a second colon may follow a name preceded by a colon. This second colon serves only to delimit the name and is otherwise ignored. Thus if

AB = <XYZ>

then

<:AB> = <XYZ> and <:AB:CD> = <XYZCD>

There are times when it is desirable to force evaluation of a symbol in a normal context when it would normally pass unevaluated. The character & preceding the symbol has this effect; it is exactly like : except that it acts only in a normal context. Continuing the previous example:

VW&AB = VWXYZ and

&AB = 12 is equivalent to XYZ = 12.

A string may be thought of as having two kinds of structure:

1. It is composed of a sequence of characters.

2. It is composed of a sequence of substrings delimited by commas not enclosed in parentheses, brackets, or quotes.

With reference to the first structure, a single character may be selected by a subscript enclosed in brackets. Referring to the string assigned to A, we note that

A[2] is <,>, A[6] is <D>, and A[7] is <)>. By an obvious extension of this notation,

A [3,7] is <(C,D)> and A[9,11] is <E,F>. Subscripts which reference the substring structure are enclosed in parentheses. Thus

A(1) = <B> and A(2).= <C,D>.

Note that a single pair of parentheses surrounding a sub‑

string is removed. Subscripting may be iterated: A(2)(2) = <D>.

Subscripting is applied only to a string-valued symbol which is in a normal context or is evaluated by a colon. Subscripting of a name on the left side of an assignment forces it to be evaluated even if it is not preceded by a colon.

Two operations, .L and .LC, determine respectively the number of substrings and the number of characters in their arguments. Thus

.L(A) 4, .L(A(2)) = 2 and .LC(A) 11.

Having dealt with the general machinery for handling strings, we now turn to the slight refinement which adds macros with arguments to the system. This takes the form of a modification to the ordinary line assigning a string to a symbol, which permits an argument string to be specified. Thus

STORE <ARG> =

<.RPT.FOR T = 1, .L(ARG(2)),1 <ST&ARG (1 ) ARG (2) (T)>> >

defines a macro with two arguments, the first a string which, when appended to <ST>, creates a store instruction, and the second a list of locations to be stored into. Whenever STORE is used, the string of characters beginning with the first following nonblank character and ending with a line delimiter or unmatched right parenthesis is made the value of ARG. The string which is the value of STORE is then substituted for it as usual.

STORE might be called with

STORE A, ( S1 ,S2,S3 )

which is, because of the definition, equivalent to .RPT.FOR T = 1,3,1

To complete the expansion we must consider the .RPT directive which has been used above. This directive causes the string which follows to be scanned repeatedly. It takes one of two forms:

1. .RPT N <...>

which causes N repetitions, or

2. .RPT.FOR J = nl,n2,n3 <...>

which causes (n2—nl)/n3 + 1 repetitions with J initially set to nl, and then incremented by n3 until it ex-ceeds n2. Zero repetitions are possible. The n3 may be elided if it is 1.

The STORE macro call above may now be seen to expand into

STA Si

STA S2

STA S3

We illustrate with two further examples. The first is a generalized MOVE macro which takes as its arguments a_sequence of _pairs of _lists. The first list of each

pair specifies the locations to load from, while the second gives the corresponding locations to store into. A list may of course have only one element.

MOVE <ARG> =

<.RPT.FOR S1 = 1, .L (ARG) ,2

*THIS LINE STEPS THROUGH THE PAIRS OF LISTS

<.RPT.FOR S2 = 1, .L(ARG(S1))

*THIS LINE STEPS THROUGH THE ELEMENTS OF ONE PAIR OF LISTS

< LDA ARG(S1) (S2)

< STA ARG(S1 1)(S2) >>>

thus

MOVE A,B,C,D

becomes

LDA A

STA B

LDA C

STA D

So does

MOVE (A,C),(B,D)

Suppose that we have some two-word data structures to manipulate. We can attach to the name of each structure a string of the form <A,B>. A is the address of the first word of the structure, B of the second. A macro can do this and assign the storage.

TW <ARG> =

< Twsi = TWS + 1

ARG(1) = <TW:TWS,TW:TWS1> TW&TWS, 0

TW&TWS1, 0

TWS TWS + 2 >

Now, if we call TW twice after setting TWS to 1:

TW A

TW B

we will have given A the value <TW1,TW2> and B the value <TW3,TW4> and defined the four TW symbols.

We can now use A and B in the MOVE macro. In fact

MOVE A,B

expands to

LDA TWI

STA TW3

LDA TW2

STA TW4

With the addition of one more device we can proceed to the definition of a very grandiose macro. The directives .IF and .ELSF, used thus:

.IF E, <...>

. .ELSF E₂ <...>.ELSF En <...>

cause each E, in turn to be evaluated until one is greater than zero. The string following this one is then scanned and the rest of the structure ignored.

*THIS NACRE] CceaTUIS AN ARITMAITIC EXPRESSION CONSISTING OF SINGLE‑

·LETTER VARIABLES. BINART + AND - AND PmusalasES. IT CALLS TILE *MACRO ERROR IF ITOS EXPRESSION IS NOT WELL FORKED.

ARITH <ARO> -

· LXPR-crARG(1)4 ^•ATTEND - 10 TNI EXPRESSION

STX-<⁺> +INTTIALLTA THE SIAM 1414104 HANDLES

+TARENTRIALI

J-1 *INITIALIZE INK CULL= POTETZE

11-0 *INITIALIZE 111 TIDEPORA/T STORAGE C04111113

⁺IF 1tnpCRARY STORAGE TS REQUIEM IT IS ASSIOCID AS Taal, +iron, ETC., AND Ti MEI TRACK OP TIM NEXT ARK/LULL LOCATION.

01 +THIS IS THE MACAO WHICH DOES THE WORK

.1, T .NE 4211TAR>

*CHICX THAT EXPRESSION VAS NOT TERMINATED BY A RIGHT FAXININESIS.

·TH/S MACRO MULCTS A SUB-EXPRESSION cONsISTITC OF OeKLAJODS *STRUNG TOGETHER WITH + AND -. IP THE SUBEXPRESSION IS A SINGLE +VARIAELE. COP (CURRENT OPERAND) WILL BE THAT VARLARLE ON EXIT. ^AOTHERw/SE IT WILL BE M^yra.

< COP - +ENSURE THAT C01^. IS NOT Earn INITIALLY

^DAN men 001^, NEARS THAT CODE DAS BEEN ASSEMBLED LEAVING A VALUE

·IN THE A REGISTER. IT COP IS A LETTER, IT IS THE VARIA31X
*WHICH IS THE CURRENT OPERAND.

OPERAND *GET THE FIRST OPERAND

ALP? FOR E-1,1,0 *0 IS SET TO 2 WREN Tian ARE NO MORE + ON

*SIGNS

< ⁺EXPECTING AN OPERATOR ON Mod:KATT=

.11 T .0 .OR T .0 ')' 4-2>
+SET E 70 TERNINATE THE LOO? DI THIS CASE.

T .E <Co^peILE ADD.ADD>

T .E ‹COMTILE SUB,(CNA,ADD)>

+IF A + OR - IS PRESENT, GET THE SECOND OPERAND AND COMPILE CODE,

.ELSF I ^<ERROR> *OTHERWISE, ERROR

> > ..cLosE LOOP AND MACRO

+THIS MACRO COLLECTS THE SECOND OPERAND OF A BINARY OPERATOR AND *CONSTRUCTS CODE TO PERPORM THE SPECIFIED OPERATION. IT USES ITS ⁺FIRST ARGUMENT IF THE FIRST OPERAND IS IN THE A REGISTER. ITS ⁺SECOND AEGUMINT IF THE SECOND OPERAND ma? BE IN A AND THE FIRST ⁺TAKEN FROM MEMORY.

CON 01.1 <CARG> ‑

< OPERAND *GET THE SECOND OPERAND

.IF .LC(COP) .0 0

*TN THIS CASE THE SEODRD OPERAND IS A SOMA VARIABLE. < .LC (,REVD}) .0 0 <LOA TRIVOT>

+IF TEX FIRST OPERA/41) IS ALIO • VIETAILE (CS A MCP LOCATION) *BRING IT INTO A

CARG(1) CO? > ^RAND COEPTLE CODE

.EIS? 1 ^er-^Ala(^Z) ra^yon.

+0TNEIW/sE INA SECOND OPERAND MUST id TR A. AND TIE FIRST 11 LEASER! COP.< >

+SET COP TO INDICATE A VALUE IN A AND CLOSE TIM MACRO.

MIS MACRO COLLECTS AN 01^.1RAND, 1010) MAY BE • FARENTNISIZIMI ⁺SW/EXPRESSION

OPERAND.

< I - ..,EXPR[J]. +GET THE SKIT CHARACTER

J-J+1 +IT SHOULD BE A LITTER OR
.IF T .0

,LC(COT) .1 0

^DIY WE ALREADY NAVE I VALUE IN A IT ?CST BE SAVED IN TEXFORART *STORAGE /MLLE THE SUBEXPRISS/oN IS EVALUATED.

< TI TI +1:

STA TINP611 +CONSTRUCT A MCP LOCATION TO SAVE IT IN

C0?-<1101PI^.TI> > +AND RIDOOttall IT IN COP

STN-1/C.070STK^. +STICK COP ON THE FRONT OF STK
XI

.17 T .141 ^TZRIOR>

E-1 +RESET no TERMINATION SWITCH FOR XL

PREV01^.-./STK(1)> +sET TREVOP 70 THE OLD COP WHIM WAS SAVED
STK-c,sT^y(2,.1(100))^, >

⁺REMOTE OLD COP THINS STK AND TERMINATE THIS CASE. X1 HAS sir COP .ELSF T .CE ^.A^.-AND 7 .LE ^.E^.

^-.II I^. IS A LETTER (RECALL THAT THE CNARACTIE_Cool LS ASCII)

c EREToe-<,coe>

CUP-</EXYR[J-1], > -.EIS? 1 ^TRAAOR> >

This macro, called by

ARITH ( (A + B) — (C— D))

would generate

LDA A ADD B

STA TEMPI

LDA C SUB D CNA

ADD TEMPI

Note that there are only three lines in the definition which actually generate code. The temporary storage location TEMPI must be defined elsewhere.

The implementation of all this is quite straightforward. When a string is encountered, it is collected character by character, due attention being paid to colons, ampersands, brackets, and quotes, and stored away. When it is referenced, the routine which delivers characters to the assembler, which we will call CHAR, is switched from the input medium to the saved string. This process is of course recursive. When the string which is the current source of characters ends, CHAR is switched back to the string it was working on before. All the various occurrences of strings are treated perfectly uniformly, except that in the case of macro definitions the substrings of the argument string are delimited when the latter is collected to improve the efficiency. Perfectly arbitrary nesting of the various constructs is possible because of the recursiveness of the string collection and reference routines.

In the interests of efficiency the IF directive is not handled in this way, since its subject string is scanned either once or not at all. All that is necessary is a flag which indicates whether an .ELSF directive is to be considered or ignored.

THE DEBUGGING SYSTEM

An interactive debugging system should not be designed for the occasional user. Its emphasis must be on completeness, convenience, and conciseness, not on highly mnemonic commands and self-explanatory output. The basic capabilities required are quite simple in the main, but the form is all important because each command will be given so many times.

One essential, completely symbolic input and output is half taken care of by the assembler. The other half is easier than it might seem: given a word to be printed in symbolic form, the symbol table is scanned for an exact match on the opcode bits. If no match is found, the word is printed as a number. Otherwise the opcode mnemonic is printed, indirect and index bits are checked, the proper symbols printed, and the table is scanned for

the largest symbol not greater than the remainder of the word. This symbol is printed out, followed if necessary by a + and a constant.

The most fundamental commands are single characters, possibly preceded by modifiers. Thus to examine a register the user types

/x1-3; LDA I NUTS + 2

where the system's response is printed in capitals. This command may be preceded by any combination of modifiers:

C for printout in constant form

S for printout in symbolic form

0 for octal radix

D for decimal radix

R for relative (symbolic) address

A for absolute address

H for printout as ASCII characters

I for printout as signed integer

N for no printing of addresses

L (load) for no printing of register contents
The modifiers hold until the user types a carriage return or gives another / command.

For examining a sequence of registers, the commands + and — are available. The former examines the preceding register, the latter the following register. In the absence of a carriage return the modifiers of the last examination hold. The —› command examines the register addressed by the one last examined.

The contents of a register may be modified after examination simply by typing the desired new contents. Note that the assembler is always part of the command processor, and that debugging commands are differentiated by their format from words to be assembled (as noted above, an assembler line has comma or space at its first punctuation character, and all debugger lines have some other initial punctuation character). Furthermore, debugging commands may occur in macros, so that very elaborate operations can be constructed and then called on with the two or three characters of a macro name.

To increase the flexibility of debu^gging macros, the unary operator @ is defined. The value of @ SYM3 is the contents of location SYM3. With this operator, macros may be defined to type out words depending on very complicated conditions. A simple example is

TG<A> =

< .RPT.FOR TEMP = A(1),37777,1

*SCAN THROUGH ALL OF STORAGE STARTING AT THE LOCATION GIVEN BY

*THE FIRST ARGUMENT

< .IF @ TEMP E. (A)2

.*IF THE CURRENT LOCATION MATCHES THE

SECOND ARGUMENT, THE SCAN IS OVER

</ TEMP; *PRINT OUT THE
TEMPI =TEMP CONTENTS

TEMP = 37777 *SAVE THE ADDRESS

>>>> *AND TERMINATE

THE SCAN

Called with

TG 100,20

it will type out the first location after 100 with contents greater than 20.

Another important command causes an expression to be typed in a specified format. Thus if SYM has the value 1253 then

sym; 1253

would be the result of giving the = command. All the modifiers are available but the normal mode of type-out is constant rather than symbolic. If no expression is given, the one most recently typed is taken. Thus, after the above command, the user might try

; SYM (the system's response the sym‑

bolic equivalent of 1253, fol‑
lows the ;)

It is often necessary to search storage for occurrences of a particular word. This may be done with a macro, as indicated above, but long searches would be quite slow. A faster search can be made with

/expression;

which causes all the locations matching the specified expression to be typed out. The match may be masked, and the bounds of the search are adjustable. This command takes all the typeout modifiers as well as

which searches for a specified effective address (including indexing and indirect addressing) and

which searches for all exceptional words (which do not match). For additional flexibility the user may specify a macro which will be executed each time a matching word is found.

In addition to being able to examine and modify his program, the user also needs to he able to run it. To this end he may start it at a specified location with

,G location

If he wishes to monitor its progress he may insert breakpoints at certain locations with the command

,B location

This causes execution of the program to be interrupted at the specified location. Control returns to the system, which types some useful information and awaits further commands. An alternate form of this command is

,B location,marco name

which causes the specified macro to be executed at each break, instead of returning control directly to the

typewriter. Very powerful conditional tracing may be done in this way.

After a break has occurred, execution of the program may be resumed with the ,P command. The breakpoint is not affected. To prevent another break until the breakpoint has been passed n time the form

\n;

may be used. Modifiers may precede the command.

To step through the program, instruction by instruction, the command ,S may be used instead of ,P. It allows one instruction to be executed and then breaks again. $n; allows n instructions to be executed before breaking. A fully automatic trace has been deliberately omitted, but presents no difficulties in principle.

THE EDITOR

There remains one feature of great importance in the IMP system, the symbolic editor. The debugger provides facilities, which have already been described, for modifying the contents of core. These modifications, however, are not recorded in the symbolic version of the program. To permit this to be done, so that reloading will result in a correctly updated binary program, several commands are available which act both on the assembler binary and on the symbolic.

This operation is not as straightforward as it might appear, since there is no one to one correspondence between lines of symbolic and words of binary. Addresses given to the debugger of course refer to core locations, but for editing it is more convenient to address lines of symbolic. To permit proper correlation of these line references with the binary program, a copy Df the symbolic file is made during loading with the address of the first and last assembled words explicitly appended to each line. Since the program is not moved around during editing, these numbers do not change except locally. When a debugging session is complete, the edited symbolic is rewritten without this information.

We illustrate this with an example. Consider the symbolic and resulting binary

S I MOVE A,B (200,201) S I LDA A 200

STA B 201

ADD C (202,202) ADD C 202

STORE D,E (203,204) STA D 203

STA 204

S2 BRU SI (205,205) S2 BRU Si 205

and the editing command

,I S2-1 insert before line S2-1
SUB F

which gives rise to the following:

S1 MOVE A,B (200,201) Si LDA A 200

STA B 201

ADD C (202,1512) BRU .END 202

	SUB F	(1513,1513)		BRU	.END 1	203
	STORE D,E	(1514,204)		STA	E	204
S2	BRU Si	(205,205)	S2	BRU	_Si	205
			.END	ADD	C	1512
				SUB	F	1513
				STA	D	1514
				BRU	Si 4	1515
				BRU	Si 5	1516

All the BRU (branch unconditional) instructions are inserted to guarantee that the right thing happens if any of the instructions causes a skip. The alternative to this rather simple-minded scheme appears to be complete reassembly, which has been rejected as too slow. The arrangement outlined will deal correctly with patches made over other patches; although the binary may come to look rather peculiar, the symbolic will always be readable.

To give the user access to the readable symbolic the command

,L symbolic line address[,symbolic line address]^•.

(where the contents of the brackets is optionally included) causes the specified block of lines to be printed. Two other edit commands are available:

,D symbolic line address[,symbolic line address];

which deletes the specified block of lines, and

,C same arguments;

which deletes and then inserts the text which follows. Deleting S1 1 from the original program would, result in binary as follows

S1 LDA A

BRU .END

BRU .END 1

STA D STA E

S2 BRU SI

.END STA 11

BRU SI 3

The implementation of these commands is quite straightforward. One entire edit command is collected and the new text, if any, is assembled. Then the changed core addresses are computed and the appropriate record of the symbolic file rewritten.

The scheme has two drawbacks: It does not work properly for skips of more than one instruction or for subroutine Calls which pick up arguments from following locations, and it leaves core in a rather confusing state, especially after several patches have been made at the same location. The first difficulty can be avoided by changing large enough segments of the symbolic. The second can be alleviated by reassembly whenever things get too unreadable.

The only other published approach to the problem of patching binary programs automatically is that of Evans,' who keeps relocation information and relocates

the entire program after each change. This procedure is not very fast, and in any event is not practical for a system with no relocation.

EFFICIENCY

The IMP system depends for its viability on fast assembly. The implementation techniques discussed in this paper have permitted the first version of the assembler to attain the unremarkable but satisfactory speed of 200 lines per second. Simple character handling hardware would probably double assembly speed on simple assemblies and produce even greater improvement on programs with many macros and repeats.

Using the latter figures, •we^-deduce that a program of 10,000 instructions, a large one by most standards, will load in 25 seconds. This number indicates that the cost of the IMP approach is not at all unreasonable—far more computer time, including overhead, is likely to be spent in the debugging operations which follow this load. When only minor changes are made it is, of course, possible to save the binary core image and thus avoid reloading.

In spite of the speed of the assembler, it is possible that a relocatable loader might be a desirable adjunct to the system. There are no basic reasons why it should not be included.

As to the size of the system, the assembler is about 2500 instructions, the debugger and editor about 2000.

ACKNOWLEDGMENTS

The ideas in this paper owe a great deal to many stimulating conversations between the author and Peter Deutsch. I am especially indebted to him for the idea that all strings in the input can be handled uniformly with string brackets. A system very similar to this one has been implemented by him for the CDC 3100.

REFERENCES

1. M. Halpern, "XPOP—A Metalanguage without Metaphysics," AFIPS Conf. Proc., vol. 25, 1964.

2. G. Mealy, "Anatomy of an Assembly System," RAND Corporation (Dec. 1962).

3. S. Boilen et al, "A Time-Sharing Debugging System for a Small Computer," AFIPS Conf. Proc., vol. 23, Spartan Books, Washington D.C., 1963, pp. 51-58.

4. L. P. Deutsch and B. W. Lampson, `DDT—TimeSharing Debugging System Reference Manual," Project GENIE Doc. 30.40.10 (May 1965).

5. "The MIDAS Assembly Program," internal memorandum, M.I.T.

6. Thomas G. Evans and D. L. Darby, "DEBUG‑