.\" expand -8 | grn | eqn | tbl | troff -ms
.nr LL 6.5i
.nr PO 1.0i
.nr PI 0.2i
.\" Get an -me horizontal line
.de hl
.br
\l'\\n(.lu-\\n(.iu'
.sp
..
.\" Two macros for code display --
.\" Use fixed width font so indents show up right,
.\" decrease point size and vertical spacing by two
.de LS
.br
.DS B \\$1
.ft C
.ps -2
.vs -2
..
.de LE
.vs +2
.ps +2
.ft P
.DE
.br
..
.\" Section headers with TOC generation
.\" Parameter 1 is the level, param 2 is the header string
.\" .XS and .XE are standard macros from "new" ms
.de hd
.NH \\$1
\\$2
.XS
\\*(SN \\$2
.XE
..
.EQ
delim $$
.EN
.ND " "
.\".RP
.ps +8
.vs +8
.\".na
.po +.25i
.ce 99
\fBPRESTO:
A System For Object-Oriented 
Parallel Programming\fP
.ps -8
.vs -8
.sp -2
.ce 0
.po -.25i
.TL
.AU
\fRBrian N. Bershad, Edward D. Lazowska, and Henry M. Levy\fP
.AI
Department of Computer Science
University of Washington
.sp 0.5
September 1987
Revised January 1988
.AB
PRESTO is a programming system for writing object-oriented
parallel programs in a multiprocessor environment.  
PRESTO provides the programmer with a set of pre-defined object
types
that simplify the construction of parallel programs.  Examples
of PRESTO objects are threads, which provide fine-grained
control over a program's execution, and synchronization
objects, which allow simultaneously executing threads to coordinate
their activities.
.PP
The goals of PRESTO are to provide a programming
environment that makes
it easy to express concurrent algorithms, to do so efficiently,
and to do so in a manner that invites extensions and modifications.
The first two goals, which are the focus of this
paper, allow a programmer
to use parallelism in a
way that is naturally suited to the problem at hand,
rather than being constrained by the limitations of a
particular underlying kernel or hardware architecture.
The third goal is touched upon but not emphasized in this paper.
.PP
PRESTO is written in C++; it currently runs on
the Sequent shared-memory multiprocessor on top
of the Dynix operating system.
In this paper we describe the system model, its applicability
to parallel
programming, experiences with the initial implementation,
and some early performance measurements.
.AE
.LP
\fBCR Categories and Subject Descriptors:\fR  
C.1.2 [Processor Architectures]:  Multiprocessors \- \fIparallel processors\fP; 
D.1.3 [Programming Techniques]: Concurrent Programming;
D.3.3 [Programming Languages]: Language Constructs \- \fIAbstract data types\fP;
D.4.1 [Operating Systems]: Process Management;
D.4.8 [Performance]: Performance \- \fIMeasurements\fP
.LP
\fBGeneral Terms:\fR  Design, Languages, Measurement, Performance
.LP
\fBAdditional Key Words and Phrases:\fR  parallel computing, software
parallelism, speedup, efficiency
.sp
.ce 0
.sp
.FS
Our work is supported by the National Science
Foundation (Grants No. CCR-8619663, CCR-8700106,
and CCR-8703049), the Naval Ocean
Systems Center, U S WEST Advanced Technologies, the
Washington Technology Center, the USENIX Association,
and Digital Equipment Corporation (the
Systems Research Center and the External Research Program).
.FE
.NH 1
Introduction
.PP
PRESTO is a programming system for writing object-oriented
parallel programs in a multiprocessor environment.  PRESTO
consists of an object-oriented language (C++ \&[8]), \&
a library of basic tools constructed in this language,
a run-time system providing efficient support, and,
most important but least tangible, a programming
methodology.
.PP
Our first goal in designing and implementing PRESTO
was to apply our experiences in building
distributed object-oriented systems \&[1,\|5] \&
to the world of
multiprocessors.
In distributed systems, an object-oriented programming paradigm
makes it easier to think about and to express
concurrent algorithms.  Problem decomposition and run-time
synchronization details can be neatly described by an object model.
Each object is responsible for solving some small part of
an overall problem, and each is responsible for maintaining
its own (and only its own) internal consistency.  These are exactly
the qualities that are needed when building parallel applications
for a multiprocessor.
.PP
Our second goal
was to provide efficient concurrency and synchronization
mechanisms.
The primitives provided by
many existing parallel programming systems
are so expensive that they become
the major factor in determining the structure
of applications.
(For example, one may be forced to design
algorithms that use only as many threads
of control as there are physical processors.)\ 
Our experience (and common sense) suggested
that the construction of parallel applications \- a
daunting task under the best of circumstances \- was
made considerably more difficult by such constraints.
PRESTO allows the programmer to use parallelism in
in the manner most natural to the problem at hand,
with minimal external performance constraints arising
from an underlying kernel or hardware architecture.
.PP
Our third goal was to provide
an "open" environment that could be used as a "toolkit"
for building efficient support for a variety of "models"
of parallel programming.
Most parallel programming systems present themselves in
terms of a fixed set of primitives running on top of
a closed run-time kernel.
The primitives together with the kernel define a "model"
of parallel programming that, while pleasing to the
implementor, may not always be satisfactory to the
application programmer.
In PRESTO, it is possible to redefine the behavior
of the lowest level system primitives.
New constructs (and thus new abstractions) for
constructing parallel applications can be introduced
quickly, and without the level of overhead normally
associated with a layered system.
.PP
This paper concentrates on describing PRESTO in terms
of the first two goals:  its object-orientation and
its performance, and the impact of these characteristics
on parallel programming.
The third goal, that of providing an open environment for
building parallel programming systems, is fully described
in a companion paper \&[2]. \&
The examples in this paper are drawn from what could be called the "default"
PRESTO programming model \- a Mesa-like environment where threads,
monitors, and condition variables define the basic primitives for
the programmer.
.PP
The remainder of this paper discusses the implementation language
for PRESTO (C++), the use of objects in building parallel programs,
a few of the default PRESTO objects, the use of PRESTO on
a multiprocessor, and
some preliminary performance measurements.
.NH 1
PRESTO and the C++ Programming Language
.PP
PRESTO is implemented in C++.
To quote from the C++ reference manual:
.QP
C++ is a superset of the C programming language that retains
the efficiency and notational convenience of C,
while providing facilities for type checking, data abstraction,
operator overloading and object-oriented programming.
.PP
We chose C++ for three reasons, each touched upon
in the above quotation.
First, we wanted an object-oriented programming language.
Second, C++ is implemented
as a preprocessor to C, making it portable to any
system with
a C compiler.  (Although PRESTO exists now on only one machine,
we intend to port it to other multiprocessors as they become
available to us.)\  
Third, we wanted PRESTO to be widely used, and
C++ is relatively easy to learn for C programmers.
.PP
Object-oriented programming in C++ is made possible
by the concept of a \fIclass\fP.  A class is a user-defined data type
allowing the programmer to specify an object in terms of
its data representation and operations.
As an example, the class definition for a simple
stack of integers might appear as
.LS
        // This is a comment.
        class Stack     {
                // Private Data 
                int             st_size;        // maximum size
                int             st_sp;          // current stack pointer
                int             *st_elements;   // data
                // Private Operations
                void            st_growstack(); // make the stack larger
        public:
                // Public Operations
                Stack(int maxsize)                      // CONSTRUCTOR
                        { st_size = maxsize;
                          st_sp = 0;
                          st_elements = new int[st_size];
                        }
                ~Stack()                                // DESTRUCTOR
                        { delete st_elements; }
                int             depth()                 // current depth
                        { return st_sp; }       
                virtual void    push(int newelement)
                        { if (st_sp == st_size)
                                st_growstack();         // ensure we have room
                          st_elements[st_sp++] = newelement;
                        }
                virtual int     pop()
                        { if (depth() == 0)             // nothing left?
                                // ERROR HANDLER HERE
                          else
                                return st_elements[--st_sp];
                        }
        };
.LE
.LP
Each instance of a \fCStack\fP has its
own private version of the class's variables.  An object's
operations  are shared by all instances of the
object's class.
The declarations in the private section specify those parts of the class
accessible only from within the operations themselves.
If another object creates a stack,
.LS
        Stack *S = new Stack(100);      // Invoke the CONSTRUCTOR; maxsize = 100
.LE
then only the operations
.LS
        S->push(x);
        x = S->pop();
        sz = S->depth();
        delete S;               // Invoke the DESTRUCTOR
.LE
can be performed on \fCS\fP.  
\fCS\fP's private operations and private data can be referenced
from within these public operations.  The process of performing
an operation is called an \fIinvocation\fP.
.PP
The keyword \fCvirtual\fP in the class definition indicates
that \fCpush()\fP and \fCpop()\fP
can be redefined by any classes that are
defined as a sub-class of \fCStack\fP.  In C++, a sub-class
can be derived from a super-class, allowing classes to exist
hierarchically. The sub-class's qualities are those
it defines for itself, as well
those inherited from its super-class.  If a sub-class
redefines any of its inherited virtual operations, the
new definitions take precedence.
The sub-class satisfies the \fIisa\fP
relation on its super-class.  That is, if class \fCSynchronizedStack\fP
is a sub-class of class \fCStack\fP, 
.LS
        // Sub-class                    isa     Super-class
        class SynchronizedStack         :       Stack   {
                //
                // Serialize access to the stack 
                //
        };
.LE
.LP
then an instance of class \fCSynchronizedStack\fP  
can be used anywhere an instance of class \fCStack\fP
is expected \- \fCSynchronizedStack\fP \fIisa\fP \fCStack\fR.
A synchronized stack object can ensure that concurrent
operations on the private data defined in its super-class are
serialized, without having to redefine the implementation of those
operations.  A more complete definition for 
\fCSynchronizedStack\fP appears
later in this paper.
.PP
An object's operations may also include \fIconstructor\fP
and \fIdestructor\fP routines.  These are
procedures that are called automatically
when the object is created and deleted respectively, allowing the object
to specify an initialization and termination sequence.
New objects can be created on the call-stack by declaring
them within a basic block, statically by declaring them
outside of any block, or on the heap, by using the \fCnew\fP
operator.
.PP
The definition for an operation can be either
included in the class definition itself
or given elsewhere.
To define the operation
\fCst_growstack()\fP separately from the declaration, 
it is necessary to use the qualifying syntax "\fC::\fP".
.LS
        //
        //      Class::Operation qualifies Operation under Class
        //
        void                            // return type
        Stack::st_growstack()           // Class::Operation
        {
                int j;
                int newsize = st_size * 2;              // double size
                int *newstack = new int[newsize];
                for (j = 0; j < st_size; j++)           // copy old stack
                        newstack[j] = st_stack[j];      // into new
                st_size = newsize;
                delete st_stack;                        
                st_stack = newstack;
        }
.LE
.LP
(These programming segments are meant only to provide enough exposure to
C++ so that the reader can understand the remaining examples
in this paper.  For complete information, see [8].)
.PP
C++ is an inherently sequential language.  Unlike languages such
as Emerald or Modula2+ \&[6], \&
C++ has no notion of concurrency or synchronization.
.PP
It would have been possible to extend the language in this
direction by modifying the compiler, but we felt that the language
was sufficiently rich that our objectives
could be achieved without changing it.
Furthermore, including knowledge about concurrency
and synchronization in the compiler would have seriously
limited future extensions to PRESTO itself (unless
one were willing to again modify the compiler).
.PP
Available as part of the standard C++ distribution
is a set of classes for defining concurrent objects on a uniprocessor.
These objects execute as co-routines, and they are limited in terms
of how they can be used.  (For example,
objects can only be single-threaded and synchronize only with messages.)\ 
Making these objects work on
a multiprocessor would have been possible (and indeed has
been done elsewhere \&[4]
\&) but would have precluded us from realizing the goals
of efficient primitives implemented on an open system.
We have used PRESTO to define classes that
efficiently mimic the behavior of those provided
as part of the C++ distribution, without assuming that this
behavior is the "PRESTO-definitive" mode of parallel programming.
.PP
Since PRESTO is written in C++, it is most naturally
used with applications written in that language.
Although it is possible to use the system from other languages
(such as C or Pascal), many of PRESTO's concepts
will be difficult and time-consuming to express.
For this
reason, users are encouraged either to write completely in 
C++, or to build application-specific interfaces between
languages.
.NH 1
Exploiting the Object Model in Parallel Programs
.PP
PRESTO provides the programmer with
several classes useful for writing parallel
programs.  These classes, and the environment in which
they execute, help support two of the major goals of
PRESTO \- efficient execution and comfortable abstractions for
expressing concurrency. 
.PP 
In PRESTO, all objects execute in a single address space
shared by all processors, allowing for fast inter-object
communication and synchronization through shared storage.
The object model allows objects to exist in a "safe" environment,
making it difficult (although not impossible) for objects to haphazardly
trounce one another.  
In a sequential object-oriented system, an object hides
its data and its implementation.  In PRESTO, an object
hides not only its data and its
implementation, but also its \fIexecution\fP.  That is, when
a caller invokes an operation on an object, the caller is unaware
whether that operation executes sequentially or in parallel.  The
implementor of an object
determines the extent of parallelism that is appropriate
to the object, much as he/she decides what data structures best
suit the needs of the object.  Dealing with concurrency in this
manner simplifies the task of writing parallel programs.
.PP
The following sections describe the major classes used by
PRESTO programs, and discuss how their design addresses
the goals of the system.
.NH 2
The Thread Class
.PP
Thread objects (threads)
are the building blocks of PRESTO parallel
programs.  As the basic unit of execution, threads conceptually
consist of a program counter and a stack of invocation records.
There are two essential operations
that can be performed on a thread.  A thread can be \fIcreated\fP,
allowing the creator to specify the thread's qualities,
such as its name and maximum storage requirements.  Once
created, a thread can be \fIstarted\fP executing some operation
of some object,  wherein it executes in parallel
with the starting thread.  \fIStart\fP, in fact, 
is an operation
defined for threads;  parameters include the object,
the operation 
where the thread is to be started,
and any parameters
expected by that operation.  For example,
.LS
        Stack *S = new Stack(100);
        // Create a new thread named "Pusher" having id TID.
        Thread *t = new Thread("Pusher", TID);

        // Let t be responsible for pushing 43 onto the stack.
        t->start(S, S->push, 43);       
.LE
.PP
As noted earlier, PRESTO extends conventional
object-oriented programming by allowing an object
to hide and control not only its data and its
implementation, but also its execution.
The user of an
object chooses between synchronous and asynchronous invocations,
and the implementor of an object chooses between
sequential and parallel execution.  
.\"An object can be passive,
.\"waiting for others to invoke it, it can be active,  animated
.\"by a single thread, or it can have internal parallelism with
.\"many threads executing within it.
Table 1
shows how these choices fit together, and how their combinations affect the 
overall execution of a program.
.sp
.TS
box, center, tab(:);
cfB cfB cfB 
cfB cfB cfB 
cp9 cp9 lp9w(3i).
User of Object:Implementor of Object:Effect
(\fIU\fP) Chooses:(\fII\fP) Chooses:\&
=
synchronous:sequential:T{
\fIU\fP invokes \fII\fP's operation.  
\fIU\fP blocks
until \fII\fP finishes. \fII\fP runs single-threaded
with \fIU\fP's thread.
T}
_
synchronous:parallel:T{
As above, only \fII\fP creates multiple
threads and starts them executing
its own operations.  These threads
compute in parallel.
T}
_
asynchronous:sequential:T{
\fIU\fP creates a new thread and starts it
executing \fII\fP's operation.  \fIU\fP
continues, and \fII\fP runs with the
single thread that \fIU\fP started within it.
When \fII\fP returns, its thread is destroyed.
T}
_
asynchronous:parallel:T{
\fIU\fP creates a new thread and starts it executing
\fII\fP's operation.  \fIU\fP continues, while \fII\fP
creates multiple threads and starts them executing
\fII\fP's operations.
T}
.TE
.sp
.LG
.ce
\fBTable 1 \- Control Over Execution\fP
.NL
.sp
.PP
An object
cannot tell whether it is being invoked synchronously or asynchronously,
and the user of an object cannot tell whether an invocation
is being performed sequentially or in parallel.
For example, the user of a matrix object probably is
not concerned with whether the object implements multiplication
by using hundreds of parallel threads or a single thread executing
over the whole problem.  Only the ultimate product is important.
It is the
responsibility of the matrix object to determine where, when,
and how much parallelism is dictated by a given invocation
of the multiply operation.
In cases such as this,
synchronous invocation with parallel execution
is most appropriate.  The following code segment shows
how a parallel matrix multiplication operation might be
defined in PRESTO:
.LS
        class Matrix    {
                int     **ma_elems;
                int     ma_rows;
                int     ma_cols;
                void    ma_dotproduct(int *el, Matrix *M, int i, int j)
                        {       // Compute the dot product of the i'th row
                                // of "this" (in ma_elems) and
                                // the j'th column of M.  Store in *el.
                        }
        public:
                Matrix(int rows, int columns);          // CONSTRUCT new matrix
                int     numcolumns();
                int     numrows();
                Matrix *multiply(Matrix *M);            // multiple "this" by M
                // Other operations...
        }

        // Create a separate thread to compute each element in the
        // product matrix in parallel.
        Matrix*
        Matrix::multiply(Matrix *M)
        {
                Matrix *P = new Matrix(ma_rows, M->numcolumns());       // product
                for (int i = 0; i < ma_rows; i++)
                        for (int j = 0; j < M->numcolumns()     {
                                Thread *t = new Thread("multiplier");
                                t->start(this, this->ma_dotproduct,
                                         &(P[i][j]), M, i, j);  //
arguments to operation
                //
                // Wait here until all threads terminate
                //
                return P;
        }
.LE
.PP
Alternatively, the insertion of a new entry into
a directory object is a non-parallel operation, best
handled asynchronously by the object doing the insertion, not by
the directory object itself.  The insert invocation might appear as:
.LS
        // Create the directory
        Directory *dir = new Directory("my_directory");

        // Create a thread to do the insertion
        Thread *t = new Thread("dir_inserter");

        // Start the thread doing the insertion of some file.
        t->start(dir, dir->insert, someFileName, someFileContents);

        // Run in parallel with the dir->insert routine.  
        // Its termination is transparent.
.LE
.PP
A thread may only be started once.  
It executes either until it terminates itself,
or until it returns
from the operation in which it
was started.  A thread may join on another thread, causing the joining
thread to block until the joinee finishes.  The joinee's return value
from the operation in which it started is returned to the joining
thread when it resumes.
.PP
Each thread has its own call-stack.  Any objects declared on that
call-stack are visible \fIonly\fP from
within the thread to which the call-stack belongs.
Objects requiring references
from more than one thread must be heap allocated or static.
If data is to be shared
between threads, then it should be declared as such within the object's
definition.
.NH 2
The Synchronization Class
.PP
Although a thread can be executing in only one object
at a time, it is possible to have multiple threads executing
within a single object simultaneously.  In a multiprocessor
system, true concurrency can occur.  To provide a controlled
environment for multi-threaded objects,  PRESTO provides two basic
classes of synchronization objects: relinquishing and non-relinquishing
locks.
.NH 3
Relinquishing Locks
.PP
A thread executes until it is preempted, terminates, or voluntarily
relinquishes the processor by performing an operation on a relinquishing
object.
The simplest relinquishing objects are those defined
by the class \fCLock\fP.   The two primary operations on locks are
\fClock()\fP and \fCunlock()\fP.  
.LS
        // l is a reference to a Lock (Lock* l)
        l->lock();
                // critical code
        l->unlock();
.LE
.LP
Return from a \fClock()\fP operation indicates that the caller
\fIholds\fP the lock.  A lock may be held by only
one thread at a time.  A thread trying to lock an already held
lock relinquishes the processor on which it is running,
allowing another ready thread to execute on that processor.
The relinquishing thread is made ready for execution when
the lock becomes free.
.NH 3
Non-Relinquishing Locks
.PP
Hardware
atomic locks serve as the basis for the
non-relinquishing synchronization object \fISpinlock\fP.
Spinlocks have a potential performance advantage over simple 
relinquishing locks.  It is less expensive
to acquire and release a non-relinquishing lock.  Further,  if the 
average waiting time 
is less than the time to relinquish and reacquire
a processor, non-relinquishing
locks are more efficient.  
.PP
As with simple relinquishing locks,
there are two operations on spinlocks, \fClock()\fP
and \fCunlock()\fP.
.LS
        // s is a reference to a Spinlock (Spinlock *s)
        s->lock();              // spin here if already locked
                // critical code
        s->unlock();
.LE
The thread that most recently locked the spinlock is the lock's
owner.  A thread
relinquishes ownership by unlocking the spinlock.   
A spinlock can have only one owner, but
unlike relinquishing locks,
a thread trying to lock an owned spinlock consumes CPU cycles polling
the lock until it becomes free.
If a thread tries to lock a spinlock that it already owns,
the thread will spin forever.  The implementation of
spinlocks causes a thread to become non-preemptible once it acquires one.
.NH 2
More Sophisticated Synchronization Classes
.NH 3
Monitors and Condition Variables
.PP
Although straightforward and easy to understand, simple relinquishing
locks can be difficult to use and thus prone to misuse.  A more
refined relinquishing synchronization mechanism is available
through monitors and condition variables\**.
.FS
Condition variables are really condition objects, but
the former terminology is 
well-established,
and is therefore retained.
.FE
These work together to provide Mesa-like
synchronization semantics \&[3,\|7,\|9].
As noted in the introduction, we view this
as the "default" PRESTO programming model,
but we encourage users to build (and share)
support in PRESTO for other parallel programming
models when this is dictated by their applications.
.PP
In PRESTO, a
section of critical code is surrounded by an \fCentry\fP and \fCexit\fP
invocation on a monitor object.  If several operations must
coexist within the same monitor,
the programmer is obligated to explicitly name the
monitor within each operation.
Although there is no direct compile-time support for monitors,
it is not necessary for the programmer to make explicit calls 
to a monitor's \fCentry\fP and \fCexit\fP routines when
writing a monitored object.  PRESTO provides a type \fCMONITOR\fP
that can be used to guard access to blocks of code;
.LS
        { // example of monitored block of code controlled by Monitor *m;
          MONITOR ENTRY(m);     // ENTRY is a nice sounding dummy variable
                // .. code here 
         }      // m is automatically released when ENTRY goes out of scope here
.LE
.LP
is equivalent to
.LS
        {
          m->entry();
                // .. code here
          m->exit();
        }
.LE
but is syntactically cleaner.  Furthermore, the first form makes
it impossible that the programmer will forget to explicitly 
release the monitor, \fIeven if\fP the code returns from
within the block.
The constructor for a \fCMONITOR\fP object enters the named
monitor, and remembers it.  When the \fCMONITOR\fP object goes
out of scope (at the bottom of a block), its destructor is automatically
called.  Within the destructor, the entered monitor is exited.
.PP
Only one thread can be active
within a monitor at any one time.  That thread
is called the monitor's \fIowner\fP.  When a thread attempts
to enter a monitor that is already owned, the thread 
is blocked, relinquishing the processor on which
the thread is executing.  Eventually, the owner
will release the monitor, causing the least recently
executed thread waiting on the monitor
to be resumed in an attempt to become the owner.
.\" HEY, READ THIS!!
.\"It is
.\"not guaranteed that the least recently executed thread
.\"will actually acquire the monitor, as some intermediate thread
.\"may acquire the monitor during the period between it's release
.\"and the actual execution of the least recently executed thread.
.\"These semantics permit a more efficient implementation at the cost of
.\"fairness.  ACTUALLY, YOU CAN GET STARVATION DOING IT THIS WAY.
.\"I DON'T WORRY ABOUT IT TOO MUCH, SINCE THE UNDERLYING SPINLOCK MECHANISM
.\"ALSO PERMITS STARVATION (SPIN ON CACHE, CHECK BUS, WHOOPS, SPIN ON CACHE,
.\"ETC...).  Just thought you might be interested.
.PP
A thread may \fCwait\fP on a condition variable.
When a condition variable is created, it must be bound to
some monitor.  The condition variable should only be used
from within that monitor.  It is an error to do otherwise.
A thread waiting on a condition variable releases the associated
monitor as it blocks.
Another thread can \fCsignal\fP the condition
variable, causing the 
the condition variable's longest waiting thread to eventually
resume.  The signaller continues
to own the monitor until it waits  or exits, so a
signalled thread, since it must reacquire the monitor,
does not execute immediately upon being
signalled.  A signal must be regarded
merely as a hint that an acceptable state had been reached at some
prior point.
When a waiting thread next runs again,
it should check that the condition on which
it waited has remained satisfied since being signalled.
A thread may also \fCbroadcast\fP
on a condition variable, causing all threads waiting on that
condition variable to be signalled.  
.PP
The following example demonstrates monitored access to stacks.
The class \fCSynchronizedStack\fP is defined as a sub-class of
the \fCStack\fP class presented earlier.  Synchronized stacks have
all the characteristics of their super-class, but guarantee that
access to the stack is atomic by redefining \fCpop()\fP and
\fCpush()\fP to require the possession of
an exclusive private monitor.  Further,
a thread trying
to \fCpop()\fP from an empty stack 
blocks, and does not resume
until the stack becomes non-empty.
.LS
                // Sub-class            isa     Super-class
        class SynchronizedStack         :       Stack   {
                // 
                // Our own private data for synchronizing.
                //
                Monitor *s_monitor;             // - for atomic access
                Condition *s_condition;         // - reads are blocking
        public:
                // Constructor to create a new stack
                Stack(int sz)   
                :(sz)           // Call CONSTRUCTOR of Super-class
                        { 
                          s_monitor = new Monitor("StackMonitor");
                          s_condition = new Condition(s_monitor, "StackCondition");
                        }
                void push(int newitem)  
                        {
                          MONITOR ENTRY(s_monitor);
                                  // Qualify to the super-class operation
                                  Stack::push(newitem);
                                  if (depth() == 1)     {
                                          // Signal if any could be waiting
                                        s_condition->signal();
                                  }
                        }
                int pop()               
                        {
                          MONITOR ENTRY(s_monitor);
                          int topitem;
                                while (depth() == 0)    {
                                        s_condition->wait();
                                        // Consider the signal only as a hint.
                                        // Must check depth again.
                                }
                                topitem = Stack::pop(); 
                          return topitem;
                        }
        };
.LE
.LP
Using instances of this class, multiple threads can safely share
the same stack.  Furthermore, operations written
to operate on the super-class \fCStack\fP
can just as easily operate on a \fCSynchronizedStack\fP.
.NH 3
Atomic Integers
.PP
To address the common situation where one would 
like simply to update a counter or some other integral value
within an otherwise unsynchronized region of code,
PRESTO provides an atomic integer class.
The class \fCAtomicInt\fP guarantees multiple-reader, single-writer
semantics for integers by automatically enclosing their
reference within a spinlock's \fClock()\fP and \fCunlock()\fP. 
\fCAtomicInt\fP
supports the full complement of integer operations
(assignment, increment, decrement, etc.).  An 
\fCAtomicInt\fP can
be used anywhere an integer is expected.
.LS
        {
                AtomicInt       a;
                AtomicInt       b = 10;
                int             c = a;
                int             d;

                // w_lock -> write lock; w_unlock -> write unlock
                // r_lock -> read lock;  r_unlock -> read unlock

                d = a++;        // w_lock a, a++, d = a, w_unlock a

                b += c;         // w_lock b, increment b by c, w_unlock b

                c = a + b;      // r_lock a; a' = a; r_unlock a;
                                // r_lock b; b' = b; r_unlock b;
                                // c = a' + b'

                a = a + b;      // r_lock a; a' = a; r_unlock a;
                                // r_lock b; b' = b; r_unlock b;
                                // w_lock a; a = a' + b'; w_unlock a;
        }
.LE
.PP
Atomic integers are an example of how the object-oriented 
programming model meshes well with the needs of parallel programs.
Instances of class AtomicInt are responsible for ensuring
their own synchronization and providing their
own access semantics in a parallel environment.   Users
of the class are insulated from the details of the class's 
implementation, and are guaranteed of its correct
operation.
.NH 1
System Architecture
.PP
PRESTO exists as a run-time library on a Sequent Balance 21000
shared-memory multiprocessor.
(PRESTO soon will also be operational
on the DEC SRC Firefly, an experimental
prototype multiprocessor workstation.)
.PP
The Sequent's operating system is Dynix, a
a 4.2BSD \&
.UX
lookalike with support for shared memory.  
Dynix \&[10] \&
provides support for writing parallel programs, but
this support is limited.
The Dynix
unit of execution is the UNIX process, an expensive
("heavyweight") and
inflexible entity.
Because the basic synchronization mechanisms are cumbersome
to use, a "parallel programming library" is provided.
This library restricts
the "threadedness" of a
parallel program to the number of physical processors
in the system, prohibiting the design of algorithms
that have hundreds (or even thousands) of independent
threads of execution.
Even if the parallel programming library were redesigned
to remove this restriction, the performance of
the basic system primitives would seriously limit the
ways in which parallelism could be used.
(We must emphasize that these limitations are \fInot\fR
unique to Dynix and the Sequent, and that on balance
we are delighted with the Sequent system.)
.PP
The basic role of the PRESTO run-time system is to map user's threads
onto physical processors and to provide access to a global
shared memory in which all objects reside.
In the case of a system like Dynix,
PRESTO maps threads onto Dynix processes, relying on the Dynix kernel
to complete the mapping onto a physical processor.  Although there
are two levels of indirection required, a Dynix process can be permanently
bound to a physical processor, so the second level of
indirection is done only once.
All details of the mappings are
invisible to the PRESTO programmer.
.PP
The architecture of
PRESTO adheres to the threaded object
model described earlier.  The system maintains a single
scheduler object.  The scheduler object keeps track of all
threads that are runnable but not yet running.  
A thread becomes
runnable when first started within an object,
or when resumed by a synchronization object after
blocking.  Each 
processor in the system is represented by a processor object.
There may be more processor objects than processors, but this is not
the intention.  One scheduler thread runs within each
processor object,
and that thread's only activity
is to request runnable threads
from the scheduler object.  When a scheduler thread obtains a runnable
thread from the scheduler object, the scheduler thread stops, and the
processor on which the scheduler thread was running begins running
the now-runnable thread.  When the newly-running thread blocks or
terminates, the scheduler thread is resumed and continues to check
for more runnable threads.
.PP
Simultaneous requests 
to the scheduler object
from multiple scheduler threads
are synchronized so that no thread can be scheduled
on more than one processor at any instant.  However, a thread
may execute on different processors at different times.  Migration
occurs only if a thread is blocked and then resumed
at some later time when some other processor is idle.  
Scheduler threads are an exception to this \- they
\fInever\fP migrate.  A scheduler thread
runs only on the processor for which it is scheduling.  Figure 1
illustrates how the scheduler object, processor objects and 
physical CPUs are related.  Because these objects
interact with one another only through their operations,
each can be easily replaced or modified without affecting
the others.  For example, the scheduler object could
be changed to maintain
multiple priority queues for threads
rather than a single runnable queue.  Since scheduler
threads interact with the scheduler only through a \fCGetAThread()\fP
operation, they would remain
unaffected by the change.
.KF
.hl
.GS
pointscale on
height 4
width 5
file scheduler.grn
.GE
.sp
.LG
.ce
\fBFigure 1 \- PRESTO Components\fP
.NL
.hl
.sp
.KE
.PP
The PRESTO scheduler eventually halts when there are
no longer any runnable or running threads.  At this point,
all existing synchronization objects are destroyed.  If
any one of them indicates a waiting thread, the system
declares deadlock and displays the state of all interminably
blocked threads.  Because a thread waiting on a spinlock is
still technically executing, this very simple
criterion for detecting deadlock fails if one or more threads
are waiting for a spinlock to become free.   More sophisticated
halting semantics have been implemented through a redefinition
of the scheduler.  For example, a message-based discrete
event simulation scheduler has been built that
resolves deadlock arising from circular message dependencies
when no threads are runnable.
.NH 1
Program Structure
.PP
In PRESTO, a
user's parallel program consists of a set of class
definitions for objects used in the program, and a set of 
implementation routines that define the operations
for each class.  In addition, the programmer must provide
one operation for a system defined class called \fCMain\fP.
.LS
        //
        // Programmer supplied 
        //
        int
        Main::main()
        {
                //
                // Called once the system has started.  There
                // will be at least one thread started in this 
                // routine.
                //
        }
.LE
.LP
The programmer links his code with the PRESTO library
and obtains an executable program.  The function
\fCmain()\fP required by the
.UX
loader is already provided by the library.  This routine
creates an object of class \fCMain\fP and starts at least
one thread in the operation \fCMain::main()\fP for that object.
.PP
The programmer may also provide two other 
operations, \fCMain::init()\fP and 
\fCMain::done()\fP. \fCMain::init()\fP, if provided, 
is called before the system begins executing; it
can be used
to override certain system default parameters such
as the number of processors to 
use.  \fCMain::done()\fP, if provided,
is called when there are no more 
runnable threads,  allowing
PRESTO programs to clean up after themselves.
Once running in \fCMain::main()\fP,  
the system is under the control
of the programmer. 
.NH 1
Some Early Performance Figures
.PP
This section presents early performance measurements for
PRESTO.  All figures represent measurements
taken from a Sequent Balance 21000 with ten 32032 processors.
As a baseline, a processor can do a null procedure call
and return in approximately 15 $mu$secs, and can execute a single
iteration of a for-loop in 7 $mu$secs.  
.NH 2
Program Performance
.PP
Figure 2 illustrates the performance of PRESTO
when running a matrix multiplication
algorithm over varying numbers of
processors.  
The problem was decomposed equally among
as many threads as there were processors, and each thread
ran independently of the others.  The two optimal curves 
are based on the performance of the algorithm
when run with $n$ threads on $n$ processors.  These are
clearly best-case examples, since the scheduling and 
synchronization costs imposed by the algorithm are essentially
zero.  Nevertheless, it shows that an optimal breakdown
of an optimal problem can yield nearly optimal results under
PRESTO.  An implementation of the same algorithm directly on top
of Dynix performs identically.  This is not surprising since
the processor speed is the same, and neither PRESTO nor Dynix are
doing anything to assist (or hinder) the computation.
.PP
A difference does exist
between the optimal and measured curves,
and this same difference exists in
the Dynix
implementation of the same algorithm.  It is 
primarily attributable to the startup costs of the program
and of initializing the processors.  
The matrices to be multiplied must
first be initialized.  While data initialization could
be done in parallel, the program represented in Figure 2 doesn't
do so.  The data initialization costs are reflected only
in the measured curves, not in the optimal ones.  In addition
to program initialization, PRESTO itself must be initialized.
The time to initialize
and begin executing on nine processors is much greater than
for a single processor (the cost is
roughly 55 msecs per processor), but
the total lifetime of the computation is much shorter.
Consequently, the percentage of time doing work unrelated to
the multiplication
algorithm increases with the number of processors, and this
appears as a break from the optimal curve when the total computation
time gets very small.  For longer-running computations,
the initialization effects disappear.
.KF
.hl
.GS
pointscale on
height 4
width 5
file SPEEDUP.grn
.GE
.sp
.LG
.ce
\fBFigure 2 \- Matrix Multiplication Speedup\fP
.NL
.hl
.sp
.KE
.NH 2
The Cost of Threads
.PP
In Figure 2, the matrix multiplication algorithm
was designed so that the number of threads was
equal to the number of available processors.
This is in some sense "optimal" from a performance
point of view.
.PP
In the case of matrix multiplication, designing
an algorithm that is "parameterized" by the number
of processors is straightforward.
In other problem settings, though, there may be
a "natural" decomposition of the problem into
threads of control, and "warping" this decomposition
onto a specific number of processors may impose a
significant hardship.
As noted, a key goal of PRESTO is to decrease the
cost of parallelism so that the problem structure,
rather than the underlying system, can be
allowed to determine the way in which parallelism
was used.
.PP
To demonstrate that this key goal was achieved,
Figure 3 illustrates the performance of the PRESTO
matrix multiplication algorithm as the number of threads
is allowed to get very large (as many as $200$ threads working on
a $200 times 200$ matrix).  The figure demonstrates
that fine granularity using many threads can be
inexpensive with PRESTO \- an important advantage
over many existing parallel programming systems.
In PRESTO, the cost
of several hundred threads is not much more than
the cost of a few threads (except
for the small cost of first scheduling each thread).
Thus, in PRESTO, threads can be used as a "program
structuring" tool.
.KF
.hl
.GS
pointscale on
height 4
width 5
file manythreads.grn
.GE
.sp
.LG
.ce
\fBFigure 3 \- Matrix Multiplication Thread Decomposition (200x200)\fP
.NL
.hl
.KE
.PP
An interesting characteristic of Figure 3 is the small hump
at around 25 threads for several of the curves.  The hump
is due to a decomposition of the problem inappropriate to
the number of processors used.  Let $E$ be the execution time
of one thread on one processor, $n$ be the number of processors
available, and $T$ be the number of threads over which a 
computation is divided.   Since $E$ is the same for all
threads, execution essentially proceeds in lock-step.  At each step,
$T / n$ threads complete, 
.\".  One expression for
.\"the total computation time $t sub c$ might then be
.\".EQ 
.\"t sub c ~=~ {T over n} times E
.\".EN
.\".LP
except that
when $n$ is not a factor of $T$, it is not
possible to evenly distribute the final $T~ mod ~n$
threads over the $n$ available processors.  Consequently,
the total computation time, $t sub c$, is
.EQ 
t sub c ~=~ left ceiling {T over n} right ceiling times E
.EN
.LP
When $T$ is small and $E$ is large (as it is with only
25 threads working on a $200 times 200$ matrix), this
"tail effect" can be quite pronounced and
manifests itself as the hump seen
in Figure 3.  The hump is largest
when $T~ mod ~n$ is small, but non-zero.
For example, when $n$ is eight,  seven processors are
wasted during the final phase of the computation as
only one thread remains to execute.  With nine processors
though, seven threads execute during the final phase, leaving
only two processors idle.  As would be expected, 25 threads
on five processors produces no hump.
.PP
This phenomenon of some processors idling while
others work is called \fIstarvation\fP.  When processors
execute in lock-step, the 
amount of wasted processing time due to starvation effects
is $E times n sub s$ where $n sub s$ is the number of
processors suffering from starvation.  
Even when processors execute asynchronously, unless
all threads terminate concurrently, there must come
a time near the end of a computation when there are fewer
runnable threads than processors.  Starvation can even become
a factor when the "optimal" number of threads
is used but some delay exists between the starting
time of the first thread and that of the last.
Clearly, the negative
impact of starvation diminishes with decreasing $E$.  For
fixed size problems, $E$ decreases with increasing $T$.
If the overhead due to a large $T$ can be 
offset by preventing the degrading effects of starvation, appreciable
performance benefits can be realized through very fine
grained decomposition.
PRESTO makes this possible.
.NH 2
Towards Cheaper Threads
.PP
The construction of a thread's
call-stack is a significant contributor
to thread creation cost.
To mitigate this in PRESTO,
threads are reclaimed upon
termination for possible reuse.
When the programmer requests a \fCnew Thread\fP,
the system checks if any reclaimed threads
are available.
If so, a new thread is not created; the reclaimed one
is reinitialized and returned to the programmer.  If not,
a thread template is created and marked as incomplete.
The thread template can be manipulated in the same ways as a complete
thread.  Eventually, the thread 
template will attempt to execute for the first time.
When this happens, the reclaim pool is checked again.  Only if it
is empty the second time is an entirely new thread created and 
initialized with the values stored in the template.
.PP
This aggressive design, which is totally
transparent to the programmer, significantly reduces
the cost of threads in situations where
a number of threads are started
simultaneously and run to termination without blocking:  only
as many call-stacks will be allocated as there are processors.
A peculiar side-effect is that it is difficult to talk in
a meaningful way about the "cost of thread creation" in PRESTO,
since this cost depends upon the style of use.
.PP
Table 3 shows the time to
create and destroy PRESTO threads when every thread can be reclaimed.
The table demonstrates two points.  First, thread creation is 
relatively inexpensive.  For the single processor case, the average
time to create and destroy a thread is about 440 $mu$secs.
Second, the rate at which threads can be created,
while not linear with the number of processors, is much better
than constant.  The non-linearity arises because
all threads originate from and return
to a common pool, and access to this pool can be a bottleneck as
locking must occur.  Practically though, bursty periods of thread
creation are usually the result of a single thread coordinating the
activities of the new threads, so high contention is unlikely.
.sp
.TS
center, box, tab(:);
cfB | cfB | cfB
c | cfB | c  
n | n | n.
Processors:Threads Created:Elapsed Time
\&:and Destroyed:(secs)
=
1:100,000:44.0
2:100,000:28.2
3:100,000:21.6
4:100,000:18.1
5:100,000:15.8
6:100,000:14.0
7:100,000:12.8
8:100,000:11.8
.TE
.sp 
.ce
.LG
\fBTable 3 \- Thread Creation and Destruction Costs\fP
.NL
.sp
.NH 2
The Cost of Synchronization
.PP
There are two types of synchronization costs in a 
parallel program:  non-competitive and competitive.  
A non-competitive cost is incurred whenever a thread
accesses a synchronization object and is able to immediately
become its owner.  The programmer pays the competitive cost
whenever a thread must wait for some other thread to
relinquish ownership, or when the thread blocks on a 
condition variable.  The competitive cost always includes,
and is therefore greater than, the non-competitive cost.
Table 4 shows these costs for four different synchronization operations.
Only the overhead involved in actually blocking and unblocking a thread
is reflected in these figures.  
.PP
The top half of the table represents non-competitive synchronization
overhead, and the bottom, competitive.  \fILock_test\fP,
\fImonitor_test\fP, \fIspin_test\fP,
and \fIatom_test\fP
show the times required for threads to
acquire and release a simple relinquishing lock,
monitor, spinlock and atomic
integer, respectively.  The 21 $mu$secs required to use a spinlock
is due to the slowness of the hardware atomic locks available on the
Sequent.  Since spinlocks serve as the basis for all other
synchronization primitives, their lackluster performance
negatively influences the other timings.
When \fIatom_test\fP is run with
two processors, the elapsed time increases slightly over the
single processor case.  This is due to an optimization for spinlocks
(on which atomic integers are based) biased towards non-competitive
acquisition.   When the optimization is removed, one processor
performs no better than two.  
.PP
The time required for two threads to switch back and
forth on a condition variable is shown for both one and
two processors in \fIswitch_test\fP.  
Each thread enters the monitor, signals
a condition variable, and then waits on that condition variable,
relinquishing the monitor (and the processor on which it is
running).  Although only one thread can be
active in the monitor at any instant, two processors
perform substantially better than one.  The reason is that the
context switch times for the waiting and signalled threads can
be overlapped so that one thread can be switched in while
the other is being switched out.  With a single processor,
this isn't possible.  
.sp
.TS
center, box, tab(:);
cfB | cfB | cfB | cfB | cfB |  cfB | cfB
c |c | c | c | c |  c | c
cfB | cfI p8 | n | n | n | n | n.
\&:Benchmark:Processors:Threads:Iterations:Elapsed Time:Average Time
\&:\&:\&:\&:\&:(secs):($mu$secs)
=
\&:lock_test:1:1:1,000,000:94.9:94.9
\&:monitor_test:1:1:1,000,000:153:153
Non-Competitive:spin_test:1:1:1,000,000:21.1:21.1
\&:atom_test:1:1:1,000,000:25.7:25.7
=
\&:lock_test:2:2:1,000,000:158.1:158.1
\&:monitor_test:2:2:1,000,000:238.3:238.3
Competitive:switch_test:1:2:100,000:123.9:1239
\&:switch_test:2:2:100,000:73.1:731
\&:atom_test:2:2:1,000,000:27.7:27.7
.TE
.sp 
.LG
.ce 1
\fBTable 4 \- Synchronization Costs\fP
.NL
.sp
.NH 1
Conclusions
.PP
PRESTO is both a production system for use in 
writing everyday parallel programs and a flexible
research tool with which various scheduling,
synchronization and granularity issues can be
explored.
.PP
The former goal is met by joining
the classical notions of concurrent programming
with the powerful concepts of object-oriented design.
Objects can be made completely responsible for their own
execution, as well as modification and presentation.
This relieves the user of an object from concern 
about potential misuse
in a parallel environment.  
By exploiting the class mechanism of 
the language, programmers can derive parallel, 
inherently safe
objects (such as a synchronized stack)
from simpler, well-understood, sequential versions.
A key point, emphasized in this paper, is that
the performance of PRESTO's primitives
is sufficiently good that the "natural"
decompositions of problems, rather than artificial
constraints imposed by the system, can be the
determining factor in the structure of parallel
algorithms.
.PP
The utility of PRESTO as a research vehicle arises
from its underlying structure.
A system component (the scheduler, a processor,
even a thread) can be redefined through inheritance without
affecting the other components.
.PP
PRESTO is not a toy.  It is the current
system of choice for parallel programming
at the University of Washington.
Certain applications have been built on top
of the "default" Mesa-like environment; for
example, a parallel solution package for
queueing network performance models.
Other applications have customized certain
aspects of PRESTO, taking advantage of its
"open" design; a parallel Othello program
involves a new PRESTO scheduler; an
instrumentation package for parallel programs
involves an extension of threads to include
monitoring capabilities.
.\"Some users have built support for
.\"substantially different "models" of
.\"parallel programming using the PRESTO
.\""toolkit," inlcuding a parallel discrete event
.\"simulation mentioned earlier in this paper.
.SH
Acknowledgements
.PP
We'd like to thank Kenneth Almquist, Tom Anderson, Jeff
Chase, and David Wagner for their user-view feedback
on the design and implementation of PRESTO.
They, along with Ellen Ratajak, Doug Comer, and
the referees, provided many helpful comments
concerning this paper.
.sp 2
.]<
.\"Almes.G.T.-Black.A.P.-Lazowska.E.D.-Eden-System:-A-Techn-1
.ds [F 1
.]-
.ds [A G.T. Almes
.as [A ", A.P. Black
.as [A ", and E.D. Lazowska
.ds [T The Eden System: A Technical Review
.ds [J IEEE Transactions On Software Engineering
.ds [V SE-11
.ds [P 43-58
.nr [P 1
.ds [D January 1985
.nr [T 0
.nr [A 0
.nr [O 0
.][ 1 journal-article
.\"Bershad.B.N.-Lazowska.E.D.-Wagner.D.-Levy.H.M.-Open-Environment-for-2
.ds [F 2
.]-
.ds [A B.N. Bershad
.as [A ", E.D. Lazowska
.as [A ", D. Wagner
.as [A ", and H.M. Levy
.ds [T An Open Environment for Building Parallel Programming Systems
.ds [I Department of Computer Science, University of Washington
.ds [R Technical Report 88-01-03 (submitted for publication)
.ds [D January 1988
.ds [K parallel programming environment, kernel, presto2
.nr [T 0
.nr [A 0
.nr [O 0
.][ 4 tech-report
.\"Hoare.C.A.R.-Monitors:-An-Operati-3
.ds [F 3
.]-
.ds [A C.A.R. Hoare
.ds [J Communications of the ACM
.ds [T Monitors: An Operating System Structuring Concept
.ds [I ACM
.ds [V 17
.ds [N 10
.ds [P 549-557
.nr [P 1
.ds [D October 1974
.ds [K hoare monitors, synchronization, concurrent programming
.nr [T 0
.nr [A 0
.nr [O 0
.][ 1 journal-article
.\"Jr..T.W.D-Gebele.A.J.-C++-on-a-Parallel-Ma-4
.ds [F 4
.]-
.ds [A T.W. Doeppner Jr.
.as [A " and Alan J. Gebele
.ds [T C++ on a Parallel Machine
.ds [I Department of Computer Science, Brown University
.ds [R Report CS-87-26
.ds [D November 1987
.nr [T 0
.nr [A 0
.nr [O 0
.][ 4 tech-report
.\"Jul.E.-Levy.H.-Hutchinson.N.-Black.A.-Fine-Grained-Mobilit-5
.ds [F 5
.]-
.ds [A E. Jul
.as [A ", H. Levy
.as [A ", N. Hutchinson
.as [A ", and A. Black
.ds [T Fine-Grained Mobility in the Emerald System
.ds [J ACM TOCS
.ds [D to appear
.ds [O Originally presented at the
.as [O " \f2Eleventh ACM Symposium on Operating Systems Principles\fP,
.as [O " Austin, TX,
.as [O " 8-11 November 1987
.nr [T 0
.nr [A 0
.nr [O 0
.][ 1 journal-article
.\"Modula2+-Reference-M-6
.ds [F 6
.]-
.ds [T Modula2+ Reference Manual
.ds [I Digital Equipment Corporation
.ds [D April 1986
.ds [K modula2+
.nr [T 0
.nr [A 0
.nr [O 0
.][ 2 book
.\"Redell.B.W.L.D.D.-Experiences-with-Pro-7
.ds [F 7
.]-
.ds [A B.W. Lampson, D.D. Redell
.ds [T Experiences with Processes and Monitors in Mesa
.ds [J Communications of the ACM
.ds [I ACM
.ds [V 23
.ds [N 2
.ds [P 104-117
.nr [P 1
.ds [D February 1980
.ds [K mesa, monitors, concurrent programming, experiences with mesa
.nr [T 0
.nr [A 0
.nr [O 0
.][ 1 journal-article
.\"Stroustrup.B.-C++-Programming-Lang-8
.ds [F 8
.]-
.ds [A B. Stroustrup
.ds [T The C++ Programming Language
.ds [I Addison-Wesley
.ds [D March 1986
.ds [K c++ cplusplus
.nr [T 0
.nr [A 0
.nr [O 0
.][ 2 book
.\"Sweet.J.G.M.W.M.R.-Mesa-Language-Manual-9
.ds [F 9
.]-
.ds [A J.G. Mitchell, W. Maybury, R. Sweet
.ds [T Mesa Language Manual
.ds [R Technical Report CSL-79-3
.ds [I Xerox Palo Alto Research Center
.ds [D April 1979
.ds [K monitors, concurrent programming, synchronization
.nr [T 0
.nr [A 0
.nr [O 0
.][ 4 tech-report
.\"Thakkar.S.S.-Gifford.P.-Fielland.G.-Balance:-A-Shared-Me-10
.ds [F 10
.]-
.ds [A S.S. Thakkar
.as [A ", P. Gifford
.as [A ", and G. Fielland
.ds [T Balance: A Shared Memory Multiprocessor
.ds [J Proceedings, 2nd International Conference on Supercomputing
.ds [C Santa Clara
.ds [D May 1987
.ds [K dynix
.nr [T 0
.nr [A 0
.nr [O 0
.][ 1 journal-article
.]>
