THE 801 MINICOMPUTER

advertisement
THE 801 MINICOMPUTER
George Radin, IBM Fellow
IBM Thomas J. Watson Research Center
P. O. Box 218
Yorktown Heights, New York 10598
Abstract
But it is just this pernicious effect on the primitive
instructions that has made us suspicious. Most instruction
frequency-studies show a sharp skew in favor of high usage of
primitive instructions (such as Load, Store, Branch, Compare,
Add). If the presence of a more complex set adds just one logic
level to a 10 level basic machine cycle (e.g. to fetch a micro
• instruction from ROS), the CPU has been slowed down by
10%. The frequency and performance improvement of the
complex functions must first overcome this 10% degradation,
and then justify the additional cost. If the presence of complex
functions results in the CPU exceeding a packaging constraint
on some level (e.g. a chip, a-board), the performance degradation can be even more substantial.
This paper provides an overview of an experimental
system developed at the IBM T. J. Watson Research Center. It
consists of a running hardware prototype, a control program and
an optimizing compiler. The basic concepts underlying the
system are discussed as are the performance characteristics of
the prototype. In particular, three principles are examined:
- system orientation towards the pervasive use of high
level language programming and a sophisticated compiler,
- a primitive instruction set which can be completely
hard-wired,
- storage hierarchy and I / O organization to enable the
CPU to execute an instruction at almost every cycle.
Often, however, a minicomputer that boasts of a rich set
of complex instructions has not spent additional hardware at all,
but has simply microprogrammed the functions. These microinstructions are designed to execute in a single cycle and, in that
cycle, to set controls most useful for the functions desired. This
however, is exactly the design goal of the 801 primitive instruction set. We question, therefore, the need for a separate set of
instructions.
Introduction
In October, 1975, a group of about twenty researchers
at the IBM T. J. Watson Research Center began the design of
a minicomputer, a compiler, and a control program whose goal
was to achieve significantly better cost/performance for high
level language programs than that attainable by existing systems. The name 801 was chosen because it was the IBM number of the building in which the project resided• (The twenty
creative researchers were singularly uninspired namers.)
In fact, for "vertical microcode", the benefits claimed
are generally not due to the power of the instructions as much
as to their residence in a high-speed control store.
This
amounts to a hardware architect attempting to guess which
subroutines, or macros, are most frequently used and assigning
high speed memory to them. It has resulted, for instance, in
functions like Extended-Precision Floating Point Divide and
Translate-and-Test on S / 370' s residing in high speed storage,
while procedure prologues and the First-Level-InterruptHandler are in main storage. The 801 CPU gets its instructions
from an "instruction cache" which is managed by LRU information. Thus all frequently used functions are very likely to be
found in this high-speed storage, exhibiting the performance
characteristics of vertical microcode.
in addition to a running research prototype the project
resulted in an understanding of many design mistakes and thus
has spawned a second generation research activity which is
currently being pursued. This paper is a description of the basic
design principles and the resulting system components
(hardware and software).
Basic Concepts
Single Cycle Implementation
Probably the major distinguishing characteristic of the
801 architecture is that its instructions are constrained to execute in a single, straightforward, rather primitive machine cycle.
A similar general approach has been pursued by a group at the
University of California [1].
Programming complex functions as software procedures
or macros rather than in microcode has three advantages:
First, the CPU is interruptible at "microcode" boundaries, hence more responsive. Architectures with complex instructions either restrict interrupts to instruction boundaries, or
(as in, for instance, the Move Characters Long instruction on
the S/370) define specific interruptible points. If the instruction must be atomic, the implementation must ensure that it can
successfully complete before any observable state is saved.
Thus, in the S / 370 Move Character instruction, before starting
the move all pages are pretouched (and locked, in an MP system) to guard against a page fault interrupt occurring after the
move has begun. If interruptible points are architected, the
state must be such that the instruction is restartable.
Complex, high-function instructions, which require
several cycles to execute, are conventionally realized by some
combination of random logic and microcode. It is often true
that implementing a complex function in random logic will result
in its execution being significantly faster than if the function
were programmed as a sequence of primitive instructions. Examples are floating point arithmetic and fixed point multiply. We
have no objection to this strategy, provided the frequency of
use justifies the cost and, more important, provided these complex instructions in no way slow down the primitive instructions.
Permission to copy without fee all or part of this material is granted
provided that the copies are not made or distributed for direct
commercial advantage, the ACM copyright notice and the title of the
publication and its date appear, and notice is given that copying is by
permission of the Association for Computing Machinery. To copy
otherwise, or to republish, requires a fee and/or specific permission.
@
1982 A C M
0-89791-066-4 82/03/0039
$00.75
39
Most c o n v e n t i o n a l architectures m a k e this decision
difficult because every store of data can be a~modification of an
instruction, perhaps even the one following the store. Thus the
hardware must ensure that the two caches are properly synchronized, a job that is either expensive or degrading, or both.
E v e n instruction prefetch m e c h a n i s m s are complex since the
effective address of a store must be compared to the Instruction
Address Register.
The second advantage of programming these functions
is that an optimizing compiler can often separate their components, moving some parts out of a loop, commoning others, etc.
Thirdly, it is often possible for parts of a complex instruction to be computed at compile time. Consider, for instance, the S / 3 7 0 Move Character instruction once again. Each
execution of this instruction must determine the optimal move
strategy by e x a m i n i n g the lengths of the source a n d - t a r g e t
strings, whether (and in what direction) they overlap, and what
their alignment characteristics are. But, for most programming
languages, these may all be k n o w n at compile time. Consider
also a multiply instruction. If one of the operands is a constant,
known at compile time, the compiler, can often produce more
efficient " s h i f t / a d d " sequences than the general multiply microcode subroutine.
Historically, as soon as index registers were introduced
into computers the frequency of instruction modification fell
dramatically until, today, instructions are almost never modified.
Therefore the 801 architecture does not require this hazard
detection. Instead it exposes the existence of the split-cache t9
software and provides instructions by which software can synchronize the caches when required. In our system the only
program that modifies instructions is the one that loads programs into memory.
The major disadvantage to using procedures instead of
microcode to impldment complex functions occurs wheri the
micro instruction set is defined to permit its o p e r a n d s to be
indirectly named by the register name fields in the instruction
which i s being interpreted. Since, in the 801 and in most conventional architectures, the register numbers are bound into the
instructions a compiler must adopt some specific register usage
convention for the procedure operands, and move the operands
to these registers when necessary.
Similarly, in conventional systems in which the existence
of a cache is unobservable to the software, I / O must (logically)
go through the cache. This is often accomplished in less expensive systems by sending the I / O physically through the cache.
The result is that the CPU is idle while the I / O proceeds, and
that after an I / O burst the c o n t e n t s of the cache no longer
reflect the working set of the process being executed, forcing it
back into transient mode. Even in more expensive systems a
b r o a d c a s t i n g or d i r e c t o r y - d u p l i c a t i o n s t r a t e g y may result in
some performance degradation.
A c o m p u t e r whose instructions all execute very efficiently, however, is attractive only if the n u m b e r of such instructions required to perform a task is not c o m m e n s u r a t e l y
larger than that required of a more complex instruction set. The
801 project was concerned only with the execution of programs
compiled by our optimizing compiler. Therefore, within the
constraints of a primitive data flow, we left the actual definition
of the instructions to the compiler writers. The results will be
discussed later, b u t g e n e r a l l y show pathlengths (that is, n u m b e r
of instructions executed) about equivalent to those on a S / 3 7 0
for systems code, and up to 5 0 % longer for commercial and
scientific applications (given no hardware floating point).
We observed that responsibility for the initiation of I / O
in current systems was evolving t o w a r d s paging supervisors,
system I / O managers using fixed block transfers, and, for low
speed devices, a buffer s t r a t e g y which moves data b e t w e e n
subsystem buffers and user areas. This results in the I / O manager knowing the location and extent of the storage being accessed, and k n o w i n g when an 1 / O transfer is in process. Thus
this software can properly synchronize the caches, and the I / O
hardware can transmit directly to and from the backing store.
The result of this system approach in our prototype is that even
when half of the memory bandwidth is being used for I / O the
CPU is virtually undegraded.
Overlapped Storage Access
801 instruction mixes show that about 3 0 % of instructions go to storage to send or receive data, and between 1 0 %
and 2 0 % of instructions are taken branches. Moreover, for
many applications, a significant portion of the memory bandwidth is used by I / O . If the CPU is forced to wait many cycles
for storage access its internal performance will be wasted.
Notice that in the p r e c e d i n g discussions (and in the
earlier discussion of complex instructions) an underlying strategy is being pervasively applied. Namely, wherever there is a
system function that is expensive or slow in all its generality,
but where software can recognize a frequently occurring degenerate case (or can move the entire function from run time to
compile time) that function is moved from hardware to software, resulting in lower cost and improved performance.
The second major design goal of the 801 project, therefore, was to organize the storage hierarchy and develop a system architecture to minimize C P U idle time due to storage
access. First, it was clear that a cache was required whose
access time was consistent with the machine cycle of the CPU.
Secondly we chose a " s t o r e - i n - c a c h e " s t r a t e g y (instead of
"storing t h r o u g h " to the b a c k i n g store) so that the 1 0 % of
expected store instructions would not degrade the performance
severely. (For instance, if the time to store a word through to
the backing store is ten cycles, and 1 0 % of instructions are
stores, this will add up to one cycle to each instruction on average depending on the a m o u n t of execution overlap.)
An interesting example of the application of this strategy concerns managing the cache itself. In the 801 the cache
line is 32 bytes and the largest unit of a store is four bytes. In
such a cache, whose line size is larger than the unit of a store
and in which a "store in cache" approach is taken, a store directed at a word which is not in the cache must initiate a fetch
of the entire line from the backing store into the cache. This is
because, as far as the cache can tell, a load of another word
from this line might be requested subsequently. Frequently,
however, the store is simply the first store into what to the
program is newly acquired space. It could be a new activation
on a process stack just pushed on procedure call (e.g. P L / I
Automatic); it could be an area obtained by a request to the
operating system; or it could be a register save area used by the
First Level Interrupt Handler. In all of these cases the hardware does not know that no old values from that line will be
needed, while to the compiler and supervisor this situation is
quite clear. We have defined explicit instructions in the 801 for
But a CPU organization that needs a new instruction at
every cycle as well as accessing data every third cycle will still
be degraded by a single conventional cache that delivers a word
every cycle. Thus we decided to split the cache into a part
containing data and a part containing instructions. In this way
we effectively doubled the bandwidth to the cache and allowed
asynchronous fetching of instructions and data at the backing
store.
40
A Compiler-Based System
c a c h e m a n a g e m e n t so t h a t s o f t w a r e c a n reduce these u n n e c e s sary loads a n d stores of c a c h e lines.
So f a r we h a v e discussed t w o m a j o r ideas w h i c h p e r v a d e
the 801 system. First, build a C P U t h a t c a n execute its instructions quickly (i.e. in o n e relatively s h o r t m a c h i n e cycle), a n d
define these i n s t r u c t i o n s to be a g o o d t a r g e t f o r c o m p i l a t i o n so
t h a t resulting p a t h l e n g t h s are g e n e r a l l y c o m m e n s u r a t e with
those for the same f u n c t i o n s o n m o r e c o m p l e x i n s t r u c t i o n sets
(e.g. S / 3 7 0 ) . S e c o n d , define the s t o r a g e h i e r a r c h y a r c h i t e c t u r e ,
the C P U instructions, the 1 / O a r c h i t e c t u r e a n d the s o f t w a r e so
t h a t the C P U will g e n e r a l l y not h a v e to wait for s t o r a g e access•
T h e third m a j o r idea c e n t e r s a b o u t the 801 compiler. A f u n d a mental decision of the 801 p r o j e c t w a s to base the entire s y s t e m
on its pervasive use. This has resulted in the following system
characteristics:
O n e o t h e r 801 system s t r a t e g y leads to m o r e effective
use of the cache. C o n v e n t i o n a l s o f t w a r e a s s u m e s t h a t its m e m ory is r a n d o m l y a d d r e s s a b l e . Because of this a s s u m p t i o n e a c h
service p r o g r a m in the s u p e r v i s o r a n d s u b s y s t e m s has its o w n
local t e m p o r a r y storage. T h u s an application p r o g r a m r e q u e s t ing these services will cause r e f e r e n c e s to m a n y different addresses.
In a h i g h - l e v e l - l a n g u a g e - b a s e d s y s t e m like the 801,
control p r o g r a m services are C A L L ' e d just like a user's s u b r o u t ines. The result is t h a t all these service p r o g r a m s get their
t e m p o r a r y areas f r o m the same stack, resulting in m u c h reuse of
c a c h e lines and, t h e r e f o r e , higher c a c h e hit ratios.
So f a r we h a v e d i s c u s s e d 801 f e a t u r e s t h a t result in
o v e r l a p p e d access to the c a c h e b e t w e e n i n s t r u c t i o n s a n d d a t a ,
o v e r l a p p e d b a c k i n g store access a m o n g the c a c h e s a n d 1 / O , less
h a r d w a r e s y n c h r o n i z i n g a m o n g the c a c h e s a n d I / O , a n d techniques to improve the c a c h e hit ratios. O n e o t h e r aspect of the
801 C P U design a n d a r c h i t e c t u r e should be d e s c r i b e d to c o m plete the picture.
I n s t r u c t i o n sets f o r c o n v e n t i o n a l C P U ' s have b e e n defined with a n implicit a s s u m p t i o n t h a t m a n y p r o g r a m m e r s will
use a s s e m b l y l a n g u a g e . This a s s u m p t i o n h a s m o t i v a t e d the
d e f i n i t i o n of c o m p l e x i n s t r u c t i o n s ( s u c h as E d i t a n d M a r k ,
T r a n s l a t e a n d Test) almost as m u c h as has the n o t i o n of a f a s t
c o n t r o l store. But, increasingly, p r o g r a m m e r s d o not use a s s e m bly l a n g u a g e except w h e r e optimal p e r f o r m a n c e is essential or
m a c h i n e f u n c t i o n s are r e q u i r e d t h a t are n o t r e f l e c t e d in the
source language.
E v e n if almost all i n s t r u c t i o n a n d d a t a r e f e r e n c e s are
f o u n d in the cache, a n d the c a c h e a n d b a c k i n g store are a l w a y s
available to the C P U , a c o n v e n t i o n a l C P U will still o f t e n be idle
while w a i t i n g f o r a load to c o m p l e t e or f o r the t a r g e t of a
b r a n c h to be fetched. Sophisticated C P U ' s o f t e n k e e p b r a n c h t a k e n histories or fetch a h e a d o n b o t h p a t h s in o r d e r to overc o m e this idle time. In the 801 p r o j e c t we o b s e r v e d that, with a
small n u m b e r of h a r d w a r e primitives, s o f t w a r e (i.e. the c o m p i ler) c o u l d r e o r d e r p r o g r a m s so t h a t the s e m a n t i c s r e m a i n e d
u n c h a n g e d but the h a r d w a r e could easily o v e r l a p this idle time
with useful w o r k .
The compiler for the 801 has d e m o n s t r a t e d t h a t it c a n
p r o d u c e o b j e c t c o d e t h a t is close e n o u g h to b e s t h a n d c o d e
g e n e r a l l y so t h a t a s s e m b l y l a n g u a g e p r o g r a m m i n g is a l m o s t
n e v e r n e e d e d f o r p e r f o r m a n c e • T h e o p e r a t i n g system has isolated those m a c h i n e - d e p e n d e n t f u n c t i o n s n o t r e f l e c t e d in the
l a n g u a g e (such as Disable, Start I/.O, D i s p a t c h ) a n d d e v e l o p e d
efficient p r o c e d u r e s w h i c h provide these f u n c t i o n s with minimal
linkage o v e r h e a d .
T h e result is a s y s t e m in w h i c h less t h a n a t h o u s a n d
lines of s u p e r v i s o r c o d e ( a n d some of the " m i c r o e o d e " s u b r o u t ine i m p l e m e n t a t i o n s of the c o m p l e x f u n c t i o n s ) are w r i t t e n in
A s s e m b l y l a n g u a g e . This has relieved the 801 a r c h i t e c t u r e of
the b u r d e n of being easy to p r o g r a m directly. Virtually the o n l y
p r o g r a m m e r s w h o are c o n c e r n e d with the n a t u r e of the a r c h i t e c ture are the compiler writers, a n d the " c o r e " supervisor writers.
All o t h e r s see the s y s t e m only t h r o u g h a high level l a n g u a g e .
Because of this, the 801 a r c h i t e c t s were able to base their decisions solely o n the n e e d s of these f e w p r o g r a m m e r s , a n d o n
cost/performance considerations.
O n load instructions the register t h a t is to be the t a r g e t
of the load is locked b y the C P U . The C P U then c o n t i n u e s
execution of the i n s t r u c t i o n s t r e a m until it r e a c h e s a n instruction t h a t requires this register, at w h i c h time it idles until the
load is completed. Thus, if the compiler c a n find a useful instruction to put a f t e r the load t h a t d o e s not require the result of
the load, the C P U will not be idle at all while the d a t a c a c h e
f e t c h e s the r e q u e s t e d w o r d .
( A n d if the c o m p i l e r c a n find
several s u c h i n s t r u c t i o n s to put a f t e r the l o a d , e x e c u t i o n of
these will even o v e r l a p c a c h e miss.)
Similarly for b r a n c h e s , the 801 a r c h i t e c t u r e defines, f o r
e v e r y t y p e of b r a n c h i n s t r u c t i o n , a n a l t e r n a t e f o r m called
B r a n c h with Execute. (This is similar to the d e l a y e d b r a n c h in
the R I S C c o m p u t e r [1].) T h e s e i n s t r u c t i o n s h a v e e x a c t l y the
s a m e s e m a n t i c s as their c o r r e s p o n d i n g b r a n c h i n s t r u c t i o n s ,
except t h a t while the i n s t r u c t i o n c a c h e is f e t c h i n g the b r a n c h
t a r g e t the C P U executes the i n s t r u c t i o n t h a t has b e e n placed
i m m e d i a t e l y a f t e r the B r a n c h with E x e c u t e i n s t r u c t i o n . F o r
instance, in the sequence:
T h u s the 801 a r c h i t e c t u r e w a s d e f i n e d as t h a t set of
r u n - t i m e o p e r a t i o n s which:
- could n o t be m o v e d to compile time,
- could not be m o r e efficiently e x e c u t e d b y o b j e c t c o d e
p r o d u c e d b y a c o m p i l e r w h i c h u n d e r s t o o d the high level intent
of the p r o g r a m ,
- was to be i m p l e m e n t e d in r a n d o m logic m o r e effectively t h a n the equivalent s e q u e n c e of s o f t w a r e instructions.
L O A D R1, A
BNZ L
It m i g h t at first s e e m s u r p r i s i n g t h a t c o m p i l e r w r i t e r s
w o u l d n o t w a n t p o w e r f u l high level instructions• But in f a c t
these i n s t r u c t i o n s are o f t e n h a r d to use since the compiler must
find t h o s e c a s e s w h i c h e x a c t l y fit the a r c h i t e c t e d c o n s t r u c t .
C o d e selection b e c o m e s n o t just finding the fewest instructions,
but the right instructions. A n d w h e n these i n s t r u c t i o n s n a m e
o p e r a n d s in s t o r a g e instead of in registers, c o d e selection will
d e p e n d u p o n the results of register allocation.
the C P U w o u l d be idle while the i n s t r u c t i o n c a c h e was f e t c h i n g
L, if the b r a n c h was taken.
C h a n g i n g the B R A N C H - N O N Z E R O to a B R A N C H N O N - Z E R O W I T H E X E C U T E , a n d
m o v i n g the L O A D instruction results in:
BNZX L
LOAD R1,A
The 801 a p p r o a c h to p r o t e c t i o n is s t r o n g l y b a s e d u p o n
this c o m p i l e r i n t e r m e d i a r y b e t w e e n users a n d the h a r d w a r e .
Conventional systems expect application programmers, and
certainly s u b s y s t e m p r o g r a m m e r s , to use a s s e m b l y l a n g u a g e or
o t h e r l a n g u a g e s in w h i c h it is possible to s u b v e r t the s y s t e m
w h i c h has exactly the same s e m a n t i c s but allows the C P U to
execute the L O A D while the i n s t r u c t i o n c a c h e is f e t c h i n g the
instruction at L. The 801 compiler is able, generally, to c o n v e r t
a b o u t 6 0 % of the b r a n c h e s in a p r o g r a m into the execute form.
41
The linkage between procedures on the 801 is another
example of a consistent machine design based on a system used
solely via a high level language. We wanted applications on the
801 to be p r o g r a m m e d using good p r o g r a m m i n g style. This
implies a large number of procedures and many calls. In particular it implies that very short procedures can be freely written
and invoked. Thus, for these short procedures, the linkage must
be minimal.
(either deliberately or accidentally). Thus hardware facilities
are required to properly isolate these users. The most popular
examples of these facilities are storage protect keys, multiple
virtual address spaces, and supervisor state. These facilities are
often costly and sometimes degrade performance. But what is
more important is that they are often inadequate. Since even
16 different keys are insufficient to assign uniquely, for instance, different users are sometimes given the same key or the
system limits the n u m b e r of active users. Also, because the key
disciplines are only two level, many subsystems are forced to
run with full addressing capability.
The 801 procedure linkage attempts to keep arguments
in registers where possible. It also expects some register values
to be destroyed across a C A L L . The result is that a procedure
call can be as cheap as a Branch and Link instruction when the
called procedure can execute entirely out of available registers.
As more complex functions are required they increase the overhead for linkage incrementally.
If, however, users are Constrained to a properly defined
source language, and their programs are processed by an intelligent compiler and run on an operating system that understands
the addressing strategies of the compiler, it. is possible to provide better protection at less cost. The 801 system, therefore, is
based upon the assumption that certain critical c o m p o n e n t s of
the compiler are correct, and that all programs executing on the
system (except for a small supervisor core) have been compiled
by this compiler. The system will guarantee:
Finally, the pervasive use of a high level language and
compiler has given the project great freedom to change. The
architecture has undergone several drastic changes, and countless minor ones. The linkage conventions, storage m a p p i n g
strategies, and run time library have similarly been changed as
experience provided new insights. In almost every case the cost
of the change was limited to recompilations.
• that all references to data (scalars, arrays, structures,
areas) really do point to that data, and that the extents of the
references are included in the extents of the data,
This ability to preserve source code, thus limiting the
impact of change, can have significant long range impact on
systems. New technologies (and packaging) often offer great
p e r f o r m a n c e and cost benefits if they can be exploited with
architecture changes.
• that a reference to d y n a m i c a l l y a l l o c a t e d - a n d - f r e e d
data is made only between an allocation and a free,
• that all branches are to labels, and all calls are to
proper entry points in procedures,
System Components
• that the extents of all arguments to a procedure match
the extents of their corresponding parameters, so that the protection persists across calls,
The ProgrammingLanguage
• that all declarations of global (external) variables in
separately compiled procedures have consistent extents.
The source language for the 801 system is called PL.8.
It was defined to be an appropriate language for writing systems
programs and to produce o p t i m i z e d code with the c h e c k i n g
described above.
PL.8 began as an almost-compatible subset of P L / I , so
that the PL.8 compiler was initially compiled by the P L / I Optimizer. It contains, for instance, the P L / I storage classes,
functions, floating point variables, varying c h a r a c t e r strings,
arrays with adjustable extents, the structured control primitives
of P L / I , the string-handling built-in-functions, etc. It differs
from P L / I in its interpretation of bit strings as binary numbers,
in its binary arithmetic (which simply reflects the arithmetic of
the 801 h a r d w a r e ) and in some language a d d i t i o n s b o r r o w e d
from Pascal. It does not contain full P L / I O N conditions,
multiple entry .points, or the ability to develop absolute pointers
to A u t o m a t i c or Static storage. Relative pointers, called Offsets, can be developed only to Areas. This discipline has several advantages:
- All program and data areas can be moved freely by
the system, since absolute addresses are never stored in useraddressable data structures.
This checking is often done at compile time, link edit
time, or program fetch time, but, when necessary, trap instructions are introduced into the object code to check at run time.
The resulting increase in pathlength due to this run time checking is generally less than 1 0 % because this code is optimized
along with the rest of the program [5].
Notice that this is not a "strongly typed" approach to
checking. Overlays of one data type on another are permitted,
provided the domains are not exceeded. But our experience in
running code conventionally on the S / 3 7 0 and then on the 801
with this checking has shown that many program bugs are discovered and that, more importantly, they tend to be the kinds of
bugs that elude normal c o m p o n e n t test procedures.
It was noted earlier that, because the operating system
was also written in the 801's high level language and compiled
by the 801 compiler, its service programs were simply C A L L ' e d
like any external procedure, resulting in better cache behavior.
An even more important consequence of this design, however, is
that the checking of matches between arguments and parameters
is performed at the time a program is loaded into memory and
linked to the supervisor. This results in efficient calls to supervisor services, especially when compared to conventional overhead. It means, also, that the compiler-generated " t r a c e b a c k "
mechanism continues into the operating system, so that when an
error occurs the entire symbolic call chain can be displayed.
- Any a r i t h m e t i c data type can be used as an offset
(relative pointer) and all a r i t h m e t i c o p e r a t i o n s can be freely
performed, since the extent checks will be made on every use.
- A store, using a computed offset, can only affect other
data in that particular area. Thus, the locations whose values
could have been changed by this store are significantly limited.
This enhances the power of the optimization algorithms.
- It leads to b e t t e r structured, more easily readable
programs.
42
The Optimizing ~ompiler
equivalent to the problem of coloring the graph so that no two
nodes connected by a vertex are colored with the same crayon.
There have been about seven programmers in the compiler group since the project began. A running compiler was
completed after about two years. Since then the group has been
involved with language extensions, new optimization techniques,
debugging, and useability aids. It should be noted, however,
that the C o m p u t e r Sciences d e p a r t m e n t at Y o r k t o w n Heights
has been working on compiler algorithms for about t w e n t y
years, many of which were simply incorporated into this compiler.
This global approach has proven very effective. Surprisingly many procedures "color" so that no s t o r e / l o a d sequences
are necessary to keep results in storage temporarily. (At present the compiler "colors" only computations. There is, however, no technical reason why local variables could not also be
"col6red" and we intend to do this eventually.) When it does
fail, other algorithms which use this graph information are employed to decide what to store. Because of this ability of the
compiler to effectively utilize a large number of registers, we
decided to implement 32 ge~neral purpose registers in the hardware.
The PL.8 compiler adopts two strategies which lead to
its excellent object code. The first is a strategy which translates
in the most straightforward, inefficient (but correct) m a n n e r
from P L 8 source language to an intermediate language (IL).
This translation has as its only objective the p r o d u c t i o n of
semantically correct object code. It seeks almost no special
cases, so that it is relatively easy to debug. Moreover the intermediate language which is its target is at a very low level, almost the real 801 machine.
The compiler will also accept Pasca[ programs, producing compatible object code so that PL.8 and Pascal procedures
can freely call one another. It will also produce efficient object
code for the S / 3 7 0 , thus providing source code portability.
Instructions and O p e r a n d s
The next phase of the compiler develops flow graphs of
the program as described in [2], and, using these graphs, performs a series of conventional optimization algorithms, such as:
•
•
•
•
Instruction formats and data representations are areas
which saw significant change as the project evolved. This section describes the current version of the architecture; "The kind
of instruction and operand set requested by the compiler developers turned out, fortunately, to be precisely one which made
hardware i m p l e m e n t a t i o n easier. The overriding theme was
regularity. For instance:
common sub-expression elimination,
moving code out of loops,
eliminating dead code,
strength reduction.
- All operands must be aligned on boundaries consistent
with their size (i.e. halfwords on halfword boundaries, words on
word boundaries). All instructions are fullwords on fullword
boundaries. (This results in an increase in program size over
two-and-four byte formats, but the larger format allows us to
define more powerful instructions resulting in shorter pathlengths.) Since the 801 was designed for a c a c h e / m a i n
s t o r e / h a r d disk hierarchy, and virtual memory addressing, the
consequence of larger programs is limited to more disk space
and larger working sets (i.e., penalties in c a c h e - h i t - r a t i o and
page-fault frequency).
Each of these algorithms transforms an IL program into
a semantically equivalent, but more efficient, IL program. Thus
these procedures can be (and are) called repetitively and in any
order. While these procedures are quite sophisticated, since each
of them acts on the entire program and on all programs, a bug
in one of them is very easily observed.
The power of this approach is not only in the optimizing
power of the algorithms but in the fact that they are applied to
such a low level IL. Conventional global optimizing compilers
perform their transformations at a much higher level of text,
primarily because they were designed to run in relatively small
size memory. Thus they can often not do much more than
convert one program to another which could have been written
by a more careful programmer. The PL.8 compiler, on the
other hand, applies its o p t i m i z a t i o n algorithms to addressing
code, domain checking code, procedure linkage code, etc.
With this alignment constraint the hardware is greatly
simplified. Each data or instruction access can cause at most
one cache miss or one page fault. The caches will have to
access at most one aligned word. Instruction prefetch mechanisms can easily find op codes if they are searching for branches. Instruction alignment and data alignment are unnecessary.
Instruction L e n g t h C o u n t fields (as in the S / 3 7 0 PSW) are
unnecessary and software can always b a c k t r a c k instructions.
Moreover, for data, traces show that misaligned operands rarely
appear and when they do are often the result of poor programming style.
The second compiler strategy which is different from
c o n v e n t i o n a l compilers is our approach to register allocation
[3,4]. The IL, like that of most compilers, assumes an arbitrarily large number of registers. In fact the result of each different
computation in the program is assigned a different (symbolic)
register. The job for register allocation is simply to assign real
registers to these symbolic registers. C o n v e n t i o n a l approaches
use some subset of the real registers for slaecial purposes (e.g.
pointers to the stack, to the code, to the parameter list). The
remaining set is assigned locally within a statement, or at best a
basic block (e.g. a loop). Between these assignments results
which are to be preserved are temporarily stored, and variables
are redundantly loaded.
- Given four byte instructions, other benefits accrue.
Register fields in instructions are made five bits long so that the
801 can name 32 registers. (This aspect of 801 architecture
makes it feasible to use the g01 to emulate other architectures
which have 16 G P R ' s , since 16 801 registers are still available
for emulator use.)
Four byte instructions also allow the target register of
every instruction to be named explicitly so that the input operands need not be destroyed. This facility is applied pervasively, as in "Shift Reg A Left by contents of Reg B and Store
Result in Reg C". This feature of the architecture simplifies
register allocation and, eliminates many Move Register instructions.
The 801 approach observes that the register assignment
problem is equivalent to the graph coloring problem, where each
symbolic register is a node and the real registers are different
colors. If two symbolic registers have the property that there is
at least one point in the program where both their values must
be retained, we model that property on the graph as a vertex
between the two nodes. Thus the register allocation problem is
43
- T h e r e are C o m p a r e a n d T r a p i n s t r u c t i o n s d e f i n e d
which allow the S / 3 7 0 C o m p a r e - B r a n c h C o n d i t i o n s e q u e n c e
to be e x e c u t e d in one m a c h i n e cycle for those cases w h e r e the
test is for a n i n f r e q u e n t l y - e n c o u n t e r e d e x c e p t i o n c o n d i t i o n .
These instructions are used to implement the r u n - t i m e extent
c h e c k i n g discussed earlier.
- The 801 is a true 32 .bit a r c h i t e c t u r e , not a 16 bit
a r c h i t e c t u r e with e x t e n d e d registers.
A d d r e s s e s are 32 bits
long; arithmetic is 32 bit t w o ' s c o m p l e m e n t ; logical a n d shift
instructions deal with 32 bit w o r d s (and c a n shift d i s t a n c e s up
to 32). A useful w a y to reduce p a t h l e n g t h ( a n d c a c h e misses) is
to define a rich set of i m m e d i a t e fields, but of course it is impossible., to e n c o d e a general 32 bit c o n s t a n t to fil into an immediate field in a f o u r byte instruction. The 801 defines the following subsets of such c o n s t a n t s w h i c h meet most requirements:
- A r i t h m e t i c is 32 bit t w o ' s c o m p l e m e n t . T h e r e are
special instructions defined to allow M A X , M I N , . a n d decimal
a d d a n d s u b t r a c t to be c o d e d efficiently. T h e r e are also t w o
instructions defined (Multiply Step a n d Divide Step) to allow
two 32 bit w o r d s to be multiplied in 16 cycles (yielding a 64 bit
p r o d u c t ) a n d a 64 bit dividend to be divided b y a 32 bit divisor
in 32 cycles (yielding a 32 bit q u o t i e n t a n d a 32 bit r e m a i n d e r ) .
• A 16 bit i m m e d i a t e field for a r i t h m e t i c a n d
address c a l c u l a t i o n (D field) w h i c h is i n t e r p r e t e d as a
t w o ' s c o m p l e m e n t signed integer. ( T h u s the c o n s t a n t s
_+.215 c a n be r e p r e s e n t e d immediately.)
• A 16 bit logical c o o s t a n t . E a c h logical o p e r a tion has t w o i m m e d i a t e f o r m s - u p p e r . a n d lower, so t h a t
in at m o s t t w o i n s t r u c t i o n s "(cycles) logical o p e r a t i o n s
c a n be p e r f o r m e d using a 3 2 bit logical c o n s t a n t .
- The 801 has
These were d e v e l o p e d
emulator "microcode"
functions, all available
• A n 11 bit e n c o d i n g of a M a s k (i.e. a s u b s t r i n g
of o n e s s u r r o u n d e d b y z e r o s or z e r o s s u r r o u n d e d b y
ones). T h u s for shift, insert, a n d isolate o p e r a t i o n s the
s u b s t r i n g c a n be defined immediately.
a rich set of shift a n d insert instructions.
to m a k e device c o n t r o l l e r " m i c r o c o d e " ,
a n d s y s t e m s c o d e very efficient. T h e
in One m a c h i n e cycle, are:
• ring shift a register up to 31 positions (specified
in a n o t h e r zegister or in a n i m m e d i a t e field),
• using a mask (in a n o t h e r register or in a n i m m e diate field) m e r g e this shifted w o r d with all zeros (i.e. isolate the field) or with a n y o t h e r register (i.e. merge), or with the result of the previous shift (i.e. long' shift),
• s t o r e this b a c k into a n y o t h e r register or into
s t o r a g e (i.e. m o v e c h a r a c t e r string).
• A 16 bit i m m e d i a t e field f o r b r a n c h t a r g e t
c a l c u l a t i o n (D-field) w h i c h is i n t e r p r e t e d as a signed
t w o ' s c o m p l e m e n t offset f r o m the a d d r e s s of the c u r r e n t
instruction. ( T h u s a relative b r a n c h to a n d f r o m a n y w h e r e within a 3 2 K b y t e p r o c e d u r e c a n be specified
immediately.)
• A 26 bit i m m e d i a t e field s p e c i f y i n g a n offset
f r o m the a d d r e s s of the c u r r e n t i n s t r u c t i o n or a n a b s o lute address, so t h a t b r a n c h e s b e t w e e n p r o c e d u r e s , to
supervisor services, or to " m i c r o c o d e s u b r o u t i n e s " c a n
be specified w i t h o u t h a v i n g to establish addressability.
(This last facility allows m i s a l i g n e d s o u r c e a n d t a r g e t
character string
moves
to
execute
as
fast
as
two
characters/cycle.)
Interrupts
- L o a d a n d Store i n s t r u c t i o n s are a v a i l a b l e in e v e r y
c o m b i n a t i o n of the following options:
and
1/0
I / O in the 801 p r o t o t y p e is c o n t r o l l e d b y a set of a d a p ters w h i c h a t t a c h to the C P U a n d m e m o r y b y two buses. T h e
E x t e r n a l Bus a t t a c h e s the a d a p t e r s to the C P U . It is used b y
s o f t w a r e to s e n d c o m m a n d s a n d receive status, by m e a n s of
s y n c h r o n o u s R e a d a n d Write instructions. D a t a is t r a n s m i t t e d
b e t w e e n the a d a p t e r s a n d the 801 b a c k i n g s t o r e t h r o u g h the
M I O ( M e m o r y - I / O ) bus. (As d e s c r i b e d a b o v e it is the r e s p o n sibility of the s o f t w a r e to s y n c h r o n i z e the caches.)
• L o a d or Store,
• c h a r a c t e r , h a l f w o r d , s i g n - e x t e n d e d h a l f w o r d , fullword,
• Base + Index, or Base + D i s p l a c e m e n t effective
a d d r e s s calculation. ( U s a g e statistics for S / 3 7 0
s h o w low use for the full B + X + D form. T h u s a
three i n p u t a d d e r did n o t seem w a r r a n t e d . )
• S t o r e the effective a d d r e s s b a c k into the b a s e
register (i.e. " a u t o i n c r e m e n t " ) or not.
R a t h e r t h a n s u p p o r t i n t e g r a t e d a n d c o m p l e x (multilevel) i n t e r r u p t h a r d w a r e , the 801 a g a i n m o v e s to s o f t w a r e
f u n c t i o n s t h a t c a n be p e r f o r m e d m o r e efficiently b y p r o g r a m ming. S o f t w a r e o n s y s t e m s t h a t provide, say, eight i n t e r r u p t
levels o f t e n find this n u m b e r i n a d e q u a t e as a distinguisher of
i n t e r r u p t handlers. T h u s a s o f t w a r e first level i n t e r r u p t h a n d l e r
is p r o g r a m m e d o n top of the h a r d w a r e , i n c r e a s i n g the real time
to r e s p o n d . M o r e o v e r the r e q u i r e m e n t to s u p p o r t eight sets of
registers results in these b e i n g s t o r e d in s o m e f a s t m e m o r y
r a t h e r t h a n in logic o n - c h i p . This results in a slower m a c h i n e
cycle. If the real time r e s p o n s i v e n e s s of a s y s t e m is m e a s u r e d
realistically it m u s t include not only the time to get to an interr u p t h a n d l e r b u t the time to process the i n t e r r u p t , w h i c h clearly
d e p e n d s o n the length of the m a c h i n e cycle. T h u s in a practical
sense the 801 is a g o o d real-time system.
- B r a n c h e s are a v a i l a b l e with the f o l l o w i n g b r a n c h
t a r g e t specifications
• a b s o l u t e 26 bit address,
• Instruction Add/ess Register + Displacement
(signed 16 or 26 bit w o r d offset),
• R e g i s t e r + Register,
B r a n c h a n d L i n k f o r m s are d e f i n e d normally. But c o n ditional b r a n c h e s are d e f i n e d not only b a s e d u p o n the state of
the C o n d i t i o n Register but o n the presence or a b s e n c e of a o n e
in a n y bit position in a n y register. (This allows tile Test U n d e r
M a s k - B r a n c h C o n d i t i o n s e q u e n c e in S / 3 7 0 to be e x e c u t e d in
o n e m a c h i n e cycle ( a n d no s t o r a g e r e f e r e n c e s ) if the bit is
a l r e a d y in a register. A g a i n the p o w e r of global register allocation m a k e s this m o r e p r o b a b l e . )
I n t e r r u p t d e t e r m i n a t i o n a n d p r i o r i t y h a n d l i n g is p a c k a g e d o u t b o a r d of the C P U chips in a special unit called the
E x t e r n a l I n t e r r u p t c o n t r o l l e r ( a l o n g with the s y s t e m clocks,
timers, a n d a d a p t e r locks).
(This p a c k a g i n g d e c i s i o n allows
o t h e r v e r s i o n s of 801 s y s t e m s to c h o o s e d i f f e r e n t i n t e r r u p t
strategies w i t h o u t i m p a c t i n g the C P U design.) in this c o n t r o l l e r
44
there are (logically) t w o bit vectors. The first, the I n t e r r u p t
R e q u e s t V e c t o r ( I R V ) c o n t a i n s a bit f o r e a c h device w h i c h m a y
wish to i n t e r r u p t the C P U (plus one e a c h for the clocks, timers,
a n d the C P U itself for simulating external interrupts). These
bits are tied b y lines to the devices.
Prototype
A h a r d w a r e p r o t o t y p e has b e e n built f o r a n e a r l y version of the 801 a r c h i t e c t u r e , o u t of M E C L 10K DIP's. It r u n s
at 1.1 cycles per instruction. (This n u m b e r m u s t be t a k e n as a n
o u t - o f - c a c h e p e r f o r m a n c e figure b e c a u s e the a p p l i c a t i o n s w h i c h
c u r r e n t l y run s h o w hit ratios at close to 1 0 0 % a f t e r the initial
c a c h e load.) We d o not yet have m u l t i p l e - u s e r m e a s u r e m e n t s .
The s e c o n d v e c t o r , called the I n t e r r u p t M a s k V e c t o r
( I M V ) c o n t a i n s a bit c o r r e s p o n d i n g to e a c h bit in the IRV. The
IMV is loaded by s o f t w a r e in the C P U . It d y n a m i c a l l y e s t a b lishes the priority levels of the i n t e r r u p t requesters. If there is a
o n e in a position in the IRV c o r r e s p o n d i n g to a o n e in the
c o r r e s p o n d i n g position of the IMV, a n d the 801 C P U is e n a b l e d
for interrupt, the C P U is i n t e r r u p t e d .
The register file is c a p a b l e of r e a d i n g out a n y three a n d
writing b a c k a n y t w o registers within a single cycle. Thus the
C P U is pipelined as follows:
O n i n t e r r u p t the C P U b e c o m e s disabled a n d u n r e l o c a t ed, a n d begins e x e c u t i n g the F i r s t Level I n t e r r u p t H a n d l e r
( F L I H ) in l o w e r m e m o r y . T h e F L I H stores the i n t e r r u p t e d
state, reads the IRV a n d d e t e r m i n e s the requester. U s i n g this
position n u m b e r , it s e n d s a new I M V (reflecting the priority of
the r e q u e s t e r ) a n d b r a n c h e s to the i n t e r r u p t h a n d l e r for t h a t
requester, w h i c h executes e n a b l e d a n d relocated. P a t h l e n g t h s
for the F L I H are less t h a n 100 i n s t r u c t i o n s ( a n d c a n be r e d u c e d
for a subclass of f a s t - r e s p o n s e i n t e r r u p t s ) , a n d less t h a n 150
i n s t r u c t i o n s f o r the d i s p a t c h e r ( w h e n the i n t e r r u p t h a n d l e r
completes).
Internal
Hardware
- The first level of the pipeline d e c o d e s the i n s t r u c t i o n ,
r e a d s t w o registers into the A L U , executes the A L U , a n d either
latches the result or, f o r L o a d o r Store instructions, s e n d s the
c o m p u t e d a d d r e s s to the cache. O n a store i n s t r u c t i o n , the d a t a
w o r d is also f e t c h e d f r o m the register file a n d sent to the cache.
- T h e s e c o n d level of the pipeline s e n d s the l a t c h e d
result t h r o u g h the shifter, sets the c o n d i t i o n register bits, a n d
stores the result b a c k into a register. D u r i n g this cycle also, if a
w o r d has b e e n received f r o m the c a c h e as the result of a load
instruction it is loaded into the register.
Bus
( T h e h a r d w a r e m o n i t o r s r e g i s t e r n a m e s to b y p a s s the
load w h e n the result is being i m m e d i a t e l y used.)
We have, so far, d e s c r i b e d a C P U that must have the
following (logical) buses to storage:
-
T h e c a c h e is designed so t h a t o n a miss the r e q u e s t e d
w o r d is sent directly to the C P U , thus r 6 d u c i n g l o c k o u t while
the c a c h e line is being filled.
a c o m m a n d bus to describe the f u n c t i o n r e q u e s t e d ,
an a d d r e s s bus,
a source d a t a bus for Stores,
a target d a t a bus for Loads.
Performance
Comparisons
Figures 1 a n d 2 s h o w s o m e ea~'ly p e r f o r m a n c e c o m p a r i sons. Since the compiler will p r o d u c e o b j e c t c o d e f o r the S / 3 7 0
as well as the 801, these c o m p a r i s o n s are possible for the s a m e
source p r o g r a m s a n d the s a m e compiler. We use the n u m b e r of
cycles in the i n n e r loops, a n d the n u m b e r of s t o r a g e r e f e r e n c e s
in the inner loops to a p p r o x i m a t e d y n a m i c p e r f o r m a n c e .
We o b s e r v e d t h a t o t h e r f u n c t i o n s m i g h t be i m p l e m e n t e d
o u t b o a r d of the C P U a n d could a t t a c h to the C P U via these
same buses (e.g. floating point). T h e r e f o r e we e x p o s e d these
buses in an 801 instruction, called Internal Bus O p e r a t i o n . This
instruction has o p e r a n d s to n a m e the following:
the bus unit being r e q u e s t e d ,
- the c o m m a n d ,
- the two o p e r a n d s (B,D, or B,X) w h i c h will be a d d e d
to p r o d u c e the o u t p u t on the a d d r e s s bus,
- the source register,
- the t a r g e t register, if n e e d e d ,
-
Figure 1 s h o w s results for a n i n - m e m o r y sort p r o c e d u r e .
Figure 2 s h o w s the results f o r r a n d o m l y selected modules f r o m
the c o m p i l e r itself. N o t e t h a t as the m o d u l e s get l a r g e r the
p o w e r of global register a l l o c a t i o n results in f e w e r s t o r a g e references. N o t e also that, in spite of the f a c t t h a t the 801 c o n t a i n s
no c o m p l e x instructions, the 801 m o d u l e s c o n t a i n f e w e r instructions a n d f e w e r i n s t r u c t i o n e x e c u t i o n s . This is b e c a u s e the
c o m p l e x i n s t r u c t i o n s are g e n e r a l l y very i n f r e q u e n t w h e r e a s the
801 has a m o r e p o w e r f u l set of primitive instructions.
a n d three flags:
- privileged c o m m a n d or not,
- t a r g e t register required or not,
- address bus sent b a c k to Base register, or not.
H a v i n g defined this generic instruction we gave bus unit
n a m e s to the instruction a n d d a t a caches, the external i n t e r r u p t
controller, the timer, a n d the relocate controller, a n d assigned
the IBO op c o d e to all i n s t r u c t i o n s directed to these units.
45
Figure 1
Performance Comparisons
Program: Heap Sor.._..tprogrammed in PL.8
In Inner Loop
CPU
Code Size (Bytes) No. of Instructions
Data Ref's
Cycles Cycles/Inst.
S/370-168
236
33
8
56
801
240
28
6
31
Figure 2
Programs: Random.L~ Selected Modules in PL.8 Compiler
Module
(In increasing
size order
FIND
SEARCHV
LOAD S
P2 EXTS
SORT S1
PM ADD1
ELMISS
PM GKV
P5DBG
DESCRPT
ENTADD
Total
Code Size (Bytes)
Dynamic Comparisons
Instructions
Executed
Data Storage
References
801/S/370
801/S/370
801/370
1.02
.93
.83
1.00
.86
.86
.87
.92
.98
.86
.79
.91
.83
.91
1.00
.78
.96
.86
.76
.81
.75
.76
.60
.38
.43
.57
.59
.63
.69
.46
.52
.42
.42
.90
.80
.50
46
1.7
1.1
References
Conclusions
[11
Patterson, David A. and Sdquin, Carlo H., "RISC-I: A
Reduced Instruction Set VLSI C o m p u t e r , " Eighth Annual Symposium in C o m p u t e r Architecture, May, 1981.
[2]
Cocke, John and Markstein, Peter W., " M e a s u r e m e n t of
Program I m p r o v e m e n t A l g o r i t h m s , " I n f o r m a t i o n Processing 80, North-Holland Publishing Co., 1980.
I3l
Chaitin, Gregory J. et al, "Register Allocation via Coloring," C o m p u t e r Languages, Vol. 6, pp. 47-57, 1981,
G r e a t Britain.
[41
Chaitin, G r e g o r y J., " R e g i s t e r A l l o c a t i o n and Spilling
via Coloring," IBM Research Report RC9124, 1981.
[51
Markstein,
V.,
Coeke,
J.,
and
P. Markstein,
" O p t i m i z a t i o n of Range Checking," IBM Research Report R C 8 4 5 6 , 1980.
While we do not have nearly enough measurements to
draw hard conclusions, the 801 group has developed a set of
intuitive principles which seem to hold consistently:
- At least in the low-to-mid range of processor complexity, a general purpose, register-oriented instruction set can
be at least as good as any special vertical microcode set. Thus
there should be only one hard-wired instruction set, and it
should be directly available to the compiler.
- A good global register allocator can effectively use a
large number of general purpose registers. Therefore all the
registers which the CPU can afford to build in hardware should
be directly and simultaneously addressable. Stack machines,
machines that hide some of the registers to improve C A L L
performance, multiple-interrupt level machines, all seem to
make poorer use of the available registers.
- Protection is far more effectively provided at a level
where the source language program is understood.
- It is easy to design and build a fast, cheap CPU, and
will be more so as VLS1 evolves. The harder problem is to
develop software, architecture and hardware which does not
keep the CPU idling due to storage access.
Acknowledgments
The seminal idea for the 801 and many s u b s e q u e n t
concepts are due to John Cocke.
In six years, the list of contributors has grown too large
to list here. The following people have been with the project
from beginning until the present and are responsible for most of
the design and implementation.
Hardware: Frank Carrubba, manager. Paul Stuckert,
Norman Kreitzer, Richard Freitas, Kenneth Case
Software: Marc Auslander, manager
Compiler: Martin Hopkins, manager. Richard
Goldberg, Peter Oden, Philip Owens, Peter Markstein, Gregory
Chaitin.
C o n t r o l Program:
Richard Oehler, manager.
Albert Chang.
Joel Birnbaum was the first manager of the project and
later a constant supporter. Bill Worley also contributed significantly through the years.
47
Download