performance evaluation of multiple register sets

advertisement
PERFORMANCE EVALUATION OF MULTIPLE REGISTER SETS
Richard
J. Eickemeyer
and Janak
H. Patel
Computer Systems Group
Coordinated Science L a b o r a t o r y
U n i v e r s i t y of Illinois
1101 W. Springfield
U r b a n a , IL 61801
Abstract
In this paper a D E C V A X with multiple register sets is
evaluated under m a n y differently sized register sets. Both the
number of register sets and the number of registers per set were
varied. Performance, measured in terms of m e m o r y traffic, is
compared to that of a standard V A X . M e m o r y tral~c is measured
from m a n y real program traces on the standard processor and
from transformations of the trace for the multiple register set
processors. Results are presented for each program; an empirical
formula is derived which describes the average program's
behavior. A decrease in m e m o r y references of approximately
1 6 % can be expected using multiple register sets.
A goal of t h e research presented here is to d e t e r m i n e t h e
effect of MRSs on overall processor performance. To accomplish
this goal one needs to compare processors w i t h MRSs to processors w i t h o u t MRSs. T h e m a c h i n e s s h o u l d be identical in o t h e r
respects. While this m a y be an unfair comparison because a traditional architecture could use the chip area for other purposes
instead of register sets, it is useful to compare M R S architectures,
as well as other architectures, against a basic type of processor
architecture. In this w a y one can observe the effects on performance of an individual architecture feature. While M R S s certainly reduce C A L L overhead, the effect on overall performance
has not been demonstrated.
W i t h t h e specification of w h a t to m e a s u r e it becomes necess a r y to d e t e r m i n e h o w to exercise t h e architectural f e a t u r e - - h o w
to d r i v e t h e model. Of course, there are m a n y issues involved in
m e a s u r i n g performance. Some of these are choice of p r o g r a m s to
r u n , m a c h i n e s to r u n t h e m on, a n d h o w to get results, i.e., s i m u lation, modeling, or actual p r o g r a m execution. A n i m p o r t a n t p a r t
of a n y p e r f o r m a n c e s t u d y is t h e w o r k l o a d u s e d - - o n e cannot
specify p e r f o r m a n c e w i t h o u t describing t h e w o r k l o a d [Ferr78].
T h e w o r k l o a d s h o u l d be representative of t h e t y p e of w o r k t h e
processor will do. T h e w o r k l o a d u s e d in t h i s paper is a set of t e n
c o m m o n C p r o g r a m s . To m e a s u r e p e r f o r m a n c e a d d r e s s traces
were collected for t h e p r o g r a m s w h i l e r u n n i n g on a V A X 11/780.
Each trace w a s m a p p e d into t h e trace t h a t w o u l d h a v e r e s u l t e d
f r o m r u n n i n g the p r o g r a m on a modified V A X w i t h m u l t i p l e
register sets. T h e change in m e m o r y traffic b e t w e e n an MRS processor a n d a single register set (SRS) processor is t h e basis for
comparison.
1. I N T R O D U C T I O N
T w o m a j o r issues h a v e emerged in c o m p u t e r architecture
since t h e i n t r o d u c t i o n of t h e Berkeley Reduced I n s t r u c t i o n Set
C o m p u t e r (RISC) [PaSe82I One concerns t h e m e r i t s of ~using an
i n s t r u c t i o n set consisting of simple, b u t f a s t i n s t r u c t i o n s in cont r a s t to t h e t r e n d of u s i n g complex, b u t p o w e r f u l instructions. A
second issue is a n organization to keep m o r e data in f a s t m e m o r i e s
s u c h as registers a n d cache. In a m u l t i p l e register set (MRS)
architecture, s u c h as RISC, several register sets are kept in a register file a n d each t i m e a procedure is called, a n e w register set is
allocated f r o m t h e register file. Instead of storing registers in
m e m o r y at each procedure call. register c o n t e n t s are k e p t on chip,
reducing both t h e o v e r h e a d of a procedure call a n d t h e overall
execution time. In m o s t of these machines, t h e n u m b e r of registers allocated at each C A L L is constant. A f u r t h e r r e d u c t i o n in
m e m o r y tratfic is obtained b y u s i n g registers for passing p a r a m e ters a n d for allocating local variables. A n u m b e r of different
a u t h o r s h a v e discussed u s i n g several register sets or m u l t i p l e contexts in a large register set [Dann79, Site79, L a m p 8 2 , EiPa85,
HuLa85b]. T h i s paper deals w i t h p e r f o r m a n c e of m u l t i p l e register set architectures.
In t h e n e x t section of this paper, previous p e r f o r m a n c e
r e s u l t s of MRS m a c h i n e s are s u m m a r i z e d . Each is e x a m i n e d in
t e r m s of t h e goals of t h i s paper; t h e deficiencies are described.
One p a r t i c u l a r MRS architecture is described a n d a n i m p l e m e n t a tion of a V A X w i t h MRSs is presented. T h e m e t h o d used to
e v a l u a t e t h e p e r f o r m a n c e for this processor is given. R e s u l t s are
presented for several differently sized MRS configurations a n d a
discussion of t h e r e s u l t s is presented. A n empirical m o d e l for
p e r f o r m a n c e i m p r o v e m e n t u s i n g MRSs is derived.
2. P R E V I O U S ILESEARCH
This research w a s s u p p o r t e d b y t h e Semiconductor Research Corporation (SRC) u n d e r contract 86-12-109.
In one of t h e first RISC papers f r o m Berkeley [PaSe82], RISC
w a s c o m p a r e d to other processors, s u c h as 68000 a n d V a x 11.
Later papers m a k e s i m i l a r c o m p a r i s o n s [Patt85]. R e s u l t s were
g i v e n s h o w i n g t h e overall execution t i m e for several d i f f e r e n t
b e n c h m a r k s on v a r i o u s processors. However, these processors
h a v e different cycle times, i n s t r u c t i o n sets, a n d n u m b e r s of registers. T h e effect of MRSs w a s m i x e d w i t h several other factors.
Several different b e n c h m a r k s were chosen so t h e overall w o r k load m a y h a v e been r e p r e s e n t a t i v e of typical s y s t e m s w o r k l o a d s .
Permission to copy without fee all or part of this material is granted
provided that the copies are not made or distributed for direct commercial
advantage, the ACM copyright notice and the title of the publication and
its date appear, and notice is given that copying is by permission of the
Association for Computing Machinery. To copy otherwise, or to
republish, requires a fee and/or specific permission.
264
© 1987 ACM 0 0 8 4 - 7 4 9 5 / 8 7 / 0 6 0 0 - 0 2 6 4 5 0 0 . 7 5
The overall results, however, are inconclusive w i t h regard to the
specific contribution of MRS architectures on performance.
Other w o r k has attempted to expand the results. In one
such s t u d y , three machines were simulated w i t h and w i t h o u t
MRSs [HiSp85]. Results were given in t e r m s of the n u m b e r of
w o r d s transferred between the processor and main m e m o r y ,
which is a good indication of overall processor performance in
m e m o r y - s a t u r a t e d processors such as RISC I. Impressive results
were obtained, and were claimed to demonstrate the performance
gain by using MRSs. F u r t h e r m o r e , these results were simulations
of actual routines produced b y compilers. The routines chosen.
however, were small, procedure intensive, and had few local
variables or parameters per procedure. Thus, they were not typical of real workloads. Because MRS machines were specifically
intended to handle procedure calls well, the type of program used
in this s t u d y , dominated b y procedure calls, should r u n well.
However, other types of statements m a y r u n equally fast
w h e t h e r or not MRSs are present. W h a t e v e r the case, the
influence of other types of statements is overshadowed b y the
large n u m b e r of procedure calls. Because the a m o u n t of register
space needed b y these p r o g r a m s is small, all local variables can fit
in registers. In real workloads, there m a y be m a n y instances
where this is not the case and performance is degraded.
In another project [HuLa85b], several different MRS organizations were compared. Results f r o m three s y s t e m s programs
were presented. There w a s no comparison made to a standard
type of architecture. And there are no overall performance
results presented, only register file statistics, such as overflow and
underflow and n u m b e r of w o r d s transferred. They did, however,
present some data t h a t m a y be useful in evaluating MRSs w h e n
context switches are present.
being the c u r r e n t register set. The b o t t o m of each stack is accessible in the event that an overflow is imminent. Should this occur,
the data can be read f r o m the b o t t o m of the stacks to m e m o r y . A
stack underflow is generated w h e n the top of the stacks is the
only level w i t h valid data and a POP occurs. The p r o g r a m
counter has its o w n stack so it too is saved on a procedure. The
top of the p r o g r a m counter stack is the r e t u r n address. W h e n
returning f r o m a procedure the stacks are popped, bringing data
up one level. In addition to the register stacks, a f e w global registers are also provided. These registers are not associated w i t h
stacks.
The use of the register stacks is demonstrated b y a description of the call and r e t u r n sequences. To call a procedure a PUSH
instruction is issued. This instruction pushes all the register
stacks. Should the b o t t o m of the stacks contain valid data, a trap
to a s o f t w a r e routine occurs. This routine saves all the registers
at the b o t t o m s of the stacks in a m e m o r y stack. Once overflow
has been handled the stacks can then be pushed. Parameter passing takes place next. Parameters are moved to the appropriate
register stack. Because a PUSH does not alter the contents of the
t o p s of the stacks these data are available for parameters if
necessary. A CALL instruction causes a j u m p to the new procedure and the saving of the r e t u r n address in a special stack
register. In the called procedure, local variables m a y be allocated
to stack registers that were not used for parameter passing. If the
procedure has a large n u m b e r of parameters or local variables,
m e m o r y is needed and is used similarly to standard techniques.
To r e t u r n f r o m a procedure, a RETURN instruction causes a
j u m p to the r e t u r n address. A POP instruction restores the stack
register contents for the r e s u m p t i o n of execution. The use of a
special register--the Return M a s k - - a l l o w s selective writing to the
tops of stacks f r o m the level below (not s h o w n in the figure).
This can be used to r e t u r n multiple values such as in call-byreference.
A complex model of an MRS machine w a s used to determine
performance in another s t u d y [EiPa85]. The model w a s driven
by statistics f r o m s y s t e m s programs and, hence, is representative
of typical s y s t e m s p r o g r a m workloads. A comparison w a s made
between a machine w i t h MRSs and an otherwise identical
machine, w i t h o u t MRSs. While it is easy to change the input
parameters to a model, the p r o g r a m statistics for different types
of p r o g r a m s are not a l w a y s readily available. The performance
model used w a s specific to one machine organization.
In s u m m a r y , the research presented used several different
methods to determine performance. In some cases, sample p r o grams were r u n or simulated. These programs m a y or m a y not
have been representative of a typical workloads t h a t w o u l d be
r u n on the processors. Some of the studies did not present overall
results, b u t results related only to procedure calls. These results
m u s t first be p u t into a context of a complete program to be
meaningful. In some cases other architectural issues were
included so MRS effects could not be isolated. This paper
addresses the problem of obtaining overall results on MRS effects
in real programs.
I n t e r r u p t s are handled as procedure calls and use the stacks
similarly. A context switch requires all register sets (in use) to
be saved. This overhead is not included in the performance s t u d y
which follows; however, since register saving is o n l y a small part
of the context-switch overhead in m o s t large operating systems, a
context s w i t c h on an MRS processor should not be m u c h more
expensive t h a n on an SRS processor.
Global Registers
~
Processor Busses
II"
3. A PARALLEL STACK V A X - M R S
I
I
3.1. P a r a l l e l S t a c k A r c h i t e c t u r e
Address traces f r o m a V A X are mapped into a VAX-MRS
(VAX w i t h multiple register sets). The register set organization
of VAX-MRS is that of the Parallel Stack Architecture (similar
to the Parallel Stack Processor [EiPa85]). Figure 1 s h o w s the
basic Parallel Stack Architecture. Register file storage is organized as a n u m b e r of stacks operating in parallel, one for each processor register. The register file is a collection of h a r d w a r e
s t a c k s - - s h i f t registers. The top of each register stack is a processor register addressed as in standard processors. On a procedure
call, all register stacks are pushed simultaneously, saving the
contents in the next level of the stacks. At the same time all
lower levels are moved to the next lower level. The tops of the
stacks are n o w available f o r the next procedure. T h u s each level
of the parallel stack constitutes a distinct register set, the top
fr
I---I UI I U
I---I I I
ft
ft
I I
I I
I I
I I
I I
I I
I I
I I
Bottom of Stack Registers
Processor Bus
Fig. 1.
265
Parallel Stack Architecture: RA: Return Address; PSW:
Program Status Word, including the program counter.
3.2. C o n v e n t i o n s f o r R e g i s t e r U s a g e
In the V A X there are 16 registers: six (0-5) are used for
t e m p o r a r y space in t h e C compiler, six (6-11) are used for u s e r declared register allocation, a n d f o u r (12-15) are used b y the s y s t e m for a r g u m e n t pointer, f r a m e pointer, s t a c k pointer, a n d prog r a m c o u n t e r [LeEc80]. To m o d i f y a V A X to a V A X - M R S the
following changes are made. Registers 0-5 of V A X are
unchanged; these are t h e global registers in VAX-MRS. V A X
registers 6-11 become the tops of stacks in VAX-MRS. Because
the a r g u m e n t pointer, f r a m e pointer, a n d p r o g r a m c o u n t e r ( V A X
registers 12, 13, a n d 15, respectively) are saved in m e m o r y on
V A X on a C A L L S instruction, these three registers also h a v e
stacks. The s t a c k pointer (register 14) is n o t s a v e d in the V A X
calling sequence so it w a s n o t given a stack. Since ali variations
of t h e V A X - M R S architecture h a n d l e registers 12-15 in t h e s a m e
w a y , t h e n u m b e r of registers per set, R , will refer to the n u m b e r
of user registers per set (corresponding to V A X registers 6-11).
In this e x a m p l e R is 6, b u t different sizes are used below. The
n u m b e r of register sets, S , is t h e n u m b e r of levels in the stacks.
If o n l y the top of t h e stacks contain v a l i d data. s u c h as at t h e
beginning of a p r o g r a m , u p to S - - 1 consecutive procedure calls
w o u l d n o t cause an overflow. In the w o r k t h a t f o l l o w s both R
a n d S are varied. A particular configuration can be described as
MRS(R ~ ) . Note t h a t V A X corresponds to MRS(6,1) in t e r m s of
t h e n u m b e r of registers available.
(cc). T h e ptrace facility allows a p r o g r a m to be traced in singlestep m o d e a n d a l l o w s a n e x a m i n a t i o n of t h e p r o g r a m ' s core image
at each step. A n i n s t r u c t i o n is s i m u l a t e d at each step to produce
a trace consisting of all m e m o r y a n d register references as well as
f u n c t i o n C A L L s a n d R E T U R N s a n d size of space needed for each
f u n c t i o n - - n u m b e r of p a r a m e t e r s , n u m b e r of registers, a n d size of
local space. T h e ten p r o g r a m s traced were:
PI
Pascal interpreter code t r a n s l a t o r ;
NROFF
T e x t f o r m a t t i n g program;
CCOM
M a i n p a r t of C Compiler (cc); a n d
TBL
Table f o r m a t t i n g program;
EQN
E q u a t i o n f o r m a t t i n g program;
C O M P A C T A d a p t i v e H u f f m a n code file compressor;
SORT
File sorter;
INDENT
C source p r o g r a m indenting p r o g r a m ;
AS
A s s e m b l e r u s e d in C Compiler.
YACC
Parser creator;
Each trace covered complete execution f r o m s t a r t to finish for
each program. T h e p r o g r a m s were r u n on s m a l l i n p u t files.
However, t h e execution t i m e w a s sufficiently long to reduce t h e
effects of initialization in t h e programs. A f e w s u m m a r i z i n g
statistics are presented in Table 1. The n u m b e r s in t h e table
s h o w the n u m b e r of logical m e m o r y r e q u e s t s (Refs) a n d t h e
actual n u m b e r of b y t e s t r a n s f e r r e d b e t w e e n t h e CPU a n d
memory.
Table 2 s h o w s t h e overflow generated b y t h e C A L L s a n d
R E T U R N s of each p r o g r a m for an MRS w i t h t h e indicated
n u m b e r of sets. Data for overflow are presented in t e r m s of w h a t
fraction of procedure calls cause an overflow f o r each n u m b e r o f
sets. In t h i s table "0" m e a n s e x a c t l y zero, a n d "0.0" m e a n s a s m a l l
n u m b e r greater t h a n zero. In general, t h e overflow f r e q u e n c y for
this w o r k l o a d is less t h a n t h e overflow m e a s u r e d b y H u g u e t a n d
Lang [HuLa85a] a n d greater t h a n m e a s u r e d b y T a m i r a n d S~quin
[TaSe83].
The operation of t h e V A X - M R S is like t h a t described for t h e
Parallel Stack Architecture. O n a C A L L , t h e stacks are pushed,
saving the c u r r e n t registers a n d allocating a n e w set of registers.
The actual PUSH occurs before p a r a m e t e r s are passed, allowing
access to variables in registers a n d allowing p a r a m e t e r s to be
passed in registers. The s t a c k is popped on a RETURN. A s in
V A X , register 0 is used f o r t h e r e t u r n value. In t h e e v e n t of an
overflow, t h e b o t t o m s of the stacks are w r i t t e n to m e m o r y b y
trapping to a s o f t w a r e routine. A similar m e t h o d on underflow
reads m e m o r y into t h e processor registers. I n s t r u c t i o n s analogous
to V A X ' s PUSHR a n d POPR do t h e actual m o v i n g to or f r o m
m e m o r y . T h e a r g u m e n t count, condition handler, a n d register
m a s k - P S W w o r d s are stored in m e m o r y on V A X - M R S , j u s t like
V A X , since t h e y are n o t n o r m a l l y kept in registers. T h e V A X
h a s a m a s k u s e d b y t h e CALLS i n s t r u c t i o n to specify w h i c h
registers to save. V A X - M R S does n o t use a similar m a s k ; a stack
register w o u l d be required to hold s u c h a m a s k . P e r f o r m a n c e
w o u l d be i m p r o v e d over t h a t presented below, especially for
large register sets, if a m a s k were to be used. The V A X - M R S
i m p l e m e n t a t i o n s t u d i e d here does n o t m a k e use of t h e R e t u r n
M a s k of t h e Parallel Stack Architecture.
4. M E A S U R E M E N T M E T H O D
It is interesting to compare t h e overflow f r e q u e n c y w i t h the
p r o g r a m ' s procedure n e s t i n g depth. Table 3 gives t h e f r e q u e n c y
of C A L L s t h a t occur at a nesting d e p t h greater t h a n or equal to
the indicated d e p t h (top level is at d e p t h 0). T h e " ' M a x i m u m
D e p t h " is t h e total n u m b e r of different nesting levels in t h e
p r o g r a m ' s execution. It is also the m i n i m u m n u m b e r of register
sets needed so t h a t there is no overflow. C o m p a r i n g overflow a n d
nesting data indicates t h a t overflow f r e q u e n c y is f a r less t h a n the
nesting depth. T h i s is d u e to locality of execution a m o n g t h e procedures w i t h i n a n a r r o w range of nesting at a n y nesting depth.
The figures in Table 3 can be interpreted in a n o t h e r w a y . Use of
a different o v e r f i o w / u n d e r f l o w s t r a t e g y , w h e r e d a t a (if available)
are loaded at e v e r y RETURN, w o u l d r e s u l t in overflow f r e q u e n cies equal to the nesting d e p t h frequencies of Table 3.
4.1. P r o g r a m M e a s u r e m e n t s
4.2. M a p p i n g t o M u l t i p l e R e g i s t e r S e t s
M e a s u r e m e n t s were m a d e b y tracing p r o g r a m execution on a
V A X !.1/780 r u n n i n g U n i x version 4.2BSD a n d the C compiler
A f t e r t h e execution trace is made, t h e trace is m a p p e d into
t h e trace t h a t w o u l d h a v e r e s u l t e d f r o m tracing V A X - M R S . T h e
Table 1. P r o g r a m Characteristics.
Name
PI
NROFF
CCOM
TBL
EQN
COMPACT
SORT
INDENT
AS
YACC
Instructions
Refs
Bytes
365132
1360454
270599
1085987
521280
1712723
185506
620455
138007
492427
265079
1091705
39567
128705
124980
500863
156429
580866
337699
1268428
M e m o r y Traffic
Data
Refs
Bytes
446133
1642192
306958
1173333
485246
1826378
171856
621622
116756
394407
271085
991020
44397
151140
97322
318497
147541
485354
343626
1311620
Total
Refs
Bytes
811265
3002646
577557
2259320
1006526
3539101
357362
1242077
254763
886834
536164
2082725
83994
279845
222302
819360
303970
1066220
681325
2580048
266
Data Refs per
Instructions
1.22
1.13
0.93
0.93
0.85
1.02
1.12
0.78
0.94
1.02
Cal~
12956
9913
12582
4164
2396
2045
812
1160
2601
2659
Instructions
per Call
28.2
27.3
41.4
44.6
57.6
129.6
48.7
107.7
60.1
127.0
m a p p i n g p r o g r a m t a k e s as i n p u t s t h e n u m b e r of sets, t h e n u m b e r
of registers per set, a n d t h e original m e m o r y a n d register trace.
References to local variables, parameters, a n d u s e r registers are
m a p p e d to registers or m e m o r y as follows.
The n u m b e r of i n s t r u c t i o n s is changed d u e to several factors. Due to t h e division of a procedure call into t w o i n s t r u c tions, PUSH a n d C A L L , one e x t r a i n s t r u c t i o n is required f o r each
call a n d for each r e t u r n . Local space is allocated in V A X b y
changing t h e s t a c k pointer. The i n s t r u c t i o n to do this is n o t
needed if all variables fit in registers. Since V A X - M R S does n o t
use a register-save m a s k on call, the reference to this is e l i m inated f r o m t h e trace. P a r a m e t e r s allocated in registers in V A X
are passed in m e m o r y t h e n loaded into registers a f t e r t h e call. In
V A X - M R S this loading is n o t necessary since the p a r a m e t e r w i l l ,
either h a v e been passed in a register or, if there were too f e w
registers, passed in m e m o r y a n d never assigned to a register. In
addition, i n s t r u c t i o n sizes can change d u e to the use of different
addressing m o d e s to access variables. One c o m m o n place w h e r e
i n s t r u c t i o n sizes change is in accessing local variables. In V A X , a
local variable is referenced as an offset f r o m the f r a m e pointer (2
bytes). O n V A X - M R S this m a y become a register reference (1
byte). T h e m a p p i n g process changes i n s t r u c t i o n addresses to
reflect t h e different n u m b e r of i n s t r u c t i o n s a n d changes in
addressing modes, b u t the changed addresses are n o t r e l e v a n t for
the p e r f o r m a n c e r e s u l t s presented here.
W i t h i n a f u n c t i o n , registers are allocated first to p a r a m e t e r s ,
n e x t to register-declared local variables, a n d finally to other local
variables. If t h e n u m b e r of registers per set is n o t large e n o u g h to
hold the above variables, t h e r e m a i n i n g variables are assigned to
m e m o r y . T h i s a r b i t r a r y m a p p i n g is used for simplicity: an intelligent m a p p i n g for t h e MRSs could be done to keep t h e m o s t freq u e n t l y used p a r a m e t e r s a n d variables in registers a n d t h e rest in
m e m o r y . Therefore, t h e p e r f o r m a n c e r e s u l t s reported here will
be s o m e w h a t pessimistic. References to s t a c k addresses a n d u s e r
registers in the traced p r o g r a m are changed to reflect the n e w dist r i b u t i o n of variables in registers a n d m e m o r y . P a r a m e t e r s are
identified b y their offset f r o m t h e a r g u m e n t pointer. Local v a r i ables are identified b y their l o n g w o r d ( 4 - b y t e ) address a n d offset
f r o m t h e f r a m e pointer. T h i s m e a n s t h a t u p to f o u r characters
could be assigned to a single register a n d a local a r r a y elements
could be assigned to a register. These w o u l d n o t n o r m a l l y be
done on a p r o g r a m m e d V A X - M R S , b u t w a s done to s i m p l i f y t h e
mapping.
Access to intermediate-level variables ( t h r o u g h
pointers) are m a p p e d to a m e m o r y a d d r e s s regardless of w h e t h e r
t h e variable is a c t u a l l y in a register or m e m o r y . These are relat!vely rare. P a r a m e t e r passing assigns p a r a m e t e r s , in order, to
available registers. Conflicts w i t h i n the registers at p a r a m e t e r
passing time are n o t t a k e n into account. A conflict can r e s u l t if a
variable f r o m t h e calling f u n c t i o n is needed as a parameter, t h e
register it is in is needed for p a r a m e t e r passing of a different
parameter, a n d t h e u s e of registers is s u c h t h a t an extra registerto-register m o v e is required for p a r a m e t e r passing. Global d a t a
addresses are n o t changed.
5. R E S U L T S
Ten p r o g r a m s were traced a n d m a p p e d into v a r i o u s
configurations of MRSs. R e s u l t s were obtained for several
n u m b e r s of registers per set: 0, 2, 4, 6, 8, 10, 12, 14, a n d 16.
Recall t h a t in addition to these registers, three other registers for
a r g u m e n t pointer, f r a m e pointer, a n d p r o g r a m c o u n t e r are
included in each register set a n d t h e stack pointer a n d six global
registers are also part of t h e c u r r e n t register set. For each of
these set sizes, 2, 4, 8, a n d 12 register sets were used. The r e s u l t s .
for each p r o g r a m are plotted in Figure 2 (a-j). The graphs s h o w
Table 2. Overflow F r e q u e n c y of Programs.
Number
of Sets
2
3
4
5
6
7
8
10
12
14
PI
37.2%
18.7
7.6
3.0
1.5
1.0
0.6
0.3
0.2
0.1
NROFF C C O M
TBL
44.7% 44.1% 38.1%
20.7
23.2
7.0
6.0
13.0
2.4
2.4
8.2
1.3
1.7
5.1
0.6
1.1
3.2
0.3
0.8
2.0
0.2
0.2
1.1
0.1
0.0
0.6
0.0
0
0.3
0
Overflow F r e q u e n c y
EQN C O M P A C T
32.1%
17.9%
9.3
2.2
1.5
0.3
0.7
0.1
0.3
0
0.1
0
0.0
0
0
0
0
0
0
0
SORT
5.3%
1.5
0.5
0.2
0.1
0
0
0
0
0
INDENT
AS
21.6% 21,9%
5.3
13,2
0.2
7,8
0
4,2
0
2,2
0
0,2
0
0,1
0
0
0
0
0
0
YACC
36.7%
3.1
0.9
0.5
0.3
0.2
0.1
0
0
0
Table 3. Nesting D e p t h F r e q u e n c y of Programs.
Nesting
Depth
2
3
4
5
6
7
8
10
12
14
Maximum
Depth
PI
99.9%
99.8
73.4
49.7
39.0
30.5
27.8
20.7
11.2
6.1
21
NROFF CCOM
TBL
98.6% 100.0% 99.9%
94.2
99.9
99.0
56.7
96.3
97.6
37.1
85.3
78.0
20.7
77.6
45.8
15.6
72.3
19.6
12.7
67.3
9.5
3.4
55.5
2.1
0.4
34.9
0.1
0
17.1
0
14
23
13
Nesting Depth F r e q u q n c y
EQN C O M P A C T SORT
99.9%
50.4%
97.8%
97.7
5.7
94.7
67.7
3.8
53.1
26.0
1.8
12.8
4.0
0
0.4
0.4
0
0
0.0
0
0
0
0
0
0
0
0
0
0
0
9
6
267
7
INDENT
47.3%
10.3
0.7
0'
0
0
0
0
0
0
5
AS
99.2%
97.3
40.3
32.3
16.9
10.9
2.5
0
0
0
YACC
99.5%
81.3
49.2
20.6
0.6
0.4
0.2
0
0
0
10
10
the n u m b e r of registers per set increases to a b o u t 6 or 8, the
c u r v e s become level. For those p r o g r a m s w i t h high overflow•
p e r f o r m a n c e is lost b y increasing t h e n u m b e r of registers past
.some point. T h e overhead of h a n d l i n g the overflow increases
w i t h an increasing n u m b e r of registers. For MRS(6,8) t h e p e r f o r m a n c e i m p r o v e m e n t is a b o u t 16%. Those p r o g r a m s w i t h high
overflow a n d a large n u m b e r of calls p e r f o r m w o r s e w i t h 2 register sets t h a n t h e SRS V A X , p r i m a r i l y because in this s t u d y
V A X - M R S did n o t use a register-save m a s k w h i l e t h e V A X h a s
such a mask.
t h e percent change in n u m b e r of (logical) m e m o r y references for
each n u m b e r of registers compared to the m e m o r y references in
t h e r a w trace; a d o w n w a r d slope indicates decreasing m e m o r y
traffic a n d increasing performance. (The n u m b e r of registers in
t h e r a w trace w a s n o t changed• so all V A X - M R S configurations
are compared to t h e s a m e V A X trace data.) Since MRSs are
intended to reduce p r o c e s s o r - t o - m e m o r y traffic, t h e change in t h e
n u m b e r of m e m o r y r e q u e s t s is a n appropriate m e a s u r e of p e r f o r mance. Using execution t i m e as a m e a s u r e of p e r f o r m a n c e introduces other processor characteristics t h a t m a y be independent of
t h e register sets. R e s u l t s were also obtained in t e r m s of n u m b e r
of b y t e s t r a n s f e r r e d at each MRS conflguratiom T h e s e r e s u l t s are
similar in f o r m a n d are n o t s h o w n here because t h e n u m b e r of
r e q u e s t s is m o r e i m p o r t a n t f r o m an architectural point of view.
One can observe in t h e g r a p h s of Figure 2 t h a t as t h e
n u m b e r of registers increases t h e rate of change in p e r f o r m a n c e
i m p r o v e m e n t decreases (in absolute value). There are t w o reasons for this. First, v e r y f e w procedures can m a k e u s e of a large
n u m b e r of registers. A s t h e n u m b e r increases, t h e n u m b e r of
procedures t h a t u s e all t h e registers decreases; procedures r e q u i r ing f e w e r registers achieve no benefit f r o m e x t r a u n u s e d registers.
For p r o g r a m s w i t h little overflow, t w o or three of t h e
c u r v e s of greater n u m b e r s of set are v e r y close together. There is
u s u a l l y little change going f r o m 8 sets to 12. In general• w h e n
(a) PI
30
30~
(b) NROFF
/
(c) C C O M
30
25
25
20
20
15
15'
20
~
15
10
Percent
Change 5
10
Percent
Change 5
10
Percent
Change 5
0
0
0
-5
-5
-5
-10
-10
-15
-15
-10
J
-15
-20
-20
|
0
,
I
|
i
i
|
-20
'
4
8
12
N u m b e r of Registers
'
'
0
16
I
'
I
/
I
a
4
8
12
N u m b e r of Registers
16
0
15
15
10
10
10
5
5
5
2 sets
0
15
EQN
0
0
Percent -5
Change
Percent 4;
Change -v
Percent -5
Change
-10
-10
-10
-15
-15
-15
-20
-20
-20
-25
-25
-25
i
i
!
|
!
!
4
8
12
N u m b e r of Registers
Fig. 2.
i
i
.16
i
|
i
|
,
|
4
8
12
N u m b e r of Registers
|
I
I
I
I
16
4,8,
I
0
I
J
J
I
I
4
8
12
N u m b e r of Registers
Change in N u m b e r of M e m o r y References f r o m V A X trace to V A X - M R S for several p r o g r a m s . Each graph
is for a v a r i e t y of n u m b e r s of register sets a n d n u m b e r s of registers per set. The n u m b e r of registers does
n o t include global registers or registers for a r g u m e n t pointer, f r a m e pointer, s t a c k pointer, a n d p r o g r a m
counter.
268
/
(f) COMPACT
i
16
,
4
8
12
N u m b e r of Registers
I
I
16
and may actually decrease performance. Table 4 shows the percent change compared to the SRS processor for instruction and
memory data references and for the total memory references and
bytes transferred.
One observes that the number of instructions increases
when using the MRS architecture. The factors changing the
number of instructions were discussed earlier in Section 4. In
addition to the changes due to mapping, overflow and underflow
increase the number of instructions. On overflow one instruction
handles the register saving and a few additional instructions are
required to change from the normal stack pointer to the overflow
stack pointer and back again. The change in number of instructions is a small positive change for a11 programs. The change in
m e m o r y traffic w h e n measured in bytes is greater than m e m o r y
traffic measured in references primarily because of changes in the
sizes of instructions.
The second reason for the rate of change is the presence of
overflow, As the number of registers increases, more registers
must be saved on overflow. If the frequency of overflow is great
enough, performance decreases with more registers as happens for
PI with 2 and 4 sets, for example. For programs with little
overflow, such as SORT, no significant change occurs for a large
number of registers. Should the graphs be extended for more
registers, all curves in which some overflow occurs would show a
decrease in performance at some Point. If one compares the
overflow frequencies in Table 2 with the graphs, the greater
separation between curves occurs w h e n the difference in overflow
for the numbers of sets is large.
6. DISCUSSION
6.1. P r o g r a m Statistics w i t h MRSs
Table 4 gives statistics for the ten programs similar to those
in Table 1. In this case, however, the data are for MRS(6,8). At
this point, for most of the programs, the vast majority of Possible
performance gain is achieved. An increase in registers per set or
number of sets results in little further performance improvement
The change in the number of data references shows a large
decrease, with changes ranging from 2 1 % to 43%. Data can be
divided into three types: local variables and parameters.
call/return processing, and other (typically global data). References of the "'other" type are always made to m e m o r y in both
V A X and V A X - M R S .
Since approximately 4 5 % of data references are "'other," the reduction in local and parameter references
and in register saving are substantial--greater than 6 0 % in both
cases. The difference in the number of data references is
accounted for by three factors. Register saving to m e m o r y on a
procedure call is eliminated by the use of multiple register sets.
By using registers for variable a11ocation and parameter passing•
fewer references to m e m o r y are required throughout the program
in referencing these variables. These factors reduce the number
of references compared to the SRS case. The handling of overflow
causes an increase in the number of references due to saving of
registers but it is usually offset by a decrease in register saving in
SRS. Note the substantial drop in the "'Data Per Instruction"
ratio w h e n compared to Table I.
(g) S O R T
10
5
0
Percent -5
Change
-10
2 sets
-15
8,12 ~
-20
I
I
6.2. Emptrica? F o r m u l a
I
I
I
I
12
Number of Registers
4
8
(h) INDENT
25
In order to better understand the behavior of MRSs and to
characterize it in a general form an empirical formula was
derived from the graphs in Figure 2. The formula is based on the
I
16
25
20
20
15
15
10
10
5
Percent
Change 0
5
Percent
Change 0
2 sets
~
25L
(i) A S
15
10
5
Percent
Change 0
-5
-5
-5
-10
-10
-10
-15
-15
4,8,12
-15
-20 f,
8,12
-20
-20
-25
-25
-25
I
0
I
I
I
I
I
4
g
12
Number of Registers
I
I
16
(j) YACC
20
I
I
I
I
i
12
Number of Registers
4
8
Fig. 2. (Continued)
269
i
i
!
16
0
i
i
i
i
|
4
8
12
Number of Registers
I
16
Table 4. Program Characteristics on VAX-MRS at MRS(6,8).
Name
PI
NROFF
CCOM
TBL
EQN
COMPACT
SORT
INDENT
AS
YACC
Instruction
References
384896 (+5.4%)
287682 (+6.3%)
540476 (+3.79,0)
192907 (+4.070)
141377 (+2.49'0)
268963 (+1.5%)
39965 (+1.09,o)
126939 (+1.69'o)
159413 (+1.9%)
340998 (+1.0%)
Memory Traffic
Data
Total
References
References
273014 (-38.8%)
657910 (-18.9%)
177152 (-42.2%)
464834 (-19.5%)
281572 (-42.09,0) 822048 (-18.39,o)
97745 (-43.19,o) 290652 (-18.79,o)
83507 (-28.59,0) 224884 (-11.79,o)
213907 (-21.19,o) 482870 (-9.99,0)
30221 (-31.99,o)
70186 (-16.4%)
70356 (-27.79,0) 197295 (-11.29,o)
104149 (-29.49,0) 263562 (-13.3%)
20015 (-41.8%)
541113 (-20.6%)
behavior of the "'average" program. Since no weighting of relative frequency of execution of programs is available the program
statistics were averaged without weighting as follows. Statistics
from each program were normalized by the number of instructions executed for the program in the VAX trace. These results
were then averaged. This results in 1 instruction and 0.994 data
references for the SRS case and 1.029 instructions and 0.647 data
references for MRS(6,8). This,same procedure can be followed
for each combination of registers per set and number of sets.
[
Total
Data Refs per
Bytes
Instruction
2338041 (-22.1%)
0.71
1758675 (-22.2%)
0.62
2731101 (-22.89,°)
0.52
935688 (-24.7%)
0.51
757050 (-14.69,o)
0.59
1870942 (-10.2%)
0.80
220158 (-21.3%)
0.76
702930 (-14.29,o)
0.55
895295 (-16.09,o)
0.65
1911593 (-25.9%)
0.59
local variables or parameters be needed their values are still
available and the two methods may still be equivalent. Some
extra register-to-register moves may be needed if there are
conflicts between incoming and outgoing parameters. In the
Parallel Stack Architecture some parameters may not have to be
moved if the same variable is an incoming and an outgoing
parameter using same register.
The basic form of the formula should have a term dependent only on the number of registers for references with an
infinite number of sets. A second term is the product of the
overflow and underfiow handling overhead times the frequency
of overflow. The overhead depends on the number of registers
while the overflow frequency depends on the number of sets.
The number of memory references per SRS instruction is
2O
10
(.484e -'261e + 1.58) + (38 + 2R )(.00241)e (-'a9s +~-~/2),
5
where R is the number of registers and S is the number of sets.
The results from this equation can be compared to the results
from SRS of 1.994 to determine the relative change. Observe that
the first term decays to some minimum number of references as
registers are added. Figure 3 shows the change in memory references of the "'average" program and the change derived from the
empirical formula. Since the equation is derived from
unweighted results no claim can be made on the applicability to a
specificworkload; however, the general form should be similar.
2 sets
~
Change
-5
-10
-20
,
The different configurations of the register file can also be
compared using the total number of registers as the independent
variable instead of number of registers per set. Table 5 liststhe
register configuration that performs best for register files of
different sizes. Performance is the change in memory traffic compared to SRS as indicated by the formula. The sizes refer to the
total number of stack registers (including argument pointer,
frame pointer, and program counter) times the number of sets.
The size of configurations is at most the "'Maximum Available
Registers."
0
Fig. 3.
|
|
!
|
I
4
8
12
Register File Width
,
16
Change in Number of Memory References from VAX
trace to VAX-MRS for an "'average" program. Solid
lines are from average of measured programs and
dashed lines are from empirical formula.
Table 5. Optimal MRS Configuration for a Given Size Register File.
6.3. O t h e r Issues
The MRS architecture described in this paper has complete
overlap of incoming and outgoing parameters and local variable
space, unlike RISC's overlapping windows. The principal contributions to performance improvement, reduction of procedure call
overhead and increased use of registers for variable allocation, are
similar for various types of MRS architectures. Since a PUSH in
the Parallel Stack Architecture frees all registers, performance in
that architecture and RISC are equivalent if no local variables or
incoming parameters are to be outgoing parameters. Should some
Maximum
Available
Re~isters
36
52
68
104
136
270
Optimal MRS Configuration I
Registers
Number
Change
per Set
of Sets
vs. SRS
6
4
-13.2%
10
14
10
14
4
4
8
8
-16.1
-16.8
-18.6
-19.8
In this study the number of instructions is essentially
unchanged from SRS to MRS architecture. The average size of
instructions decreases due to the increased use of registers instead
of memory references because a shorter addressing mode can be
used. A comparison of two processors with fewer addressing
modes would show a smaller increase or a decrease in instruction
count from SRS to MRS. In the SRS case references to variables
in memory would first require a LOAD instruction before processing. This LOAD could be eliminated in the MRS case for local
variables now kept in registers. In both cases the number of
instructions would be greater than in the corresponding VAX
processors. Handling overflow would require more instructions if
a multiple-register save instruction were not available. Finally. a
simpler processor may have less overhead on a CALL instruction,
It is not necessary to keep three pointers into the memory stack.
Fewer stack registers would be required in addition to the
decrease in overhead. The saving of other information on a CALL
as in VAX may not be needed. The results shown here are
specific to the VAX architecture and other architectures may
show performance changes of different magnitudes.
7. CONCLUSIONS
There had been a lack of performance measurements of
multiple register set architectures that show the effect of the
register sets on overall processor performance for real workloads.
This paper compares results from several real programs on an
SRS and many MRS architectures. A decrease in memory references of approximately 16% can be expected on average. Since
VAX makes use of C-language register declarations, a comparison
to a processor that used no C-language register declarations
would show a greater improvement.
REFERF_~EES
[Dann79] R. B. Dannenberg. "'An Architecture with Many
Operand Registers to Efficiently Execute BlockStructured Languages," Proc. 6th Annu, Syrup. Comput.
Arch., Apr. 1979, pp. 50-57.
[EiPa85]
R. J. Eickemeyer and J. H. Pate1. "'A Parallel Stack
Processor (PSP)," IEEE International Conference on
Computer Design: VLSI in Computers, Port Chester,
NY. Oct. 1985. pp. 473-476.
[Ferr78]
D. Ferrari, Computer Systems £erfornmnce Evaluation,
Englewood Cliffs, NJ: Prentice-Hall. 1978.
[HiSp85]
C. Y. Hitchcock III and H. M. B. Sprunt, "'Analyzing
Multiple Register Sets," 12th Annu. Int. Syrup. Comput.
Arch., IEEE, Boston, MA. June 1985, pp. 55-63.
[HuLa85a]M. Huguet and T. Lang, "'A C-Oriented Register Set
Design." Proc. 29th Syrup. on Mini and Microcomputers, Sant Feliu de Guixols. Catalonia. June 1985, pp.
182-189.
[HuLa85b]M. Huguet and T. Lang. "A Reduced Register File for
RISC Architectures," Comput. Arch. News, vol. 27, no.
10, Sept. 1985, pp. 22-31.
[Lamp82] B. W. Lampson, "'Fast Procedure Calls," Proc. Syrup.
Architectural Support for Prog. Lang. and Oper. Syst.,
[LeEc80]
[Part85]
ACM, Palo Alto, CA, Mar. 1982, pp. 66-76.
H. M. Levy and R. H. Eckhouse. Jr., Computer Programming and Architecture--The VAX-11, Bedford,
MA: Digital Press, 1980.
D . A . Patterson, "'Reduced Instruction Set Computers,"
Comm. ACM, vol. 28, no. 1, Jan. 1985, pp. 8-21.
[PaSe82]
D. A. Patterson and C. H. S~quin, "A VLSI RISC,'"
Computer, vol. 15, no. 9, Sept. 1982, pp. 8-21.
[Site79]
R . L . Sites, "'How to Use 1000 Registers," Caltech
Conference on VLSI, Jan. 1979, pp. 527-532.
Y. Tamir and C. H. S~quin, "'Strategies for Managing
the RISC Register File,'" IEEE Transactions on Computers, voL C-32, no. 11, November 1983, pp. 977-989
[TaSe83]
271
Download