PERFORMANCE EVALUATION OF MULTIPLE REGISTER SETS Richard J. Eickemeyer and Janak H. Patel Computer Systems Group Coordinated Science L a b o r a t o r y U n i v e r s i t y of Illinois 1101 W. Springfield U r b a n a , IL 61801 Abstract In this paper a D E C V A X with multiple register sets is evaluated under m a n y differently sized register sets. Both the number of register sets and the number of registers per set were varied. Performance, measured in terms of m e m o r y traffic, is compared to that of a standard V A X . M e m o r y tral~c is measured from m a n y real program traces on the standard processor and from transformations of the trace for the multiple register set processors. Results are presented for each program; an empirical formula is derived which describes the average program's behavior. A decrease in m e m o r y references of approximately 1 6 % can be expected using multiple register sets. A goal of t h e research presented here is to d e t e r m i n e t h e effect of MRSs on overall processor performance. To accomplish this goal one needs to compare processors w i t h MRSs to processors w i t h o u t MRSs. T h e m a c h i n e s s h o u l d be identical in o t h e r respects. While this m a y be an unfair comparison because a traditional architecture could use the chip area for other purposes instead of register sets, it is useful to compare M R S architectures, as well as other architectures, against a basic type of processor architecture. In this w a y one can observe the effects on performance of an individual architecture feature. While M R S s certainly reduce C A L L overhead, the effect on overall performance has not been demonstrated. W i t h t h e specification of w h a t to m e a s u r e it becomes necess a r y to d e t e r m i n e h o w to exercise t h e architectural f e a t u r e - - h o w to d r i v e t h e model. Of course, there are m a n y issues involved in m e a s u r i n g performance. Some of these are choice of p r o g r a m s to r u n , m a c h i n e s to r u n t h e m on, a n d h o w to get results, i.e., s i m u lation, modeling, or actual p r o g r a m execution. A n i m p o r t a n t p a r t of a n y p e r f o r m a n c e s t u d y is t h e w o r k l o a d u s e d - - o n e cannot specify p e r f o r m a n c e w i t h o u t describing t h e w o r k l o a d [Ferr78]. T h e w o r k l o a d s h o u l d be representative of t h e t y p e of w o r k t h e processor will do. T h e w o r k l o a d u s e d in t h i s paper is a set of t e n c o m m o n C p r o g r a m s . To m e a s u r e p e r f o r m a n c e a d d r e s s traces were collected for t h e p r o g r a m s w h i l e r u n n i n g on a V A X 11/780. Each trace w a s m a p p e d into t h e trace t h a t w o u l d h a v e r e s u l t e d f r o m r u n n i n g the p r o g r a m on a modified V A X w i t h m u l t i p l e register sets. T h e change in m e m o r y traffic b e t w e e n an MRS processor a n d a single register set (SRS) processor is t h e basis for comparison. 1. I N T R O D U C T I O N T w o m a j o r issues h a v e emerged in c o m p u t e r architecture since t h e i n t r o d u c t i o n of t h e Berkeley Reduced I n s t r u c t i o n Set C o m p u t e r (RISC) [PaSe82I One concerns t h e m e r i t s of ~using an i n s t r u c t i o n set consisting of simple, b u t f a s t i n s t r u c t i o n s in cont r a s t to t h e t r e n d of u s i n g complex, b u t p o w e r f u l instructions. A second issue is a n organization to keep m o r e data in f a s t m e m o r i e s s u c h as registers a n d cache. In a m u l t i p l e register set (MRS) architecture, s u c h as RISC, several register sets are kept in a register file a n d each t i m e a procedure is called, a n e w register set is allocated f r o m t h e register file. Instead of storing registers in m e m o r y at each procedure call. register c o n t e n t s are k e p t on chip, reducing both t h e o v e r h e a d of a procedure call a n d t h e overall execution time. In m o s t of these machines, t h e n u m b e r of registers allocated at each C A L L is constant. A f u r t h e r r e d u c t i o n in m e m o r y tratfic is obtained b y u s i n g registers for passing p a r a m e ters a n d for allocating local variables. A n u m b e r of different a u t h o r s h a v e discussed u s i n g several register sets or m u l t i p l e contexts in a large register set [Dann79, Site79, L a m p 8 2 , EiPa85, HuLa85b]. T h i s paper deals w i t h p e r f o r m a n c e of m u l t i p l e register set architectures. In t h e n e x t section of this paper, previous p e r f o r m a n c e r e s u l t s of MRS m a c h i n e s are s u m m a r i z e d . Each is e x a m i n e d in t e r m s of t h e goals of t h i s paper; t h e deficiencies are described. One p a r t i c u l a r MRS architecture is described a n d a n i m p l e m e n t a tion of a V A X w i t h MRSs is presented. T h e m e t h o d used to e v a l u a t e t h e p e r f o r m a n c e for this processor is given. R e s u l t s are presented for several differently sized MRS configurations a n d a discussion of t h e r e s u l t s is presented. A n empirical m o d e l for p e r f o r m a n c e i m p r o v e m e n t u s i n g MRSs is derived. 2. P R E V I O U S ILESEARCH This research w a s s u p p o r t e d b y t h e Semiconductor Research Corporation (SRC) u n d e r contract 86-12-109. In one of t h e first RISC papers f r o m Berkeley [PaSe82], RISC w a s c o m p a r e d to other processors, s u c h as 68000 a n d V a x 11. Later papers m a k e s i m i l a r c o m p a r i s o n s [Patt85]. R e s u l t s were g i v e n s h o w i n g t h e overall execution t i m e for several d i f f e r e n t b e n c h m a r k s on v a r i o u s processors. However, these processors h a v e different cycle times, i n s t r u c t i o n sets, a n d n u m b e r s of registers. T h e effect of MRSs w a s m i x e d w i t h several other factors. Several different b e n c h m a r k s were chosen so t h e overall w o r k load m a y h a v e been r e p r e s e n t a t i v e of typical s y s t e m s w o r k l o a d s . Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. 264 © 1987 ACM 0 0 8 4 - 7 4 9 5 / 8 7 / 0 6 0 0 - 0 2 6 4 5 0 0 . 7 5 The overall results, however, are inconclusive w i t h regard to the specific contribution of MRS architectures on performance. Other w o r k has attempted to expand the results. In one such s t u d y , three machines were simulated w i t h and w i t h o u t MRSs [HiSp85]. Results were given in t e r m s of the n u m b e r of w o r d s transferred between the processor and main m e m o r y , which is a good indication of overall processor performance in m e m o r y - s a t u r a t e d processors such as RISC I. Impressive results were obtained, and were claimed to demonstrate the performance gain by using MRSs. F u r t h e r m o r e , these results were simulations of actual routines produced b y compilers. The routines chosen. however, were small, procedure intensive, and had few local variables or parameters per procedure. Thus, they were not typical of real workloads. Because MRS machines were specifically intended to handle procedure calls well, the type of program used in this s t u d y , dominated b y procedure calls, should r u n well. However, other types of statements m a y r u n equally fast w h e t h e r or not MRSs are present. W h a t e v e r the case, the influence of other types of statements is overshadowed b y the large n u m b e r of procedure calls. Because the a m o u n t of register space needed b y these p r o g r a m s is small, all local variables can fit in registers. In real workloads, there m a y be m a n y instances where this is not the case and performance is degraded. In another project [HuLa85b], several different MRS organizations were compared. Results f r o m three s y s t e m s programs were presented. There w a s no comparison made to a standard type of architecture. And there are no overall performance results presented, only register file statistics, such as overflow and underflow and n u m b e r of w o r d s transferred. They did, however, present some data t h a t m a y be useful in evaluating MRSs w h e n context switches are present. being the c u r r e n t register set. The b o t t o m of each stack is accessible in the event that an overflow is imminent. Should this occur, the data can be read f r o m the b o t t o m of the stacks to m e m o r y . A stack underflow is generated w h e n the top of the stacks is the only level w i t h valid data and a POP occurs. The p r o g r a m counter has its o w n stack so it too is saved on a procedure. The top of the p r o g r a m counter stack is the r e t u r n address. W h e n returning f r o m a procedure the stacks are popped, bringing data up one level. In addition to the register stacks, a f e w global registers are also provided. These registers are not associated w i t h stacks. The use of the register stacks is demonstrated b y a description of the call and r e t u r n sequences. To call a procedure a PUSH instruction is issued. This instruction pushes all the register stacks. Should the b o t t o m of the stacks contain valid data, a trap to a s o f t w a r e routine occurs. This routine saves all the registers at the b o t t o m s of the stacks in a m e m o r y stack. Once overflow has been handled the stacks can then be pushed. Parameter passing takes place next. Parameters are moved to the appropriate register stack. Because a PUSH does not alter the contents of the t o p s of the stacks these data are available for parameters if necessary. A CALL instruction causes a j u m p to the new procedure and the saving of the r e t u r n address in a special stack register. In the called procedure, local variables m a y be allocated to stack registers that were not used for parameter passing. If the procedure has a large n u m b e r of parameters or local variables, m e m o r y is needed and is used similarly to standard techniques. To r e t u r n f r o m a procedure, a RETURN instruction causes a j u m p to the r e t u r n address. A POP instruction restores the stack register contents for the r e s u m p t i o n of execution. The use of a special register--the Return M a s k - - a l l o w s selective writing to the tops of stacks f r o m the level below (not s h o w n in the figure). This can be used to r e t u r n multiple values such as in call-byreference. A complex model of an MRS machine w a s used to determine performance in another s t u d y [EiPa85]. The model w a s driven by statistics f r o m s y s t e m s programs and, hence, is representative of typical s y s t e m s p r o g r a m workloads. A comparison w a s made between a machine w i t h MRSs and an otherwise identical machine, w i t h o u t MRSs. While it is easy to change the input parameters to a model, the p r o g r a m statistics for different types of p r o g r a m s are not a l w a y s readily available. The performance model used w a s specific to one machine organization. In s u m m a r y , the research presented used several different methods to determine performance. In some cases, sample p r o grams were r u n or simulated. These programs m a y or m a y not have been representative of a typical workloads t h a t w o u l d be r u n on the processors. Some of the studies did not present overall results, b u t results related only to procedure calls. These results m u s t first be p u t into a context of a complete program to be meaningful. In some cases other architectural issues were included so MRS effects could not be isolated. This paper addresses the problem of obtaining overall results on MRS effects in real programs. I n t e r r u p t s are handled as procedure calls and use the stacks similarly. A context switch requires all register sets (in use) to be saved. This overhead is not included in the performance s t u d y which follows; however, since register saving is o n l y a small part of the context-switch overhead in m o s t large operating systems, a context s w i t c h on an MRS processor should not be m u c h more expensive t h a n on an SRS processor. Global Registers ~ Processor Busses II" 3. A PARALLEL STACK V A X - M R S I I 3.1. P a r a l l e l S t a c k A r c h i t e c t u r e Address traces f r o m a V A X are mapped into a VAX-MRS (VAX w i t h multiple register sets). The register set organization of VAX-MRS is that of the Parallel Stack Architecture (similar to the Parallel Stack Processor [EiPa85]). Figure 1 s h o w s the basic Parallel Stack Architecture. Register file storage is organized as a n u m b e r of stacks operating in parallel, one for each processor register. The register file is a collection of h a r d w a r e s t a c k s - - s h i f t registers. The top of each register stack is a processor register addressed as in standard processors. On a procedure call, all register stacks are pushed simultaneously, saving the contents in the next level of the stacks. At the same time all lower levels are moved to the next lower level. The tops of the stacks are n o w available f o r the next procedure. T h u s each level of the parallel stack constitutes a distinct register set, the top fr I---I UI I U I---I I I ft ft I I I I I I I I I I I I I I I I Bottom of Stack Registers Processor Bus Fig. 1. 265 Parallel Stack Architecture: RA: Return Address; PSW: Program Status Word, including the program counter. 3.2. C o n v e n t i o n s f o r R e g i s t e r U s a g e In the V A X there are 16 registers: six (0-5) are used for t e m p o r a r y space in t h e C compiler, six (6-11) are used for u s e r declared register allocation, a n d f o u r (12-15) are used b y the s y s t e m for a r g u m e n t pointer, f r a m e pointer, s t a c k pointer, a n d prog r a m c o u n t e r [LeEc80]. To m o d i f y a V A X to a V A X - M R S the following changes are made. Registers 0-5 of V A X are unchanged; these are t h e global registers in VAX-MRS. V A X registers 6-11 become the tops of stacks in VAX-MRS. Because the a r g u m e n t pointer, f r a m e pointer, a n d p r o g r a m c o u n t e r ( V A X registers 12, 13, a n d 15, respectively) are saved in m e m o r y on V A X on a C A L L S instruction, these three registers also h a v e stacks. The s t a c k pointer (register 14) is n o t s a v e d in the V A X calling sequence so it w a s n o t given a stack. Since ali variations of t h e V A X - M R S architecture h a n d l e registers 12-15 in t h e s a m e w a y , t h e n u m b e r of registers per set, R , will refer to the n u m b e r of user registers per set (corresponding to V A X registers 6-11). In this e x a m p l e R is 6, b u t different sizes are used below. The n u m b e r of register sets, S , is t h e n u m b e r of levels in the stacks. If o n l y the top of t h e stacks contain v a l i d data. s u c h as at t h e beginning of a p r o g r a m , u p to S - - 1 consecutive procedure calls w o u l d n o t cause an overflow. In the w o r k t h a t f o l l o w s both R a n d S are varied. A particular configuration can be described as MRS(R ~ ) . Note t h a t V A X corresponds to MRS(6,1) in t e r m s of t h e n u m b e r of registers available. (cc). T h e ptrace facility allows a p r o g r a m to be traced in singlestep m o d e a n d a l l o w s a n e x a m i n a t i o n of t h e p r o g r a m ' s core image at each step. A n i n s t r u c t i o n is s i m u l a t e d at each step to produce a trace consisting of all m e m o r y a n d register references as well as f u n c t i o n C A L L s a n d R E T U R N s a n d size of space needed for each f u n c t i o n - - n u m b e r of p a r a m e t e r s , n u m b e r of registers, a n d size of local space. T h e ten p r o g r a m s traced were: PI Pascal interpreter code t r a n s l a t o r ; NROFF T e x t f o r m a t t i n g program; CCOM M a i n p a r t of C Compiler (cc); a n d TBL Table f o r m a t t i n g program; EQN E q u a t i o n f o r m a t t i n g program; C O M P A C T A d a p t i v e H u f f m a n code file compressor; SORT File sorter; INDENT C source p r o g r a m indenting p r o g r a m ; AS A s s e m b l e r u s e d in C Compiler. YACC Parser creator; Each trace covered complete execution f r o m s t a r t to finish for each program. T h e p r o g r a m s were r u n on s m a l l i n p u t files. However, t h e execution t i m e w a s sufficiently long to reduce t h e effects of initialization in t h e programs. A f e w s u m m a r i z i n g statistics are presented in Table 1. The n u m b e r s in t h e table s h o w the n u m b e r of logical m e m o r y r e q u e s t s (Refs) a n d t h e actual n u m b e r of b y t e s t r a n s f e r r e d b e t w e e n t h e CPU a n d memory. Table 2 s h o w s t h e overflow generated b y t h e C A L L s a n d R E T U R N s of each p r o g r a m for an MRS w i t h t h e indicated n u m b e r of sets. Data for overflow are presented in t e r m s of w h a t fraction of procedure calls cause an overflow f o r each n u m b e r o f sets. In t h i s table "0" m e a n s e x a c t l y zero, a n d "0.0" m e a n s a s m a l l n u m b e r greater t h a n zero. In general, t h e overflow f r e q u e n c y for this w o r k l o a d is less t h a n t h e overflow m e a s u r e d b y H u g u e t a n d Lang [HuLa85a] a n d greater t h a n m e a s u r e d b y T a m i r a n d S~quin [TaSe83]. The operation of t h e V A X - M R S is like t h a t described for t h e Parallel Stack Architecture. O n a C A L L , t h e stacks are pushed, saving the c u r r e n t registers a n d allocating a n e w set of registers. The actual PUSH occurs before p a r a m e t e r s are passed, allowing access to variables in registers a n d allowing p a r a m e t e r s to be passed in registers. The s t a c k is popped on a RETURN. A s in V A X , register 0 is used f o r t h e r e t u r n value. In t h e e v e n t of an overflow, t h e b o t t o m s of the stacks are w r i t t e n to m e m o r y b y trapping to a s o f t w a r e routine. A similar m e t h o d on underflow reads m e m o r y into t h e processor registers. I n s t r u c t i o n s analogous to V A X ' s PUSHR a n d POPR do t h e actual m o v i n g to or f r o m m e m o r y . T h e a r g u m e n t count, condition handler, a n d register m a s k - P S W w o r d s are stored in m e m o r y on V A X - M R S , j u s t like V A X , since t h e y are n o t n o r m a l l y kept in registers. T h e V A X h a s a m a s k u s e d b y t h e CALLS i n s t r u c t i o n to specify w h i c h registers to save. V A X - M R S does n o t use a similar m a s k ; a stack register w o u l d be required to hold s u c h a m a s k . P e r f o r m a n c e w o u l d be i m p r o v e d over t h a t presented below, especially for large register sets, if a m a s k were to be used. The V A X - M R S i m p l e m e n t a t i o n s t u d i e d here does n o t m a k e use of t h e R e t u r n M a s k of t h e Parallel Stack Architecture. 4. M E A S U R E M E N T M E T H O D It is interesting to compare t h e overflow f r e q u e n c y w i t h the p r o g r a m ' s procedure n e s t i n g depth. Table 3 gives t h e f r e q u e n c y of C A L L s t h a t occur at a nesting d e p t h greater t h a n or equal to the indicated d e p t h (top level is at d e p t h 0). T h e " ' M a x i m u m D e p t h " is t h e total n u m b e r of different nesting levels in t h e p r o g r a m ' s execution. It is also the m i n i m u m n u m b e r of register sets needed so t h a t there is no overflow. C o m p a r i n g overflow a n d nesting data indicates t h a t overflow f r e q u e n c y is f a r less t h a n the nesting depth. T h i s is d u e to locality of execution a m o n g t h e procedures w i t h i n a n a r r o w range of nesting at a n y nesting depth. The figures in Table 3 can be interpreted in a n o t h e r w a y . Use of a different o v e r f i o w / u n d e r f l o w s t r a t e g y , w h e r e d a t a (if available) are loaded at e v e r y RETURN, w o u l d r e s u l t in overflow f r e q u e n cies equal to the nesting d e p t h frequencies of Table 3. 4.1. P r o g r a m M e a s u r e m e n t s 4.2. M a p p i n g t o M u l t i p l e R e g i s t e r S e t s M e a s u r e m e n t s were m a d e b y tracing p r o g r a m execution on a V A X !.1/780 r u n n i n g U n i x version 4.2BSD a n d the C compiler A f t e r t h e execution trace is made, t h e trace is m a p p e d into t h e trace t h a t w o u l d h a v e r e s u l t e d f r o m tracing V A X - M R S . T h e Table 1. P r o g r a m Characteristics. Name PI NROFF CCOM TBL EQN COMPACT SORT INDENT AS YACC Instructions Refs Bytes 365132 1360454 270599 1085987 521280 1712723 185506 620455 138007 492427 265079 1091705 39567 128705 124980 500863 156429 580866 337699 1268428 M e m o r y Traffic Data Refs Bytes 446133 1642192 306958 1173333 485246 1826378 171856 621622 116756 394407 271085 991020 44397 151140 97322 318497 147541 485354 343626 1311620 Total Refs Bytes 811265 3002646 577557 2259320 1006526 3539101 357362 1242077 254763 886834 536164 2082725 83994 279845 222302 819360 303970 1066220 681325 2580048 266 Data Refs per Instructions 1.22 1.13 0.93 0.93 0.85 1.02 1.12 0.78 0.94 1.02 Cal~ 12956 9913 12582 4164 2396 2045 812 1160 2601 2659 Instructions per Call 28.2 27.3 41.4 44.6 57.6 129.6 48.7 107.7 60.1 127.0 m a p p i n g p r o g r a m t a k e s as i n p u t s t h e n u m b e r of sets, t h e n u m b e r of registers per set, a n d t h e original m e m o r y a n d register trace. References to local variables, parameters, a n d u s e r registers are m a p p e d to registers or m e m o r y as follows. The n u m b e r of i n s t r u c t i o n s is changed d u e to several factors. Due to t h e division of a procedure call into t w o i n s t r u c tions, PUSH a n d C A L L , one e x t r a i n s t r u c t i o n is required f o r each call a n d for each r e t u r n . Local space is allocated in V A X b y changing t h e s t a c k pointer. The i n s t r u c t i o n to do this is n o t needed if all variables fit in registers. Since V A X - M R S does n o t use a register-save m a s k on call, the reference to this is e l i m inated f r o m t h e trace. P a r a m e t e r s allocated in registers in V A X are passed in m e m o r y t h e n loaded into registers a f t e r t h e call. In V A X - M R S this loading is n o t necessary since the p a r a m e t e r w i l l , either h a v e been passed in a register or, if there were too f e w registers, passed in m e m o r y a n d never assigned to a register. In addition, i n s t r u c t i o n sizes can change d u e to the use of different addressing m o d e s to access variables. One c o m m o n place w h e r e i n s t r u c t i o n sizes change is in accessing local variables. In V A X , a local variable is referenced as an offset f r o m the f r a m e pointer (2 bytes). O n V A X - M R S this m a y become a register reference (1 byte). T h e m a p p i n g process changes i n s t r u c t i o n addresses to reflect t h e different n u m b e r of i n s t r u c t i o n s a n d changes in addressing modes, b u t the changed addresses are n o t r e l e v a n t for the p e r f o r m a n c e r e s u l t s presented here. W i t h i n a f u n c t i o n , registers are allocated first to p a r a m e t e r s , n e x t to register-declared local variables, a n d finally to other local variables. If t h e n u m b e r of registers per set is n o t large e n o u g h to hold the above variables, t h e r e m a i n i n g variables are assigned to m e m o r y . T h i s a r b i t r a r y m a p p i n g is used for simplicity: an intelligent m a p p i n g for t h e MRSs could be done to keep t h e m o s t freq u e n t l y used p a r a m e t e r s a n d variables in registers a n d t h e rest in m e m o r y . Therefore, t h e p e r f o r m a n c e r e s u l t s reported here will be s o m e w h a t pessimistic. References to s t a c k addresses a n d u s e r registers in the traced p r o g r a m are changed to reflect the n e w dist r i b u t i o n of variables in registers a n d m e m o r y . P a r a m e t e r s are identified b y their offset f r o m t h e a r g u m e n t pointer. Local v a r i ables are identified b y their l o n g w o r d ( 4 - b y t e ) address a n d offset f r o m t h e f r a m e pointer. T h i s m e a n s t h a t u p to f o u r characters could be assigned to a single register a n d a local a r r a y elements could be assigned to a register. These w o u l d n o t n o r m a l l y be done on a p r o g r a m m e d V A X - M R S , b u t w a s done to s i m p l i f y t h e mapping. Access to intermediate-level variables ( t h r o u g h pointers) are m a p p e d to a m e m o r y a d d r e s s regardless of w h e t h e r t h e variable is a c t u a l l y in a register or m e m o r y . These are relat!vely rare. P a r a m e t e r passing assigns p a r a m e t e r s , in order, to available registers. Conflicts w i t h i n the registers at p a r a m e t e r passing time are n o t t a k e n into account. A conflict can r e s u l t if a variable f r o m t h e calling f u n c t i o n is needed as a parameter, t h e register it is in is needed for p a r a m e t e r passing of a different parameter, a n d t h e u s e of registers is s u c h t h a t an extra registerto-register m o v e is required for p a r a m e t e r passing. Global d a t a addresses are n o t changed. 5. R E S U L T S Ten p r o g r a m s were traced a n d m a p p e d into v a r i o u s configurations of MRSs. R e s u l t s were obtained for several n u m b e r s of registers per set: 0, 2, 4, 6, 8, 10, 12, 14, a n d 16. Recall t h a t in addition to these registers, three other registers for a r g u m e n t pointer, f r a m e pointer, a n d p r o g r a m c o u n t e r are included in each register set a n d t h e stack pointer a n d six global registers are also part of t h e c u r r e n t register set. For each of these set sizes, 2, 4, 8, a n d 12 register sets were used. The r e s u l t s . for each p r o g r a m are plotted in Figure 2 (a-j). The graphs s h o w Table 2. Overflow F r e q u e n c y of Programs. Number of Sets 2 3 4 5 6 7 8 10 12 14 PI 37.2% 18.7 7.6 3.0 1.5 1.0 0.6 0.3 0.2 0.1 NROFF C C O M TBL 44.7% 44.1% 38.1% 20.7 23.2 7.0 6.0 13.0 2.4 2.4 8.2 1.3 1.7 5.1 0.6 1.1 3.2 0.3 0.8 2.0 0.2 0.2 1.1 0.1 0.0 0.6 0.0 0 0.3 0 Overflow F r e q u e n c y EQN C O M P A C T 32.1% 17.9% 9.3 2.2 1.5 0.3 0.7 0.1 0.3 0 0.1 0 0.0 0 0 0 0 0 0 0 SORT 5.3% 1.5 0.5 0.2 0.1 0 0 0 0 0 INDENT AS 21.6% 21,9% 5.3 13,2 0.2 7,8 0 4,2 0 2,2 0 0,2 0 0,1 0 0 0 0 0 0 YACC 36.7% 3.1 0.9 0.5 0.3 0.2 0.1 0 0 0 Table 3. Nesting D e p t h F r e q u e n c y of Programs. Nesting Depth 2 3 4 5 6 7 8 10 12 14 Maximum Depth PI 99.9% 99.8 73.4 49.7 39.0 30.5 27.8 20.7 11.2 6.1 21 NROFF CCOM TBL 98.6% 100.0% 99.9% 94.2 99.9 99.0 56.7 96.3 97.6 37.1 85.3 78.0 20.7 77.6 45.8 15.6 72.3 19.6 12.7 67.3 9.5 3.4 55.5 2.1 0.4 34.9 0.1 0 17.1 0 14 23 13 Nesting Depth F r e q u q n c y EQN C O M P A C T SORT 99.9% 50.4% 97.8% 97.7 5.7 94.7 67.7 3.8 53.1 26.0 1.8 12.8 4.0 0 0.4 0.4 0 0 0.0 0 0 0 0 0 0 0 0 0 0 0 9 6 267 7 INDENT 47.3% 10.3 0.7 0' 0 0 0 0 0 0 5 AS 99.2% 97.3 40.3 32.3 16.9 10.9 2.5 0 0 0 YACC 99.5% 81.3 49.2 20.6 0.6 0.4 0.2 0 0 0 10 10 the n u m b e r of registers per set increases to a b o u t 6 or 8, the c u r v e s become level. For those p r o g r a m s w i t h high overflow• p e r f o r m a n c e is lost b y increasing t h e n u m b e r of registers past .some point. T h e overhead of h a n d l i n g the overflow increases w i t h an increasing n u m b e r of registers. For MRS(6,8) t h e p e r f o r m a n c e i m p r o v e m e n t is a b o u t 16%. Those p r o g r a m s w i t h high overflow a n d a large n u m b e r of calls p e r f o r m w o r s e w i t h 2 register sets t h a n t h e SRS V A X , p r i m a r i l y because in this s t u d y V A X - M R S did n o t use a register-save m a s k w h i l e t h e V A X h a s such a mask. t h e percent change in n u m b e r of (logical) m e m o r y references for each n u m b e r of registers compared to the m e m o r y references in t h e r a w trace; a d o w n w a r d slope indicates decreasing m e m o r y traffic a n d increasing performance. (The n u m b e r of registers in t h e r a w trace w a s n o t changed• so all V A X - M R S configurations are compared to t h e s a m e V A X trace data.) Since MRSs are intended to reduce p r o c e s s o r - t o - m e m o r y traffic, t h e change in t h e n u m b e r of m e m o r y r e q u e s t s is a n appropriate m e a s u r e of p e r f o r mance. Using execution t i m e as a m e a s u r e of p e r f o r m a n c e introduces other processor characteristics t h a t m a y be independent of t h e register sets. R e s u l t s were also obtained in t e r m s of n u m b e r of b y t e s t r a n s f e r r e d at each MRS conflguratiom T h e s e r e s u l t s are similar in f o r m a n d are n o t s h o w n here because t h e n u m b e r of r e q u e s t s is m o r e i m p o r t a n t f r o m an architectural point of view. One can observe in t h e g r a p h s of Figure 2 t h a t as t h e n u m b e r of registers increases t h e rate of change in p e r f o r m a n c e i m p r o v e m e n t decreases (in absolute value). There are t w o reasons for this. First, v e r y f e w procedures can m a k e u s e of a large n u m b e r of registers. A s t h e n u m b e r increases, t h e n u m b e r of procedures t h a t u s e all t h e registers decreases; procedures r e q u i r ing f e w e r registers achieve no benefit f r o m e x t r a u n u s e d registers. For p r o g r a m s w i t h little overflow, t w o or three of t h e c u r v e s of greater n u m b e r s of set are v e r y close together. There is u s u a l l y little change going f r o m 8 sets to 12. In general• w h e n (a) PI 30 30~ (b) NROFF / (c) C C O M 30 25 25 20 20 15 15' 20 ~ 15 10 Percent Change 5 10 Percent Change 5 10 Percent Change 5 0 0 0 -5 -5 -5 -10 -10 -15 -15 -10 J -15 -20 -20 | 0 , I | i i | -20 ' 4 8 12 N u m b e r of Registers ' ' 0 16 I ' I / I a 4 8 12 N u m b e r of Registers 16 0 15 15 10 10 10 5 5 5 2 sets 0 15 EQN 0 0 Percent -5 Change Percent 4; Change -v Percent -5 Change -10 -10 -10 -15 -15 -15 -20 -20 -20 -25 -25 -25 i i ! | ! ! 4 8 12 N u m b e r of Registers Fig. 2. i i .16 i | i | , | 4 8 12 N u m b e r of Registers | I I I I 16 4,8, I 0 I J J I I 4 8 12 N u m b e r of Registers Change in N u m b e r of M e m o r y References f r o m V A X trace to V A X - M R S for several p r o g r a m s . Each graph is for a v a r i e t y of n u m b e r s of register sets a n d n u m b e r s of registers per set. The n u m b e r of registers does n o t include global registers or registers for a r g u m e n t pointer, f r a m e pointer, s t a c k pointer, a n d p r o g r a m counter. 268 / (f) COMPACT i 16 , 4 8 12 N u m b e r of Registers I I 16 and may actually decrease performance. Table 4 shows the percent change compared to the SRS processor for instruction and memory data references and for the total memory references and bytes transferred. One observes that the number of instructions increases when using the MRS architecture. The factors changing the number of instructions were discussed earlier in Section 4. In addition to the changes due to mapping, overflow and underflow increase the number of instructions. On overflow one instruction handles the register saving and a few additional instructions are required to change from the normal stack pointer to the overflow stack pointer and back again. The change in number of instructions is a small positive change for a11 programs. The change in m e m o r y traffic w h e n measured in bytes is greater than m e m o r y traffic measured in references primarily because of changes in the sizes of instructions. The second reason for the rate of change is the presence of overflow, As the number of registers increases, more registers must be saved on overflow. If the frequency of overflow is great enough, performance decreases with more registers as happens for PI with 2 and 4 sets, for example. For programs with little overflow, such as SORT, no significant change occurs for a large number of registers. Should the graphs be extended for more registers, all curves in which some overflow occurs would show a decrease in performance at some Point. If one compares the overflow frequencies in Table 2 with the graphs, the greater separation between curves occurs w h e n the difference in overflow for the numbers of sets is large. 6. DISCUSSION 6.1. P r o g r a m Statistics w i t h MRSs Table 4 gives statistics for the ten programs similar to those in Table 1. In this case, however, the data are for MRS(6,8). At this point, for most of the programs, the vast majority of Possible performance gain is achieved. An increase in registers per set or number of sets results in little further performance improvement The change in the number of data references shows a large decrease, with changes ranging from 2 1 % to 43%. Data can be divided into three types: local variables and parameters. call/return processing, and other (typically global data). References of the "'other" type are always made to m e m o r y in both V A X and V A X - M R S . Since approximately 4 5 % of data references are "'other," the reduction in local and parameter references and in register saving are substantial--greater than 6 0 % in both cases. The difference in the number of data references is accounted for by three factors. Register saving to m e m o r y on a procedure call is eliminated by the use of multiple register sets. By using registers for variable a11ocation and parameter passing• fewer references to m e m o r y are required throughout the program in referencing these variables. These factors reduce the number of references compared to the SRS case. The handling of overflow causes an increase in the number of references due to saving of registers but it is usually offset by a decrease in register saving in SRS. Note the substantial drop in the "'Data Per Instruction" ratio w h e n compared to Table I. (g) S O R T 10 5 0 Percent -5 Change -10 2 sets -15 8,12 ~ -20 I I 6.2. Emptrica? F o r m u l a I I I I 12 Number of Registers 4 8 (h) INDENT 25 In order to better understand the behavior of MRSs and to characterize it in a general form an empirical formula was derived from the graphs in Figure 2. The formula is based on the I 16 25 20 20 15 15 10 10 5 Percent Change 0 5 Percent Change 0 2 sets ~ 25L (i) A S 15 10 5 Percent Change 0 -5 -5 -5 -10 -10 -10 -15 -15 4,8,12 -15 -20 f, 8,12 -20 -20 -25 -25 -25 I 0 I I I I I 4 g 12 Number of Registers I I 16 (j) YACC 20 I I I I i 12 Number of Registers 4 8 Fig. 2. (Continued) 269 i i ! 16 0 i i i i | 4 8 12 Number of Registers I 16 Table 4. Program Characteristics on VAX-MRS at MRS(6,8). Name PI NROFF CCOM TBL EQN COMPACT SORT INDENT AS YACC Instruction References 384896 (+5.4%) 287682 (+6.3%) 540476 (+3.79,0) 192907 (+4.070) 141377 (+2.49'0) 268963 (+1.5%) 39965 (+1.09,o) 126939 (+1.69'o) 159413 (+1.9%) 340998 (+1.0%) Memory Traffic Data Total References References 273014 (-38.8%) 657910 (-18.9%) 177152 (-42.2%) 464834 (-19.5%) 281572 (-42.09,0) 822048 (-18.39,o) 97745 (-43.19,o) 290652 (-18.79,o) 83507 (-28.59,0) 224884 (-11.79,o) 213907 (-21.19,o) 482870 (-9.99,0) 30221 (-31.99,o) 70186 (-16.4%) 70356 (-27.79,0) 197295 (-11.29,o) 104149 (-29.49,0) 263562 (-13.3%) 20015 (-41.8%) 541113 (-20.6%) behavior of the "'average" program. Since no weighting of relative frequency of execution of programs is available the program statistics were averaged without weighting as follows. Statistics from each program were normalized by the number of instructions executed for the program in the VAX trace. These results were then averaged. This results in 1 instruction and 0.994 data references for the SRS case and 1.029 instructions and 0.647 data references for MRS(6,8). This,same procedure can be followed for each combination of registers per set and number of sets. [ Total Data Refs per Bytes Instruction 2338041 (-22.1%) 0.71 1758675 (-22.2%) 0.62 2731101 (-22.89,°) 0.52 935688 (-24.7%) 0.51 757050 (-14.69,o) 0.59 1870942 (-10.2%) 0.80 220158 (-21.3%) 0.76 702930 (-14.29,o) 0.55 895295 (-16.09,o) 0.65 1911593 (-25.9%) 0.59 local variables or parameters be needed their values are still available and the two methods may still be equivalent. Some extra register-to-register moves may be needed if there are conflicts between incoming and outgoing parameters. In the Parallel Stack Architecture some parameters may not have to be moved if the same variable is an incoming and an outgoing parameter using same register. The basic form of the formula should have a term dependent only on the number of registers for references with an infinite number of sets. A second term is the product of the overflow and underfiow handling overhead times the frequency of overflow. The overhead depends on the number of registers while the overflow frequency depends on the number of sets. The number of memory references per SRS instruction is 2O 10 (.484e -'261e + 1.58) + (38 + 2R )(.00241)e (-'a9s +~-~/2), 5 where R is the number of registers and S is the number of sets. The results from this equation can be compared to the results from SRS of 1.994 to determine the relative change. Observe that the first term decays to some minimum number of references as registers are added. Figure 3 shows the change in memory references of the "'average" program and the change derived from the empirical formula. Since the equation is derived from unweighted results no claim can be made on the applicability to a specificworkload; however, the general form should be similar. 2 sets ~ Change -5 -10 -20 , The different configurations of the register file can also be compared using the total number of registers as the independent variable instead of number of registers per set. Table 5 liststhe register configuration that performs best for register files of different sizes. Performance is the change in memory traffic compared to SRS as indicated by the formula. The sizes refer to the total number of stack registers (including argument pointer, frame pointer, and program counter) times the number of sets. The size of configurations is at most the "'Maximum Available Registers." 0 Fig. 3. | | ! | I 4 8 12 Register File Width , 16 Change in Number of Memory References from VAX trace to VAX-MRS for an "'average" program. Solid lines are from average of measured programs and dashed lines are from empirical formula. Table 5. Optimal MRS Configuration for a Given Size Register File. 6.3. O t h e r Issues The MRS architecture described in this paper has complete overlap of incoming and outgoing parameters and local variable space, unlike RISC's overlapping windows. The principal contributions to performance improvement, reduction of procedure call overhead and increased use of registers for variable allocation, are similar for various types of MRS architectures. Since a PUSH in the Parallel Stack Architecture frees all registers, performance in that architecture and RISC are equivalent if no local variables or incoming parameters are to be outgoing parameters. Should some Maximum Available Re~isters 36 52 68 104 136 270 Optimal MRS Configuration I Registers Number Change per Set of Sets vs. SRS 6 4 -13.2% 10 14 10 14 4 4 8 8 -16.1 -16.8 -18.6 -19.8 In this study the number of instructions is essentially unchanged from SRS to MRS architecture. The average size of instructions decreases due to the increased use of registers instead of memory references because a shorter addressing mode can be used. A comparison of two processors with fewer addressing modes would show a smaller increase or a decrease in instruction count from SRS to MRS. In the SRS case references to variables in memory would first require a LOAD instruction before processing. This LOAD could be eliminated in the MRS case for local variables now kept in registers. In both cases the number of instructions would be greater than in the corresponding VAX processors. Handling overflow would require more instructions if a multiple-register save instruction were not available. Finally. a simpler processor may have less overhead on a CALL instruction, It is not necessary to keep three pointers into the memory stack. Fewer stack registers would be required in addition to the decrease in overhead. The saving of other information on a CALL as in VAX may not be needed. The results shown here are specific to the VAX architecture and other architectures may show performance changes of different magnitudes. 7. CONCLUSIONS There had been a lack of performance measurements of multiple register set architectures that show the effect of the register sets on overall processor performance for real workloads. This paper compares results from several real programs on an SRS and many MRS architectures. A decrease in memory references of approximately 16% can be expected on average. Since VAX makes use of C-language register declarations, a comparison to a processor that used no C-language register declarations would show a greater improvement. REFERF_~EES [Dann79] R. B. Dannenberg. "'An Architecture with Many Operand Registers to Efficiently Execute BlockStructured Languages," Proc. 6th Annu, Syrup. Comput. Arch., Apr. 1979, pp. 50-57. [EiPa85] R. J. Eickemeyer and J. H. Pate1. "'A Parallel Stack Processor (PSP)," IEEE International Conference on Computer Design: VLSI in Computers, Port Chester, NY. Oct. 1985. pp. 473-476. [Ferr78] D. Ferrari, Computer Systems £erfornmnce Evaluation, Englewood Cliffs, NJ: Prentice-Hall. 1978. [HiSp85] C. Y. Hitchcock III and H. M. B. Sprunt, "'Analyzing Multiple Register Sets," 12th Annu. Int. Syrup. Comput. Arch., IEEE, Boston, MA. June 1985, pp. 55-63. [HuLa85a]M. Huguet and T. Lang, "'A C-Oriented Register Set Design." Proc. 29th Syrup. on Mini and Microcomputers, Sant Feliu de Guixols. Catalonia. June 1985, pp. 182-189. [HuLa85b]M. Huguet and T. Lang. "A Reduced Register File for RISC Architectures," Comput. Arch. News, vol. 27, no. 10, Sept. 1985, pp. 22-31. [Lamp82] B. W. Lampson, "'Fast Procedure Calls," Proc. Syrup. Architectural Support for Prog. Lang. and Oper. Syst., [LeEc80] [Part85] ACM, Palo Alto, CA, Mar. 1982, pp. 66-76. H. M. Levy and R. H. Eckhouse. Jr., Computer Programming and Architecture--The VAX-11, Bedford, MA: Digital Press, 1980. D . A . Patterson, "'Reduced Instruction Set Computers," Comm. ACM, vol. 28, no. 1, Jan. 1985, pp. 8-21. [PaSe82] D. A. Patterson and C. H. S~quin, "A VLSI RISC,'" Computer, vol. 15, no. 9, Sept. 1982, pp. 8-21. [Site79] R . L . Sites, "'How to Use 1000 Registers," Caltech Conference on VLSI, Jan. 1979, pp. 527-532. Y. Tamir and C. H. S~quin, "'Strategies for Managing the RISC Register File,'" IEEE Transactions on Computers, voL C-32, no. 11, November 1983, pp. 977-989 [TaSe83] 271