bstract

advertisement
lristitute of Science
h i g a l o r e 560 012.
Iticliati
bstract
-vent-driven approach is more flexible compared to
compiled code, as different delay models can be inIncrease i n t,lre c:oinplexit,y of VLSI digit,al circuit design corporated and the element evaluation is carried
demands faster logic sitnulation t,echiticiiies than t,hose out only when logic value at the input changes.
currently availablr. One o f t,ltc ways o f s p c d i n g tip ex- 'The 'I'-algorithm is more efficient than event-driven
isting logic sirrinlation algoritltrrls is by exploit,ing the approach because it reduces the number of tableinherent paralldisni of the scqiiential version. In t,his lookups[2].
paper, we explore (.lie possibilit,y of mapping an eventMost of the existing sequential techniques take
driven logic simiilation algorithm onto a c1itst.c.r of processors int6rcon iiected by an ethernet,. 7'hc set. of events considerable amount of tirne to carry out logic simat any simulation tjme step is partilionect by the M a s t e p ulation of VESI circuits. Their execution times can
Task (r:inning c n the host processor) among the Worker he reduced by exploiting the inherent parallelism of
Tasks(running o n the other processms). 'I'he partition- tire PIgorithlik 'The methods used to speed up the
ing scheme ensures a tmlancd load 1~:acti kt'oskcr Thsk simu!alion a x .
detserrninest !I? I ' i i i w i I t , rlwnents, evalriirtcs them intlepcn8 'runirig u p the basic aIgoritfim l o be suitable for
denlly and coiiles up wit.h the ncw event, list which is
vector prrscessing [3],
passed onto ttir Master Task. Aft,er receiving the event
list,s from all i l c Workers, t,he Master 'I'ask increr:ients
the simulation t.iinc stkp, mmpiitrs tlic n e w e v e r i t list
partition for i h iis.xt
~ sirniiilxticin cycic:.
.
< +,c wiiic , mu^. r m d l a i aig~rieIan~a
direr ' i y
iirnted ihis distributed logic s i i n i i l a ?
onto t h e "r;sdware.
of 8 VAXststions w i n g C . T - iz pack,tge
for d i s t r i i m t d ;>regramrning~Tire pai)ci c r ~ r ~ c l ~ ~with
t l t : ~ The first approach demands ~
~ ~ ~ t~ ~ ~ of~~
a rrok oit t h e q p w d v p figures oht,ainetl on t h e ISCAS conripukl-ationai power of vector processurs(higir VEC-
bcncliiiiaik circuits.
~ a ,~ ~ ~ oe n
~
torization ratio, long vector length)[3]. The latter
approach gives a good speedup, but suffers from the
disadvantage that any change in the algorithm cannot be reflected hack in the hardware easily and cost
to perfurrnancr ratio is high, thereby making it Iess
attractive for speeding u p the simulation.
Logic sirnulators play a niajor r o l o for vtLrifyiiig the
functioiiality of' VIA1 circuit, design. Existiug iogic
The current trends in carrying out parallel logic
simulators C P I I h e hroadl?; classified i n t o three types simul:~tionare ccmcentrated O R mapping the simulaConipiled cod<.[11, E v w t driieii [ I ] a n d 'L-Aigorithm
tion algorithm on ?3erprocessors of the general pur[a]. I n the c a w of rotnpilecl rodc SlinuIation, all pose parallel mschinrg. We can either exploit funceienienis are I-.vafuatetl at each tinnie step, and the
ticinal parallelism whi;h is inlierent in the algorithm
zero clelay iiiodrlliug of eleiueilts iq iiiipliect T h e
or the d a t a parallelism In the former, khe simulation
'Cotnputer A i d 4 Design t,ahoraIory
task is partitioned and assigned to several functional
Knoirkdgr: ISased Cornputei System Laborrrtawy, Super- units which operate irr a pipeline fashion. One of the
computer Education and Research Centre
rzquirernents is t h a t the simulation task has to be diMicroprocessor App1w;rtions Labornto! y, Suprrcorrtputer
~ units in nsuch
Education and
-arclr Centre, Dept. of Compritei Science vided into Fxed number of ~
and Automatior)
a way that each functional unit "ekes almost same
f
-~
185
-.
-
~
t
~
amount of time to carry out the execution. The limitation of this approach is that, it does not scale kery
easily as the number of processors is increased.[8]
Jn the case of the latter, the circuit is partitioned
into subcircuits, and each subcircuit is assigned a
processor which carries out the simulation independently. The performance of this approach depends
mainly on the partitioning stratergy. Partitioning
of the circuits into subcircuits is a vital step ira
this method[8]. Secondly synchronization of simuiation time step among processors affects the speedup.
Synchronization plays a major role in distributed
logic sirnu!ation[9]. I n our approach, we dynamically
partition the event list at runtime and this aids the
CT package to achieve an efficient load balance and
have less comniunicatiori overheads. 'rtiis adds to a
considerable increase in speedup of the simulation
process. This paper describes one possible approach
of mapping the event-driven logic simulation algorithm, on a distributed computing system using the
concept of centralized simulation time maintained by
the Master task.
en
Ekhernrt
Figure I: The ~ a s ~ e r - ~ ~ Tasking
o r k e ~Abstractitsri
on a Distributed System
value. If the evaluated value is diRerent from the existing value, a new event is generated and it is queued
in the simulation time queue depending on the delay
characteristic of the element. This simulation cycle
is repeated for every increnaent in sinultetion time
step, until the simulation period is ccrnpleted.
rit
In ilVLSI circuit, at any sirnulation time step, only
a small portioir of the entire circuit i s active. Therefore it is ineficient to evaluate all the elements at
each simulation st,ep. Whenever the vairie of the
output of a element changes, an event is said to be
generated. This new event affects the gates which
are connected to the output signal line. These elements have to be reevaluated. In otherwords, at each
time step only those elements whose inputs have
changed have to be reevaluated. This technique is
called event-driven simulation. Event driven simulators supporting more complex delay models require
a special type of data structure to find out the fanrns
of an element for evaluation and the fanouts for finding the potentially active elements[4]. In addition,
a time flow mechanism is necessary to keep track of
time order riiairilcriarice of event lists.
At any time step, the event is picked from the
event list and using the event data the element logic
value is modified to the present value. This value is
propagated to a11 fanouts connected to this output,
aid the fanout elements are pushed into the appropriate evaluation stacks. This phase is' called the
fanout phase. In the element evaluation phase, the
dkrnent is popped from the evaluation stack, the inpat logic values of the element are gathered and the
element is evaluated to find the new logic output
2.1
The Master-Worker ~~s~~~~ Abstraction
We implement the parallel algorithm using the
Master- Worker tasking paradigm of the CT kernel
package. The CT Kernel[G] is a distributed programming package that extends C with tasktype
- a tasking abstraction. A tasktyye[il] is an abstract datatype that encapsulates a set of data objects and defines operations on them. These operations form the sequential body of an instance of the
tasktype activation: called the Task. Some of the
operations within the Imdy of the task allow the pertinent task to conanaunicabc with other tasks either
synchronously or asynchrnnously. The communication interface of a task is uniform over the multiple
processors and so is its namespace. A tasktype is
different from the Ada'e tmktyge but has mme implementation similarities
A task is explicitly created by declaring variables
of the tasktype m d then explicitly invoking it. On
activation, a task executes sequentially the sequence
of statements in the body of the tasktype definition.
On completion of the execution the task terminates
automatically following a well-defined termination
wmantics. There can be multiple instances of a
186
TASK m a s t e r - t a s k 0
i n t P , j ; / d i Declare t h e workers */
tasktype norker-task worker [M~~-W~R~ERS]
;
/ * Workload f o r each worker */
Worker-load event-setCIIIAX-WOFLRERSI ,total- event;
Logic-sim-iitate s t a t e , i n c r - s t a t e , temp-vals;
f o r (i = 0 ; i < MAX-WORKERS; i++)
nesCaorkerCi]); /* Spavn t h e vorker t a s k s
*/
initidlize.-master 0 :
hear(PARENT, e v e ~ t - ~ ~ ~ ) ~
o ~ ~ ~ ~ ~ etemp~ ~ set);
- a e t ,
tell(PARENT, temp-set) ;
hear(PARENT, h e r - s t a t e ) :
update_sirsr_etate,vorker(incP_stata, s t a t e ) ;
evaluate-circuit-elemernts ( s t a t e ,
new-event-set ) :
tell(PARENT, new-event-set) ;
f o r (i = 0 ; i < HAX_NO-SI~LATION_TIME_STEPS;
get-next-sim-step(bi))
I
partition-event-list(tota1-event ,eventmeet):
/* T e l l Worker about fanout computation */
for (j = 0 ; j < MAX-UURKERS; j++)
t e l l (worker [jI I event -set [j] ) ;
/* Hear from Vorkers t h e r e s u l t s */
f o r < j = 0 ; i < MAX-WORKERS; j++)
hear (worker [jl ,temp-valaC j l ) :
upate,logic-sim-state(temp_vals,
s t a t e , incr- state);
/* T e l l a l l workers - incremental s t a t e */
tell(wcrker, incr- state) ;
/* Hear from vorkers - evaluation r e s u l t s * /
f o r ( j = 0 ; j < MAX-WORKERS; j++)
hear(workerCj1, event-setCj]);
generate-total-events(event-set,totaP_eventl:
1
conclude 0 ;
END-TASK-TYPE
3
Figure 3: The pseudo code for Worker Task
passes back this event list to the Master task. It then
awaits the next set of events t o be processed. We
have a Master task and generate a number ~fVVorker
s which are assigned to different, processors as
3
in Fig. 9 There can be more than two Workers per
print- results0 :
processor as this increases the CPU utilization h t o r
END-TASK
hecause w h ; k one Worker awaits the messages from
1
the Master, the others can do the processing (but
the
logic sirndaf,iornstate is ~ ~ ~ ~ ~emt aa peri n e ~
Figure 2: The pseudo code for Master ‘Faask
processor basis). We use the ~~~~-~~~~~
~ ~ ~ c ~ ~ o r i o u ~
cornmuriication s
nts inhenked by every twktype
definition.
~ Q r ~ ~ ~ ~~~~~e~~~
~ n ~ tsc ~ t ~ o n
tasktype active at any ‘time. Since mch t < s k is an
a d c when
~ an
~ array
~ ~of ~
independently executalde unit, and given the bias are also capable
towards metsage passing architectures therp rannot workers are speci
be any globd (shared) variables bef8uvre.ntauks. All
function and procedure invocations from the body 2.2 The Eventlist Parti
of the task have a11 necessary parameters passed
rithm
to it explictly. However the reentrant code for the
procedures, functions and constant declarations are When the computational load is not evenly disshared betwem tasks.
tributed among the Worker tasks, some of the WorkIn the C’l’ kernel, we define a Master Task task- ers may be idle which in turn causes processor untype as one that is the first task to be initiated on derutilization. The number of elements a Worker
the host processor and which in turn spawns a num- Task is going to evrtluste d e ~ e n d son the number of
5er of Workar Tasks.The M a s k r task takes in event fanouts of the element present in the event. Assumlist generated in every cycle and commrrnicates it to ing an equal evaluation time for all types ofelements,
~
~ can ~be carried
~
~o ~
~
~
the Workers. We drfine the Worker I b k tnsktype as we can say that the ~
one that takes in t,he event list communicated by the on the basis of number of fanouts. The partitiongives an even
Master task, does t‘he fancut and element evaluation ing method used in our i
I and i s given
and after tbe computatian of the n e w set of events, balance of t o t d fanouts
’
187
n
~
ally evaltJato {
t w i s tagpc9t-i on to the appropriate s i m i i l ~ i m i WF ale
taking into account
the d e h y fha:actetm&li:.
le master ta& maintains the sinnuinkion time
c s well RS event list,
the worker task5 carry out t b actual fanout apd
the evaluation phases in a distributed manner The
master task interacts with every workqr +ask twice
in each sirriulation cycle.
The initialization COIISIS~S of reading in the circuit description and initializing the data structures
representing the state of Iogac sirnuhitixi Ths is
carried out Ly aH tasks rnmter and workers on a
per processor basis
As seeit froirs the pseudo code, the master task
partitions t,he everillist of the pertinent simu!atiolr
time step and t s ~ l as11 the workers ~ E m r.artitioned
r
event sets. 'The c~roakerscarry out he fanoiet p h a *
and later the master hears thew resdts from each
of the workers The Master then updates the state
of logic simulation and broadcasts the i n c r e ~ ~ ~ t ~ ~
state change to all the workers. The WOFBCP~~
after Bearing the state changes amd updating their
logic simulation state (on a per processor basis) car ries out the elehernt evaleaation corresponding ' TP t h e
eventlist partitioned and assigned to it. After the
completion sf the evaluation phase, all workers c ~ U
the output values generated to the master task who
goes on to update the eventlist and initiate the simulation cycle for the next time step.
5
Step 'En: Assigrn oiie dernent
tnsk(irm tfw sorted order).
( c each W-orker
step 2kx Su\)seqmently assign the item of the event
list to the il'orker task having the lowest tot.al
fanoe:t
Step 2c: Coritintie until the event list for the current sirmiation t,itrie step is exhausted.
This way we riistire that, all the workers are almost
equaliy loaded with the fanout elei~ient~s.
This in indirectly influences the load-balancing policies of the
parallel aIgorithni
The Master and Worker Task
gor i t 11111
In Figs. 2 and 3 we give the pseudo C code of the
Master and Workcr task definitions
The essei)tial aspects of this parallel algorithm is
as follcivvs. The state of the logic simulation at any
time step is givrn by the state of the logic elrrnents
(some of which niay be in a delay state) We duplicate the initial state of the logic sirnulation on each
of the processors and for each cycle of t h L sitnulation,
the state of the logic simulation is updated on each
processor and thus kept consistent glohaily. This
helps avoid formulating the oecessary .;et of logic eiements into a message packet and sending it along
with the partihioned eventlist to a worker task for
simulation. Tlie master task accepts the results of a
simulation cycle and after updaling its state of logic
simulation, broadcasts it to all the worker tasks (the
relevant updated portion of the state). The pseudo
code expresses this succinctly.
Logic simulation is essentially divided into two
phases of activ'ity ill every simulation cycle. These
are the fanout phase where the rwults of evaluation of a logic dement is propagattd to its fanout
elements, and the evaluation phase, where the logic
element is evaluakd.
Bn the fanout phase, the events queued for the pertinent simulation time step is taken and its output
value propagated as the input valrae of the corresponding fa:iout elements. These fanout elements
are scattered across the processors so that the evaluation phase car] be carried out. In the evaluation
phase, those logic elements whose inputs were updated during the fanout phase are taken, function-
~
0loservat ions:
'The above method is not a very efficient solution
i n a distributed environment. It can be made eI@rcient if we maintain the logic simulation state consistent across processors for every simulation cycle efficiently. The communication costs per cycle is high
as compared to the computation costs and there i s
good scope for improvement
In the present iniplementatio,r,
the communication costs by s u t t m
which are sent only when needec. 7'lm during the
task assignment, if the n u n i h ~ ir,F i'monts are less
than optimal per task, then on*y n few cf the w t d w r
tasks are chosen andl sent the ~roiklodidarid t i i t r c s ~
of the worker tasks directly go to i h next
~
time step
We chose two representative circuits from the ISCAS benchmarks for evaluating the performance of
this ~ ~ g o r T%iey
~ t ~are~ the
. c432 and c&@ and
were chosen to be within l i ~ t ofs the resources available. The Table 1 gives the tirnirvg of the parallel
logic simulation run for the above-dwcr circuits on
a cluster of eight, VAX station 2000 workstations
interconnected by LAN. Our experience with this
~~~~~~~~~~~~~~~
158
partitioning scheme is that the load between simulation time-stc:ps is characterized hy spurts a.nd between spurts there is less work for most worker tasks
than during t.lie spurts. Agaiii we find that for more
than seven processors t h c i s no decrease in the time
taken which e#;senti;.ll;y imptirs that processors were
going idle w i ! h less work and t,lie communica.lion
costs were risiiig as ronipiued t.o the computation
costs. More refirecme~lsw e d l,o he incorporated into
the event b a s d load partitioniaig afgoritlim for perhaps overiappcd and balanced load.
s
This paper dir,cusses the mapping of an event-driven
1 logic simulation algorithm onto a distributed system.
1 A load balancing scheme presented ensures that the
tasks are eveiily loaded. ]Results indicate that for
large circuits lvliere the ratio of computation to communication costs is high,. the performance is good.
The algorithni also presents good scope for improvement. We feel that this algorithm will be useful given
the general a\saiIability of networked workstations.
sdware Engines for
csign and Simylation
wvier Science Pub., NH.
CAD ss3r VLSI, Vol. 2, 1986.
ohan A Tasking Abstraction for Message
Passing Architectures, Proc. of PARCOM-90,
Pune, India, 1990.
[8] Srinivas Patil, Prithviraj Banerjee and Constantine D. Polychronopoulos, EAicient Circuit
Partitioning Algorithms for Parallel Logic simulation, Supercomputing Conference 1989, pp
361-370.
[9] Friedrich Woppe, Accelerated Logic Simulation
using Parallel Processing, Compeuro 1988, pp
156- 163.
Acknowletlgeanents
The authors thank Prof.V.Itajaraman for the encouragement and nieinhers of the KBCS, CADL,
MAL and SERC for their help.
References
[1] M.A. Ermer and A.D. Friedman, Diagonsis and
Reliable Design of Digital Systems, Computer
Science I'ress, Rockville, 1976.
[2] N . Ishiura, II. Yasuura and S. Yajima, Time
First Eldualion Algorithm for High-speed
Logic S)mu/ation, Proc. ICCAD, November
1 9 8 4 , ~397-199.
~
189
Download