lristitute of Science h i g a l o r e 560 012. Iticliati bstract -vent-driven approach is more flexible compared to compiled code, as different delay models can be inIncrease i n t,lre c:oinplexit,y of VLSI digit,al circuit design corporated and the element evaluation is carried demands faster logic sitnulation t,echiticiiies than t,hose out only when logic value at the input changes. currently availablr. One o f t,ltc ways o f s p c d i n g tip ex- 'The 'I'-algorithm is more efficient than event-driven isting logic sirrinlation algoritltrrls is by exploit,ing the approach because it reduces the number of tableinherent paralldisni of the scqiiential version. In t,his lookups[2]. paper, we explore (.lie possibilit,y of mapping an eventMost of the existing sequential techniques take driven logic simiilation algorithm onto a c1itst.c.r of processors int6rcon iiected by an ethernet,. 7'hc set. of events considerable amount of tirne to carry out logic simat any simulation tjme step is partilionect by the M a s t e p ulation of VESI circuits. Their execution times can Task (r:inning c n the host processor) among the Worker he reduced by exploiting the inherent parallelism of Tasks(running o n the other processms). 'I'he partition- tire PIgorithlik 'The methods used to speed up the ing scheme ensures a tmlancd load 1~:acti kt'oskcr Thsk simu!alion a x . detserrninest !I? I ' i i i w i I t , rlwnents, evalriirtcs them intlepcn8 'runirig u p the basic aIgoritfim l o be suitable for denlly and coiiles up wit.h the ncw event, list which is vector prrscessing [3], passed onto ttir Master Task. Aft,er receiving the event list,s from all i l c Workers, t,he Master 'I'ask increr:ients the simulation t.iinc stkp, mmpiitrs tlic n e w e v e r i t list partition for i h iis.xt ~ sirniiilxticin cycic:. . < +,c wiiic , mu^. r m d l a i aig~rieIan~a direr ' i y iirnted ihis distributed logic s i i n i i l a ? onto t h e "r;sdware. of 8 VAXststions w i n g C . T - iz pack,tge for d i s t r i i m t d ;>regramrning~Tire pai)ci c r ~ r ~ c l ~ ~with t l t : ~ The first approach demands ~ ~ ~ ~ t~ ~ ~ of~~ a rrok oit t h e q p w d v p figures oht,ainetl on t h e ISCAS conripukl-ationai power of vector processurs(higir VEC- bcncliiiiaik circuits. ~ a ,~ ~ ~ oe n ~ torization ratio, long vector length)[3]. The latter approach gives a good speedup, but suffers from the disadvantage that any change in the algorithm cannot be reflected hack in the hardware easily and cost to perfurrnancr ratio is high, thereby making it Iess attractive for speeding u p the simulation. Logic sirnulators play a niajor r o l o for vtLrifyiiig the functioiiality of' VIA1 circuit, design. Existiug iogic The current trends in carrying out parallel logic simulators C P I I h e hroadl?; classified i n t o three types simul:~tionare ccmcentrated O R mapping the simulaConipiled cod<.[11, E v w t driieii [ I ] a n d 'L-Aigorithm tion algorithm on ?3erprocessors of the general pur[a]. I n the c a w of rotnpilecl rodc SlinuIation, all pose parallel mschinrg. We can either exploit funceienienis are I-.vafuatetl at each tinnie step, and the ticinal parallelism whi;h is inlierent in the algorithm zero clelay iiiodrlliug of eleiueilts iq iiiipliect T h e or the d a t a parallelism In the former, khe simulation 'Cotnputer A i d 4 Design t,ahoraIory task is partitioned and assigned to several functional Knoirkdgr: ISased Cornputei System Laborrrtawy, Super- units which operate irr a pipeline fashion. One of the computer Education and Research Centre rzquirernents is t h a t the simulation task has to be diMicroprocessor App1w;rtions Labornto! y, Suprrcorrtputer ~ units in nsuch Education and -arclr Centre, Dept. of Compritei Science vided into Fxed number of ~ and Automatior) a way that each functional unit "ekes almost same f -~ 185 -. - ~ t ~ amount of time to carry out the execution. The limitation of this approach is that, it does not scale kery easily as the number of processors is increased.[8] Jn the case of the latter, the circuit is partitioned into subcircuits, and each subcircuit is assigned a processor which carries out the simulation independently. The performance of this approach depends mainly on the partitioning stratergy. Partitioning of the circuits into subcircuits is a vital step ira this method[8]. Secondly synchronization of simuiation time step among processors affects the speedup. Synchronization plays a major role in distributed logic sirnu!ation[9]. I n our approach, we dynamically partition the event list at runtime and this aids the CT package to achieve an efficient load balance and have less comniunicatiori overheads. 'rtiis adds to a considerable increase in speedup of the simulation process. This paper describes one possible approach of mapping the event-driven logic simulation algorithm, on a distributed computing system using the concept of centralized simulation time maintained by the Master task. en Ekhernrt Figure I: The ~ a s ~ e r - ~ ~ Tasking o r k e ~Abstractitsri on a Distributed System value. If the evaluated value is diRerent from the existing value, a new event is generated and it is queued in the simulation time queue depending on the delay characteristic of the element. This simulation cycle is repeated for every increnaent in sinultetion time step, until the simulation period is ccrnpleted. rit In ilVLSI circuit, at any sirnulation time step, only a small portioir of the entire circuit i s active. Therefore it is ineficient to evaluate all the elements at each simulation st,ep. Whenever the vairie of the output of a element changes, an event is said to be generated. This new event affects the gates which are connected to the output signal line. These elements have to be reevaluated. In otherwords, at each time step only those elements whose inputs have changed have to be reevaluated. This technique is called event-driven simulation. Event driven simulators supporting more complex delay models require a special type of data structure to find out the fanrns of an element for evaluation and the fanouts for finding the potentially active elements[4]. In addition, a time flow mechanism is necessary to keep track of time order riiairilcriarice of event lists. At any time step, the event is picked from the event list and using the event data the element logic value is modified to the present value. This value is propagated to a11 fanouts connected to this output, aid the fanout elements are pushed into the appropriate evaluation stacks. This phase is' called the fanout phase. In the element evaluation phase, the dkrnent is popped from the evaluation stack, the inpat logic values of the element are gathered and the element is evaluated to find the new logic output 2.1 The Master-Worker ~~s~~~~ Abstraction We implement the parallel algorithm using the Master- Worker tasking paradigm of the CT kernel package. The CT Kernel[G] is a distributed programming package that extends C with tasktype - a tasking abstraction. A tasktyye[il] is an abstract datatype that encapsulates a set of data objects and defines operations on them. These operations form the sequential body of an instance of the tasktype activation: called the Task. Some of the operations within the Imdy of the task allow the pertinent task to conanaunicabc with other tasks either synchronously or asynchrnnously. The communication interface of a task is uniform over the multiple processors and so is its namespace. A tasktype is different from the Ada'e tmktyge but has mme implementation similarities A task is explicitly created by declaring variables of the tasktype m d then explicitly invoking it. On activation, a task executes sequentially the sequence of statements in the body of the tasktype definition. On completion of the execution the task terminates automatically following a well-defined termination wmantics. There can be multiple instances of a 186 TASK m a s t e r - t a s k 0 i n t P , j ; / d i Declare t h e workers */ tasktype norker-task worker [M~~-W~R~ERS] ; / * Workload f o r each worker */ Worker-load event-setCIIIAX-WOFLRERSI ,total- event; Logic-sim-iitate s t a t e , i n c r - s t a t e , temp-vals; f o r (i = 0 ; i < MAX-WORKERS; i++) nesCaorkerCi]); /* Spavn t h e vorker t a s k s */ initidlize.-master 0 : hear(PARENT, e v e ~ t - ~ ~ ~ ) ~ o ~ ~ ~ ~ ~ etemp~ ~ set); - a e t , tell(PARENT, temp-set) ; hear(PARENT, h e r - s t a t e ) : update_sirsr_etate,vorker(incP_stata, s t a t e ) ; evaluate-circuit-elemernts ( s t a t e , new-event-set ) : tell(PARENT, new-event-set) ; f o r (i = 0 ; i < HAX_NO-SI~LATION_TIME_STEPS; get-next-sim-step(bi)) I partition-event-list(tota1-event ,eventmeet): /* T e l l Worker about fanout computation */ for (j = 0 ; j < MAX-UURKERS; j++) t e l l (worker [jI I event -set [j] ) ; /* Hear from Vorkers t h e r e s u l t s */ f o r < j = 0 ; i < MAX-WORKERS; j++) hear (worker [jl ,temp-valaC j l ) : upate,logic-sim-state(temp_vals, s t a t e , incr- state); /* T e l l a l l workers - incremental s t a t e */ tell(wcrker, incr- state) ; /* Hear from vorkers - evaluation r e s u l t s * / f o r ( j = 0 ; j < MAX-WORKERS; j++) hear(workerCj1, event-setCj]); generate-total-events(event-set,totaP_eventl: 1 conclude 0 ; END-TASK-TYPE 3 Figure 3: The pseudo code for Worker Task passes back this event list to the Master task. It then awaits the next set of events t o be processed. We have a Master task and generate a number ~fVVorker s which are assigned to different, processors as 3 in Fig. 9 There can be more than two Workers per print- results0 : processor as this increases the CPU utilization h t o r END-TASK hecause w h ; k one Worker awaits the messages from 1 the Master, the others can do the processing (but the logic sirndaf,iornstate is ~ ~ ~ ~ ~emt aa peri n e ~ Figure 2: The pseudo code for Master ‘Faask processor basis). We use the ~~~~-~~~~~ ~ ~ ~ c ~ ~ o r i o u ~ cornmuriication s nts inhenked by every twktype definition. ~ Q r ~ ~ ~ ~~~~~e~~~ ~ n ~ tsc ~ t ~ o n tasktype active at any ‘time. Since mch t < s k is an a d c when ~ an ~ array ~ ~of ~ independently executalde unit, and given the bias are also capable towards metsage passing architectures therp rannot workers are speci be any globd (shared) variables bef8uvre.ntauks. All function and procedure invocations from the body 2.2 The Eventlist Parti of the task have a11 necessary parameters passed rithm to it explictly. However the reentrant code for the procedures, functions and constant declarations are When the computational load is not evenly disshared betwem tasks. tributed among the Worker tasks, some of the WorkIn the C’l’ kernel, we define a Master Task task- ers may be idle which in turn causes processor untype as one that is the first task to be initiated on derutilization. The number of elements a Worker the host processor and which in turn spawns a num- Task is going to evrtluste d e ~ e n d son the number of 5er of Workar Tasks.The M a s k r task takes in event fanouts of the element present in the event. Assumlist generated in every cycle and commrrnicates it to ing an equal evaluation time for all types ofelements, ~ ~ can ~be carried ~ ~o ~ ~ ~ the Workers. We drfine the Worker I b k tnsktype as we can say that the ~ one that takes in t,he event list communicated by the on the basis of number of fanouts. The partitiongives an even Master task, does t‘he fancut and element evaluation ing method used in our i I and i s given and after tbe computatian of the n e w set of events, balance of t o t d fanouts ’ 187 n ~ ally evaltJato { t w i s tagpc9t-i on to the appropriate s i m i i l ~ i m i WF ale taking into account the d e h y fha:actetm&li:. le master ta& maintains the sinnuinkion time c s well RS event list, the worker task5 carry out t b actual fanout apd the evaluation phases in a distributed manner The master task interacts with every workqr +ask twice in each sirriulation cycle. The initialization COIISIS~S of reading in the circuit description and initializing the data structures representing the state of Iogac sirnuhitixi Ths is carried out Ly aH tasks rnmter and workers on a per processor basis As seeit froirs the pseudo code, the master task partitions t,he everillist of the pertinent simu!atiolr time step and t s ~ l as11 the workers ~ E m r.artitioned r event sets. 'The c~roakerscarry out he fanoiet p h a * and later the master hears thew resdts from each of the workers The Master then updates the state of logic simulation and broadcasts the i n c r e ~ ~ ~ t ~ ~ state change to all the workers. The WOFBCP~~ after Bearing the state changes amd updating their logic simulation state (on a per processor basis) car ries out the elehernt evaleaation corresponding ' TP t h e eventlist partitioned and assigned to it. After the completion sf the evaluation phase, all workers c ~ U the output values generated to the master task who goes on to update the eventlist and initiate the simulation cycle for the next time step. 5 Step 'En: Assigrn oiie dernent tnsk(irm tfw sorted order). ( c each W-orker step 2kx Su\)seqmently assign the item of the event list to the il'orker task having the lowest tot.al fanoe:t Step 2c: Coritintie until the event list for the current sirmiation t,itrie step is exhausted. This way we riistire that, all the workers are almost equaliy loaded with the fanout elei~ient~s. This in indirectly influences the load-balancing policies of the parallel aIgorithni The Master and Worker Task gor i t 11111 In Figs. 2 and 3 we give the pseudo C code of the Master and Workcr task definitions The essei)tial aspects of this parallel algorithm is as follcivvs. The state of the logic simulation at any time step is givrn by the state of the logic elrrnents (some of which niay be in a delay state) We duplicate the initial state of the logic sirnulation on each of the processors and for each cycle of t h L sitnulation, the state of the logic simulation is updated on each processor and thus kept consistent glohaily. This helps avoid formulating the oecessary .;et of logic eiements into a message packet and sending it along with the partihioned eventlist to a worker task for simulation. Tlie master task accepts the results of a simulation cycle and after updaling its state of logic simulation, broadcasts it to all the worker tasks (the relevant updated portion of the state). The pseudo code expresses this succinctly. Logic simulation is essentially divided into two phases of activ'ity ill every simulation cycle. These are the fanout phase where the rwults of evaluation of a logic dement is propagattd to its fanout elements, and the evaluation phase, where the logic element is evaluakd. Bn the fanout phase, the events queued for the pertinent simulation time step is taken and its output value propagated as the input valrae of the corresponding fa:iout elements. These fanout elements are scattered across the processors so that the evaluation phase car] be carried out. In the evaluation phase, those logic elements whose inputs were updated during the fanout phase are taken, function- ~ 0loservat ions: 'The above method is not a very efficient solution i n a distributed environment. It can be made eI@rcient if we maintain the logic simulation state consistent across processors for every simulation cycle efficiently. The communication costs per cycle is high as compared to the computation costs and there i s good scope for improvement In the present iniplementatio,r, the communication costs by s u t t m which are sent only when needec. 7'lm during the task assignment, if the n u n i h ~ ir,F i'monts are less than optimal per task, then on*y n few cf the w t d w r tasks are chosen andl sent the ~roiklodidarid t i i t r c s ~ of the worker tasks directly go to i h next ~ time step We chose two representative circuits from the ISCAS benchmarks for evaluating the performance of this ~ ~ g o r T%iey ~ t ~are~ the . c432 and c&@ and were chosen to be within l i ~ t ofs the resources available. The Table 1 gives the tirnirvg of the parallel logic simulation run for the above-dwcr circuits on a cluster of eight, VAX station 2000 workstations interconnected by LAN. Our experience with this ~~~~~~~~~~~~~~~ 158 partitioning scheme is that the load between simulation time-stc:ps is characterized hy spurts a.nd between spurts there is less work for most worker tasks than during t.lie spurts. Agaiii we find that for more than seven processors t h c i s no decrease in the time taken which e#;senti;.ll;y imptirs that processors were going idle w i ! h less work and t,lie communica.lion costs were risiiig as ronipiued t.o the computation costs. More refirecme~lsw e d l,o he incorporated into the event b a s d load partitioniaig afgoritlim for perhaps overiappcd and balanced load. s This paper dir,cusses the mapping of an event-driven 1 logic simulation algorithm onto a distributed system. 1 A load balancing scheme presented ensures that the tasks are eveiily loaded. ]Results indicate that for large circuits lvliere the ratio of computation to communication costs is high,. the performance is good. The algorithni also presents good scope for improvement. We feel that this algorithm will be useful given the general a\saiIability of networked workstations. sdware Engines for csign and Simylation wvier Science Pub., NH. CAD ss3r VLSI, Vol. 2, 1986. ohan A Tasking Abstraction for Message Passing Architectures, Proc. of PARCOM-90, Pune, India, 1990. [8] Srinivas Patil, Prithviraj Banerjee and Constantine D. Polychronopoulos, EAicient Circuit Partitioning Algorithms for Parallel Logic simulation, Supercomputing Conference 1989, pp 361-370. [9] Friedrich Woppe, Accelerated Logic Simulation using Parallel Processing, Compeuro 1988, pp 156- 163. Acknowletlgeanents The authors thank Prof.V.Itajaraman for the encouragement and nieinhers of the KBCS, CADL, MAL and SERC for their help. References [1] M.A. Ermer and A.D. Friedman, Diagonsis and Reliable Design of Digital Systems, Computer Science I'ress, Rockville, 1976. [2] N . Ishiura, II. Yasuura and S. Yajima, Time First Eldualion Algorithm for High-speed Logic S)mu/ation, Proc. ICCAD, November 1 9 8 4 , ~397-199. ~ 189