A LOW-OVERHEAD COHERENCE SOLUTION FOR MULTIPROCESSORS WITH PRIVATE CACHEMEMORIES Mark S. Papamarcos and Janak H. Pate1 Coordinated Science Laboratory Unlversi ry of Illinois 1101 W. Springfield Urbana, IL 61801 ABSTRACT This paper presents a cache coherence solution for multiprocessors organized around a single The solution aims at reduoing time-shared bus. bus traffic and hence bus wait time. This in turn processor utilization. the overall inoreaees Unllke most traditional high-performance coherenoe this solution does not use any global solutions, this coherence saheme is Furthermore, tables. modular and easily extensible, requiring no modification of cache modules to add more processors to The performance of this scheme is a systtun. evaluated by using an approximate analysis method. It Is shown that the performance of this scheme is olomely tied uith the miss ratio and the amount of sharing between processors. I. rl PROC . CACHE I . . . CACHE A I I Timeshared I I I I I CACHE ’ Bus I MAIN MEMORY mTRODUCTlUN The use of cache memory has long been reoogas a cost-effective means of increasing the [conti6g, systw performance of unlprocessor Streoker76, Rao78, Smlth62j. Meade7 0, Kaplan73, the application of In this paper, we will conslder oaohe memory in a tightly-coupled multiproceseor system organized around a timeshared bus. Many oolnputer systems, partioularly the ones which use Without microprocessors, are heavily bus-lidted. saDle type of local memory, it is physically impob sible to gain a significant performance advantage through multiple microprocessors on a single bus. Generally, there are two different lmplementations of multlproceesor caobe systems. One involves a single shared cache for all processors This organization has sac distinot lYeh83 1. in particular, ef flclent caohe utiliadvantages, organization requires a However, this zation. crossbar between the processors and the shared oaohe. It is impractical to provide oommunlcation between each processor and the shared caohe using a shared bus. The other alternative is private oaahe for each processor, as shown in Fig. 1. However, this organization suffers from the well known d&X conaiatenov or m cclheranoe problem. Should the same urlteable data block exist in more tired Fig. 1 one oache, it to modify its loaal of the system. than System Organization is possible for one prooessor copy independently of the rest The simplest way to solve the coherence problem is to require that the address of the blook being written in cache be transmitted throughout F.aoh cache must then oheok its own the system. directory and purge the block if present. This referred to aa schape is moat frequently Obviously ) the Invalidate -invalidata. traffic grows very quickly and, asaumlng that 252 of the mea~ory referenoea, writes constitute the system becomes saturated with less than fOUr processors. In IBean791, a pias filtar is Proposed to reduce the oaohe direotory interferenoe 00~ that results from this schmbe. The filter of a small associative memory between the slats The associative memory keeps bus and each caahe. p reoord of the moat recently invalidated blooks, i%hlblting some subsequent wasteful invalidations. However, this only aervets to reduce the amount of interference without aotually oaohe directory reducing the bus trafiic. Another class of coherence solutions are of Status bits are aseothe J&&L&~ type. Upon a oiated with each blook in main memory. cache miss or the first write to a blook in cache, An iwalithe block’s global status is cheoked. date signal is sent only If another cache has a ACKNC%iLEKXXMENTS: This research was supported by the Naval Electronics Systems Command under VHSIC contract NOO039-80-C-0556 and by the Joint Services Eleotronics Program under contract N00014-84-C-0149. 0194-7111/84/0000/0348$01 .OOO1984 IEEE 284 Requests for transfers due to misses are copy. by the global table to eliminate also screened interference. The unnecessary cache directory performance associated with these solutions is very high if one ignores the interference in the The hardware required to impleglobal directory. ment a global directory for low access interferrequiring a distributed direoence is extensive, These schemes and their tory uith full crossbar. variations have been analyzed by several authors [Tang76,Censier76,DuboisB2,Yen82,Archi&ld63~. Indicates either Shared or Exclusive ownership of while the second bit is set If the block a block, has been locally modified. Because the state Shared-l4odified is not allowed in our scheme, this status is used instead to denote a block containing invalid data. A write-back policy is assumed. statuses of a block In cache at The four possible any given time are then: A solution multiprocessors more appropriate for bus organized by Goodman has been proposed sabeate, an invalidate [Goodman831. In this request is broadcast only when a block is written in cache for the first time. The updated blook is simultaneously written through to main memory. Only if a block in caohe is written to more than once is it necessary to write it back before This particular write strategy, a replacing it. combination of write-through and write-baok, is A dual oaohe directory system called um. is enployed In order to reduce caohe Interference. LUa,Ud: &C&&X-n (Rxcl-Unmod): No other oaohe has this block. Data In block is oonaiatent nith main memory. Block does not valid 3. B-B contain data. (Shared-Unmod): Scme other may have this block. Data in block is oonalstent with main memory. aaahea 4. mm has this blook. looplly modified aiatent uith main (Excl-Mod): No other oaohe Data in block haa been and is therefore inoonmemory. A blook is written back to main memory nhen evicted onLy if its statue is Excl-Mod. If a write-through caohe nas desired then one would not need to differentiate between Excl-Hod and SxolUlnod . Writes to an Exclusive block result only in modification of the caohed blook and the aetUng of the Modified stmtus. The status of Shered-Unmod says that some other caches u have this bleak. Initially, whan a bloak is daaiared Shared-Umod, must hava this at leaat two caches bloak. HcaIever, at a later time when all but one oaahe eviots this blook, it is no longer truly ShWed. But the statue ia not altered in favor of of Implementation. aimpliclty We seek to integrate the hi&h performance of global directory solutions associated with the inhibition of all ineffeotive invaLidations and the modularity and easy adaptability to mioropr+ In a bus-orgmnixed ceaaors of Goodman’s achaae. for interrogation, it system with dual directories is possible to determine at miss time if a blook Therefore a statue is resident in another caohe. mmy be kept for eaoh blook in oaohe indicating All unneceewhether It is Exaluaive or Shared. aarp Invalidate requests can be out off at the Bus traffic is therefore reduced point of origin. invalidations and writes to aaahe miaaea, aotual Of these, the traffic generated to m&n memory. invalidationa caehe &sees and aotual by represents the minimum unavoidable traffic. The nuber of WritM to main mwory is determined by the particular policy of nrite-through or writebaok. ThemSore, a multiprooeaaor on a for tImeshared bus, performance should then approach the maximum possible for a cache coherent system under the given write policy. flou Detailed charts of the proposed coher- given In Figs. 2 and 3. Fig. 2 operations during a read cycle the write ovala. The followof the algorithm and some imple- ease algorithm are eiVea the rquired Fig. 3 deaoribea ing 1s a smmary mentation details flow aharts. end which are not present in the a read request la broadIf the all caches and the main memory. an invaliUM caused by a write operation, Upon a cache miss, The cache coherence solution to be presented is applicable to both write-through and write-baok policies. However, it haa been shown that urlteback generates leas bus traffic thaa write-through This has been verified by our perfor[Norton82 I. have chosen a Thwefore, ne mance studies. write-back policy in the rent of this paper. lloder a write-back policy, coherence is not maintrined between a cache and a main memory aa can be This in turn done uith a write-through policy. implies that I/O processors must follow the same pratoool as a cache for data transfer to and frm MSt t0 mbS If a oaohe date al(plal aooompaniea the request. ~reotory mat4hea the requested address then it inhibits the main mmory frcm putting data on the bus. Aearning ceohe operation3 are aaynahronoua wit& e&U& other and the bus, poaaible multiple responses can be resolved mith a ample suah as a daisy chain. The priority network, highest priority caahe among the responding Oaohea will then put the data on the bus. ff no caohe hu the blook then the mmory provides the blook. On a remd A unique reaponae is thus guaranteed. all caahea which matoh the reauemted operation, address set the statue of the oorreapondinS block to Shared-unmod. In addition, the blook is written back to main mmory conaurrentl;n ;.amiE 9 traaafer if its status was Exol-Hod. matching caahea set the blook statue to Invalid* me requesting caohe sets the status of the block to Shared-Unmod if the blook oame fram another oaohe and to Excl-Unmod If the block came fm Mob4 memory. II. 1. 2. PROWSSD COHERENCESOLUTION In this section ne present a low-overhead coherence algorithm. To implement this algorithm, it is necessary to associate tuo statue bits with each blook in cache. No statue bits are The first bit associated with the main memory. oanhe 285 I select block mod cache 0 done write in cache and set modified I Ishored-unmod) 1 exci-unmod 1 Fig. Fig. 2 mod block into cache ond send invalidate I 3 Cache Write Operation Cache Read Operation other blook is Invalidated and the other processor treats the acoess as a cache miss and prooeeds on that basis. An implicit assumption In this schme is that the controller nuat lmow before it starts executing the instruction that it is an indivislSane current microprooes3or3 are ble operation. capable of looking the bus for the duration of an Unfortunately, with sane others it instruction. Is not possible to recognize a read-modify-write it is then too late before the read is oomplete; For specific processors we have to backtrack. devised elaborate methods using Interrupts and system calls to handle such situations. We will not present the specifics here, but it suffices to say that the schemes involve either the aborting and retrying of instruction3 or decoding lnstructions in the cache controller. maln memory. Upon a subsequent cache write, an invalidation signal is broadcast with the blook address only if the status is Shared-Unmod, thus minimizing unnecessary invalidatlon traffic. As will be seen in the following sections, the performance of the proposed coherence algorithm is directly dependent on the miss ratio and while in algorithms not the degree of sharing, utlllxing global tables the performance is tied Sinae the olosely uith the write frequency. number of cache misses are far fewer than the number of writes, intuitively It is clear that the propoeed algorithm should perform better than other modular algorithms. Host multiprocessing systems require the use of aynchrotization and mutual exoluaion primitlves . These primltives can be implemented with operations (e.g. ) indivisible read-modify-write memory. Indivisl ble readtest-and-set) to mobify-write operations are a challenge to moat However, in our ayscache coherence solutions. the bus provide3 a convenient vlookn operat-, tion with which to solve the read-modify-write problem. In our scheme if the block is either Excl-Urnsod or Ercl-Mod no speoial action is required to perform an IndivIsIble read-modlfyHowever, if the write operation on that block. blook is declared Shared-Unmod, ue must account. for the contingency in uhich two proces3or3 are If the simultaneously accessing a Shared block. operation being performed is designated as Indiathen the cache controllers must first visible, capture the bus before proceeding to execute the Through the normal bus arbitration instruction. mechanism, only one oache controller will get the This controller can then complete the indibus. In the process, of course, the visible operation. III. PERKXWANCEANALYSIS The analysis of this coherence solution stems from an approximate method proposed by Pate1 [Pate182 1. In this method, a request for a blook transfer is broken up into several unit requests for service. The waiting time is also treated a8 Purthermore, these a series of unit requests. unit requests are treated aa independent and rar+ dom requests to the bus. It was shown in that paper that this rather non-obvious transformation of the problem results in a muoh simpler but fairly aocurate analytis. Tbe error8 introduced by the approximation are less than 52 for a low miss ratio. First, let us define the system parameters; H number of processors a m 286 processor memory reference miss ratio rate fraction of memory references probability that looally modified blook is %ilrtym that are writes a block In cache has before eviction, i.e., fraction of write requests Ulrmodified blooks in cache that Cache Interference over the processor been the z= of cyalea required reference for a 1 +bA+maT+madT+ bW+a (1) where 2 is the real exeoution time for 1 useful The unit request rate for eaoh of unit of work. the N proaessors as seen by the bus is -‘z-l..bAThe probability the bus is given that by -Z o/z2 -- no processor is requesting blook (1 _ Z - 1 - ;A - Q/Z2)N tramfer nmber date (1-m)awsuI+ 22 fraction of write requests that reference Shared blocks, equivalent to the fraction of Shared blooks in cache if references are aaaumed to be equally distributed throughout oaohe number of cycles required for bus arbitration logic amber is assumed to be distributed execution time, yielding of cycles required for a block invaliTherefore, the probability that at least one prcceasor is requesting the bus, that is, the average bus utilization B, is, To analyze our cache system, consider an interval of time comprising k units of useful processor activity. In that time, kb bus requests will be issued, where b = ma + (I-m)aweu 1 + bA + maT + madf + (I-m)awauI B = N (Z - 1 - bA - bW - Q/Z21 Z of Q to be the srrm of + bil these (3) Nou we oan solve for B, W and Z using equations Similar derivations exist for (1). (2) and (3). the- case of no coherence, no coherence and no bus Goodman’ s contention (infinite crossbar), and The prooeasor utilizaUon U is simply sohcme. time per bus where W is the average waiting request. The cpu idle Umese per useful cpu cyole are the faotors bA for bus arbitration, maT for fetohing blooka on misses, madT for write-back of Modified blooks, (1-m)awauI for Invalidate cycles, and bW for waiting time to acquire the bus. Now we account for caohe interference fran other prooessor3. If no dual cache directory is aaaumed the performance degradation due to cache interference can be extrmely severe. Therefore, ue have assumed dual directories In cache. In will ooour only this case, the cache interference in the following situations: 1. A given prooeasor reoeives invalidate rawest8 fran (N-l) other processors at the rate of (N-l ) (I-m)awsu. We assume that all invalidates are effeotive and that, on the average, one caohe is Invalidated. The WtitY for an invalidate is assumed to be one oaohe cyole. 2. Transfer requests occur at the rate (N-l)ma, Of tkhh (N-1)mas are for Shared blood. We again assume that, on the average, one cache responds to the request. The penalty for a transfer Is T cycles. Ue define effects, namely (2) To solve for 8, W and Z, we need one more expression for the bus utilization. That can be obtained by multiplylng N by the actual bus time the exeoution period, giving wed, averaged over The term ma in the above expresslon represents the bus aooesses due to caohe misses and the term (I-m)awsu accounts for the invalidate requests resulting from writes to Shared-Umod blooka. The actual exeoution time for 1 useful unit work, disregarding cache interference, will be - 1 - bA - Q/Z2)N Z B=l-(1-J l/Z. Iv. DISCUSSION OF RESULTS In this se&Ion ne present the analytical results to demonstrate the effect of various parameters on the cpu performance and bus traffic. The values of cache parameters used span a rea3onable range covering most practical situations. In 8-e cases we have chosen pessimisUo value3 to emphasize the fact that our caohe coherence SOlUThe following Uon still gives good performance. values ‘were used as default cache parameters: m a 51 a 3 90s two Q = (1 - m)avsu + ma.sT 287 Miss ratio: reasonable It may actually be lower for cache alxes, 30 this is a PeeLower miss ratios 3lm.istic aa3waptlon. would be appropriate for single-tasking prooessors, uhile the 7.5% figure may be appropriate for multi-tasking environments involving many context snitches. Prooessor to memory acoass rate: Here we assume 90s of cpu cycles result in a although a smaller fraOcaohe request, Uon is more likely in processors with a large register set. d = 502 Write-back probability: Assume here that approxzmately half of all blocks are eviction, modified before loaally although 20% and 80s are tried in order to see the effect of this parameter. Y = 202 Write frequency: Assumed to be about 201 This is a of all memory references. fairly standard nwnber. Since it only appears as a factor in the generation of invalidate requests with u and s, its actual value Is not critical. u = 301 Fraction of writes to unmodified blocks: Assume that roughly one third of all write hits are first-time writes to a given unmodified block and the remainder are subsequent writes to the modified 3 = 51 Degree of sharing: In most cases ue have assumed that 51 of writes are to a block which is declared Shared-Unmod. This should be a pessimistic assumption exoept for program3 uhlah pass large amount3 of data betveen processors in which aase 3 = In systems where 152 is more reasonable. most sharing ocours only on semaphores, the 15 figure is more likely. Bus arbitration time: Assume that the the next bus master logic for determining settles within one cache cycle. with about 18 processors. The effect of bus saturation on system performance can be seen in Fig. 5. Note that, in general, bus utilization and system performance increase almost linearly with N until the bus reaches saturation. At this point, processor utllixation begins to approach a curve proportional to l/N as seen in Fig. 6. If a l$ miss ratio could be achieved, performance would top out with Nun29. blook. A= 1 Fig. 1( Effect of Hiss Ratio m: Bua Utilization vs. Number of Processors NU Block transfer time: In a mlcroproceasor blooks are likely to bs environment Therefore, in most canes ne have small. assumed that it takes approximately two oaohe cycles to transfer a blook to a We have also conaldered the oaohe. effect of varying block transfer times due to differing technologies or larger aaahe blooks. Block invalidate time: Ye have assumed I=2 that the time taken for an invalidate cycle should be only slightly longer than a normal caohe cycle, since the invalidate operation con3lsts only of transitand modifying the Ung an address affeated cache directories. The analytical method was verified wing a Um+drlven simulator of the performance model. the pr edioted perf ormanoe In all cases tested, differed by no more than 52 fraa the simulated error tended to approach 0 with per f ormanoe . This Because of the oomparative heavier bus loading. ea3e of generating data using the analytical solution, all results shown have heen derived analytiOn each graph, all parameters assume their aally. default values except the one being varied. T=2 min-5% mha=7.5% 2 Fig. 5 4 6 8 10 Effect of Miss Ratio System Performance 12 14 16 , 18 N 20 PI: vs. Number of Processors lr” Figa. 4 through 6 illustrate the effects of different miss ratios on bus utilization, system performance, and processor uUli!zaUon as function of the number of processors. System performance is expressed as NU, where N is the number of pr+ cesaors and U is the single processor utilization. The system performance is llmlted primarily by the From Fig. 4 ue see that for 7.5% miss ratio bus. the bus saturates uith about 8 processors. As the ml33 ratio decreases to 2.51 the bus saturates 01 0 Fig. 288 2 4 6 13 10 6. Effect of Miss Ratio Prooeasor Utilization 12 14 16 18 , 20 m: vs. No. of Processors N 7 and 8 illustrate the effect of degrees of interprocessor sharing. The effect of sharing faotor, 3, on system performanoe la relauvely smell compared with the effect of miss raU0. It is the faotor (l-m)aweu that is Figs. differing ahan- responsible for the generation of invalidation traffic, which is generally smaller than the miss traffic. These graphs are also demonstrative of of variations In the write frequency the effects (w) and the percentage of first writes (u). The value of w is relatively fixed between 201 and be fairly constant as well, and u should 302, exoept when moving large quantities of data. The 3 t. 100% case correspond3 to a standard write-baak coherence scheme in which any block Is potentially aharable. With a write-back frequency of 305 of 502 to compemate for initial uriteinstead througbs, the curves for Goodman1 3 scheme are almost identical to those for 3 = 100s. aham- 0 2 shows is very that 0 10 12 14 16 16 20 writeback-20X 6. wrltebock=50% 7. 6. that it is atmolutely necessary to be able a block fnto caohe in one or two cycles. 11 6 10 .NU 9. Fig. 10 illustrates the degradation due to bloak transfer times. inareaslng System performanoe is 30 llmlted by transfer tines of 4 cyalea Finally, Fig. ooherence solution 4 100% 8. Effect of Degree of Sharing 3: System Performance vs. Number of Processors Fig. 9 Illustrates the effeot of different write-back frequenoiea. The results here are fairly predictable. Write-back is yet another faotor which contributes to the bus traffic. A write-through policy would contrlbute much more traffic than this. Pig. or more to bring X sham-15% witehack-80% 5. 4. the proposed close to the ideal for a Umeahared a system not COP 5. aahlevable system performance bus. The top curve represents strained by a bus, while the second correspond3 to a Eiystem with no coherence overhead. The bottcm curve, representing the proposed solution, 1s very dOem to the middle curve, clearly showing that little system performance is lost in maintaining oaobe consistency using our slgorithm. 2. it/.. 0 Fig. 2 9. 4 6 8 10 Effect of Write-Back System Performance dram- 1X 12 vs. 14 16 18 20 Probability d: Number of Processors tranmfer=l 12. 10. tmnafer-2 2 4 6 8 10 12 14 16 Fig. 7. Effect of Degree Bus Utilization 18 , 20 N of Sharing 3: vs. Number of Processors 289 0 2 Fig. 10. 4 6 Effect of Blook Transfer Time T: System Performance vs. No. of Processors ICensier781 L. M. Censier and P. Feautrier, “A New Solution to Coherence Problems in Multicache Systens,” IEEE lkana. m., vol. C-27, December 19’78, pp. 1112-1118. 20 r NU 16. 16. 14. [Conti69] C. J. Conti, Voncepts for Buffer Storage, w m w. Jirnvn w, vol. 2, March 1%9, PP. 9-13. [Du~oIs~~ 1 H. Dubois and F. A. Brlggs, “Effects of Cache Coherency in Multiprocessors, w J&EE m. m. , vol. C-31, November 1982, pp. 1083- 12. 10. no coherence N 0 Fig. 2 4 6 8 10 12 14 16 16 20 11. Overhead of Coherence Solution: System Performance vs. No. of Processors V. CONCLUDINGREUARKS In this paper we have introduced a new coherence algorithm for multiprocessors on a timeshared It takes advantage of the relatively small bus. amount of data shared between processors without In addition, it is the need for a global table. easily extensible to an arbitrary nmber of proThe applicessors and relatively uncomplicated. cations of a system of this type are many. Processing modules could be added as needed, and the system need not be redesigned for each new application. an interesting application For emmple, would be to allocate one processing module to each user, with one possibly dedicated to operating system functions. Again, the primary advantage is easy expandability and very little performance degradation as a result of It. For any multiprocessor system on a timeshared bus, this coherence solution is as easy to implement as any other, save broadcast-invalidate, and offers a significant performance improvement if the amount of shared data memory is reasonably small. 1099. [Goodman831 J. R. Goodman, %?.lng Cache Memory to Reduce Prooessor-Memory Traffic, w Pcpc. m m 2mR. anArchitacture, June 1983, pp. 124-131. CKaplan731 K. R. Kaplan and R. 0. Winder, “Cache-Based Computer Systems, w w, March 1973, PP. 30-36. [Heade70] R. M. Heade, pcpf. Eill;l;, [Norton821 R. L. Norton and J. A. Abraham, Vsing Write Back Cache to Improve Performance of Hultiuser Multiprocessors,” Pcpc. 1982 a. August 1962, cnnf. lLCParallelProcessinn# PP. 326-331. [Patel82] J. H. Patel, “Analysis of Multiprocessors with Private Caohe Memories, n m m. J&ulpuL, vol. C-31, April 1982, pp. 296-304. [Rao781 G. S. Rao, “Performance Analysis of Memories, ” 1. m, vol. 25, No. 3, July PP. 376-395. [Smith821 A. J. Smith, ypys, VI. ACKNOWLEDGEMENTS We would like to thank Professor Faye Brigga of Rice University and Professor Jean-Loup Baer Of the University of Washington for helpful discussions concerning this paper. “On Memory System Design, w BEIEs 37, 1970, pp. 33-43. vol. vol. Cache 1978,. “Cache Memories, w Comoutinn &c= 14, No. 3, September 1982, PP. 473-530. [Strecker76] wCache Memories for PDP-11 W. D. Stracker, Family Cauputers”. k!u& LkdAnniGLlm. M ~QULULC Architecture, J~uary 1976, PP. 155-l 58. REFERENCES [Archibald831 J. Archibald and J. L. Baer, “An Econcmlcal Solution to the Cache Coherence SolutiOn,m University of Washington Technical Report 83-l O-07, October, 1983. [Bean791 B. M. Bean, K. Langston, R. Partridge, and K. Filter Hemory for Filtering out K. SY, bias Unnecessary Interrogations of Cache Direotories in a Multiprocessor System, n United States Patent 4,142,234, February 27, 1979. [Tang761 C. K. Tightly .&QG. Tang, “Cache System Coupled Multiprocessor dll;c, vol. 45, 1976, pp. Design in the System, ” BEIEs 749-753. 1Yeh83 1 P. C. C. Yeh, J. H. Patel, and E. S. Davidson, “Shared Cache for Hultlple-Stream Cornputer Systems,” ml&an& GQa&., vol. C32, January 1983, pp. 38-47. [Yen82 I ii. C. Yen and K. S. Fu, “Coherence Problem in a Multicache System, n m. a Uli. cpnf. QIJ &ALU& e, 1982, PP. 332-339.