A LOW-OVERHEAD COHERENCE SOLUTION FOR MULTIPROCESSORS WITH PRIVATE CACHE MEMORIES

advertisement
A LOW-OVERHEAD
COHERENCE
SOLUTION FOR MULTIPROCESSORS
WITH PRIVATE CACHEMEMORIES
Mark S. Papamarcos and Janak H. Pate1
Coordinated Science Laboratory
Unlversi ry of Illinois
1101 W. Springfield
Urbana, IL 61801
ABSTRACT
This paper presents
a cache coherence solution for multiprocessors
organized around a single
The solution
aims at reduoing
time-shared
bus.
bus traffic
and hence bus wait time.
This in turn
processor
utilization.
the
overall
inoreaees
Unllke most traditional
high-performance
coherenoe
this solution
does not use any global
solutions,
this
coherence
saheme is
Furthermore,
tables.
modular
and easily extensible,
requiring
no modification
of cache modules to add more processors to
The performance
of this
scheme is
a systtun.
evaluated
by using an approximate
analysis
method.
It Is shown that the performance of this scheme is
olomely tied uith the miss ratio and the amount of
sharing between processors.
I.
rl
PROC
.
CACHE
I
.
.
.
CACHE
A
I
I
Timeshared
I
I
I
I
I
CACHE
’
Bus
I
MAIN
MEMORY
mTRODUCTlUN
The use of cache memory has long been reoogas a cost-effective
means of increasing
the
[conti6g,
systw
performance
of unlprocessor
Streoker76,
Rao78, Smlth62j.
Meade7 0, Kaplan73,
the application
of
In this paper, we will conslder
oaohe memory in a tightly-coupled
multiproceseor
system organized
around a timeshared
bus.
Many
oolnputer systems, partioularly
the ones which use
Without
microprocessors,
are heavily
bus-lidted.
saDle type of local memory, it is physically
impob
sible
to gain a significant
performance advantage
through multiple
microprocessors
on a single bus.
Generally,
there are two different
lmplementations
of multlproceesor
caobe systems.
One
involves
a single
shared cache for all processors
This organization
has sac
distinot
lYeh83 1.
in particular,
ef flclent
caohe utiliadvantages,
organization
requires a
However, this
zation.
crossbar
between the processors
and the shared
oaohe.
It is impractical
to provide
oommunlcation
between each processor and the shared caohe using
a shared bus.
The other
alternative
is private
oaahe
for
each processor,
as shown in Fig.
1.
However, this organization
suffers
from the well
known d&X conaiatenov
or m
cclheranoe problem.
Should the same urlteable
data block exist in more
tired
Fig.
1
one oache, it
to modify its loaal
of the system.
than
System
Organization
is possible
for one prooessor
copy independently
of the rest
The simplest way to solve the coherence problem is to require
that the address
of the blook
being written
in cache be transmitted
throughout
F.aoh cache must then oheok its own
the system.
directory
and purge the block if present.
This
referred
to
aa
schape
is
moat
frequently
Obviously ) the Invalidate
-invalidata.
traffic
grows very quickly
and, asaumlng that
252 of the mea~ory referenoea,
writes
constitute
the system
becomes saturated
with less than fOUr
processors.
In IBean791, a pias filtar
is Proposed to reduce the oaohe direotory
interferenoe
00~
that
results
from this schmbe. The filter
of a small associative
memory between the
slats
The associative
memory keeps
bus and each caahe.
p reoord
of the moat recently
invalidated
blooks,
i%hlblting
some subsequent wasteful invalidations.
However, this only aervets to reduce the amount of
interference
without
aotually
oaohe directory
reducing the bus trafiic.
Another class of coherence solutions
are of
Status bits are aseothe J&&L&~
type.
Upon a
oiated
with each blook in main memory.
cache miss or the first
write to a blook in cache,
An iwalithe block’s global status is cheoked.
date signal
is sent only If another cache has a
ACKNC%iLEKXXMENTS:
This research was supported
by the Naval
Electronics
Systems Command
under VHSIC contract
NOO039-80-C-0556 and by
the Joint Services Eleotronics
Program under
contract N00014-84-C-0149.
0194-7111/84/0000/0348$01 .OOO1984 IEEE
284
Requests
for transfers
due to misses are
copy.
by
the global
table
to eliminate
also screened
interference.
The
unnecessary
cache directory
performance
associated
with
these solutions
is
very high if one ignores the interference
in the
The hardware required
to impleglobal directory.
ment a global
directory
for low access interferrequiring
a distributed
direoence is extensive,
These schemes and their
tory uith full
crossbar.
variations
have been analyzed
by several
authors
[Tang76,Censier76,DuboisB2,Yen82,Archi&ld63~.
Indicates
either
Shared or Exclusive ownership of
while
the second bit is set If the block
a block,
has been locally
modified.
Because the state
Shared-l4odified
is not allowed in our scheme, this
status
is used instead
to denote a block containing invalid
data.
A write-back
policy is assumed.
statuses
of a block In cache at
The four possible
any given time are then:
A solution
multiprocessors
more appropriate
for bus organized
by Goodman
has been proposed
sabeate,
an invalidate
[Goodman831.
In
this
request
is broadcast
only when a block is written
in cache for the first
time.
The updated blook is
simultaneously
written
through
to main memory.
Only if a block in caohe is written
to more than
once is it
necessary
to write
it back before
This particular
write
strategy,
a
replacing
it.
combination
of write-through
and write-baok,
is
A dual oaohe directory
system
called um.
is enployed In order to reduce caohe Interference.
LUa,Ud:
&C&&X-n
(Rxcl-Unmod):
No other
oaohe has this block.
Data In block is oonaiatent
nith main memory.
Block
does not
valid
3.
B-B
contain
data.
(Shared-Unmod):
Scme other
may have this block.
Data in block is
oonalstent
with main memory.
aaahea
4.
mm
has this
blook.
looplly
modified
aiatent
uith main
(Excl-Mod):
No other oaohe
Data in block haa been
and is
therefore
inoonmemory.
A blook
is written
back to main memory nhen
evicted
onLy if its
statue
is Excl-Mod.
If a
write-through
caohe nas desired then one would not
need to differentiate
between Excl-Hod and SxolUlnod .
Writes to an Exclusive
block result
only
in modification
of the caohed blook and the aetUng
of the Modified
stmtus.
The status
of
Shered-Unmod says that some other caches u
have
this bleak.
Initially,
whan a bloak
is daaiared
Shared-Umod,
must hava this
at leaat
two caches
bloak.
HcaIever,
at a later
time
when all but one
oaahe eviots
this
blook,
it is no longer truly
ShWed.
But the statue ia not altered
in favor of
of Implementation.
aimpliclty
We seek to integrate
the hi&h performance of
global
directory
solutions
associated
with
the
inhibition
of all
ineffeotive
invaLidations
and
the modularity
and easy adaptability
to mioropr+
In a bus-orgmnixed
ceaaors of Goodman’s achaae.
for
interrogation,
it
system with dual directories
is
possible
to determine at miss time if a blook
Therefore a statue
is resident
in another caohe.
mmy be kept for
eaoh blook in oaohe indicating
All unneceewhether
It is Exaluaive
or Shared.
aarp Invalidate
requests
can
be out
off
at
the
Bus traffic
is therefore
reduced
point of origin.
invalidations
and writes
to aaahe miaaea, aotual
Of these, the traffic
generated
to m&n memory.
invalidationa
caehe
&sees
and
aotual
by
represents
the minimum unavoidable
traffic.
The
nuber
of WritM
to main
mwory
is
determined
by
the particular
policy
of nrite-through
or writebaok.
ThemSore,
a multiprooeaaor
on a
for
tImeshared
bus, performance
should then approach
the maximum possible
for a cache coherent
system
under
the given write
policy.
flou
Detailed
charts
of
the
proposed
coher-
given In Figs.
2 and 3.
Fig. 2
operations
during a read cycle
the write
ovala.
The followof the algorithm
and some imple-
ease algorithm
are
eiVea
the rquired
Fig. 3 deaoribea
ing 1s a smmary
mentation
details
flow aharts.
end
which
are
not
present
in
the
a read request
la broadIf the
all
caches
and the main memory.
an invaliUM caused by a write operation,
Upon a cache miss,
The cache coherence solution
to be presented
is applicable
to both write-through
and write-baok
policies.
However, it haa been shown that urlteback generates
leas bus traffic
thaa write-through
This has been verified
by our perfor[Norton82 I.
have chosen a
Thwefore,
ne
mance
studies.
write-back
policy
in the rent
of this
paper.
lloder a write-back
policy,
coherence is not maintrined
between a cache and a main memory aa can be
This in turn
done uith
a write-through
policy.
implies
that I/O processors
must
follow
the same
pratoool
as a cache for data transfer
to and frm
MSt
t0
mbS
If a oaohe
date al(plal
aooompaniea the request.
~reotory
mat4hea the requested
address then it
inhibits
the main mmory frcm putting
data on the
bus.
Aearning ceohe operation3
are aaynahronoua
wit&
e&U&
other
and
the
bus,
poaaible
multiple
responses
can be resolved
mith a ample
suah as a daisy chain.
The
priority
network,
highest priority
caahe among the responding
Oaohea
will
then put the data on the bus.
ff no caohe
hu the blook then the mmory provides
the blook.
On a remd
A unique reaponae is thus guaranteed.
all caahea which matoh the reauemted
operation,
address
set the statue
of the oorreapondinS
block
to Shared-unmod.
In addition,
the blook is written back to main mmory conaurrentl;n
;.amiE
9
traaafer
if its status
was Exol-Hod.
matching caahea set the blook statue to Invalid*
me requesting
caohe sets the status of the block
to Shared-Unmod if the blook oame fram another
oaohe and to Excl-Unmod If the block came fm
Mob4
memory.
II.
1.
2.
PROWSSD COHERENCESOLUTION
In
this
section
ne present
a low-overhead
coherence
algorithm.
To implement
this
algorithm,
it is necessary to associate
tuo statue
bits with each blook in cache.
No statue bits are
The first
bit
associated
with the main memory.
oanhe
285
I
select block
mod
cache
0
done
write in cache
and set modified
I
Ishored-unmod)
1 exci-unmod
1
Fig.
Fig.
2
mod block into
cache ond send
invalidate
I
3 Cache Write Operation
Cache Read Operation
other blook is Invalidated
and the other processor
treats the acoess as a cache miss and prooeeds on
that basis.
An implicit
assumption In this schme
is that the controller
nuat lmow before it starts
executing
the instruction
that it is an indivislSane current
microprooes3or3
are
ble operation.
capable of looking the bus for the duration of an
Unfortunately,
with sane others it
instruction.
Is not possible
to recognize
a read-modify-write
it is then too late
before the read is oomplete;
For specific
processors
we have
to backtrack.
devised
elaborate
methods using Interrupts
and
system calls
to handle such situations.
We will
not present the specifics
here, but it suffices
to
say that the schemes involve
either
the aborting
and retrying
of instruction3
or decoding lnstructions in the cache controller.
maln memory. Upon a subsequent
cache write,
an
invalidation
signal
is broadcast
with the blook
address only if the status is Shared-Unmod, thus
minimizing
unnecessary invalidatlon
traffic.
As will
be seen in the following
sections,
the performance
of the proposed coherence algorithm is directly
dependent on the miss ratio and
while in algorithms
not
the degree of sharing,
utlllxing
global
tables
the performance
is tied
Sinae the
olosely
uith
the write
frequency.
number of cache misses are far
fewer than the
number of writes,
intuitively
It is clear that the
propoeed
algorithm
should
perform
better
than
other modular algorithms.
Host multiprocessing
systems require
the use
of aynchrotization
and mutual exoluaion
primitlves . These primltives
can be implemented with
operations
(e.g. )
indivisible
read-modify-write
memory.
Indivisl
ble
readtest-and-set)
to
mobify-write
operations
are a challenge
to moat
However, in our ayscache coherence solutions.
the bus provide3 a convenient
vlookn operat-,
tion with which to solve
the read-modify-write
problem.
In our scheme if the block is either
Excl-Urnsod or Ercl-Mod
no speoial
action
is
required
to perform an IndivIsIble
read-modlfyHowever, if the
write
operation
on that block.
blook is declared
Shared-Unmod, ue must account.
for
the contingency
in uhich two proces3or3
are
If the
simultaneously
accessing
a Shared block.
operation
being performed is designated
as Indiathen the cache controllers
must first
visible,
capture the bus before proceeding
to execute the
Through the normal bus arbitration
instruction.
mechanism, only one oache controller
will
get the
This controller
can then complete the indibus.
In the process, of course, the
visible
operation.
III.
PERKXWANCEANALYSIS
The analysis
of this coherence solution
stems
from an approximate
method proposed
by Pate1
[Pate182 1. In this method, a request for a blook
transfer
is broken up into several
unit requests
for service.
The waiting
time is also treated a8
Purthermore,
these
a series
of unit requests.
unit requests are treated aa independent
and rar+
dom requests
to the bus.
It was shown in that
paper that this rather non-obvious
transformation
of the problem results
in a muoh simpler
but
fairly
aocurate analytis.
Tbe error8 introduced
by the approximation
are less than 52 for a low
miss ratio.
First,
let us define the system parameters;
H
number of processors
a
m
286
processor memory reference
miss ratio
rate
fraction
of memory references
probability
that
looally
modified
blook is %ilrtym
that
are writes
a block In cache has
before eviction,
i.e.,
fraction
of write
requests
Ulrmodified blooks in cache
that
Cache Interference
over the processor
been
the
z=
of
cyalea
required
reference
for
a
1 +bA+maT+madT+
bW+a
(1)
where 2 is the real exeoution
time for 1 useful
The unit request rate for eaoh of
unit of work.
the N proaessors as seen by the bus is
-‘z-l..bAThe probability
the bus is given
that
by
-Z
o/z2
--
no processor
is
requesting
blook
(1 _ Z - 1 - ;A - Q/Z2)N
tramfer
nmber
date
(1-m)awsuI+
22
fraction
of write
requests
that
reference
Shared blocks,
equivalent
to the fraction
of
Shared blooks in cache if references
are
aaaumed to be equally
distributed
throughout
oaohe
number of cycles required for bus arbitration
logic
amber
is assumed to be distributed
execution
time, yielding
of cycles
required
for
a block
invaliTherefore,
the probability
that at least one prcceasor is requesting
the bus, that is,
the average
bus utilization
B, is,
To analyze
our cache system,
consider
an
interval
of time comprising k units of useful processor activity.
In that time, kb bus requests
will be issued, where
b = ma + (I-m)aweu
1 + bA + maT + madf + (I-m)awauI
B = N (Z - 1 - bA - bW - Q/Z21
Z
of
Q to
be
the
srrm of
+ bil
these
(3)
Nou we oan solve for B, W and Z using equations
Similar
derivations
exist for
(1). (2) and (3).
the- case of no coherence,
no coherence and no bus
Goodman’ s
contention
(infinite
crossbar),
and
The prooeasor
utilizaUon
U is simply
sohcme.
time per bus
where W is the average waiting
request.
The cpu idle Umese per useful cpu cyole
are the faotors
bA for bus arbitration,
maT for
fetohing
blooka on misses, madT for write-back
of
Modified blooks, (1-m)awauI for Invalidate
cycles,
and bW for waiting
time to acquire the bus.
Now we account for caohe interference
fran
other prooessor3.
If no dual cache directory
is
aaaumed the performance
degradation
due to cache
interference
can be extrmely
severe.
Therefore,
ue have assumed dual directories
In
cache.
In
will ooour only
this case, the cache interference
in the following
situations:
1.
A given
prooeasor
reoeives
invalidate
rawest8
fran (N-l) other processors
at the
rate of (N-l ) (I-m)awsu.
We assume that all
invalidates
are effeotive
and that,
on the
average,
one caohe is
Invalidated.
The
WtitY
for an invalidate
is assumed to be
one oaohe cyole.
2.
Transfer requests occur at the rate (N-l)ma,
Of tkhh
(N-1)mas are for Shared blood.
We
again assume that, on the average,
one cache
responds to the request.
The penalty
for a
transfer
Is T cycles.
Ue define
effects,
namely
(2)
To solve for 8, W and Z, we need one more expression
for
the bus utilization.
That can be
obtained
by multiplylng
N by the actual
bus time
the exeoution
period, giving
wed, averaged over
The term ma in the above expresslon
represents
the
bus aooesses due to caohe misses and the term
(I-m)awsu
accounts
for the invalidate
requests
resulting
from writes to Shared-Umod blooka.
The actual
exeoution
time for 1 useful
unit
work,
disregarding
cache interference,
will be
- 1 - bA - Q/Z2)N
Z
B=l-(1-J
l/Z.
Iv.
DISCUSSION OF RESULTS
In this
se&Ion
ne present
the analytical
results
to demonstrate
the effect
of various
parameters on the cpu performance and bus traffic.
The values of cache parameters used span a rea3onable
range
covering
most
practical
situations.
In
8-e
cases we have chosen pessimisUo
value3 to
emphasize the fact that our caohe coherence SOlUThe following
Uon still
gives
good
performance.
values ‘were used as default
cache parameters:
m a 51
a 3 90s
two
Q = (1 - m)avsu + ma.sT
287
Miss ratio:
reasonable
It may actually
be lower for
cache alxes, 30 this is a PeeLower miss ratios
3lm.istic
aa3waptlon.
would be appropriate
for single-tasking
prooessors,
uhile the 7.5% figure may be
appropriate
for multi-tasking
environments involving
many context snitches.
Prooessor to memory acoass rate: Here we
assume 90s of cpu cycles
result
in a
although
a smaller
fraOcaohe request,
Uon is more likely
in processors with a
large register
set.
d = 502
Write-back
probability:
Assume here that
approxzmately
half
of all
blocks
are
eviction,
modified
before
loaally
although 20% and 80s are tried in order
to see the effect of this parameter.
Y = 202
Write frequency:
Assumed to be about 201
This is a
of all
memory references.
fairly
standard
nwnber. Since it
only
appears as a factor
in the generation
of
invalidate
requests
with u and s, its
actual value Is not critical.
u = 301
Fraction
of writes
to unmodified
blocks:
Assume that
roughly
one third
of all
write
hits
are first-time
writes
to a
given unmodified
block and the remainder
are subsequent
writes
to the modified
3 = 51
Degree of sharing:
In
most cases ue have
assumed that 51 of writes are to a block
which is declared
Shared-Unmod.
This
should be a pessimistic
assumption exoept
for program3 uhlah pass large amount3 of
data betveen processors in which aase 3 =
In systems where
152 is more reasonable.
most sharing ocours only on semaphores,
the 15 figure is more likely.
Bus arbitration
time:
Assume that
the
the next bus master
logic for determining
settles
within one cache cycle.
with
about 18 processors.
The effect
of bus
saturation
on system performance
can be seen in
Fig.
5.
Note that,
in
general,
bus utilization
and system performance
increase
almost linearly
with N until
the bus reaches saturation.
At this
point,
processor utllixation
begins to approach a
curve proportional
to l/N as seen in Fig. 6. If a
l$ miss ratio could be achieved,
performance would
top out with Nun29.
blook.
A=
1
Fig.
1( Effect of Hiss Ratio m:
Bua Utilization
vs. Number of Processors
NU
Block transfer
time:
In a mlcroproceasor
blooks
are
likely
to
bs
environment
Therefore,
in most canes ne have
small.
assumed that it takes approximately
two
oaohe cycles
to transfer
a blook to a
We have also
conaldered
the
oaohe.
effect
of varying
block transfer
times
due to differing
technologies
or larger
aaahe blooks.
Block invalidate
time: Ye have assumed
I=2
that
the time taken for an invalidate
cycle should be only slightly
longer than
a normal caohe cycle,
since the invalidate operation
con3lsts only of transitand modifying
the
Ung
an address
affeated cache directories.
The analytical
method was verified
wing
a
Um+drlven
simulator
of the performance
model.
the pr edioted
perf ormanoe
In all
cases tested,
differed
by no more than 52 fraa the simulated
error
tended to approach 0 with
per f ormanoe . This
Because
of the oomparative
heavier
bus loading.
ea3e of generating
data using the analytical
solution, all results
shown have heen derived analytiOn each graph, all parameters assume their
aally.
default
values
except
the one being varied.
T=2
min-5%
mha=7.5%
2
Fig.
5
4
6
8
10
Effect
of Miss Ratio
System
Performance
12
14
16
,
18
N
20
PI:
vs. Number of Processors
lr”
Figa. 4 through 6 illustrate
the effects
of
different
miss ratios
on bus utilization,
system
performance,
and processor uUli!zaUon
as function
of the number of processors.
System
performance
is expressed as NU, where N is the number of pr+
cesaors and U is the single processor utilization.
The system performance is llmlted
primarily
by the
From Fig. 4 ue see that for 7.5% miss ratio
bus.
the bus saturates
uith about 8 processors.
As the
ml33 ratio
decreases
to 2.51 the bus saturates
01
0
Fig.
288
2
4
6
13
10
6. Effect of Miss Ratio
Prooeasor Utilization
12
14
16
18
,
20
m:
vs. No. of Processors
N
7 and 8 illustrate
the effect
of
degrees
of interprocessor
sharing.
The
effect
of sharing faotor,
3, on system performanoe
la relauvely
smell
compared
with
the effect
of
miss raU0.
It is the faotor
(l-m)aweu that is
Figs.
differing
ahan-
responsible
for the generation
of invalidation
traffic,
which is generally smaller
than the miss
traffic.
These graphs are also demonstrative
of
of variations
In the write
frequency
the effects
(w) and the percentage of first
writes
(u).
The
value
of w is relatively
fixed
between
201 and
be fairly
constant
as well,
and u should
302,
exoept when moving large quantities
of data.
The
3 t. 100% case correspond3
to a standard
write-baak
coherence
scheme in which any block Is potentially
aharable.
With a write-back
frequency
of
305
of 502 to compemate for initial
uriteinstead
througbs,
the curves
for Goodman1 3 scheme are
almost
identical
to those for 3 = 100s.
aham-
0
2
shows
is
very
that
0
10
12
14
16
16
20
writeback-20X
6.
wrltebock=50%
7.
6.
that it is atmolutely
necessary
to be able
a block fnto
caohe in one or two cycles.
11
6
10 .NU
9.
Fig. 10 illustrates
the degradation
due to
bloak
transfer
times.
inareaslng
System performanoe is 30 llmlted
by transfer
tines of 4 cyalea
Finally,
Fig.
ooherence
solution
4
100%
8. Effect of Degree of Sharing 3:
System Performance vs. Number of Processors
Fig.
9 Illustrates
the effeot
of different
write-back
frequenoiea.
The results
here are
fairly
predictable.
Write-back
is
yet
another
faotor
which contributes
to the bus traffic.
A
write-through
policy
would contrlbute
much more
traffic
than this.
Pig.
or more
to bring
X
sham-15%
witehack-80%
5.
4.
the
proposed
close to the ideal
for a Umeahared
a system not COP
5.
aahlevable
system
performance
bus.
The top curve represents
strained
by a bus, while the second correspond3
to
a Eiystem with
no coherence
overhead.
The bottcm
curve,
representing
the proposed solution,
1s very
dOem to the middle
curve,
clearly
showing that
little
system performance
is lost
in maintaining
oaobe consistency
using our slgorithm.
2.
it/..
0
Fig.
2
9.
4
6
8
10
Effect
of Write-Back
System
Performance
dram- 1X
12
vs.
14
16
18
20
Probability
d:
Number of Processors
tranmfer=l
12.
10.
tmnafer-2
2
4
6
8
10
12
14
16
Fig.
7.
Effect
of
Degree
Bus Utilization
18
,
20
N
of Sharing 3:
vs. Number of Processors
289
0
2
Fig.
10.
4
6
Effect of Blook Transfer Time T:
System Performance vs. No. of Processors
ICensier781
L. M. Censier and P. Feautrier,
“A New Solution to Coherence Problems in Multicache
Systens,”
IEEE lkana.
m.,
vol.
C-27,
December 19’78, pp. 1112-1118.
20 r NU
16.
16.
14.
[Conti69]
C. J. Conti,
Voncepts
for Buffer
Storage, w
m
w.
Jirnvn w,
vol. 2, March 1%9,
PP. 9-13.
[Du~oIs~~ 1
H. Dubois and F. A. Brlggs,
“Effects
of Cache
Coherency in Multiprocessors,
w J&EE m.
m.
, vol. C-31, November 1982, pp. 1083-
12.
10.
no coherence
N
0
Fig.
2
4
6
8
10
12
14
16
16
20
11. Overhead of Coherence Solution:
System Performance vs. No. of Processors
V. CONCLUDINGREUARKS
In this paper we have introduced
a new coherence algorithm
for multiprocessors
on a timeshared
It takes advantage of the relatively
small
bus.
amount of data shared between processors without
In addition,
it is
the need for a global
table.
easily
extensible
to an arbitrary
nmber of proThe applicessors and relatively
uncomplicated.
cations
of a system of this type are many. Processing modules could be added as needed, and the
system need not be redesigned
for each new application.
an interesting
application
For emmple,
would be to allocate
one processing module to each
user,
with one possibly
dedicated
to operating
system functions.
Again, the primary advantage is
easy expandability
and very
little
performance
degradation
as a result
of
It.
For any multiprocessor system on a timeshared
bus, this coherence
solution
is as easy to implement as any other,
save broadcast-invalidate,
and offers
a significant performance
improvement
if
the amount of
shared data memory is reasonably
small.
1099.
[Goodman831
J. R. Goodman, %?.lng Cache Memory to Reduce
Prooessor-Memory
Traffic,
w Pcpc. m
m
2mR. anArchitacture,
June 1983,
pp. 124-131.
CKaplan731
K. R. Kaplan and R. 0. Winder, “Cache-Based
Computer Systems, w w,
March 1973, PP.
30-36.
[Heade70]
R. M. Heade,
pcpf.
Eill;l;,
[Norton821
R. L. Norton and J. A. Abraham, Vsing Write
Back Cache to Improve Performance
of
Hultiuser
Multiprocessors,”
Pcpc.
1982 a.
August 1962,
cnnf. lLCParallelProcessinn#
PP. 326-331.
[Patel82]
J. H. Patel,
“Analysis
of Multiprocessors
with Private
Caohe Memories, n m
m.
J&ulpuL, vol. C-31, April
1982, pp. 296-304.
[Rao781
G.
S.
Rao, “Performance
Analysis
of
Memories, ” 1. m,
vol.
25, No. 3, July
PP. 376-395.
[Smith821
A. J. Smith,
ypys,
VI. ACKNOWLEDGEMENTS
We would like to thank Professor
Faye
Brigga
of Rice University
and Professor Jean-Loup Baer Of
the University
of Washington for helpful
discussions concerning this paper.
“On Memory System Design, w BEIEs
37, 1970, pp. 33-43.
vol.
vol.
Cache
1978,.
“Cache Memories, w Comoutinn &c=
14, No. 3, September 1982, PP.
473-530.
[Strecker76]
wCache Memories for PDP-11
W. D. Stracker,
Family Cauputers”.
k!u& LkdAnniGLlm.
M
~QULULC Architecture,
J~uary
1976,
PP.
155-l 58.
REFERENCES
[Archibald831
J. Archibald
and J. L. Baer, “An Econcmlcal
Solution
to the Cache Coherence SolutiOn,m
University
of Washington
Technical
Report
83-l O-07,
October,
1983.
[Bean791
B. M. Bean, K. Langston, R. Partridge,
and K.
Filter
Hemory for Filtering
out
K. SY, bias
Unnecessary
Interrogations
of
Cache Direotories
in a Multiprocessor
System, n United
States Patent 4,142,234,
February 27, 1979.
[Tang761
C. K.
Tightly
.&QG.
Tang, “Cache System
Coupled Multiprocessor
dll;c,
vol.
45,
1976,
pp.
Design in the
System, ” BEIEs
749-753.
1Yeh83 1
P. C. C. Yeh, J. H. Patel,
and E. S. Davidson, “Shared Cache for Hultlple-Stream
Cornputer Systems,” ml&an&
GQa&.,
vol. C32, January 1983, pp. 38-47.
[Yen82 I
ii. C. Yen and K. S. Fu, “Coherence Problem in
a Multicache
System, n m.
a
Uli. cpnf.
QIJ &ALU&
e,
1982, PP. 332-339.
Download