ppt

advertisement
Memory Consistency
Memory Consistency
Memory Consistency
 Reads and writes of the shared
memory face consistency problem
 Need to achieve controlled
consistency in memory events
 Shared memory behavior
determined by:
Program order
Memory access order
 Challenges
Modern processors reorder operations
Compiler optimizations (scalar
replacement, instruction rescheduling
Basic Concept
 On a multiprocessor:
Concurrent instruction streams (threads) on different processors
Memory events performed by one process may create data to
be used by another
Events: read and write
 Memory consistency model specifies how the memory
events initiated by one process should be
observed by other processes
 Event ordering
Declare which memory access is allowed , which process should
wait for a later access when processes compete
Uniprocessor vs.
Multiprocessor Model
Understanding Program Order
Initially X = 2
P1
…..
r0=Read(X)
r0=r0+1
Write(r0,X)
…..
Possible execution sequences:
P1:r0=Read(X)
P2:r1=Read(X)
P1:r0=r0+1
P1:Write(r0,X)
P2:r1=r1+1
P2:Write(r1,X)
x=3
P2
…..
r1=Read(x)
r1=r1+1
Write(r1,X)
……
P2:r1=Read(X)
P2:r1=r1+1
P2:Write(r1,X)
P1:r0=Read(X)
P1:r0=r0+1
P1:Write(r0,X)
x=4
Interleaving
P1
a. A=1;
b. Print B,C;
P2
c. B=1;
d. Print A,C;
P3
e. C=1;
f. Print A,B;
switch
A, B, C shared variables (initially 0)
Shared Memory
Program orders of individual instruction streams
may need to be modified because of interaction
among them
Finding optimum global memory order is an NP hard
problem
Example
P1
a. A=1;
b. Print B,C;
P2
c. B=1;
d. Print A,C;
P3
e. C=1;
f. Print A,B;
switch
A, B, C shared variables (initially 0)
Shared Memory
 Concatenate program orders in P1, P2 and P3
6-tuple binary strings (64 output combinations)
(a,b,c,d,e,f) => (001011) (in order execution)
(a,c,e,b,d,f) => (111111) (in order execution)
(b,d,f,e,a,c) => (000000) (out of order execution)
6! (720 possible permutations)
Mutual exclusion problem
mutual exclusion problem in concurrent
programming
allow two threads to share a single-use
resource without conflict, using only shared
memory for communication.
avoid the strict alternation of a naive turntaking algorithm
Definition
If two processes attempt to enter a critical
section at the same time, allow only one process
in, based on whose turn it is.
If one process is already in the critical section,
the other process will wait for the first process
to exit.
How would you implement this without
mutual exclusion,
freedom from deadlock, and
freedom from starvation.
Solution: Dekker’s Algorithm
This is done by the use of two flags f0
and f1 which indicate an intention to enter
the critical section and a turn variable
which indicates who has priority between
the two processes.
flag[0] := false
flag[1] := false
turn := 0 // or 1
P0
flag[0] := true
while flag[1] = true {
if turn ≠ 0 {
flag[0] := false
while turn ≠ 0 {
}
flag[0] := true
}
} // critical section
...
turn := 1
flag[0] := false // remainder
// section
P1
flag[1] := true
while flag[0] = true {
if turn ≠ 1 {
flag[1] := false
while turn ≠ 1 {
}
flag[1] := true
}
} // critical section
...
turn := 0
flag[1] := false // remainder
// section
Disadvantages
limited to two processes
makes use of busy waiting instead of
process suspension.
Modern CPUs execute their instructions in
an out-of-order fashion,
even memory accesses can be reordered
Peterson’s Algorithm
flag[0] = 0;
flag[1] = 0;
turn;
P0
flag[0] = 1;
turn = 1;
while (flag[1] == 1 && turn == 1)
{
// busy wait
} // critical section
... // end of critical section
flag[0] = 0;
P1
flag[1] = 1;
turn = 0;
while (flag[0] == 1 && turn == 0)
{
// busy wait
} // critical section
... // end of critical section
flag[1] = 0;
Lamport's bakery algorithm
a bakery with a
numbering machine
the 'customers' will
be threads,
identified by the
letter i, obtained
from a global
variable.
more than one
thread might get
the same number
// declaration and initial values of global variables
Entering: array [1..NUM_THREADS] of bool = {false};
Number: array [1.. NUM_THREADS] of integer = {0};
1 lock(integer i) {
2
Entering[i] = true;
3
Number[i] = 1 + max(Number[1], ..., Number[NUM_THREADS]);
4
Entering[i] = false;
5
for (j = 1; j <= NUM_THREADS; j++) {
6
// Wait until thread j receives its number:
7
while (Entering[j]) { /* nothing */ }
8
// Wait until all threads with smaller numbers or with the same
9
// number, but with higher priority, finish their work:
10
while ((Number[j] != 0) && ((Number[j], j) < (Number[i], i))) {
11
/* nothing */
12
}
13
}
14
}
15 unlock(integer i) {
16
Number[i] = 0;
17 }
18 Thread(integer i) {
19
while (true) {
20
lock(i);
21
// The critical section goes here...
22
unlock(i);
23
// non-critical section...
24
}
25 }
Models
Strict Consistency: Read always returns with most recent Write to same address
Sequential Consistency: The result of any execution appears as the interleaving of
individual programs strictly in sequential program order
Processor Consistency: Writes issued by
each processor are in program order, but
writes from different processors can be
out of order (Goodman)
Weak Consistency: Programmer uses
synch operations to enforce
sequential consistency (Dubois)
Reads from each processor is not restricted
More opportunities for pipelining
Relationship to Cache
Coherence Protocol
Cache coherence protocol must observe the
constraints imposed by the memory consistency
model
Ex: Read hit in a cache
Reading without waiting for the completion of a previous write
my violate sequential consistency
Cache coherence protocol provides a mechanism
to propagate the newly written value
Memory consistency model places an additional
constraint on when the value can be propagated
to a given processor
Latency Tolerance
Scalable systems
Distributed shared memory architecture
Access to remote memory: long latency
Processor speed vs. the memory and
interconnect
Need for
Latency reduction, avoidance, hiding
Latency Avoidance
Organize user applications at
architectural, compiler or application
levels to achieve program/data locality
Possible when applications exhibit:
Temporal or spatial locality
How do you enhance locality?
Locality Enhancement
Architectural support:
Cache coherency protocols, memory consistency
models, fast message passing, etc.
User support
High Performance Fortran: program instructs
compiler how to allocate the data (example ?)
Software support
Compiler performs certain transformations
Example?
Latency Reduction
What if locality is limited?
Data access is dynamically changing?
For ex: sorting algorithms
We need latency reduction mechanisms
Target communication subsystem
Interconnect
Network interface
Fast communication software
• Cluster: TCP, UDP, etc
Latency Hiding
Hide communication latency within computation
Overlapping techniques
Prefetching techniques
• Hide read latency
Distributed coherent caches
• Reduce cache misses
• Shorten time to retrieve clean copy
Multiple context processors
• Switch from one context to another when long-latency
operations is encountered (hardware supported
multithreading)
Memory
Delays
 SMP
high in multiprocessors due to added contention for shared
resources such as a shared bus and memory modules
 Distributed
are even more pronounced in distributed-memory
multiprocessors where memory requests may need to be
satisfied across an interconnection network.
 By masking some or all of these significant memory
latencies, prefetching can be an effective means of
speeding up multiprocessor applications
Data Prefetching
Overlapping computation with memory
accesses
Rather than waiting for a cache miss to
perform a memory fetch, data prefetching
anticipates such misses and issues a fetch to
the memory system in advance of the actual
memory reference.
Cache Hierarchy
Popular latency reducing technique
But still common for scientific programs to
spend more than half their run times
stalled on memory requests
partially a result of the “on demand” fetch
policy
fetch data into the cache from main memory only
after the processor has requested a word and
found it absent from the cache.
Why do scientific applications
exhibit poor cache utilization?
 Is something wrong with the principle of locality?
 The traversal of large data arrays is often at the heart of this
problem.
 Temporal locality in array computations
 once an element has been used to compute a result, it is often not
referenced again before it is displaced from the cache to make room
for additional array elements.
 Sequential array accesses patterns exhibit a high degree of spatial
locality, many other types of array access patterns do not.
 For example, in a language which stores matrices in row-major order, a
row-wise traversal of a matrix will result in consecutively referenced
elements being widely separated in memory. Such strided reference
patterns result in low spatial locality if the stride is greater than the
cache block size. In this case, only one word per cache block is actually
used while the remainder of the block remains untouched even though
cache space has been allocated for it.
Memory references r1,r2 and r3
not in the cache
Time: Computation and memory references satisfied within the cache hierarchy
main memory access time
Challenges
 Cache pollution
 Data arrives early enough to hide all of the memory latency
 Data must be held in the processor cache for some period of time before it is
used by the processor.
 During this time, the prefetched data are exposed to the cache replacement
policy and may be evicted from the cache before use.
 Moreover, the prefetched data may displace data in the cache that is currently in
use by the processor.
 Memory bandwidth
 Back to figure:
 No prefetch: the three memory requests occur within the first 31 time units of program
startup,
 With prefetch: these requests are compressed into a period of 19 time units.
 By removing processor stall cycles, prefetching effectively increases the
frequency of memory requests issued by the processor.
 Memory systems must be designed to match this higher bandwidth to avoid
becoming saturated and nullifying the benefits of prefetching.
Spatial Locality
Block transfer is a way of prefetching
(1960s)
Software prefetching later (1980s)
Binding Prefetch
Non-blocking load instructions
these instructions are issued in advance of the actual
use to take advantage of the parallelism between the
processor and memory subsystem.
Rather than loading data into the cache, however,
the specified word is placed directly into a processor
register.
the value of the prefetched variable is bound to
a named location at the time the prefetch is
issued.
Software-Initiated Data
Prefetching
Some form of fetch instruction
can be as simple as a load into a processor register
Fetches are non-blocking memory operations
Allow prefetches to bypass other outstanding
memory operations in the cache.
Fetch instructions cannot cause exceptions
The hardware required to implement softwareinitiated prefetching is modest
Prefetch Challenges
 prefetch scheduling.
judicious placement of fetch instructions within the
target application.
not possible to precisely predict when to schedule a
prefetch so that data arrives in the cache at the
moment it will be requested by the processor
uncertainties not predictable at compile time
careful consideration when statically scheduling prefetch
instructions.
may be added by the programmer or by the compiler
during an optimization pass.
programming effort ?
Suitable spots for “Fetch”
most often used within loops responsible
for large array calculations.
common in scientific codes,
exhibit poor cache utilization
predictable array referencing patterns.
Example:
How to solve these two issues?
software piplining
assume a four-word cache block
Issues:
Cache misses during the first iteration
Unnecessary prefetches in the last iteration of the unrolled loop
Assumptions
 implicit assumption
Prefetching one iteration ahead of the data’s actual use is
sufficient to hide the latency
 What if the loops contain small computational bodies.
Define prefetch distance
initiate prefetches d iterations before the data is referenced
How do you determine “d”?
• Let
– “l” be the average cache miss latency, measured in processor
cycles,
– “s” be the estimated cycle time of the shortest possible execution
path through one loop iteration, including the prefetch overhead.
d
Revisiting the example
let us assume an
average miss latency
of 100 processor
cycles and a loop
iteration time of 45
cycles
d=3 (handle a
prefetch distance of
three)
Case Study
 Given a distributed-shared multiprocessor
 let’s define a remote access cache (RAC)
Assume that RAC is located at the network interface of each node
Motivation: prefetched remote data could be accessed at a speed
comparable to that of local memory while the processor cache
hierarchy was reserved for demand-fetched data.
 Which one is better: Having RAC or pretefetching data
directly into the processor cache hierarchy?
Despite significantly increasing cache contention and
reducing overall cache space,
The latter approach results in higher cache hit rates,
dominant performance factor.
Case Study
 Transfer of individual cache blocks across the interconnection
network of a multiprocessor yields low network efficiency
what if we propose transferring prefetched data in larger units?
 Method: a compiler schedules a single prefetch command
before the loop is entered rather than software pipelining
prefetches within a loop.
transfer of large blocks of remote memory used within the loop body
prefetched into local memory to prevent excessive cache pollution.
 Issues:
binding prefetch since data stored in a processor’s local memory are
not exposed to any coherency policy
imposes constraints on the use of prefetched data which, in turn, limits
the amount of remote data that can be prefetched.
What about besides the
“loops”?
 Prefetching is normally restricted to loops
 array accesses whose indices are linear functions of the loop indices
 compiler must be able to predict memory access patterns when
scheduling prefetches.
 such loops are relatively common in scientific codes but far less so in
general applications.
 Irregular data structures
 difficult to reliably predict when a particular data will be accessed
 once a cache block has been accessed, there is less of a chance that
several successive cache blocks will also be requested when data
structures such as graphs and linked lists are used.
 comparatively high temporal locality
result in high cache utilization thereby diminishing the benefit of
prefetching.
What is the overhead of fetch
instructions?
 require extra execution cycles
 fetch source addresses must be calculated and stored in the processor
 to avoid recalculation for the matching load or store instruction.
How:
• Register space
Problem:
• compiler will have less register space to allocate to other active variables.
• fetch instructions increase register pressure
• It gets worse when
– the prefetch distance is greater than one
– multiple prefetch addresses
 code expansion
 may degrade instruction cache performance.
 software-initiated prefetching is done statically
 unable to detect when a prefetched block has been prematurely evicted
and needs to be re-fetched.
Hardware-Initiated Data
Prefetching
Prefetching capabilities without the need
for programmer or compiler intervention.
No changes to existing executables
instruction overhead completely eliminated.
can take advantage of run-time
information to potentially make
prefetching more effective.
Cache Blocks
 Typically: fetch data from main memory into the processor cache in
units of cache blocks.
 multiple word cache blocks are themselves a form of data prefetching.
 large cache blocks
Effective prefetching vs cache pollution.
 What is the complication for SMPs with private caches
false sharing: when two or more processors wish to access different
words within the same cache block and at least one of the accesses is
a store.
cache coherence traffic is generated to ensure that the changes
made to a block by a store operation are seen by all processors
caching the block.
• Unnecessary traffic
• Increasing the cache block size increases the likelihood of such
occurances
 How do we take advantage of spatial locality without
introducing some of the problems associated with large
cache blocks?
Sequential prefetching
one block lookahead (OBL) approach
initiates a prefetch for block b+1 when block
b is accessed.
How is it different from doubling the block
size?
prefetched blocks are treated separately with
regard to the cache replacement and
coherency policies.
OBL: Case Study
 Assume that a large block contains one word which is
frequently referenced and several other words which are
not in use.
 Assume that an LRU replacement policy is used,
 What is the implication?
the entire block will be retained even though only a portion of
the block’s data is actually in use.
 How do we solve?
Replace large block with two smaller blocks,
one of them could be evicted to make room for more active data.
use of smaller cache blocks reduces the probability of false
sharing
OBL implementations
 Based on “what type of access to block b initiates the
prefetch of b+1”
prefetch on miss
Initiates a prefetch for block b+1 whenever an access for block b
results in a cache miss.
If b+1 is already cached, no memory access is initiated
tagged prefetch algorithms
Associates a tag bit with every memory block.
Use this bit to detect
• when a block is demand-fetched or
• when a prefetched block is referenced for the first time.
Then, next sequential block is fetched.
Which one is better in terms of reducing miss rate? Prefetch on
miss vs tagged prefetch?
Prefetch on miss vs tagged
prefetch
Accessing three contiguous blocks strictly sequential access pattern:
Shortcoming of the OBL
prefetch may not be initiated far enough
in advance of the actual use to avoid a
processor memory stall.
A sequential access stream resulting from a
tight loop, for example, may not allow
sufficient time between the use of blocks b
and b+1 to completely hide the memory
latency.
How do you solve this
shortcoming?
Increase the number of blocks prefetched
after a demand fetch from one to “d”
As each prefetched block, b, is accessed for
the first time, the cache is interrogated to
check if blocks b+1, ... b+d are present in
the cache
What if d=1? What kind of prefetching is
this?
Tagged
Another technique with
d-prefetch
 d prefetched blocks are brought into a FIFO stream
buffer before being brought into the cache.
As each buffer entry is referenced, it is brought into the cache
while the remaining blocks are moved up in the queue and a
new block is prefetched into the tail position.
If a miss occurs in the cache and the desired block is also not
found at the head of the stream buffer, the buffer is flushed.
 Advantage:
prefetched data are not placed directly into the cache,
avoids cache pollution.
 Disadvantage:
requires that prefetched blocks be accessed in a strictly
sequential order to take advantage of the stream buffer.
Tradeoffs of
d-prefetching?
Good: increasing the degree of prefetching
reduces miss rates in sections of code that show a
high degree of spatial locality
Bad
additional traffic and cache pollution are generated
by sequential prefetching during program phases that
show little spatial locality.
What if are able to vary the “d”
Adaptive sequential
prefetching
 d is matched to the degree of spatial locality exhibited by the
program at a particular point in time.
 a prefetch efficiency metric is periodically calculated
 Prefetch efficiency
 ratio of useful prefetches to total prefetches
a useful prefetch occurs whenever a prefetched block results in a cache hit.
 d is initialized to one,
 incremented whenever efficiency exceeds a predetermined upper
threshold
 decremented whenever the efficiency drops below a lower threshold
 If d=0, no prefetching
 Which one is better? adaptive or tagged prefetching?
 Miss ratio vs Memory traffic and contention
Sequential prefetching
summary
 Does sequential prefetching require changes to existing
executables?
 What about the hardware complexity?
 Which one offers both simplicity and performance?
 TAGGED
 Compared to software-initiated prefetching, what might be the
problem?
 tend to generate more unnecessary prefetches.
 Non-sequential access patterns are not good
Ex: such as scalar references or array accesses with large strides, will result
in unnecessary prefetch requests
do not exhibit the spatial locality upon which sequential prefetching is
based.
 To enable prefetching of strided and other irregular data access
patterns, several more elaborate hardware prefetching techniques
have been proposed.
Prefetching with arbitrary
strides
Reference Prediction Table
State: initial, transient, steady
RPT Entries State Transition
Matrix Multiplication
Assume that starting addresses
a=10000 b=20000 c=30000, and 1 word cache block
After the first iteration of inner loop
Matrix Multiplication
After the second iteration of inner loop
Hits/misses?
Matrix Multiplication
After the third iteration
b and c hits provided that a prefetch of distance one is enough
RPT Limitations
Prefetch distance to one loop iteration
Loop entrance : miss
Loop exit: unnecessary prefetch
How can we solve this?
Use longer distance
Prefetch address = effective address +
(stride x distance )
with lookahead program counter (LA-PC)
Summary
Prefetches
timely, useful, and introduce little overhead.
Reduce secondary effects in the memory
system
strategies are diverse and no single
strategy provides optimal performance
Summary
Prefetching schemes are diverse.
To help categorize a particular approach it
is useful to answer three basic questions
concerning the prefetching mechanism:
1) When are prefetches initiated,
2) Where are prefetched data placed,
3) What is the unit of prefetch?
Software vs Hardware
Prefetching
 Prefetch instructions actually increase the amount of
work done by the processor.
 Hardware-based prefetching techniques do not require
the use of explicit fetch instructions.
hardware monitors the processor in an attempt to infer
prefetching opportunities.
no instruction overhead
generates more unnecessary prefetches than software-initiated
schemes.
need to speculate on future memory accesses without the benefit
of compile-time information
• Cache pollution
• Consume memory bandwidth
Conclusions
 Prefetches can be initiated either by
 explicit fetch operation within a program (software initiated)
 logic that monitors the processor’s referencing pattern (hardwareinitiated).
 Prefetches must be timely.
 issued too early
chance that the prefetched data will displace other useful data or be
displaced itself before use.
 issued too late
may not arrive before the actual memory reference and introduce stalls
 Prefetches must be precise.
 The software approach issues prefetches only for data that is likely to
be used
 Hardware schemes tend to fetch more data unnecessarily.
Conclusions
 The decision of where to place prefetched data in the
memory hierarchy
higher level of the memory hierarchy to provide a performance
benefit.
 The majority of schemes
prefetched data in some type of cache memory.
 Prefetched data in processor registers
binding and additional constraints must be imposed on the use
of the data.
 Finally, multiprocessor systems can introduce additional
levels into the memory hierarchy which must be taken
into consideration.
Conclusions
Data can be prefetched in units of single
words, cache blocks or larger blocks of memory.
determined by the organization of the underlying
cache and memory system.
Uniprocessors and SMPs
Cache blocks appropriate
Distributed memory multiprocessor
larger memory blocks
to amortize the cost of initiating a data transfer across an
interconnection network
Download